Message ID | Z-JE2HNY-Tj8qwQw@gondor.apana.org.au (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
Series | [GIT,PULL] Crypto Update for 6.15 | expand |
On Tue, Mar 25, 2025 at 01:53:28PM +0800, Herbert Xu wrote: > > crypto: hash - Add request chaining API Herbert didn't mention that I have nacked this patch, which he is insisting on pushing for some reason instead of my original version that is much better. Let me reiterate why "request chaining" is a bad idea and is going to cause problems. It makes it so that now a single hash request can now actually be a list of hash requests. It makes some of the crypto code operate on the whole list. However, most code still operates only on the first request in the list. It's undocumented and inconsistent which code is doing which, which is going to cause bugs. The first request in the list is also being treated specially in undocumented ways, so submitting a list of requests is not necessarily equivalent to submitting them all individually. Another recipe for bugs. Each hash request can also contain an entire scatterlist. It's overkill for what is actually needed for multibuffer hashing, which is a simple API that hashes two buffers specified by virtual address. Herbert's API creates lots of unnecessary edge cases, most of which lack any testing. It continues many of the worst practices of the crypto API that we *know* are not working, like requiring per-request memory allocations and optimizing for legacy hardware offload rather than the CPU-based crypto that almost everyone actually uses. In contrast, my patchset https://lore.kernel.org/r/20250212154718.44255-1-ebiggers@kernel.org/ supports multibuffer hashing in a much better way and has been ready for a year already. It actually works; it has a smaller diffstat; it is faster; it has a much simpler API; and it actually includes all needed pieces including x86 and arm64 support, dm-verity and fs-verity support, and full documentation and tests. I've been spending a lot of time fixing the kernel's crypto code over the years. I'm not looking forward to having another set of major issues to fix. And this latest set of issues will be totally unnecessary. We can do better than this, especially for cryptography code. Nacked-by: Eric Biggers <ebiggers@kernel.org> - Eric
On Tue, 25 Mar 2025 at 16:25, Eric Biggers <ebiggers@kernel.org> wrote: > > On Tue, Mar 25, 2025 at 01:53:28PM +0800, Herbert Xu wrote: > > > > crypto: hash - Add request chaining API > > Herbert didn't mention that I have nacked this patch, which he is insisting on > pushing for some reason instead of my original version that is much better. > > Let me reiterate why "request chaining" is a bad idea and is going to cause > problems. > > It makes it so that now a single hash request can now actually be a list of hash > requests. It makes some of the crypto code operate on the whole list. However, > most code still operates only on the first request in the list. It's > undocumented and inconsistent which code is doing which, which is going to cause > bugs. The first request in the list is also being treated specially in > undocumented ways, so submitting a list of requests is not necessarily > equivalent to submitting them all individually. Another recipe for bugs. > > Each hash request can also contain an entire scatterlist. It's overkill for > what is actually needed for multibuffer hashing, which is a simple API that > hashes two buffers specified by virtual address. Herbert's API creates lots of > unnecessary edge cases, most of which lack any testing. It continues many of > the worst practices of the crypto API that we *know* are not working, like > requiring per-request memory allocations and optimizing for legacy hardware > offload rather than the CPU-based crypto that almost everyone actually uses. > > In contrast, my patchset > https://lore.kernel.org/r/20250212154718.44255-1-ebiggers@kernel.org/ supports > multibuffer hashing in a much better way and has been ready for a year already. > It actually works; it has a smaller diffstat; it is faster; it has a much > simpler API; and it actually includes all needed pieces including x86 and arm64 > support, dm-verity and fs-verity support, and full documentation and tests. > > I've been spending a lot of time fixing the kernel's crypto code over the years. > I'm not looking forward to having another set of major issues to fix. > > And this latest set of issues will be totally unnecessary. > > We can do better than this, especially for cryptography code. > > Nacked-by: Eric Biggers <ebiggers@kernel.org> > It's sad that it is coming to this, but I have to second Eric here: for CPU based crypto, the flexibility of Herbert's approach has no added value. SHA CPU instructions can be interleaved at the instruction level to get almost 2x speedup in some cases, and this works very well when operating on equal sized inputs. However, generalizing this to arbitrary request chains to accommodate async h/w offload introduces a lot of complexity for use cases that are only imaginary. Given Eric's track record as a contributor to the crypto subsystem and as a maintainer of subsystems that are closely tied to it, I would expect Herbert to take his opinion more seriously, but it is just being ignored. Instead, a lightly tested alternative with no integration into existing users has been merged in its place, with very little input from the community. So Herbert, please withdraw this pull request, and work with Eric and the rest of us to converge on something that we can all get behind.
On Tue, Mar 25, 2025 at 08:25:41AM -0700, Eric Biggers wrote: > > Herbert didn't mention that I have nacked this patch, which he is insisting on > pushing for some reason instead of my original version that is much better. Let's see how your version is so much better: https://lore.kernel.org/all/20250212154718.44255-6-ebiggers@kernel.org/ - /* Up to 1 + FS_VERITY_MAX_LEVELS pages may be mapped at once */ - BUILD_BUG_ON(1 + FS_VERITY_MAX_LEVELS > KM_MAX_IDX); + /* + * Up to FS_VERITY_MAX_PENDING_DATA_BLOCKS + FS_VERITY_MAX_LEVELS pages + * may be mapped at once. + */ + BUILD_BUG_ON(FS_VERITY_MAX_PENDING_DATA_BLOCKS + + FS_VERITY_MAX_LEVELS > KM_MAX_IDX); This arbitrary limit is a direct result of your welded-on commitment to an API that supports virtually mapped addresses only. Make no mistake, virtual addresses are simple and easy to use, but the kernel added more complicated constructs for real reasons. I've gone through your use-case in fsverity/dm-verity, and they never touch the data at all so the only reason for it to kmap the data at all is to feed it to the Crypto API, which is capable of doing its own kmap but you elected not to use that because you hate the interface. In fact it's a recurring theme, the zswap code jumps through multiple hoops to map the data they're working on so that they can feed it to the Crypto API as a virtually mapped pointer, even though they never touch the mapped data at all. The same thing also happened in ubifs, which I managed to simplify by switching away from kmapped pointers: https://patchwork.kernel.org/project/linux-crypto/patch/99ae6a15afc1478bab201949dc3dbb2c7634b687.1742034499.git.herbert@gondor.apana.org.au/ - addr += UBIFS_BLOCK_SIZE; - if (folio_test_highmem(folio) && (offset_in_page(addr) == 0)) { - kunmap_local(addr - UBIFS_BLOCK_SIZE); - addr = kmap_local_folio(folio, i * UBIFS_BLOCK_SIZE); - } + offset += UBIFS_BLOCK_SIZE; All this complexity was added because the legacy compression interface only supported virtually mapped addresses. Sure the ahash/acomp interface was suboptimal for *only* supporting SG lists, which is what this pull request addresses by adding virtual address (and folio) support. > Let me reiterate why "request chaining" is a bad idea and is going to cause > problems. I'm more than willing to discuss with you the implementation details of how the chaining is done and improving it. However, if you proceed to only issue blanket nacks without providing any constructive feedback, then the only thing I can do is ignore you. > In contrast, my patchset > https://lore.kernel.org/r/20250212154718.44255-1-ebiggers@kernel.org/ supports > multibuffer hashing in a much better way and has been ready for a year already. > It actually works; it has a smaller diffstat; it is faster; it has a much > simpler API; and it actually includes all needed pieces including x86 and arm64 > support, dm-verity and fs-verity support, and full documentation and tests. Everybody wants to sratch their itch but my job as the maintainer is to ensure that the subsystem doesn't collapse into an unmaintainable hodgepodge of individual contributions. Yes I get that batching is useful for you, but your use-case is not unique at all. The compression people are proposing pretty much the same thing https://patchwork.kernel.org/project/linux-crypto/patch/20250303084724.6490-15-kanchana.p.sridhar@intel.com/ I don't want to be pressured by you into committing to an interface that works for you only. > Nacked-by: Eric Biggers <ebiggers@kernel.org> This pull request doesn't even contain the meat of the hash changes since I've been busy with the compression work. So this is simply a pre-emptive strike to stop further work from rendering your patches obsolete. Cheers,
On Wed, Mar 26, 2025 at 09:49:14AM +0800, Herbert Xu wrote: > > Let's see how your version is so much better: > > https://lore.kernel.org/all/20250212154718.44255-6-ebiggers@kernel.org/ BTW, I absolutely hate how the fs/block layer uses work queues for everything. It's been used as an argument for async being unnecessary because you can always wait for completion since you're in a work queue. But this is exactly the wrong way to do asynchronous completion. In fact, now that async support has been removed because of religious opposition to ahash, we now end up with the worst of both worlds where hashing is punted off to a work queue where it is simply executed on the CPU: /** * fsverity_enqueue_verify_work() - enqueue work on the fs-verity workqueue * @work: the work to enqueue * * Enqueue verification work for asynchronous processing. */ void fsverity_enqueue_verify_work(struct work_struct *work) { queue_work(fsverity_read_workqueue, work); } The correct way to do async offload is to do it conditionally: ret = submit_request(rq); if (unlikely(needs_async(ret))) { allocate for async path with fallback to sync processing in case of OOM return; } execute normal synchronous path Cheers,
On Wed, Mar 26, 2025 at 09:49:14AM +0800, Herbert Xu wrote: > On Tue, Mar 25, 2025 at 08:25:41AM -0700, Eric Biggers wrote: > > > > Herbert didn't mention that I have nacked this patch, which he is insisting on > > pushing for some reason instead of my original version that is much better. > > Let's see how your version is so much better: > > https://lore.kernel.org/all/20250212154718.44255-6-ebiggers@kernel.org/ > > - /* Up to 1 + FS_VERITY_MAX_LEVELS pages may be mapped at once */ > - BUILD_BUG_ON(1 + FS_VERITY_MAX_LEVELS > KM_MAX_IDX); > + /* > + * Up to FS_VERITY_MAX_PENDING_DATA_BLOCKS + FS_VERITY_MAX_LEVELS pages > + * may be mapped at once. > + */ > + BUILD_BUG_ON(FS_VERITY_MAX_PENDING_DATA_BLOCKS + > + FS_VERITY_MAX_LEVELS > KM_MAX_IDX); > > This arbitrary limit is a direct result of your welded-on commitment > to an API that supports virtually mapped addresses only. Make no > mistake, virtual addresses are simple and easy to use, but the kernel > added more complicated constructs for real reasons. Umm, so you think someone is going to do multibuffer hashing with more buffers than kmap_local supports (16)? Why? Regardless of the exact API, that case would require kmap() to support. It's hard to see how it would ever be worth it, even if theoretically a CPU was capable of taking advantage of that much instruction-level parallelism (this is implausible with SHA-256 instructions) and ignoring the other issues like code size bloat and increased memory usage that a very high interleaving factor would cause. Of course, in practice this is just going to be used with 2x, which is what CPUs can actually do with the SHA-256 instructions and avoids the various downsides of overly-large interleaving factors. > I've gone through your use-case in fsverity/dm-verity, and they > never touch the data at all so the only reason for it to kmap the > data at all is to feed it to the Crypto API, which is capable of > doing its own kmap but you elected not to use that because you > hate the interface. Which is incorrect and just shows that you still haven't even read the code. Take a look at cf715f4b7eb521a5bf67d391387b754c2fcde8d2. Switching dm-verity to always "map" the data blocks significantly simplified the dm-verity code (-138 line diffstat), *even before switching to shash*. So we really want to just pass virtual addresses to the crypto API too. It's much simpler. > In fact it's a recurring theme, the zswap code jumps through multiple > hoops to map the data they're working on so that they can feed it to > the Crypto API as a virtually mapped pointer, even though they never > touch the mapped data at all. Compression and hashing are not the same and use different APIs. So this is a straw man. But I think you are on the wrong track for compression too. What zswap needs is relatively limited: only the compressed data (not the uncompressed data) can be split across pages, and only 2 pages. A complex API with source and destination scatterlists isn't needed for this use case either, even assuming that the best solution is to make all the compression algorithms support this "natively" (most don't yet, or don't support it efficiently). Other solutions that could be faster include just continuing to linearize the data, or rethinking zswap to not create non-linear compressed data in the first place, e.g. by putting compressed data only in large folios. And yes, the zswap patchset is using request chaining, but that's because you forced the zswap people to use it. It wasn't their original proposal. And based on the discussions and various versions of the patchset, they've been having quite a bit of trouble making sense of your API. But again, this is compression, not hashing. They don't use the same API. > which I managed to simplify by switching away from kmapped pointers: > > https://patchwork.kernel.org/project/linux-crypto/patch/99ae6a15afc1478bab201949dc3dbb2c7634b687.1742034499.git.herbert@gondor.apana.org.au/ > "Simplify" by a +90 line diffstat. Sure. > > Let me reiterate why "request chaining" is a bad idea and is going to cause > > problems. > > I'm more than willing to discuss with you the implementation details > of how the chaining is done and improving it. However, if you proceed > to only issue blanket nacks without providing any constructive feedback, > then the only thing I can do is ignore you. I've given you extensive constructive feedback over the past year, while you've continued to nack my patches for inconsistent and bogus reasons. > > In contrast, my patchset > > https://lore.kernel.org/r/20250212154718.44255-1-ebiggers@kernel.org/ supports > > multibuffer hashing in a much better way and has been ready for a year already. > > It actually works; it has a smaller diffstat; it is faster; it has a much > > simpler API; and it actually includes all needed pieces including x86 and arm64 > > support, dm-verity and fs-verity support, and full documentation and tests. > > Everybody wants to sratch their itch but my job as the maintainer is > to ensure that the subsystem doesn't collapse into an unmaintainable > hodgepodge of individual contributions. But when there is only one such contribution, why overengineer it with something that is slower, more complex, more error-prone, and harder to maintain? Especially when this is a kernel-internal API that we can change whenever we want to suit what is actually being used in the kernel. And your vague plan to use multibuffer hashing in IPsec doesn't count. I keep explaining why it doesn't actually make sense, and how I've *actually* been optimizing IPsec in other ways that actually matter and actually work, but you haven't been listening. > This pull request doesn't even contain the meat of the hash changes > since I've been busy with the compression work. So this is simply > a pre-emptive strike to stop further work from rendering your patches > obsolete. I'd love for your work to make my patches obsolete, but unfortunately your version is just worse. And besides it being very incomplete, the main issue is fundamental with the design. So it doesn't really make sense to use it, especially when I'm going to get stuck cleaning up your mess again. - Eric
On Tue, Mar 25, 2025 at 08:20:38PM -0700, Eric Biggers wrote: > > I'd love for your work to make my patches obsolete, but unfortunately your > version is just worse. And besides it being very incomplete, the main issue is > fundamental with the design. So it doesn't really make sense to use it, > especially when I'm going to get stuck cleaning up your mess again. I haven't even pushed the bulk of my hash work and yet you're trying stop my pull request, so it's your way or the high way. I don't see how that is constructive feedback. Cheers,
On Wed, Mar 26, 2025 at 10:16:10AM +0800, Herbert Xu wrote: > On Wed, Mar 26, 2025 at 09:49:14AM +0800, Herbert Xu wrote: > > > > Let's see how your version is so much better: > > > > https://lore.kernel.org/all/20250212154718.44255-6-ebiggers@kernel.org/ > > BTW, I absolutely hate how the fs/block layer uses work queues > for everything. It's been used as an argument for async being > unnecessary because you can always wait for completion since > you're in a work queue. > > But this is exactly the wrong way to do asynchronous completion. > In fact, now that async support has been removed because of > religious opposition to ahash, we now end up with the worst of > both worlds where hashing is punted off to a work queue where > it is simply executed on the CPU: > > /** > * fsverity_enqueue_verify_work() - enqueue work on the fs-verity workqueue > * @work: the work to enqueue > * > * Enqueue verification work for asynchronous processing. > */ > void fsverity_enqueue_verify_work(struct work_struct *work) > { > queue_work(fsverity_read_workqueue, work); > } > > The correct way to do async offload is to do it conditionally: > > ret = submit_request(rq); > if (unlikely(needs_async(ret))) { > allocate for async path with fallback to sync > processing in case of OOM > return; > } > > execute normal synchronous path > In the general case, the workqueue is needed anyway because the work can block (e.g. to read Merkle tree blocks) or can take longer than should be spent in softirq context. But in many cases the workqueue is indeed overkill and hurts I/O performance. For that reason, dm-verity and dm-crypt already support doing the read completion work in softirq context in some cases. It's not enabled by default though, and isn't implemented in quite the way it should be. Several people, including me, have been looking into improving that. So I think your observation about the workqueue being unhelpful is generally correct, but fixing that is already partially implemented and is being worked on further. And regardless, this does not have that much relevance to the crypto API. Yes, you can't sleep from a softirq, which means you can't wait for an async crypto request to complete (other than polling). So if you want to do that, you have to go down the workqueue code path. But in practice 99% of users are just using the CPU-based crypto that is synchronous and does not block. - Eric
On Tue, Mar 25, 2025 at 08:34:04PM -0700, Eric Biggers wrote: > > So I think your observation about the workqueue being unhelpful is generally > correct, but fixing that is already partially implemented and is being worked on > further. And regardless, this does not have that much relevance to the crypto > API. Yes, you can't sleep from a softirq, which means you can't wait for an > async crypto request to complete (other than polling). So if you want to do > that, you have to go down the workqueue code path. But in practice 99% of users > are just using the CPU-based crypto that is synchronous and does not block. The point is that you don't have to wait. Once verity verification is done, all you do is mark the page/folio as up-to-date. That work can be done directly from softirq context. So all you need to do to support async crypto is to mark the page/folio as up-to-date from the completion function, no work queues are needed anywhere. Look, right now you've got this crazy cargo cult programming paradigm of work queues that is worshipped because it lets you wait for async completion. In reality it is forcing everybody to go async even when they don't need it. Take ext4 as an example: ext4 calls verity schedule_work(verity_work); return asynchronously! verity_work: do the crypto work __read_end_io(bio); Just get rid of the work queue, it is not needed for async crypto, which you don't even support anymore because you hate the interface so much. Even if we want to support async crypto, all you have to do is move the __read_end_io call into the async completion function. Voila, no work queues are needed. ext4 calls verity verity: ret = do the crypto work if (is_async(ret)) return asynchronously; __read_end_io(bio) return synchronously; async completion: __read_end_io(bio) Networking has been doing this since 2008, I have no idea why storage insists on the crazy workqueue paradigm. Cheers,