mbox series

[0/6] Multibuffer hashing take two

Message ID cover.1730021644.git.herbert@gondor.apana.org.au (mailing list archive)
Headers show
Series Multibuffer hashing take two | expand

Message

Herbert Xu Oct. 27, 2024, 9:45 a.m. UTC
Multibuffer hashing was a constant sore while it was part of the
kernel.  It was very buggy and unnecessarily complex.  Finally
it was removed when it had been broken for a while without anyone
noticing.

Peace reigned in its absence, until Eric Biggers made a proposal
for its comeback :)

Link: https://lore.kernel.org/all/20240415213719.120673-1-ebiggers@kernel.org/

The issue is that the SHA algorithm (and possibly others) is
inherently not parallelisable.  Therefore the only way to exploit
parallelism on modern CPUs is to hash multiple indendent streams
of data.

Eric's proposal is a simple interface bolted onto shash that takes
two streams of data of identical length.  I thought the limitation
of two was too small, and Eric addressed that in his latest version:

Link: https://lore.kernel.org/all/20241001153718.111665-2-ebiggers@kernel.org/

However, I still disliked the addition of this to shash as it meant
that users would have to spend extra effort in order to accumulate
and maintain multiple streams of data.

My preference is to use ahash as the basis of multibuffer, because
its request object interface is perfectly suited to chaining.

The ahash interface is almost universally hated because of its
use of the SG list.  So to sweeten the deal I have added virtual
address support to ahash, thus rendering the shash interface
redundant.

Note that ahash can already be used synchronously by asking for
sync-only algorithms.  Thus there is no need to handle callbacks
and such *if* you don't want to.

This patch-set introduces two additions to the ahash interface.
First of all request chaining is added so that an arbitrary number
of requests can be submitted in one go.  Incidentally this also
reduces the cost of indirect calls by amortisation.

It then adds virtual address support to ahash.  This allows the
user to supply a virtual address as the input instead of an SG
list.

This is assumed to be not DMA-capable so it is always copied
before it's passed to an existing ahash driver.  New drivers can
elect to take virtual addresses directly.  Of course existing shash
algorithms are able to take virtual addresses without any copying.

The final patch resurrects the old SHA2 AVX2 muiltibuffer code as
a proof of concept that this API works.  The result shows that with
a full complement of 8 requests, this API is able to achieve parity
with the more modern but single-threaded SHA-NI code.

Herbert Xu (6):
  crypto: ahash - Only save callback and data in ahash_save_req
  crypto: hash - Add request chaining API
  crypto: tcrypt - Restore multibuffer ahash tests
  crypto: ahash - Add virtual address support
  crypto: ahash - Set default reqsize from ahash_alg
  crypto: x86/sha2 - Restore multibuffer AVX2 support

 arch/x86/crypto/Makefile                   |   2 +-
 arch/x86/crypto/sha256_mb_mgr_datastruct.S | 304 +++++++++++
 arch/x86/crypto/sha256_ssse3_glue.c        | 523 ++++++++++++++++--
 arch/x86/crypto/sha256_x8_avx2.S           | 596 +++++++++++++++++++++
 crypto/ahash.c                             | 566 ++++++++++++++++---
 crypto/tcrypt.c                            | 227 ++++++++
 include/crypto/algapi.h                    |  10 +
 include/crypto/hash.h                      |  68 ++-
 include/crypto/internal/hash.h             |  17 +-
 include/linux/crypto.h                     |  26 +
 10 files changed, 2209 insertions(+), 130 deletions(-)
 create mode 100644 arch/x86/crypto/sha256_mb_mgr_datastruct.S
 create mode 100644 arch/x86/crypto/sha256_x8_avx2.S

Comments

Eric Biggers Oct. 28, 2024, 7 p.m. UTC | #1
Hi Herbert,

On Sun, Oct 27, 2024 at 05:45:28PM +0800, Herbert Xu wrote:
> Multibuffer hashing was a constant sore while it was part of the
> kernel.  It was very buggy and unnecessarily complex.  Finally
> it was removed when it had been broken for a while without anyone
> noticing.
> 
> Peace reigned in its absence, until Eric Biggers made a proposal
> for its comeback :)
> 
> Link: https://lore.kernel.org/all/20240415213719.120673-1-ebiggers@kernel.org/
> 
> The issue is that the SHA algorithm (and possibly others) is
> inherently not parallelisable.  Therefore the only way to exploit
> parallelism on modern CPUs is to hash multiple indendent streams
> of data.
> 
> Eric's proposal is a simple interface bolted onto shash that takes
> two streams of data of identical length.  I thought the limitation
> of two was too small, and Eric addressed that in his latest version:
> 
> Link: https://lore.kernel.org/all/20241001153718.111665-2-ebiggers@kernel.org/
> 
> However, I still disliked the addition of this to shash as it meant
> that users would have to spend extra effort in order to accumulate
> and maintain multiple streams of data.
> 
> My preference is to use ahash as the basis of multibuffer, because
> its request object interface is perfectly suited to chaining.

As I've explained before, I think your proposed approach is much too complex,
inefficient, and broken compared to my much simpler patchset
https://lore.kernel.org/linux-crypto/20241001153718.111665-1-ebiggers@kernel.org/.
If this is really going to be the API for multibuffer hashing, then I'm not very
interested in using or contributing to it.

The much larger positive diffstat in your patchset alone should speak for
itself, especially when it doesn't include essential pieces that were included
in my smaller patchset, such as self-tests, dm-verity and fs-verity support,
SHA-NI and ARMv8 CE support.  Note that due to the complexity of your API, it
would require far more updates to the tests in order to cover all the new edge
cases.  This patchset also removes the shash support for sha256-avx2, which is
not realistic, as there are still ~100 users of shash in the kernel.

You say that your API is needed so that users don't need to "spend extra effort
in order to accumulate and maintain multiple streams of data."  That's
incorrect, though.  The users, e.g. {dm,fs}-verity, will need to do that anyway
even with your API.  I think this would have been clear if you had tried to
update them to use your API.

With this patchset I am also seeing random crashes in the x86 sha256 glue code,
and all multibuffer SHA256 hashes come back as all-zeroes.  Bugs like this were
predictable, of course.  There's a high amount of complexity inherent in the
ahash request chaining approach, both at the API layer and in the per-algorithm
glue code.  It will be hard to get everything right.  And I am just submitting 8
equal-length requests, so I haven't even tried any of the edge cases that your
proposed API allows, like submitting requests that aren't synced up properly.

I don't think it's worth the time for me to try to debug and fix your code and
add the missing pieces, when we could just choose a much simpler design that
would result in far fewer bugs.  Especially for cryptographic code, choosing
sound designs that minimize the number of bugs should be the highest priority.

I understand that you're trying to contribute something useful, and perhaps
solve a wider set of problems than I set out to solve.  The reality, though, is
that this patchset is creating more problems than it's solving.  Compared to my
patchset, it makes things harder, more error-prone, and less efficient for the
users who actually want the multibuffer hashing, and likewise for wiring it up
to the low-level algorithms.  It also doesn't bring us meaningfully closer to
applying multibuffer crypto in other applications like IPsec where it will be
very difficult to apply, is irrelevant for the most common algorithm used, and
would at best provide less benefit than other easier-to-implement optimizations.

The virtual address to support in ahash is a somewhat nice addition, but it
could go in independently, and ultimately it seems not that useful, given that
users could just use shash or lib/crypto/ instead.  shash will still have less
overhead, and lib/crypto/ even less, despite ahash getting slightly better.
Also, users who actually care about old-school style crypto accelerators need
pages + async processing anyway for optimal performance, especially given this
patchset's proposed approach of handling virtual addresses using a bounce page.

If you're really interested in the AVX2 multibuffer SHA256 for some reason, I'd
be willing to clean up that assembly code and wire it up to the much simpler API
that I proposed.  Despite the favorable microbenchmark result, this would be of
limited use, for various reasons that I've explained before.  But it could be
done if desired, and it would be much simpler than what you have.

- Eric
Herbert Xu Oct. 29, 2024, 4:33 a.m. UTC | #2
On Mon, Oct 28, 2024 at 12:00:45PM -0700, Eric Biggers wrote:
>
> You say that your API is needed so that users don't need to "spend extra effort
> in order to accumulate and maintain multiple streams of data."  That's
> incorrect, though.  The users, e.g. {dm,fs}-verity, will need to do that anyway
> even with your API.  I think this would have been clear if you had tried to
> update them to use your API.

It's a lot easier once you switch them back to dynamically allocated
requests instead of having them on the stack.  Storing the hash state
on the stack of course limits your ability to aggregate the hash
operations since each one is hundreds-of-bytes long.

We could also introduce SYNC_AHASH_REQ_ON_STACK like we do for
skcipher but I think we should move away from that for the cases
where aggregation makes sense.

Note that when I say switching back to ahash, I'm not talking about
switching back to an asynchronous API.  If you're happy with the
synchronous offerings then you're totally free to use ahash with
synchronous-only implementations, just like skcipher.

> With this patchset I am also seeing random crashes in the x86 sha256 glue code,
> and all multibuffer SHA256 hashes come back as all-zeroes.  Bugs like this were

Guilty as charged.  I haven't tested this at all apart from timing
the speed.

However, adding a proper test is trivial.  We already have the
ahash_requests ready to go so they just have to be chained together
and submitted en-masse.

> If you're really interested in the AVX2 multibuffer SHA256 for some reason, I'd
> be willing to clean up that assembly code and wire it up to the much simpler API
> that I proposed.  Despite the favorable microbenchmark result, this would be of
> limited use, for various reasons that I've explained before.  But it could be
> done if desired, and it would be much simpler than what you have.

No I have zero interest in AVX2.  I simply picked that because
it was already in the kernel git history and I wasn't certain
whether my CPU is recent enough to see much of a benefit from
your SHA-NI code.

Cheers,