diff mbox series

[v2,1/8] crypto: shash - add support for finup2x

Message ID 20240422203544.195390-2-ebiggers@kernel.org (mailing list archive)
State New
Headers show
Series Optimize dm-verity and fsverity using multibuffer hashing | expand

Commit Message

Eric Biggers April 22, 2024, 8:35 p.m. UTC
From: Eric Biggers <ebiggers@google.com>

Most cryptographic hash functions are serialized, in the sense that they
have an internal block size and the blocks must be processed serially.
(BLAKE3 is a notable exception that has tree-based hashing built-in, but
all the more common choices such as the SHAs and BLAKE2 are serialized.
ParallelHash and Sakura are parallel hashes based on SHA3, but SHA3 is
much slower than SHA256 in software even with the ARMv8 SHA3 extension.)

This limits the performance of computing a single hash.  Yet, computing
multiple hashes simultaneously does not have this limitation.  Modern
CPUs are superscalar and often can execute independent instructions in
parallel.  As a result, on many modern CPUs, it is possible to hash two
equal-length messages in about the same time as a single message, if all
the instructions are interleaved.

Meanwhile, a very common use case for hashing in the Linux kernel is
dm-verity and fs-verity.  Both use a Merkle tree that has a fixed block
size, usually 4096 bytes with an empty or 32-byte salt prepended.  The
hash algorithm is usually SHA-256.  Usually, many blocks need to be
hashed at a time.  This is an ideal scenario for multibuffer hashing.

Linux actually used to support SHA-256 multibuffer hashing on x86_64,
before it was removed by commit ab8085c130ed ("crypto: x86 - remove SHA
multibuffer routines and mcryptd").  However, it was integrated with the
crypto API in a weird way, where it behaved as an asynchronous hash that
queued up and executed all requests on a global queue.  This made it
very complex, buggy, and virtually unusable.

This patch takes a new approach of just adding an API
crypto_shash_finup2x() that synchronously computes the hash of two
equal-length messages, starting from a common state that represents the
(possibly empty) common prefix shared by the two messages.

The new API is part of the "shash" algorithm type, as it does not make
sense in "ahash".  It does a "finup" operation rather than a "digest"
operation in order to support the salt that is used by dm-verity and
fs-verity.  There is no fallback implementation that does two regular
finups if the underlying algorithm doesn't support finup2x, since users
probably will want to avoid the overhead of queueing up multiple hashes
when multibuffer hashing won't actually be used anyway.

For now the API only supports 2-way interleaving, as the usefulness and
practicality seems to drop off dramatically after 2.  The arm64 CPUs I
tested don't support more than 2 concurrent SHA-256 hashes.  On x86_64,
AMD's Zen 4 can do 4 concurrent SHA-256 hashes (at least based on a
microbenchmark of the sha256rnds2 instruction), and it's been reported
that the highest SHA-256 throughput on Intel processors comes from using
AVX512 to compute 16 hashes in parallel.  However, higher interleaving
factors would involve tradeoffs such as no longer being able to cache
the round constants in registers, further increasing the code size (both
source and binary), further increasing the amount of state that users
need to keep track of, and causing there to be more "leftover" hashes.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 include/crypto/hash.h | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

Comments

Herbert Xu May 3, 2024, 10:18 a.m. UTC | #1
Eric Biggers <ebiggers@kernel.org> wrote:
>
> For now the API only supports 2-way interleaving, as the usefulness and
> practicality seems to drop off dramatically after 2.  The arm64 CPUs I
> tested don't support more than 2 concurrent SHA-256 hashes.  On x86_64,
> AMD's Zen 4 can do 4 concurrent SHA-256 hashes (at least based on a
> microbenchmark of the sha256rnds2 instruction), and it's been reported
> that the highest SHA-256 throughput on Intel processors comes from using
> AVX512 to compute 16 hashes in parallel.  However, higher interleaving
> factors would involve tradeoffs such as no longer being able to cache
> the round constants in registers, further increasing the code size (both
> source and binary), further increasing the amount of state that users
> need to keep track of, and causing there to be more "leftover" hashes.

I think the lack of extensibility is the biggest problem with this
API.  Now I confess I too have used the magic number 2 in the
lskcipher patch-set, but there I think at least it was more
justifiable based on the set of algorithms we currently support.

Here I think the evidence for limiting this to 2 is weak.  And the
amount of work to extend this beyond 2 would mean ripping this API
out again.

So let's get this right from the start.  Rather than shoehorning
this into shash, how about we add this to ahash instead where an
async return is a natural part of the API?

In fact, if we do it there we don't need to make any major changes
to the API.  You could simply add an optional flag that to the
request flags to indicate that more requests will be forthcoming
immediately.

The algorithm could then either delay the current request if it
is supported, or process it immediately as is the case now.

Cheers,
Eric Biggers May 3, 2024, 3:28 p.m. UTC | #2
On Fri, May 03, 2024 at 06:18:32PM +0800, Herbert Xu wrote:
> Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > For now the API only supports 2-way interleaving, as the usefulness and
> > practicality seems to drop off dramatically after 2.  The arm64 CPUs I
> > tested don't support more than 2 concurrent SHA-256 hashes.  On x86_64,
> > AMD's Zen 4 can do 4 concurrent SHA-256 hashes (at least based on a
> > microbenchmark of the sha256rnds2 instruction), and it's been reported
> > that the highest SHA-256 throughput on Intel processors comes from using
> > AVX512 to compute 16 hashes in parallel.  However, higher interleaving
> > factors would involve tradeoffs such as no longer being able to cache
> > the round constants in registers, further increasing the code size (both
> > source and binary), further increasing the amount of state that users
> > need to keep track of, and causing there to be more "leftover" hashes.
> 
> I think the lack of extensibility is the biggest problem with this
> API.  Now I confess I too have used the magic number 2 in the
> lskcipher patch-set, but there I think at least it was more
> justifiable based on the set of algorithms we currently support.
> 
> Here I think the evidence for limiting this to 2 is weak.  And the
> amount of work to extend this beyond 2 would mean ripping this API
> out again.
> 
> So let's get this right from the start.  Rather than shoehorning
> this into shash, how about we add this to ahash instead where an
> async return is a natural part of the API?
> 
> In fact, if we do it there we don't need to make any major changes
> to the API.  You could simply add an optional flag that to the
> request flags to indicate that more requests will be forthcoming
> immediately.
> 
> The algorithm could then either delay the current request if it
> is supported, or process it immediately as is the case now.
> 

The kernel already had ahash-based multibuffer hashing years ago.  It failed
spectacularly, as it was extremely complex, buggy, slow, and potentially
insecure as it mixed requests from different contexts.  Sure, it could have been
improved slightly by adding flush support, but most issues would have remained.

Synchronous hashing really is the right model here.  One of the main performance
issues we are having with dm-verity and fs-verity is the scheduling hops
associated with the workqueues on which the dm-verity and fs-verity work runs.
If there was another scheduling hop from the worker task to another task to do
the actual hashing, that would be even worse and would defeat the point of doing
multibuffer hashing.  And with the ahash based API this would be difficult to
avoid, as when an individual request gets submitted and put on a queue somewhere
it would lose the information about the original submitter, so when it finally
gets hashed it might be by another task (which the original task would then have
to wait for).  I guess the submitter could provide some sort of tag that makes
the request be put on a dedicated queue that would eventually get processed only
by the same task (which might also be needed for security reasons anyway, due to
all the CPU side channels), but that would add lots of complexity to add tag
support to the API and support an arbitrary number of queues.

And then there's the issue of request lengths.  With one at a time submission
via 'ahash_request', each request would have its own length.  Having to support
multibuffer hashing of different length requests would add a massive amount of
complexity and edge cases that are difficult to get correct, as was shown by the
old ahash based code.  This suggests that either the API needs to enforce that
all the lengths are the same, or it needs to provide a clean API (my patch)
where the caller just provides a single length that applies to all messages.

So the synchronous API really seems like the right approach, whereas shoehorning
it into the asynchronous hash API would result in something much more complex
and not actually useful for the intended use cases.

If you're concerned about the hardcoding to 2x specifically, how about the
following API instead:

    int crypto_shash_finup_mb(struct shash_desc *desc,
                              const u8 *datas[], unsigned int len,
                              u8 *outs[], int num_msgs)

This would allow extension to higher interleaving factors.

I do suspect that anything higher than 2x isn't going to be very practical for
in-kernel use cases, where code size, latency, and per-request memory usage tend
to be very important.  Regardless, this would make the API able to support
higher interleaving factors.

- Eric
Eric Biggers May 7, 2024, 12:33 a.m. UTC | #3
On Fri, May 03, 2024 at 08:28:10AM -0700, Eric Biggers wrote:
> On Fri, May 03, 2024 at 06:18:32PM +0800, Herbert Xu wrote:
> > Eric Biggers <ebiggers@kernel.org> wrote:
> > >
> > > For now the API only supports 2-way interleaving, as the usefulness and
> > > practicality seems to drop off dramatically after 2.  The arm64 CPUs I
> > > tested don't support more than 2 concurrent SHA-256 hashes.  On x86_64,
> > > AMD's Zen 4 can do 4 concurrent SHA-256 hashes (at least based on a
> > > microbenchmark of the sha256rnds2 instruction), and it's been reported
> > > that the highest SHA-256 throughput on Intel processors comes from using
> > > AVX512 to compute 16 hashes in parallel.  However, higher interleaving
> > > factors would involve tradeoffs such as no longer being able to cache
> > > the round constants in registers, further increasing the code size (both
> > > source and binary), further increasing the amount of state that users
> > > need to keep track of, and causing there to be more "leftover" hashes.
> > 
> > I think the lack of extensibility is the biggest problem with this
> > API.  Now I confess I too have used the magic number 2 in the
> > lskcipher patch-set, but there I think at least it was more
> > justifiable based on the set of algorithms we currently support.
> > 
> > Here I think the evidence for limiting this to 2 is weak.  And the
> > amount of work to extend this beyond 2 would mean ripping this API
> > out again.
> > 
> > So let's get this right from the start.  Rather than shoehorning
> > this into shash, how about we add this to ahash instead where an
> > async return is a natural part of the API?
> > 
> > In fact, if we do it there we don't need to make any major changes
> > to the API.  You could simply add an optional flag that to the
> > request flags to indicate that more requests will be forthcoming
> > immediately.
> > 
> > The algorithm could then either delay the current request if it
> > is supported, or process it immediately as is the case now.
> > 
> 
> The kernel already had ahash-based multibuffer hashing years ago.  It failed
> spectacularly, as it was extremely complex, buggy, slow, and potentially
> insecure as it mixed requests from different contexts.  Sure, it could have been
> improved slightly by adding flush support, but most issues would have remained.
> 
> Synchronous hashing really is the right model here.  One of the main performance
> issues we are having with dm-verity and fs-verity is the scheduling hops
> associated with the workqueues on which the dm-verity and fs-verity work runs.
> If there was another scheduling hop from the worker task to another task to do
> the actual hashing, that would be even worse and would defeat the point of doing
> multibuffer hashing.  And with the ahash based API this would be difficult to
> avoid, as when an individual request gets submitted and put on a queue somewhere
> it would lose the information about the original submitter, so when it finally
> gets hashed it might be by another task (which the original task would then have
> to wait for).  I guess the submitter could provide some sort of tag that makes
> the request be put on a dedicated queue that would eventually get processed only
> by the same task (which might also be needed for security reasons anyway, due to
> all the CPU side channels), but that would add lots of complexity to add tag
> support to the API and support an arbitrary number of queues.
> 
> And then there's the issue of request lengths.  With one at a time submission
> via 'ahash_request', each request would have its own length.  Having to support
> multibuffer hashing of different length requests would add a massive amount of
> complexity and edge cases that are difficult to get correct, as was shown by the
> old ahash based code.  This suggests that either the API needs to enforce that
> all the lengths are the same, or it needs to provide a clean API (my patch)
> where the caller just provides a single length that applies to all messages.
> 
> So the synchronous API really seems like the right approach, whereas shoehorning
> it into the asynchronous hash API would result in something much more complex
> and not actually useful for the intended use cases.
> 
> If you're concerned about the hardcoding to 2x specifically, how about the
> following API instead:
> 
>     int crypto_shash_finup_mb(struct shash_desc *desc,
>                               const u8 *datas[], unsigned int len,
>                               u8 *outs[], int num_msgs)
> 
> This would allow extension to higher interleaving factors.
> 
> I do suspect that anything higher than 2x isn't going to be very practical for
> in-kernel use cases, where code size, latency, and per-request memory usage tend
> to be very important.  Regardless, this would make the API able to support
> higher interleaving factors.

I've sent out a new version that makes the change to crypto_shash_finup_mb().

- Eric
diff mbox series

Patch

diff --git a/include/crypto/hash.h b/include/crypto/hash.h
index 0014bdd81ab7..66d93c940861 100644
--- a/include/crypto/hash.h
+++ b/include/crypto/hash.h
@@ -177,10 +177,13 @@  struct shash_desc {
  * @finup: see struct ahash_alg
  * @digest: see struct ahash_alg
  * @export: see struct ahash_alg
  * @import: see struct ahash_alg
  * @setkey: see struct ahash_alg
+ * @finup2x: **[optional]** Finish calculating the digests of two equal-length
+ *	     messages, interleaving the instructions to potentially achieve
+ *	     better performance than hashing each message individually.
  * @init_tfm: Initialize the cryptographic transformation object.
  *	      This function is called only once at the instantiation
  *	      time, right after the transformation context was
  *	      allocated. In case the cryptographic hardware has
  *	      some special requirements which need to be handled
@@ -208,10 +211,12 @@  struct shash_alg {
 		      unsigned int len, u8 *out);
 	int (*export)(struct shash_desc *desc, void *out);
 	int (*import)(struct shash_desc *desc, const void *in);
 	int (*setkey)(struct crypto_shash *tfm, const u8 *key,
 		      unsigned int keylen);
+	int (*finup2x)(struct shash_desc *desc, const u8 *data1,
+		       const u8 *data2, unsigned int len, u8 *out1, u8 *out2);
 	int (*init_tfm)(struct crypto_shash *tfm);
 	void (*exit_tfm)(struct crypto_shash *tfm);
 	int (*clone_tfm)(struct crypto_shash *dst, struct crypto_shash *src);
 
 	unsigned int descsize;
@@ -749,10 +754,15 @@  static inline unsigned int crypto_shash_digestsize(struct crypto_shash *tfm)
 static inline unsigned int crypto_shash_statesize(struct crypto_shash *tfm)
 {
 	return crypto_shash_alg(tfm)->statesize;
 }
 
+static inline bool crypto_shash_supports_finup2x(struct crypto_shash *tfm)
+{
+	return crypto_shash_alg(tfm)->finup2x != NULL;
+}
+
 static inline u32 crypto_shash_get_flags(struct crypto_shash *tfm)
 {
 	return crypto_tfm_get_flags(crypto_shash_tfm(tfm));
 }
 
@@ -842,10 +852,34 @@  int crypto_shash_digest(struct shash_desc *desc, const u8 *data,
  * Return: 0 on success; < 0 if an error occurred.
  */
 int crypto_shash_tfm_digest(struct crypto_shash *tfm, const u8 *data,
 			    unsigned int len, u8 *out);
 
+/**
+ * crypto_shash_finup2x() - finish hashing two equal-length messages
+ * @desc: the hash state that will be forked for the two messages.  This
+ *	  contains the state after hashing a (possibly-empty) common prefix of
+ *	  the two messages.
+ * @data1: the first message (not including any common prefix from @desc)
+ * @data2: the second message (not including any common prefix from @desc)
+ * @len: length of @data1 and @data2 in bytes
+ * @out1: output buffer for first message digest
+ * @out2: output buffer for second message digest
+ *
+ * Users must check crypto_shash_supports_finup2x(tfm) before calling this.
+ *
+ * Context: Any context.
+ * Return: 0 on success; a negative errno value on failure.
+ */
+static inline int crypto_shash_finup2x(struct shash_desc *desc,
+				       const u8 *data1, const u8 *data2,
+				       unsigned int len, u8 *out1, u8 *out2)
+{
+	return crypto_shash_alg(desc->tfm)->finup2x(desc, data1, data2, len,
+						    out1, out2);
+}
+
 /**
  * crypto_shash_export() - extract operational state for message digest
  * @desc: reference to the operational state handle whose state is exported
  * @out: output buffer of sufficient size that can hold the hash state
  *