diff mbox series

[v2,6/6] crypto: lib/sha - Combine round constants and message schedule

Message ID 20201020203957.3512851-7-nivedita@alum.mit.edu (mailing list archive)
State Superseded
Delegated to: Herbert Xu
Headers show
Series crypto: lib/sha256 - cleanup/optimization | expand

Commit Message

Arvind Sankar Oct. 20, 2020, 8:39 p.m. UTC
Putting the round constants and the message schedule arrays together in
one structure saves one register, which can be a significant benefit on
register-constrained architectures. On x86-32 (tested on Broadwell
Xeon), this gives a 10% performance benefit.

Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Suggested-by: David Laight <David.Laight@ACULAB.COM>
---
 lib/crypto/sha256.c | 49 ++++++++++++++++++++++++++-------------------
 1 file changed, 28 insertions(+), 21 deletions(-)

Comments

David Laight Oct. 20, 2020, 9:36 p.m. UTC | #1
From: Arvind Sankar
> Sent: 20 October 2020 21:40
> 
> Putting the round constants and the message schedule arrays together in
> one structure saves one register, which can be a significant benefit on
> register-constrained architectures. On x86-32 (tested on Broadwell
> Xeon), this gives a 10% performance benefit.

I'm actually stunned it makes that much difference.
The object code must be truly horrid (before and after).

There are probably other strange tweaks that give a similar
improvement.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Arvind Sankar Oct. 21, 2020, 3:16 p.m. UTC | #2
On Tue, Oct 20, 2020 at 09:36:00PM +0000, David Laight wrote:
> From: Arvind Sankar
> > Sent: 20 October 2020 21:40
> > 
> > Putting the round constants and the message schedule arrays together in
> > one structure saves one register, which can be a significant benefit on
> > register-constrained architectures. On x86-32 (tested on Broadwell
> > Xeon), this gives a 10% performance benefit.
> 
> I'm actually stunned it makes that much difference.
> The object code must be truly horrid (before and after).
> 
> There are probably other strange tweaks that give a similar
> improvement.
> 
> 	David
> 

Hm yes, I took a closer look at the generated code, and gcc seems to be
doing something completely braindead. Before this change, it actually
copies 8 words at a time from SHA256_K onto the stack, and uses those
stack temporaries for the calculation. So this patch is giving a benefit
just because it only does the copy once instead of every time around the
loop.

It doesn't even really need a register to hold SHA256_K since this isn't
PIC code, it could just access it directly as SHA256_K(%ecx) if it just
multiplied the loop counter i by 4.
Eric Biggers Oct. 22, 2020, 4:34 a.m. UTC | #3
On Tue, Oct 20, 2020 at 04:39:57PM -0400, Arvind Sankar wrote:
> Putting the round constants and the message schedule arrays together in
> one structure saves one register, which can be a significant benefit on
> register-constrained architectures. On x86-32 (tested on Broadwell
> Xeon), this gives a 10% performance benefit.
> 
> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
> Suggested-by: David Laight <David.Laight@ACULAB.COM>
> ---
>  lib/crypto/sha256.c | 49 ++++++++++++++++++++++++++-------------------
>  1 file changed, 28 insertions(+), 21 deletions(-)
> 
> diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> index 3a8802d5f747..985cd0560d79 100644
> --- a/lib/crypto/sha256.c
> +++ b/lib/crypto/sha256.c
> @@ -29,6 +29,11 @@ static const u32 SHA256_K[] = {
>  	0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2,
>  };
>  
> +struct KW {
> +	u32 K[64];
> +	u32 W[64];
> +};

Note that this doubles the stack usage from 256 to 512 bytes.  That's pretty
large for kernel code, especially when compiler options can increase the stack
usage well beyond the "expected" value.

So unless this gives a big performance improvement on architectures other than
32-bit x86 (which people don't really care about these days), we probably
shouldn't do this.

FWIW, it's possible to reduce the length of 'W' to 16 words by computing the
next W value just before each round 16-63, or by computing the next W values in
batches of 16 before rounds 16, 32, and 48.  (This is similar to what lib/sha1.c
does for SHA-1.)  In a quick userspace benchmark that seems to reduce
performance by about 25% on x86_64, though.

- Eric
David Laight Oct. 22, 2020, 8:20 a.m. UTC | #4
From: Eric Biggers
> Sent: 22 October 2020 05:35
> 
> On Tue, Oct 20, 2020 at 04:39:57PM -0400, Arvind Sankar wrote:
> > Putting the round constants and the message schedule arrays together in
> > one structure saves one register, which can be a significant benefit on
> > register-constrained architectures. On x86-32 (tested on Broadwell
> > Xeon), this gives a 10% performance benefit.
> >
> > Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
> > Suggested-by: David Laight <David.Laight@ACULAB.COM>
> > ---
> >  lib/crypto/sha256.c | 49 ++++++++++++++++++++++++++-------------------
> >  1 file changed, 28 insertions(+), 21 deletions(-)
> >
> > diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> > index 3a8802d5f747..985cd0560d79 100644
> > --- a/lib/crypto/sha256.c
> > +++ b/lib/crypto/sha256.c
> > @@ -29,6 +29,11 @@ static const u32 SHA256_K[] = {
> >  	0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2,
> >  };
> >
> > +struct KW {
> > +	u32 K[64];
> > +	u32 W[64];
> > +};
> 
> Note that this doubles the stack usage from 256 to 512 bytes.  That's pretty
> large for kernel code, especially when compiler options can increase the stack
> usage well beyond the "expected" value.
> 
> So unless this gives a big performance improvement on architectures other than
> 32-bit x86 (which people don't really care about these days), we probably
> shouldn't do this.

IIRC the gain came from an odd side effect - which can probably
be got (for some compiler versions) by other means.

> FWIW, it's possible to reduce the length of 'W' to 16 words by computing the
> next W value just before each round 16-63,

I was looking at that.
You'd need to do the first 16 rounds then rounds 17-63 in a second
loop to avoid the conditional.
The problem is that it needs too many registers.
You'd need registers for 16 W values, the 8 a-h and a few spare.

...

Looking closely each round is like:
        t1 = h + e1(e) + Ch(e, f, g) + 0x428a2f98 + W[0];
        t2 = e0(a) + Maj(a, b, c);
        h = t1 + t2;   // Not used for a few rounds
        d += t1;       // Needed next round
So only needs 4 of the state variables (e, f, g, h).
The next round uses d, e, f and g.
So with extreme care d and h can use the same register.
Although I'm not sure how you'd get the compiler to do it.
Possibly making state[] volatile (or using READ/WRITE_ONCE).
So the last two lines become:
        state[7] = t1 + t2;
        d = state[3] + t1;
That might stop the x86 (32bit) spilling registers.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
diff mbox series

Patch

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index 3a8802d5f747..985cd0560d79 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -29,6 +29,11 @@  static const u32 SHA256_K[] = {
 	0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2,
 };
 
+struct KW {
+	u32 K[64];
+	u32 W[64];
+};
+
 static inline u32 Ch(u32 x, u32 y, u32 z)
 {
 	return z ^ (x & (y ^ z));
@@ -56,39 +61,39 @@  static inline void BLEND_OP(int I, u32 *W)
 
 #define SHA256_ROUND(i, a, b, c, d, e, f, g, h) do {		\
 	u32 t1, t2;						\
-	t1 = h + e1(e) + Ch(e, f, g) + SHA256_K[i] + W[i];	\
+	t1 = h + e1(e) + Ch(e, f, g) + KW->K[i] + KW->W[i];	\
 	t2 = e0(a) + Maj(a, b, c);				\
 	d += t1;						\
 	h = t1 + t2;						\
 } while (0)
 
-static void sha256_transform(u32 *state, const u8 *input, u32 *W)
+static void sha256_transform(u32 *state, const u8 *input, struct KW *KW)
 {
 	u32 a, b, c, d, e, f, g, h;
 	int i;
 
 	/* load the input */
 	for (i = 0; i < 16; i += 8) {
-		LOAD_OP(i + 0, W, input);
-		LOAD_OP(i + 1, W, input);
-		LOAD_OP(i + 2, W, input);
-		LOAD_OP(i + 3, W, input);
-		LOAD_OP(i + 4, W, input);
-		LOAD_OP(i + 5, W, input);
-		LOAD_OP(i + 6, W, input);
-		LOAD_OP(i + 7, W, input);
+		LOAD_OP(i + 0, KW->W, input);
+		LOAD_OP(i + 1, KW->W, input);
+		LOAD_OP(i + 2, KW->W, input);
+		LOAD_OP(i + 3, KW->W, input);
+		LOAD_OP(i + 4, KW->W, input);
+		LOAD_OP(i + 5, KW->W, input);
+		LOAD_OP(i + 6, KW->W, input);
+		LOAD_OP(i + 7, KW->W, input);
 	}
 
 	/* now blend */
 	for (i = 16; i < 64; i += 8) {
-		BLEND_OP(i + 0, W);
-		BLEND_OP(i + 1, W);
-		BLEND_OP(i + 2, W);
-		BLEND_OP(i + 3, W);
-		BLEND_OP(i + 4, W);
-		BLEND_OP(i + 5, W);
-		BLEND_OP(i + 6, W);
-		BLEND_OP(i + 7, W);
+		BLEND_OP(i + 0, KW->W);
+		BLEND_OP(i + 1, KW->W);
+		BLEND_OP(i + 2, KW->W);
+		BLEND_OP(i + 3, KW->W);
+		BLEND_OP(i + 4, KW->W);
+		BLEND_OP(i + 5, KW->W);
+		BLEND_OP(i + 6, KW->W);
+		BLEND_OP(i + 7, KW->W);
 	}
 
 	/* load the state into our registers */
@@ -115,7 +120,7 @@  void sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len)
 {
 	unsigned int partial, done;
 	const u8 *src;
-	u32 W[64];
+	struct KW KW;
 
 	partial = sctx->count & 0x3f;
 	sctx->count += len;
@@ -129,13 +134,15 @@  void sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len)
 			src = sctx->buf;
 		}
 
+		memcpy(KW.K, SHA256_K, sizeof(KW.K));
+
 		do {
-			sha256_transform(sctx->state, src, W);
+			sha256_transform(sctx->state, src, &KW);
 			done += 64;
 			src = data + done;
 		} while (done + 63 < len);
 
-		memzero_explicit(W, sizeof(W));
+		memzero_explicit(KW.W, sizeof(KW.W));
 
 		partial = 0;
 	}