Message ID | 20201020203957.3512851-7-nivedita@alum.mit.edu (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Herbert Xu |
Headers | show |
Series | crypto: lib/sha256 - cleanup/optimization | expand |
From: Arvind Sankar > Sent: 20 October 2020 21:40 > > Putting the round constants and the message schedule arrays together in > one structure saves one register, which can be a significant benefit on > register-constrained architectures. On x86-32 (tested on Broadwell > Xeon), this gives a 10% performance benefit. I'm actually stunned it makes that much difference. The object code must be truly horrid (before and after). There are probably other strange tweaks that give a similar improvement. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, Oct 20, 2020 at 09:36:00PM +0000, David Laight wrote: > From: Arvind Sankar > > Sent: 20 October 2020 21:40 > > > > Putting the round constants and the message schedule arrays together in > > one structure saves one register, which can be a significant benefit on > > register-constrained architectures. On x86-32 (tested on Broadwell > > Xeon), this gives a 10% performance benefit. > > I'm actually stunned it makes that much difference. > The object code must be truly horrid (before and after). > > There are probably other strange tweaks that give a similar > improvement. > > David > Hm yes, I took a closer look at the generated code, and gcc seems to be doing something completely braindead. Before this change, it actually copies 8 words at a time from SHA256_K onto the stack, and uses those stack temporaries for the calculation. So this patch is giving a benefit just because it only does the copy once instead of every time around the loop. It doesn't even really need a register to hold SHA256_K since this isn't PIC code, it could just access it directly as SHA256_K(%ecx) if it just multiplied the loop counter i by 4.
On Tue, Oct 20, 2020 at 04:39:57PM -0400, Arvind Sankar wrote: > Putting the round constants and the message schedule arrays together in > one structure saves one register, which can be a significant benefit on > register-constrained architectures. On x86-32 (tested on Broadwell > Xeon), this gives a 10% performance benefit. > > Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> > Suggested-by: David Laight <David.Laight@ACULAB.COM> > --- > lib/crypto/sha256.c | 49 ++++++++++++++++++++++++++------------------- > 1 file changed, 28 insertions(+), 21 deletions(-) > > diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c > index 3a8802d5f747..985cd0560d79 100644 > --- a/lib/crypto/sha256.c > +++ b/lib/crypto/sha256.c > @@ -29,6 +29,11 @@ static const u32 SHA256_K[] = { > 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2, > }; > > +struct KW { > + u32 K[64]; > + u32 W[64]; > +}; Note that this doubles the stack usage from 256 to 512 bytes. That's pretty large for kernel code, especially when compiler options can increase the stack usage well beyond the "expected" value. So unless this gives a big performance improvement on architectures other than 32-bit x86 (which people don't really care about these days), we probably shouldn't do this. FWIW, it's possible to reduce the length of 'W' to 16 words by computing the next W value just before each round 16-63, or by computing the next W values in batches of 16 before rounds 16, 32, and 48. (This is similar to what lib/sha1.c does for SHA-1.) In a quick userspace benchmark that seems to reduce performance by about 25% on x86_64, though. - Eric
From: Eric Biggers > Sent: 22 October 2020 05:35 > > On Tue, Oct 20, 2020 at 04:39:57PM -0400, Arvind Sankar wrote: > > Putting the round constants and the message schedule arrays together in > > one structure saves one register, which can be a significant benefit on > > register-constrained architectures. On x86-32 (tested on Broadwell > > Xeon), this gives a 10% performance benefit. > > > > Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> > > Suggested-by: David Laight <David.Laight@ACULAB.COM> > > --- > > lib/crypto/sha256.c | 49 ++++++++++++++++++++++++++------------------- > > 1 file changed, 28 insertions(+), 21 deletions(-) > > > > diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c > > index 3a8802d5f747..985cd0560d79 100644 > > --- a/lib/crypto/sha256.c > > +++ b/lib/crypto/sha256.c > > @@ -29,6 +29,11 @@ static const u32 SHA256_K[] = { > > 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2, > > }; > > > > +struct KW { > > + u32 K[64]; > > + u32 W[64]; > > +}; > > Note that this doubles the stack usage from 256 to 512 bytes. That's pretty > large for kernel code, especially when compiler options can increase the stack > usage well beyond the "expected" value. > > So unless this gives a big performance improvement on architectures other than > 32-bit x86 (which people don't really care about these days), we probably > shouldn't do this. IIRC the gain came from an odd side effect - which can probably be got (for some compiler versions) by other means. > FWIW, it's possible to reduce the length of 'W' to 16 words by computing the > next W value just before each round 16-63, I was looking at that. You'd need to do the first 16 rounds then rounds 17-63 in a second loop to avoid the conditional. The problem is that it needs too many registers. You'd need registers for 16 W values, the 8 a-h and a few spare. ... Looking closely each round is like: t1 = h + e1(e) + Ch(e, f, g) + 0x428a2f98 + W[0]; t2 = e0(a) + Maj(a, b, c); h = t1 + t2; // Not used for a few rounds d += t1; // Needed next round So only needs 4 of the state variables (e, f, g, h). The next round uses d, e, f and g. So with extreme care d and h can use the same register. Although I'm not sure how you'd get the compiler to do it. Possibly making state[] volatile (or using READ/WRITE_ONCE). So the last two lines become: state[7] = t1 + t2; d = state[3] + t1; That might stop the x86 (32bit) spilling registers. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c index 3a8802d5f747..985cd0560d79 100644 --- a/lib/crypto/sha256.c +++ b/lib/crypto/sha256.c @@ -29,6 +29,11 @@ static const u32 SHA256_K[] = { 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2, }; +struct KW { + u32 K[64]; + u32 W[64]; +}; + static inline u32 Ch(u32 x, u32 y, u32 z) { return z ^ (x & (y ^ z)); @@ -56,39 +61,39 @@ static inline void BLEND_OP(int I, u32 *W) #define SHA256_ROUND(i, a, b, c, d, e, f, g, h) do { \ u32 t1, t2; \ - t1 = h + e1(e) + Ch(e, f, g) + SHA256_K[i] + W[i]; \ + t1 = h + e1(e) + Ch(e, f, g) + KW->K[i] + KW->W[i]; \ t2 = e0(a) + Maj(a, b, c); \ d += t1; \ h = t1 + t2; \ } while (0) -static void sha256_transform(u32 *state, const u8 *input, u32 *W) +static void sha256_transform(u32 *state, const u8 *input, struct KW *KW) { u32 a, b, c, d, e, f, g, h; int i; /* load the input */ for (i = 0; i < 16; i += 8) { - LOAD_OP(i + 0, W, input); - LOAD_OP(i + 1, W, input); - LOAD_OP(i + 2, W, input); - LOAD_OP(i + 3, W, input); - LOAD_OP(i + 4, W, input); - LOAD_OP(i + 5, W, input); - LOAD_OP(i + 6, W, input); - LOAD_OP(i + 7, W, input); + LOAD_OP(i + 0, KW->W, input); + LOAD_OP(i + 1, KW->W, input); + LOAD_OP(i + 2, KW->W, input); + LOAD_OP(i + 3, KW->W, input); + LOAD_OP(i + 4, KW->W, input); + LOAD_OP(i + 5, KW->W, input); + LOAD_OP(i + 6, KW->W, input); + LOAD_OP(i + 7, KW->W, input); } /* now blend */ for (i = 16; i < 64; i += 8) { - BLEND_OP(i + 0, W); - BLEND_OP(i + 1, W); - BLEND_OP(i + 2, W); - BLEND_OP(i + 3, W); - BLEND_OP(i + 4, W); - BLEND_OP(i + 5, W); - BLEND_OP(i + 6, W); - BLEND_OP(i + 7, W); + BLEND_OP(i + 0, KW->W); + BLEND_OP(i + 1, KW->W); + BLEND_OP(i + 2, KW->W); + BLEND_OP(i + 3, KW->W); + BLEND_OP(i + 4, KW->W); + BLEND_OP(i + 5, KW->W); + BLEND_OP(i + 6, KW->W); + BLEND_OP(i + 7, KW->W); } /* load the state into our registers */ @@ -115,7 +120,7 @@ void sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len) { unsigned int partial, done; const u8 *src; - u32 W[64]; + struct KW KW; partial = sctx->count & 0x3f; sctx->count += len; @@ -129,13 +134,15 @@ void sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len) src = sctx->buf; } + memcpy(KW.K, SHA256_K, sizeof(KW.K)); + do { - sha256_transform(sctx->state, src, W); + sha256_transform(sctx->state, src, &KW); done += 64; src = data + done; } while (done + 63 < len); - memzero_explicit(W, sizeof(W)); + memzero_explicit(KW.W, sizeof(KW.W)); partial = 0; }
Putting the round constants and the message schedule arrays together in one structure saves one register, which can be a significant benefit on register-constrained architectures. On x86-32 (tested on Broadwell Xeon), this gives a 10% performance benefit. Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Suggested-by: David Laight <David.Laight@ACULAB.COM> --- lib/crypto/sha256.c | 49 ++++++++++++++++++++++++++------------------- 1 file changed, 28 insertions(+), 21 deletions(-)