Message ID | 20201025143119.1054168-7-nivedita@alum.mit.edu (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Herbert Xu |
Headers | show |
Series | crypto: lib/sha256 - cleanup/optimization | expand |
From: Arvind Sankar > Sent: 25 October 2020 14:31 > > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64 > (tested on Broadwell Xeon) while not increasing code size too much. I can't believe unrolling the BLEND loop makes any difference. Unrolling the LOAD one might - but you don't need 8 times, once should be more than enough. The LOAD loop needs a memory read, memory write and BSWAP per iteration. The loop control is add + compare + jmp. On sandy bridge and later the compare and jmp become a single u-op. So the loop has the read, write (can happen together) and 3 other u-ops. That won't run at 1 clock per iteration on Sandy Bridge. However just unroll once and you need 4 non-memory u-op per loop iteration. That might run at 2 clocks per 8 bytes. Fiddling the loop to remove the compare (ie run from -64 to 0) should merge the 'add' and 'jnz' into a single u-op. That might be enough to get the 'rolled up' loop to run in 1 clock on sandy bridge, certainly on slightly later cpu. That is theoretical for intel cpu sandy bridge onwards. I've an i7-7700 (Kaby Lake?) that I belive has an extra instruction pipeline and might run the initial loop in 1 clock. I don't have any recent AMD cpu, nor any ARM or PPC ones. But fully out-of-order cpu are likely to be similar. One of the other test systems I've got is an Atom C2758. This 8 core but mostly in-order. Running sha256_transform() on that tend to give one of two TSC counts, one of which is double the other! That is pretty consistent even for 100 iterations. WRT patch 5. On the C2758 the original unrolled code is slightly faster. On the i7-7700 the 8 unroll is a bit faster 'hot cache', but slower 'cold cache' - probably because of the d-cache loads for K[]. Non-x86 architectures might need to use d-cache reads for the 32bit 'K' constants even in the unrolled loop. X86 can use 'lea' with a 32bit offset to avoid data reads. So the cold-cache case for the old code may be similar. Interestingly I had to write an asm ror32() to get reasonable code (in userspace). The C version the kernel uses didn't work. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, Oct 25, 2020 at 06:51:18PM +0000, David Laight wrote: > From: Arvind Sankar > > Sent: 25 October 2020 14:31 > > > > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64 > > (tested on Broadwell Xeon) while not increasing code size too much. > > I can't believe unrolling the BLEND loop makes any difference. It's actually the BLEND loop that accounts for almost all of the difference. The LOAD loop doesn't matter much in general: even replacing it with a plain memcpy() only increases performance by 3-4%. But unrolling it is low cost in code size terms, and clang actually does it without being asked. > WRT patch 5. > On the C2758 the original unrolled code is slightly faster. > On the i7-7700 the 8 unroll is a bit faster 'hot cache', > but slower 'cold cache' - probably because of the d-cache > loads for K[]. > > Non-x86 architectures might need to use d-cache reads for > the 32bit 'K' constants even in the unrolled loop. > X86 can use 'lea' with a 32bit offset to avoid data reads. > So the cold-cache case for the old code may be similar. Not sure I follow: in the old code, the K's are 32-bit immediates, so they should come from the i-cache whether an add or an lea is used? Why is the cold-cache case relevant anyway? If the code is only being executed a couple of times or so, i.e. you're hashing a single say 64-128 byte input once in a blue moon, the performance of the hash doesn't really matter, no? > > Interestingly I had to write an asm ror32() to get reasonable > code (in userspace). The C version the kernel uses didn't work. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >
From: Arvind Sankar > Sent: 25 October 2020 20:18 > > On Sun, Oct 25, 2020 at 06:51:18PM +0000, David Laight wrote: > > From: Arvind Sankar > > > Sent: 25 October 2020 14:31 > > > > > > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64 > > > (tested on Broadwell Xeon) while not increasing code size too much. > > > > I can't believe unrolling the BLEND loop makes any difference. > > It's actually the BLEND loop that accounts for almost all of the > difference. The LOAD loop doesn't matter much in general: even replacing > it with a plain memcpy() only increases performance by 3-4%. But > unrolling it is low cost in code size terms, and clang actually does it > without being asked. (memcpy is wrong - misses the byte swaps). That's odd, the BLEND loop is about 20 instructions. I wouldn't expect unrolling to help - unless you manage to use 16 registers for the active W[] values. > > WRT patch 5. > > On the C2758 the original unrolled code is slightly faster. > > On the i7-7700 the 8 unroll is a bit faster 'hot cache', > > but slower 'cold cache' - probably because of the d-cache > > loads for K[]. > > > > Non-x86 architectures might need to use d-cache reads for > > the 32bit 'K' constants even in the unrolled loop. > > X86 can use 'lea' with a 32bit offset to avoid data reads. > > So the cold-cache case for the old code may be similar. > > Not sure I follow: in the old code, the K's are 32-bit immediates, so > they should come from the i-cache whether an add or an lea is used? I was thinking of other instruction sets that end up using pc-relative addressing for constants. Might only happen for 64bint ones though. > Why is the cold-cache case relevant anyway? If the code is only being > executed a couple of times or so, i.e. you're hashing a single say > 64-128 byte input once in a blue moon, the performance of the hash > doesn't really matter, no? I was measuring the cold cache one because I could. I didn't note the actual figures but it was 8-10 times slower that the hot-cache case. While sha256 is likely to be run hot-cache (on a big buffer) the cold-cache timing are probably relevant for things like memcpy(). I remember seeing a very long divide function for sparc32 that was probably only a gain in a benchmark loop - it would have displaced a lot of the working set from the i-cache! David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, Oct 25, 2020 at 11:23:52PM +0000, David Laight wrote: > From: Arvind Sankar > > Sent: 25 October 2020 20:18 > > > > On Sun, Oct 25, 2020 at 06:51:18PM +0000, David Laight wrote: > > > From: Arvind Sankar > > > > Sent: 25 October 2020 14:31 > > > > > > > > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64 > > > > (tested on Broadwell Xeon) while not increasing code size too much. > > > > > > I can't believe unrolling the BLEND loop makes any difference. > > > > It's actually the BLEND loop that accounts for almost all of the > > difference. The LOAD loop doesn't matter much in general: even replacing > > it with a plain memcpy() only increases performance by 3-4%. But > > unrolling it is low cost in code size terms, and clang actually does it > > without being asked. > > (memcpy is wrong - misses the byte swaps). I know it's wrong, the point is that it's impossible to gain very much from optimizing the LOAD loop because it doesn't account for much of the total time. > > That's odd, the BLEND loop is about 20 instructions. > I wouldn't expect unrolling to help - unless you manage > to use 16 registers for the active W[] values. > I am not sure about what's going on inside the hardware, but even with a straightforward assembly version that just reads out of memory the way the calculation is specified, unrolling the BLEND loop 8x speeds up the performance by 7-8%. The compiler is actually pretty bad here, just translating everything into assembler with no attempt to optimize anything gets a 10-12% speedup over the C version.
On Sun, 25 Oct 2020 at 15:31, Arvind Sankar <nivedita@alum.mit.edu> wrote: > > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64 > (tested on Broadwell Xeon) while not increasing code size too much. > > Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> > Reviewed-by: Eric Biggers <ebiggers@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> > --- > lib/crypto/sha256.c | 24 ++++++++++++++++++++---- > 1 file changed, 20 insertions(+), 4 deletions(-) > > diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c > index e2e29d9b0ccd..cdef37c05972 100644 > --- a/lib/crypto/sha256.c > +++ b/lib/crypto/sha256.c > @@ -76,12 +76,28 @@ static void sha256_transform(u32 *state, const u8 *input, u32 *W) > int i; > > /* load the input */ > - for (i = 0; i < 16; i++) > - LOAD_OP(i, W, input); > + for (i = 0; i < 16; i += 8) { > + LOAD_OP(i + 0, W, input); > + LOAD_OP(i + 1, W, input); > + LOAD_OP(i + 2, W, input); > + LOAD_OP(i + 3, W, input); > + LOAD_OP(i + 4, W, input); > + LOAD_OP(i + 5, W, input); > + LOAD_OP(i + 6, W, input); > + LOAD_OP(i + 7, W, input); > + } > > /* now blend */ > - for (i = 16; i < 64; i++) > - BLEND_OP(i, W); > + for (i = 16; i < 64; i += 8) { > + BLEND_OP(i + 0, W); > + BLEND_OP(i + 1, W); > + BLEND_OP(i + 2, W); > + BLEND_OP(i + 3, W); > + BLEND_OP(i + 4, W); > + BLEND_OP(i + 5, W); > + BLEND_OP(i + 6, W); > + BLEND_OP(i + 7, W); > + } > > /* load the state into our registers */ > a = state[0]; b = state[1]; c = state[2]; d = state[3]; > -- > 2.26.2 >
From: Arvind Sankar > Sent: 25 October 2020 23:54 ... > > That's odd, the BLEND loop is about 20 instructions. > > I wouldn't expect unrolling to help - unless you manage > > to use 16 registers for the active W[] values. > > > > I am not sure about what's going on inside the hardware, but even with > a straightforward assembly version that just reads out of memory the way > the calculation is specified, unrolling the BLEND loop 8x speeds up the > performance by 7-8%. > > The compiler is actually pretty bad here, just translating everything > into assembler with no attempt to optimize anything gets a 10-12% > speedup over the C version. I'm not seeing anything particularly stupid. The loop body (excluding loop control) is 23 instructions. Doubles to 46 if I unroll once. Unrolling 4 times does save a couple of instructions per iteration. The only horrid part of the code is the long dependency chain at the end when the values get xor'ed together. gcc is very bad at that, it converts (a + b) + (c + d) to (((a + b) + c) + d) which takes an extra clock. Unrolling 4 times gives almost all the gain. But it really shouldn't be needed at all. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c index e2e29d9b0ccd..cdef37c05972 100644 --- a/lib/crypto/sha256.c +++ b/lib/crypto/sha256.c @@ -76,12 +76,28 @@ static void sha256_transform(u32 *state, const u8 *input, u32 *W) int i; /* load the input */ - for (i = 0; i < 16; i++) - LOAD_OP(i, W, input); + for (i = 0; i < 16; i += 8) { + LOAD_OP(i + 0, W, input); + LOAD_OP(i + 1, W, input); + LOAD_OP(i + 2, W, input); + LOAD_OP(i + 3, W, input); + LOAD_OP(i + 4, W, input); + LOAD_OP(i + 5, W, input); + LOAD_OP(i + 6, W, input); + LOAD_OP(i + 7, W, input); + } /* now blend */ - for (i = 16; i < 64; i++) - BLEND_OP(i, W); + for (i = 16; i < 64; i += 8) { + BLEND_OP(i + 0, W); + BLEND_OP(i + 1, W); + BLEND_OP(i + 2, W); + BLEND_OP(i + 3, W); + BLEND_OP(i + 4, W); + BLEND_OP(i + 5, W); + BLEND_OP(i + 6, W); + BLEND_OP(i + 7, W); + } /* load the state into our registers */ a = state[0]; b = state[1]; c = state[2]; d = state[3];