Message ID | 20240411162359.39073-5-ebiggers@kernel.org (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Herbert Xu |
Headers | show |
Series | crypto: x86/sha256-ni - cleanup and optimization | expand |
"Eric Biggers" <ebiggers@kernel.org> wrote: > Instead of loading the message words into both MSG and \m0 and then > adding the round constants to MSG, load the message words into \m0 and > the round constants into MSG and then add \m0 to MSG. This shortens the > source code slightly. It changes the instructions slightly, but it > doesn't affect binary code size and doesn't seem to affect performance. At last the final change: write the macro straightforward and SIMPLE, closely matching NIST.FIPS.180-4.pdf and their order of operations. @@ ... +.macro sha256 m0 :req, m1 :req, m2 :req, m3 :req +.if \@ < 4 + movdqu \@*16(DATA_PTR), \m0 + pshufb SHUF_MASK, \m0 # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)} +.else + # \m0 = {w(\@*16-16), w(\@*16-15), w(\@*16-14), w(\@*16-13)} + # \m1 = {w(\@*16-12), w(\@*16-11), w(\@*16-10), w(\@*16-9)} + # \m2 = {w(\@*16-8), w(\@*16-7), w(\@*16-6), w(\@*16-5)} + # \m3 = {w(\@*16-4), w(\@*16-3), w(\@*16-2), w(\@*16-1)} + sha256msg1 \m1, \m0 + movdqa \m3, TMP + palignr $4, \m2, TMP + paddd TMP, \m0 + sha256msg2 \m3, \m0 # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)} +.endif + movdqa (\@-8)*16(SHA256CONSTANTS), MSG + paddd \m0, MSG + sha256rnds2 STATE0, STATE1 # STATE1 = {f', e', b', a'} + punpckhqdq MSG, MSG + sha256rnds2 STATE1, STATE0 # STATE0 = {f", e", b", a"}, + # STATE1 = {h", g", d", c"} +.endm JFTR: you may simplify this further using .altmacro and generate \m0 to \m3 as MSG%(4-\@&3), MSG%(5-\@&3), MSG%(6-\@&3) and MSG%(7-\@&3) within the macro, thus getting rid of its 4 arguments. @@ ... +.rept 4 # 4*4*4 rounds + sha256 MSG0, MSG1, MSG2, MSG3 + sha256 MSG1, MSG2, MSG3, MSG0 + sha256 MSG2, MSG3, MSG0, MSG1 + sha256 MSG3, MSG0, MSG1, MSG2 +.endr Now that all code written by Tim Chen and Sean Gulley is gone, remove their copyright notice and insert your and my name instead. regards Stefan PS: see <https://skanthak.homepage.t-online.de/fips-180.html> (which I still polish) not just for this implementation. PPS: if MASM had a counter like \@, I'd used it there.
On Mon, Apr 15, 2024 at 10:41:07PM +0200, Stefan Kanthak wrote: > "Eric Biggers" <ebiggers@kernel.org> wrote: > > > Instead of loading the message words into both MSG and \m0 and then > > adding the round constants to MSG, load the message words into \m0 and > > the round constants into MSG and then add \m0 to MSG. This shortens the > > source code slightly. It changes the instructions slightly, but it > > doesn't affect binary code size and doesn't seem to affect performance. > > At last the final change: write the macro straightforward and SIMPLE, > closely matching NIST.FIPS.180-4.pdf and their order of operations. > > @@ ... > +.macro sha256 m0 :req, m1 :req, m2 :req, m3 :req > +.if \@ < 4 > + movdqu \@*16(DATA_PTR), \m0 > + pshufb SHUF_MASK, \m0 # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)} > +.else > + # \m0 = {w(\@*16-16), w(\@*16-15), w(\@*16-14), w(\@*16-13)} > + # \m1 = {w(\@*16-12), w(\@*16-11), w(\@*16-10), w(\@*16-9)} > + # \m2 = {w(\@*16-8), w(\@*16-7), w(\@*16-6), w(\@*16-5)} > + # \m3 = {w(\@*16-4), w(\@*16-3), w(\@*16-2), w(\@*16-1)} > + sha256msg1 \m1, \m0 > + movdqa \m3, TMP > + palignr $4, \m2, TMP > + paddd TMP, \m0 > + sha256msg2 \m3, \m0 # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)} > +.endif > + movdqa (\@-8)*16(SHA256CONSTANTS), MSG > + paddd \m0, MSG > + sha256rnds2 STATE0, STATE1 # STATE1 = {f', e', b', a'} > + punpckhqdq MSG, MSG > + sha256rnds2 STATE1, STATE0 # STATE0 = {f", e", b", a"}, > + # STATE1 = {h", g", d", c"} > +.endm > > JFTR: you may simplify this further using .altmacro and generate \m0 to \m3 > as MSG%(4-\@&3), MSG%(5-\@&3), MSG%(6-\@&3) and MSG%(7-\@&3) within > the macro, thus getting rid of its 4 arguments. > > @@ ... > +.rept 4 # 4*4*4 rounds > + sha256 MSG0, MSG1, MSG2, MSG3 > + sha256 MSG1, MSG2, MSG3, MSG0 > + sha256 MSG2, MSG3, MSG0, MSG1 > + sha256 MSG3, MSG0, MSG1, MSG2 > +.endr Could you please send a real patch, following Documentation/process/submitting-patches.rst? It's hard to understand what you're proposing here. > Now that all code written by Tim Chen and Sean Gulley is gone, > remove their copyright notice and insert your and my name instead. Well, their code has been cleaned up. We have to keep copyright notices around unless we're certain they can go. > > regards > Stefan > > PS: see <https://skanthak.homepage.t-online.de/fips-180.html> > (which I still polish) not just for this implementation. > > PPS: if MASM had a counter like \@, I'd used it there. Thanks, - Eric
"Eric Biggers" <ebiggers@kernel.org> wrote: > On Mon, Apr 15, 2024 at 10:41:07PM +0200, Stefan Kanthak wrote: [...] >> At last the final change: write the macro straightforward and SIMPLE, >> closely matching NIST.FIPS.180-4.pdf and their order of operations. >> >> @@ ... >> +.macro sha256 m0 :req, m1 :req, m2 :req, m3 :req >> +.if \@ < 4 >> + movdqu \@*16(DATA_PTR), \m0 >> + pshufb SHUF_MASK, \m0 # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)} >> +.else >> + # \m0 = {w(\@*16-16), w(\@*16-15), w(\@*16-14), w(\@*16-13)} >> + # \m1 = {w(\@*16-12), w(\@*16-11), w(\@*16-10), w(\@*16-9)} >> + # \m2 = {w(\@*16-8), w(\@*16-7), w(\@*16-6), w(\@*16-5)} >> + # \m3 = {w(\@*16-4), w(\@*16-3), w(\@*16-2), w(\@*16-1)} >> + sha256msg1 \m1, \m0 >> + movdqa \m3, TMP >> + palignr $4, \m2, TMP >> + paddd TMP, \m0 >> + sha256msg2 \m3, \m0 # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)} >> +.endif >> + movdqa (\@-8)*16(SHA256CONSTANTS), MSG >> + paddd \m0, MSG >> + sha256rnds2 STATE0, STATE1 # STATE1 = {f', e', b', a'} >> + punpckhqdq MSG, MSG >> + sha256rnds2 STATE1, STATE0 # STATE0 = {f", e", b", a"}, >> + # STATE1 = {h", g", d", c"} >> +.endm >> >> JFTR: you may simplify this further using .altmacro and generate \m0 to \m3 >> as MSG%(4-\@&3), MSG%(5-\@&3), MSG%(6-\@&3) and MSG%(7-\@&3) within >> the macro, thus getting rid of its 4 arguments. >> >> @@ ... >> +.rept 4 # 4*4*4 rounds >> + sha256 MSG0, MSG1, MSG2, MSG3 >> + sha256 MSG1, MSG2, MSG3, MSG0 >> + sha256 MSG2, MSG3, MSG0, MSG1 >> + sha256 MSG3, MSG0, MSG1, MSG2 >> +.endr > > Could you please send a real patch, following > Documentation/process/submitting-patches.rst? It's hard to understand what > you're proposing here. 1) I replace your macro (which unfortunately follows Tim Chens twisted code) COMPLETELY with a clean and simple implementation: message schedule first, update of state variables last. You don't need ".if \i >= 12 && \i < 60"/".if \i >= 4 && \i < 52" at all! 2) I replace the .irp which invokes your macro with a .rept: my macro uses \@ instead of an argument for the round number. <https://sourceware.org/binutils/docs/as.html#Macro> | \@ | as maintains a counter of how many macros it has executed in this pseudo- | variable; you can copy that number to your output with '\@', but only | within a macro definition. That's all. regards Stefan
On Tue, Apr 16, 2024 at 12:04:56AM +0200, Stefan Kanthak wrote: > "Eric Biggers" <ebiggers@kernel.org> wrote: > > > On Mon, Apr 15, 2024 at 10:41:07PM +0200, Stefan Kanthak wrote: > [...] > >> At last the final change: write the macro straightforward and SIMPLE, > >> closely matching NIST.FIPS.180-4.pdf and their order of operations. > >> > >> @@ ... > >> +.macro sha256 m0 :req, m1 :req, m2 :req, m3 :req > >> +.if \@ < 4 > >> + movdqu \@*16(DATA_PTR), \m0 > >> + pshufb SHUF_MASK, \m0 # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)} > >> +.else > >> + # \m0 = {w(\@*16-16), w(\@*16-15), w(\@*16-14), w(\@*16-13)} > >> + # \m1 = {w(\@*16-12), w(\@*16-11), w(\@*16-10), w(\@*16-9)} > >> + # \m2 = {w(\@*16-8), w(\@*16-7), w(\@*16-6), w(\@*16-5)} > >> + # \m3 = {w(\@*16-4), w(\@*16-3), w(\@*16-2), w(\@*16-1)} > >> + sha256msg1 \m1, \m0 > >> + movdqa \m3, TMP > >> + palignr $4, \m2, TMP > >> + paddd TMP, \m0 > >> + sha256msg2 \m3, \m0 # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)} > >> +.endif > >> + movdqa (\@-8)*16(SHA256CONSTANTS), MSG > >> + paddd \m0, MSG > >> + sha256rnds2 STATE0, STATE1 # STATE1 = {f', e', b', a'} > >> + punpckhqdq MSG, MSG > >> + sha256rnds2 STATE1, STATE0 # STATE0 = {f", e", b", a"}, > >> + # STATE1 = {h", g", d", c"} > >> +.endm > >> > >> JFTR: you may simplify this further using .altmacro and generate \m0 to \m3 > >> as MSG%(4-\@&3), MSG%(5-\@&3), MSG%(6-\@&3) and MSG%(7-\@&3) within > >> the macro, thus getting rid of its 4 arguments. > >> > >> @@ ... > >> +.rept 4 # 4*4*4 rounds > >> + sha256 MSG0, MSG1, MSG2, MSG3 > >> + sha256 MSG1, MSG2, MSG3, MSG0 > >> + sha256 MSG2, MSG3, MSG0, MSG1 > >> + sha256 MSG3, MSG0, MSG1, MSG2 > >> +.endr > > > > Could you please send a real patch, following > > Documentation/process/submitting-patches.rst? It's hard to understand what > > you're proposing here. > > 1) I replace your macro (which unfortunately follows Tim Chens twisted code) > COMPLETELY with a clean and simple implementation: message schedule first, > update of state variables last. > You don't need ".if \i >= 12 && \i < 60"/".if \i >= 4 && \i < 52" at all! It's probably intentional that the code does the message schedule computations a bit ahead of time. This might improve performance by reducing the time spent waiting for the message schedule. It would be worth trying a few different variants on different CPUs and seeing how they actually perform in practice, though. > > 2) I replace the .irp which invokes your macro with a .rept: my macro uses \@ > instead of an argument for the round number. > > <https://sourceware.org/binutils/docs/as.html#Macro> The \@ feature is a bit obscure and maybe is best avoided. - Eric
"Eric Biggers" <ebiggers@kernel.org> wrote: > On Tue, Apr 16, 2024 at 12:04:56AM +0200, Stefan Kanthak wrote: >> "Eric Biggers" <ebiggers@kernel.org> wrote: >> >> > On Mon, Apr 15, 2024 at 10:41:07PM +0200, Stefan Kanthak wrote: >> [...] >> >> At last the final change: write the macro straightforward and SIMPLE, >> >> closely matching NIST.FIPS.180-4.pdf and their order of operations. >> >> >> >> @@ ... >> >> +.macro sha256 m0 :req, m1 :req, m2 :req, m3 :req >> >> +.if \@ < 4 >> >> + movdqu \@*16(DATA_PTR), \m0 >> >> + pshufb SHUF_MASK, \m0 # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)} >> >> +.else >> >> + # \m0 = {w(\@*16-16), w(\@*16-15), w(\@*16-14), w(\@*16-13)} >> >> + # \m1 = {w(\@*16-12), w(\@*16-11), w(\@*16-10), w(\@*16-9)} >> >> + # \m2 = {w(\@*16-8), w(\@*16-7), w(\@*16-6), w(\@*16-5)} >> >> + # \m3 = {w(\@*16-4), w(\@*16-3), w(\@*16-2), w(\@*16-1)} >> >> + sha256msg1 \m1, \m0 >> >> + movdqa \m3, TMP >> >> + palignr $4, \m2, TMP >> >> + paddd TMP, \m0 >> >> + sha256msg2 \m3, \m0 # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)} >> >> +.endif >> >> + movdqa (\@-8)*16(SHA256CONSTANTS), MSG >> >> + paddd \m0, MSG >> >> + sha256rnds2 STATE0, STATE1 # STATE1 = {f', e', b', a'} >> >> + punpckhqdq MSG, MSG >> >> + sha256rnds2 STATE1, STATE0 # STATE0 = {f", e", b", a"}, >> >> + # STATE1 = {h", g", d", c"} >> >> +.endm >> >> >> >> JFTR: you may simplify this further using .altmacro and generate \m0 to \m3 >> >> as MSG%(4-\@&3), MSG%(5-\@&3), MSG%(6-\@&3) and MSG%(7-\@&3) within >> >> the macro, thus getting rid of its 4 arguments. >> >> >> >> @@ ... >> >> +.rept 4 # 4*4*4 rounds >> >> + sha256 MSG0, MSG1, MSG2, MSG3 >> >> + sha256 MSG1, MSG2, MSG3, MSG0 >> >> + sha256 MSG2, MSG3, MSG0, MSG1 >> >> + sha256 MSG3, MSG0, MSG1, MSG2 >> >> +.endr >> > >> > Could you please send a real patch, following >> > Documentation/process/submitting-patches.rst? It's hard to understand what >> > you're proposing here. >> >> 1) I replace your macro (which unfortunately follows Tim Chens twisted code) >> COMPLETELY with a clean and simple implementation: message schedule first, >> update of state variables last. >> You don't need ".if \i >= 12 && \i < 60"/".if \i >= 4 && \i < 52" at all! > > It's probably intentional that the code does the message schedule computations a > bit ahead of time. This might improve performance by reducing the time spent > waiting for the message schedule. While this is a valid point, Intel states in <https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sha-extensions.html> only for SHA-1: | In other words, the rounds processing is the critical path and the latency of | sha1rnds4 determines the performance of SHA-1 calculations. For SHA-256 no such explicit statement is given: did the authors consider it not worthwhile? JFTR: while Tim Chen's code (following the paper) executes 3 instructions and 1 sha256msg2 between every 2 sha256rnds2, my macro executes them back to back, so my code would be slower if their latency determines performance. > It would be worth trying a few different variants on different CPUs and seeing > how they actually perform in practice, though. Right. I noticed no difference on Zen2+ and Zen3; Intel CPUs with SHA-NI are not available to me (I didn't bother to use __asm__ on Compiler Explorer). Stefan
diff --git a/arch/x86/crypto/sha256_ni_asm.S b/arch/x86/crypto/sha256_ni_asm.S index ffc9f1c75c15..d515a55a3bc1 100644 --- a/arch/x86/crypto/sha256_ni_asm.S +++ b/arch/x86/crypto/sha256_ni_asm.S @@ -76,17 +76,15 @@ #define ABEF_SAVE %xmm9 #define CDGH_SAVE %xmm10 .macro do_4rounds i, m0, m1, m2, m3 .if \i < 16 - movdqu \i*4(DATA_PTR), MSG - pshufb SHUF_MASK, MSG - movdqa MSG, \m0 -.else - movdqa \m0, MSG + movdqu \i*4(DATA_PTR), \m0 + pshufb SHUF_MASK, \m0 .endif - paddd (\i-32)*4(SHA256CONSTANTS), MSG + movdqa (\i-32)*4(SHA256CONSTANTS), MSG + paddd \m0, MSG sha256rnds2 STATE0, STATE1 .if \i >= 12 && \i < 60 movdqa \m0, TMP palignr $4, \m3, TMP paddd TMP, \m1