[v2,4/4] crypto: x86/sha256-ni - simplify do_4rounds

Message ID	20240411162359.39073-5-ebiggers@kernel.org (mailing list archive)
State	Accepted
Delegated to:	Herbert Xu
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CBC7C78C6C; Thu, 11 Apr 2024 16:25:45 +0000 (UTC) From: Eric Biggers <ebiggers@kernel.org> To: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Stefan Kanthak <stefan.kanthak@nexgo.de> Subject: [PATCH v2 4/4] crypto: x86/sha256-ni - simplify do_4rounds Date: Thu, 11 Apr 2024 09:23:59 -0700 Message-ID: <20240411162359.39073-5-ebiggers@kernel.org> In-Reply-To: <20240411162359.39073-1-ebiggers@kernel.org> References: <20240411162359.39073-1-ebiggers@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	crypto: x86/sha256-ni - cleanup and optimization \| expand [v2,0/4] crypto: x86/sha256-ni - cleanup and optimization [v2,1/4] crypto: x86/sha256-ni - convert to use rounds macros [v2,2/4] crypto: x86/sha256-ni - rename some register aliases [v2,3/4] crypto: x86/sha256-ni - optimize code size [v2,4/4] crypto: x86/sha256-ni - simplify do_4rounds

Eric Biggers April 11, 2024, 4:23 p.m. UTC

From: Eric Biggers <ebiggers@google.com>

Instead of loading the message words into both MSG and \m0 and then
adding the round constants to MSG, load the message words into \m0 and
the round constants into MSG and then add \m0 to MSG.  This shortens the
source code slightly.  It changes the instructions slightly, but it
doesn't affect binary code size and doesn't seem to affect performance.

Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/sha256_ni_asm.S | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

Stefan Kanthak April 15, 2024, 8:41 p.m. UTC | #1

"Eric Biggers" <ebiggers@kernel.org> wrote:

> Instead of loading the message words into both MSG and \m0 and then
> adding the round constants to MSG, load the message words into \m0 and
> the round constants into MSG and then add \m0 to MSG.  This shortens the
> source code slightly.  It changes the instructions slightly, but it
> doesn't affect binary code size and doesn't seem to affect performance.

At last the final change: write the macro straightforward and SIMPLE,
closely matching NIST.FIPS.180-4.pdf and their order of operations.

@@ ...
+.macro  sha256  m0 :req, m1 :req, m2 :req, m3 :req
+.if \@ < 4
+        movdqu  \@*16(DATA_PTR), \m0
+        pshufb  SHUF_MASK, \m0          # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)}
+.else
+                                        # \m0 = {w(\@*16-16), w(\@*16-15), w(\@*16-14), w(\@*16-13)}
+                                        # \m1 = {w(\@*16-12), w(\@*16-11), w(\@*16-10), w(\@*16-9)}
+                                        # \m2 = {w(\@*16-8),  w(\@*16-7),  w(\@*16-6),  w(\@*16-5)}
+                                        # \m3 = {w(\@*16-4),  w(\@*16-3),  w(\@*16-2),  w(\@*16-1)}
+        sha256msg1 \m1, \m0
+        movdqa     \m3, TMP
+        palignr    $4, \m2, TMP
+        paddd      TMP, \m0
+        sha256msg2 \m3, \m0             # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)}
+.endif
+        movdqa      (\@-8)*16(SHA256CONSTANTS), MSG
+        paddd       \m0, MSG
+        sha256rnds2 STATE0, STATE1      # STATE1 = {f', e', b', a'}
+        punpckhqdq  MSG, MSG
+        sha256rnds2 STATE1, STATE0      # STATE0 = {f", e", b", a"},
+                                        # STATE1 = {h", g", d", c"}
+.endm

JFTR: you may simplify this further using .altmacro and generate \m0 to \m3
      as MSG%(4-\@&3), MSG%(5-\@&3), MSG%(6-\@&3) and MSG%(7-\@&3) within
      the macro, thus getting rid of its 4 arguments.

@@ ...
+.rept 4                                 # 4*4*4 rounds
+        sha256  MSG0, MSG1, MSG2, MSG3
+        sha256  MSG1, MSG2, MSG3, MSG0
+        sha256  MSG2, MSG3, MSG0, MSG1
+        sha256  MSG3, MSG0, MSG1, MSG2
+.endr

Now that all code written by Tim Chen and Sean Gulley is gone,
remove their copyright notice and insert your and my name instead.

regards
Stefan

PS: see <https://skanthak.homepage.t-online.de/fips-180.html>
    (which I still polish) not just for this implementation.

PPS: if MASM had a counter like \@, I'd used it there.

Eric Biggers April 15, 2024, 9:21 p.m. UTC | #2

On Mon, Apr 15, 2024 at 10:41:07PM +0200, Stefan Kanthak wrote:
> "Eric Biggers" <ebiggers@kernel.org> wrote:
> 
> > Instead of loading the message words into both MSG and \m0 and then
> > adding the round constants to MSG, load the message words into \m0 and
> > the round constants into MSG and then add \m0 to MSG.  This shortens the
> > source code slightly.  It changes the instructions slightly, but it
> > doesn't affect binary code size and doesn't seem to affect performance.
> 
> At last the final change: write the macro straightforward and SIMPLE,
> closely matching NIST.FIPS.180-4.pdf and their order of operations.
> 
> @@ ...
> +.macro  sha256  m0 :req, m1 :req, m2 :req, m3 :req
> +.if \@ < 4
> +        movdqu  \@*16(DATA_PTR), \m0
> +        pshufb  SHUF_MASK, \m0          # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)}
> +.else
> +                                        # \m0 = {w(\@*16-16), w(\@*16-15), w(\@*16-14), w(\@*16-13)}
> +                                        # \m1 = {w(\@*16-12), w(\@*16-11), w(\@*16-10), w(\@*16-9)}
> +                                        # \m2 = {w(\@*16-8),  w(\@*16-7),  w(\@*16-6),  w(\@*16-5)}
> +                                        # \m3 = {w(\@*16-4),  w(\@*16-3),  w(\@*16-2),  w(\@*16-1)}
> +        sha256msg1 \m1, \m0
> +        movdqa     \m3, TMP
> +        palignr    $4, \m2, TMP
> +        paddd      TMP, \m0
> +        sha256msg2 \m3, \m0             # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)}
> +.endif
> +        movdqa      (\@-8)*16(SHA256CONSTANTS), MSG
> +        paddd       \m0, MSG
> +        sha256rnds2 STATE0, STATE1      # STATE1 = {f', e', b', a'}
> +        punpckhqdq  MSG, MSG
> +        sha256rnds2 STATE1, STATE0      # STATE0 = {f", e", b", a"},
> +                                        # STATE1 = {h", g", d", c"}
> +.endm
> 
> JFTR: you may simplify this further using .altmacro and generate \m0 to \m3
>       as MSG%(4-\@&3), MSG%(5-\@&3), MSG%(6-\@&3) and MSG%(7-\@&3) within
>       the macro, thus getting rid of its 4 arguments.
> 
> @@ ...
> +.rept 4                                 # 4*4*4 rounds
> +        sha256  MSG0, MSG1, MSG2, MSG3
> +        sha256  MSG1, MSG2, MSG3, MSG0
> +        sha256  MSG2, MSG3, MSG0, MSG1
> +        sha256  MSG3, MSG0, MSG1, MSG2
> +.endr

Could you please send a real patch, following
Documentation/process/submitting-patches.rst?  It's hard to understand what
you're proposing here.

> Now that all code written by Tim Chen and Sean Gulley is gone,
> remove their copyright notice and insert your and my name instead.

Well, their code has been cleaned up.  We have to keep copyright notices around
unless we're certain they can go.

> 
> regards
> Stefan
> 
> PS: see <https://skanthak.homepage.t-online.de/fips-180.html>
>     (which I still polish) not just for this implementation.
> 
> PPS: if MASM had a counter like \@, I'd used it there.

Thanks,

- Eric

Stefan Kanthak April 15, 2024, 10:04 p.m. UTC | #3

"Eric Biggers" <ebiggers@kernel.org> wrote:

> On Mon, Apr 15, 2024 at 10:41:07PM +0200, Stefan Kanthak wrote:
[...]
>> At last the final change: write the macro straightforward and SIMPLE,
>> closely matching NIST.FIPS.180-4.pdf and their order of operations.
>> 
>> @@ ...
>> +.macro  sha256  m0 :req, m1 :req, m2 :req, m3 :req
>> +.if \@ < 4
>> +        movdqu  \@*16(DATA_PTR), \m0
>> +        pshufb  SHUF_MASK, \m0          # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)}
>> +.else
>> +                                        # \m0 = {w(\@*16-16), w(\@*16-15), w(\@*16-14), w(\@*16-13)}
>> +                                        # \m1 = {w(\@*16-12), w(\@*16-11), w(\@*16-10), w(\@*16-9)}
>> +                                        # \m2 = {w(\@*16-8),  w(\@*16-7),  w(\@*16-6),  w(\@*16-5)}
>> +                                        # \m3 = {w(\@*16-4),  w(\@*16-3),  w(\@*16-2),  w(\@*16-1)}
>> +        sha256msg1 \m1, \m0
>> +        movdqa     \m3, TMP
>> +        palignr    $4, \m2, TMP
>> +        paddd      TMP, \m0
>> +        sha256msg2 \m3, \m0             # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)}
>> +.endif
>> +        movdqa      (\@-8)*16(SHA256CONSTANTS), MSG
>> +        paddd       \m0, MSG
>> +        sha256rnds2 STATE0, STATE1      # STATE1 = {f', e', b', a'}
>> +        punpckhqdq  MSG, MSG
>> +        sha256rnds2 STATE1, STATE0      # STATE0 = {f", e", b", a"},
>> +                                        # STATE1 = {h", g", d", c"}
>> +.endm
>> 
>> JFTR: you may simplify this further using .altmacro and generate \m0 to \m3
>>       as MSG%(4-\@&3), MSG%(5-\@&3), MSG%(6-\@&3) and MSG%(7-\@&3) within
>>       the macro, thus getting rid of its 4 arguments.
>> 
>> @@ ...
>> +.rept 4                                 # 4*4*4 rounds
>> +        sha256  MSG0, MSG1, MSG2, MSG3
>> +        sha256  MSG1, MSG2, MSG3, MSG0
>> +        sha256  MSG2, MSG3, MSG0, MSG1
>> +        sha256  MSG3, MSG0, MSG1, MSG2
>> +.endr
> 
> Could you please send a real patch, following
> Documentation/process/submitting-patches.rst?  It's hard to understand what
> you're proposing here.

1) I replace your macro (which unfortunately follows Tim Chens twisted code)
   COMPLETELY with a clean and simple implementation: message schedule first,
   update of state variables last.
   You don't need ".if \i >= 12 && \i < 60"/".if \i >= 4 && \i < 52" at all!

2) I replace the .irp which invokes your macro with a .rept: my macro uses \@
   instead of an argument for the round number.

<https://sourceware.org/binutils/docs/as.html#Macro>

| \@
|    as maintains a counter of how many macros it has executed in this pseudo-
|    variable; you can copy that number to your output with '\@', but only
|    within a macro definition. 

That's all.

regards
Stefan

Eric Biggers April 15, 2024, 10:46 p.m. UTC | #4

On Tue, Apr 16, 2024 at 12:04:56AM +0200, Stefan Kanthak wrote:
> "Eric Biggers" <ebiggers@kernel.org> wrote:
> 
> > On Mon, Apr 15, 2024 at 10:41:07PM +0200, Stefan Kanthak wrote:
> [...]
> >> At last the final change: write the macro straightforward and SIMPLE,
> >> closely matching NIST.FIPS.180-4.pdf and their order of operations.
> >> 
> >> @@ ...
> >> +.macro  sha256  m0 :req, m1 :req, m2 :req, m3 :req
> >> +.if \@ < 4
> >> +        movdqu  \@*16(DATA_PTR), \m0
> >> +        pshufb  SHUF_MASK, \m0          # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)}
> >> +.else
> >> +                                        # \m0 = {w(\@*16-16), w(\@*16-15), w(\@*16-14), w(\@*16-13)}
> >> +                                        # \m1 = {w(\@*16-12), w(\@*16-11), w(\@*16-10), w(\@*16-9)}
> >> +                                        # \m2 = {w(\@*16-8),  w(\@*16-7),  w(\@*16-6),  w(\@*16-5)}
> >> +                                        # \m3 = {w(\@*16-4),  w(\@*16-3),  w(\@*16-2),  w(\@*16-1)}
> >> +        sha256msg1 \m1, \m0
> >> +        movdqa     \m3, TMP
> >> +        palignr    $4, \m2, TMP
> >> +        paddd      TMP, \m0
> >> +        sha256msg2 \m3, \m0             # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)}
> >> +.endif
> >> +        movdqa      (\@-8)*16(SHA256CONSTANTS), MSG
> >> +        paddd       \m0, MSG
> >> +        sha256rnds2 STATE0, STATE1      # STATE1 = {f', e', b', a'}
> >> +        punpckhqdq  MSG, MSG
> >> +        sha256rnds2 STATE1, STATE0      # STATE0 = {f", e", b", a"},
> >> +                                        # STATE1 = {h", g", d", c"}
> >> +.endm
> >> 
> >> JFTR: you may simplify this further using .altmacro and generate \m0 to \m3
> >>       as MSG%(4-\@&3), MSG%(5-\@&3), MSG%(6-\@&3) and MSG%(7-\@&3) within
> >>       the macro, thus getting rid of its 4 arguments.
> >> 
> >> @@ ...
> >> +.rept 4                                 # 4*4*4 rounds
> >> +        sha256  MSG0, MSG1, MSG2, MSG3
> >> +        sha256  MSG1, MSG2, MSG3, MSG0
> >> +        sha256  MSG2, MSG3, MSG0, MSG1
> >> +        sha256  MSG3, MSG0, MSG1, MSG2
> >> +.endr
> > 
> > Could you please send a real patch, following
> > Documentation/process/submitting-patches.rst?  It's hard to understand what
> > you're proposing here.
> 
> 1) I replace your macro (which unfortunately follows Tim Chens twisted code)
>    COMPLETELY with a clean and simple implementation: message schedule first,
>    update of state variables last.
>    You don't need ".if \i >= 12 && \i < 60"/".if \i >= 4 && \i < 52" at all!

It's probably intentional that the code does the message schedule computations a
bit ahead of time.  This might improve performance by reducing the time spent
waiting for the message schedule.

It would be worth trying a few different variants on different CPUs and seeing
how they actually perform in practice, though.

> 
> 2) I replace the .irp which invokes your macro with a .rept: my macro uses \@
>    instead of an argument for the round number.
> 
> <https://sourceware.org/binutils/docs/as.html#Macro>

The \@ feature is a bit obscure and maybe is best avoided.

- Eric

Stefan Kanthak April 16, 2024, 12:17 a.m. UTC | #5

"Eric Biggers" <ebiggers@kernel.org> wrote:

> On Tue, Apr 16, 2024 at 12:04:56AM +0200, Stefan Kanthak wrote:
>> "Eric Biggers" <ebiggers@kernel.org> wrote:
>> 
>> > On Mon, Apr 15, 2024 at 10:41:07PM +0200, Stefan Kanthak wrote:
>> [...]
>> >> At last the final change: write the macro straightforward and SIMPLE,
>> >> closely matching NIST.FIPS.180-4.pdf and their order of operations.
>> >> 
>> >> @@ ...
>> >> +.macro  sha256  m0 :req, m1 :req, m2 :req, m3 :req
>> >> +.if \@ < 4
>> >> +        movdqu  \@*16(DATA_PTR), \m0
>> >> +        pshufb  SHUF_MASK, \m0          # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)}
>> >> +.else
>> >> +                                        # \m0 = {w(\@*16-16), w(\@*16-15), w(\@*16-14), w(\@*16-13)}
>> >> +                                        # \m1 = {w(\@*16-12), w(\@*16-11), w(\@*16-10), w(\@*16-9)}
>> >> +                                        # \m2 = {w(\@*16-8),  w(\@*16-7),  w(\@*16-6),  w(\@*16-5)}
>> >> +                                        # \m3 = {w(\@*16-4),  w(\@*16-3),  w(\@*16-2),  w(\@*16-1)}
>> >> +        sha256msg1 \m1, \m0
>> >> +        movdqa     \m3, TMP
>> >> +        palignr    $4, \m2, TMP
>> >> +        paddd      TMP, \m0
>> >> +        sha256msg2 \m3, \m0             # \m0 = {w(\@*16), w(\@*16+1), w(\@*16+2), w(\@*16+3)}
>> >> +.endif
>> >> +        movdqa      (\@-8)*16(SHA256CONSTANTS), MSG
>> >> +        paddd       \m0, MSG
>> >> +        sha256rnds2 STATE0, STATE1      # STATE1 = {f', e', b', a'}
>> >> +        punpckhqdq  MSG, MSG
>> >> +        sha256rnds2 STATE1, STATE0      # STATE0 = {f", e", b", a"},
>> >> +                                        # STATE1 = {h", g", d", c"}
>> >> +.endm
>> >> 
>> >> JFTR: you may simplify this further using .altmacro and generate \m0 to \m3
>> >>       as MSG%(4-\@&3), MSG%(5-\@&3), MSG%(6-\@&3) and MSG%(7-\@&3) within
>> >>       the macro, thus getting rid of its 4 arguments.
>> >> 
>> >> @@ ...
>> >> +.rept 4                                 # 4*4*4 rounds
>> >> +        sha256  MSG0, MSG1, MSG2, MSG3
>> >> +        sha256  MSG1, MSG2, MSG3, MSG0
>> >> +        sha256  MSG2, MSG3, MSG0, MSG1
>> >> +        sha256  MSG3, MSG0, MSG1, MSG2
>> >> +.endr
>> > 
>> > Could you please send a real patch, following
>> > Documentation/process/submitting-patches.rst?  It's hard to understand what
>> > you're proposing here.
>> 
>> 1) I replace your macro (which unfortunately follows Tim Chens twisted code)
>>    COMPLETELY with a clean and simple implementation: message schedule first,
>>    update of state variables last.
>>    You don't need ".if \i >= 12 && \i < 60"/".if \i >= 4 && \i < 52" at all!
> 
> It's probably intentional that the code does the message schedule computations a
> bit ahead of time.  This might improve performance by reducing the time spent
> waiting for the message schedule.

While this is a valid point, Intel states in
<https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sha-extensions.html>
only for SHA-1:

| In other words, the rounds processing is the critical path and the latency of
| sha1rnds4 determines the performance of SHA-1 calculations.

For SHA-256 no such explicit statement is given: did the authors consider it not
worthwhile?

JFTR: while Tim Chen's code (following the paper) executes 3 instructions and 1
      sha256msg2 between every 2 sha256rnds2, my macro executes them back to back,
      so my code would be slower if their latency determines performance.

> It would be worth trying a few different variants on different CPUs and seeing
> how they actually perform in practice, though.

Right. I noticed no difference on Zen2+ and Zen3; Intel CPUs with SHA-NI are not
available to me (I didn't bother to use __asm__ on Compiler Explorer).

Stefan

[v2,4/4] crypto: x86/sha256-ni - simplify do_4rounds

Commit Message

Comments

Patch