[crypto,v3,0/2] reduce code size from blake2s on m68k and other small platforms

Message ID	20220111220506.742067-1-Jason@zx2c4.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> From: "Jason A. Donenfeld" <Jason@zx2c4.com> To: linux-crypto@vger.kernel.org, netdev@vger.kernel.org, wireguard@lists.zx2c4.com, linux-kernel@vger.kernel.org, bpf@vger.kernel.org, geert@linux-m68k.org, tytso@mit.edu, gregkh@linuxfoundation.org, jeanphilippe.aumasson@gmail.com, ardb@kernel.org Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Subject: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms Date: Tue, 11 Jan 2022 23:05:04 +0100 Message-Id: <20220111220506.742067-1-Jason@zx2c4.com> In-Reply-To: <20220111181037.632969-1-Jason@zx2c4.com> References: <20220111181037.632969-1-Jason@zx2c4.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	reduce code size from blake2s on m68k and other small platforms \| expand [crypto,v3,0/2] reduce code size from blake2s on m68k and other small platforms [crypto,v3,1/2] lib/crypto: blake2s: move hmac construction into wireguard [crypto,v3,2/2] lib/crypto: sha1: re-roll loops to reduce code size

Jason A. Donenfeld Jan. 11, 2022, 10:05 p.m. UTC

Hi,

Geert emailed me this afternoon concerned about blake2s codesize on m68k
and other small systems. We identified two effective ways of chopping
down the size. One of them moves some wireguard-specific things into
wireguard proper. The other one adds a slower codepath for small
machines to blake2s. This worked, and was v1 of this patchset, but I
wasn't so much of a fan. Then someone pointed out that the generic C
SHA-1 implementation is still unrolled, which is a *lot* of extra code.
Simply rerolling that saves about as much as v1 did. So, we instead do
that in this patchset. SHA-1 is being phased out, and soon it won't
be included at all (hopefully). And nothing performance-oriented has
anything to do with it anyway.

The result of these two patches mitigates Geert's feared code size
increase for 5.17.

v3 improves on v2 by making the re-rolling of SHA-1 much simpler,
resulting in even larger code size reduction and much better
performance. The reason I'm sending yet a third version in such a short
amount of time is because the trick here feels obvious and substantial
enough that I'd hate for Geert to waste time measuring the impact of the
previous commit.

Thanks,
Jason

Jason A. Donenfeld (2):
  lib/crypto: blake2s: move hmac construction into wireguard
  lib/crypto: sha1: re-roll loops to reduce code size

 drivers/net/wireguard/noise.c | 45 ++++++++++++++---
 include/crypto/blake2s.h      |  3 --
 lib/crypto/blake2s-selftest.c | 31 ------------
 lib/crypto/blake2s.c          | 37 --------------
 lib/sha1.c                    | 95 ++++++-----------------------------
 5 files changed, 53 insertions(+), 158 deletions(-)

Geert Uytterhoeven Jan. 12, 2022, 10:59 a.m. UTC | #1

Hi Jason,

On Tue, Jan 11, 2022 at 11:05 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Geert emailed me this afternoon concerned about blake2s codesize on m68k
> and other small systems. We identified two effective ways of chopping
> down the size. One of them moves some wireguard-specific things into
> wireguard proper. The other one adds a slower codepath for small
> machines to blake2s. This worked, and was v1 of this patchset, but I
> wasn't so much of a fan. Then someone pointed out that the generic C
> SHA-1 implementation is still unrolled, which is a *lot* of extra code.
> Simply rerolling that saves about as much as v1 did. So, we instead do
> that in this patchset. SHA-1 is being phased out, and soon it won't
> be included at all (hopefully). And nothing performance-oriented has
> anything to do with it anyway.
>
> The result of these two patches mitigates Geert's feared code size
> increase for 5.17.
>
> v3 improves on v2 by making the re-rolling of SHA-1 much simpler,
> resulting in even larger code size reduction and much better
> performance. The reason I'm sending yet a third version in such a short
> amount of time is because the trick here feels obvious and substantial
> enough that I'd hate for Geert to waste time measuring the impact of the
> previous commit.
>
> Thanks,
> Jason
>
> Jason A. Donenfeld (2):
>   lib/crypto: blake2s: move hmac construction into wireguard
>   lib/crypto: sha1: re-roll loops to reduce code size

Thanks for the series!

On m68k:
add/remove: 1/4 grow/shrink: 0/1 up/down: 4/-4232 (-4228)
Function                                     old     new   delta
__ksymtab_blake2s256_hmac                     12       -     -12
blake2s_init.constprop                        94       -     -94
blake2s256_hmac                              302       -    -302
sha1_transform                              4402     582   -3820
Total: Before=4230537, After=4226309, chg -0.10%

Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

Jason A. Donenfeld Jan. 12, 2022, 1:18 p.m. UTC | #2

Hi Geert,

On Wed, Jan 12, 2022 at 12:00 PM Geert Uytterhoeven
<geert@linux-m68k.org> wrote:
> Thanks for the series!
>
> On m68k:
> add/remove: 1/4 grow/shrink: 0/1 up/down: 4/-4232 (-4228)
> Function                                     old     new   delta
> __ksymtab_blake2s256_hmac                     12       -     -12
> blake2s_init.constprop                        94       -     -94
> blake2s256_hmac                              302       -    -302
> sha1_transform                              4402     582   -3820
> Total: Before=4230537, After=4226309, chg -0.10%
>
> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>

Excellent, thanks for the breakdown. So this shaves off ~4k, which was
about what we were shooting for here, so I think indeed this series
accomplishes its goal of counteracting the addition of BLAKE2s.
Hopefully Herbert will apply this series for 5.17.

Jason

Herbert Xu Jan. 18, 2022, 6:42 a.m. UTC | #3

Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Excellent, thanks for the breakdown. So this shaves off ~4k, which was
> about what we were shooting for here, so I think indeed this series
> accomplishes its goal of counteracting the addition of BLAKE2s.
> Hopefully Herbert will apply this series for 5.17.

As the patches that triggered this weren't part of the crypto
tree, this will have to go through the random tree if you want
them for 5.17.

Otherwise if you're happy to wait then I can pull them through
cryptodev.

Cheers,

Jason A. Donenfeld Jan. 18, 2022, 11:43 a.m. UTC | #4

On 1/18/22, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> As the patches that triggered this weren't part of the crypto
> tree, this will have to go through the random tree if you want
> them for 5.17.

Sure, will do.

David Laight Jan. 18, 2022, 12:44 p.m. UTC | #5

From: Jason A. Donenfeld
> Sent: 18 January 2022 11:43
> 
> On 1/18/22, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > As the patches that triggered this weren't part of the crypto
> > tree, this will have to go through the random tree if you want
> > them for 5.17.
> 
> Sure, will do.

I've rammed the code through godbolt... https://godbolt.org/z/Wv64z9zG8

Some things I've noticed;

1) There is no point having all the inline functions.
   Far better to have real functions to do the work.
   Given the cost of hashing 64 bytes of data the extra
   function call won't matter.
   Indeed for repeated calls it will help because the required
   code will be in the I-cache.

2) The compiles I tried do manage to remove the blake2_sigma[][]
   when unrolling everything - which is a slight gain for the full
   unroll. But I doubt it is that significant if the access can
   get sensibly optimised.
   For non-x86 that might require all the values by multiplied by 4.

3) Although G() is a massive register dependency chain the compiler
   knows that G(,[0-3],) are independent and can execute in parallel.
   This does help execution time on multi-issue cpu (like x86).
   With care it ought to be possible to use the same code for G(,[4-7],)
   without stopping the compiler interleaving all the instructions.

4) I strongly suspect that using a loop for the rounds will have
   minimal impact on performance - especially if the first call is
   'cold cache'.
   But I've not got time to test the code.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Jason A. Donenfeld Jan. 18, 2022, 12:50 p.m. UTC | #6

On Tue, Jan 18, 2022 at 1:45 PM David Laight <David.Laight@aculab.com> wrote:
> I've rammed the code through godbolt... https://godbolt.org/z/Wv64z9zG8
>
> Some things I've noticed;

It seems like you've done a lot of work here but...

>    But I've not got time to test the code.

But you're not going to take it all the way. So it unfortunately
amounts to mailing list armchair optimization. That's too bad because
it really seems like you might be onto something worth seeing through.
As I've mentioned a few times now, I've dropped the blake2s
optimization patch, and I won't be developing that further. But it
appears as though you've really been captured by it, so I urge you:
please send a real patch with benchmarks on various platforms! (And CC
me on the patch.) Faster reference code would really be terrific.

Jason

[crypto,v3,0/2] reduce code size from blake2s on m68k and other small platforms

Message

Comments