diff mbox series

lib/crypto: blake2s: fix a CFI failure

Message ID 20220119082447.1675-1-miles.chen@mediatek.com (mailing list archive)
State Not Applicable
Delegated to: Herbert Xu
Headers show
Series lib/crypto: blake2s: fix a CFI failure | expand

Commit Message

Miles Chen Jan. 19, 2022, 8:24 a.m. UTC
From: Miles Chen <miles.chen@mediatek.com>

With CONFIG_CFI_CLANG=y, we observe a CFI failure of
blake2s_compress_generic.

Reverting commit 6048fdcc5f26 ("lib/crypto: blake2s: include as built-in")
is a solution for this problem. So I looked into the patch
and I think it is caused by the weak symbols use by blake2s_compress().

To fix it, remove the weak symbol and use CRYPTO_ARCH_HAVE_LIB_BLAKE2S
to select blake2s_compress_arch/blake2s_compress_generic.

log:
[    0.000000][    T0] Kernel panic - not syncing: CFI failure (target: blake2s_compress_generic+0x0/0x1444)
[    0.000000][    T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-mainline-06981-g076c855b846e #1
[    0.000000][    T0] Hardware name: MT6873 (DT)
[    0.000000][    T0] Call trace:
[    0.000000][    T0]  dump_backtrace+0xfc/0x1dc
[    0.000000][    T0]  dump_stack_lvl+0xa8/0x11c
[    0.000000][    T0]  panic+0x194/0x464
[    0.000000][    T0]  __cfi_check_fail+0x54/0x58
[    0.000000][    T0]  __cfi_slowpath_diag+0x354/0x4b0
[    0.000000][    T0]  blake2s_update+0x14c/0x178
[    0.000000][    T0]  _extract_entropy+0xf4/0x29c
[    0.000000][    T0]  crng_initialize_primary+0x24/0x94
[    0.000000][    T0]  rand_initialize+0x2c/0x6c
[    0.000000][    T0]  start_kernel+0x2f8/0x65c
[    0.000000][    T0]  __primary_switched+0xc4/0x7be4
[    0.000000][    T0] Rebooting in 5 seconds..

Fixes: 6048fdcc5f26 ("lib/crypto: blake2s: include as built-in")
Signed-off-by: Miles Chen <miles.chen@mediatek.com>
---
 lib/crypto/blake2s-generic.c | 3 +--
 lib/crypto/blake2s.c         | 6 ++++++
 2 files changed, 7 insertions(+), 2 deletions(-)

Comments

Jason A. Donenfeld Jan. 19, 2022, 9 a.m. UTC | #1
Hi Miles,

Thanks for the patch. Could you let me know which architecture and
compiler this was broken on? If I had to guess, I'd wager arm32, and
you hit this by enabling optimized blake2s?

If so, I'm not sure the problem is with weak symbols. Why should CFI
break weak symbols? Rather, perhaps the issue is that the function is
defined in blake2s-core.S? Are there some CFI macros we need for that
definition?

Jason
Jason A. Donenfeld Jan. 19, 2022, 9:09 a.m. UTC | #2
Hey again,

Actually... It looks like the issue is that in this file:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/crypto/internal/blake2s.h

this line

typedef void (*blake2s_compress_t)(struct blake2s_state *state,
				   const u8 *block, size_t nblocks, u32 inc);

should become

typedef void (*blake2s_compress_t)(struct blake2s_state *state,
				   const u8 *block, size_t nblocks, const u32 inc);

Does making that change fix things for you?

Thanks,
Jason
Ard Biesheuvel Jan. 19, 2022, 9:09 a.m. UTC | #3
(+ Sami, Eric)

On Wed, 19 Jan 2022 at 10:00, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi Miles,
>
> Thanks for the patch. Could you let me know which architecture and
> compiler this was broken on? If I had to guess, I'd wager arm32, and
> you hit this by enabling optimized blake2s?
>
> If so, I'm not sure the problem is with weak symbols. Why should CFI
> break weak symbols? Rather, perhaps the issue is that the function is
> defined in blake2s-core.S? Are there some CFI macros we need for that
> definition?
>

We should try to understand why CFI thinks the prototypes of the two
symbols are different. There are still a number of issues with CFI, so
papering over them by reverting stuff that we want for good reasons is
not the way to go imo.

In the short term, you can work around it by avoiding the indirect
call to blake2s_compress, e.g.,

diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 93f2ae051370..fef2ff678431 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -16,9 +16,15 @@
 #include <linux/init.h>
 #include <linux/bug.h>

+static void __blake2s_compress(struct blake2s_state *state, const u8 *block,
+                              size_t nblocks, const u32 inc)
+{
+       return blake2s_compress(state, block, nblocks, inc);
+}
+
 void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
 {
-       __blake2s_update(state, in, inlen, blake2s_compress);
+       __blake2s_update(state, in, inlen, __blake2s_compress);
 }
 EXPORT_SYMBOL(blake2s_update);
Ard Biesheuvel Jan. 19, 2022, 9:13 a.m. UTC | #4
On Wed, 19 Jan 2022 at 10:09, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> (+ Sami, Eric)
>
> On Wed, 19 Jan 2022 at 10:00, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > Hi Miles,
> >
> > Thanks for the patch. Could you let me know which architecture and
> > compiler this was broken on? If I had to guess, I'd wager arm32, and
> > you hit this by enabling optimized blake2s?
> >
> > If so, I'm not sure the problem is with weak symbols. Why should CFI
> > break weak symbols? Rather, perhaps the issue is that the function is
> > defined in blake2s-core.S? Are there some CFI macros we need for that
> > definition?
> >
>
> We should try to understand why CFI thinks the prototypes of the two
> symbols are different. There are still a number of issues with CFI, so
> papering over them by reverting stuff that we want for good reasons is
> not the way to go imo.
>
> In the short term, you can work around it by avoiding the indirect
> call to blake2s_compress, e.g.,
>
> diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
> index 93f2ae051370..fef2ff678431 100644
> --- a/lib/crypto/blake2s.c
> +++ b/lib/crypto/blake2s.c
> @@ -16,9 +16,15 @@
>  #include <linux/init.h>
>  #include <linux/bug.h>
>
> +static void __blake2s_compress(struct blake2s_state *state, const u8 *block,
> +                              size_t nblocks, const u32 inc)
> +{
> +       return blake2s_compress(state, block, nblocks, inc);
> +}
> +
>  void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
>  {
> -       __blake2s_update(state, in, inlen, blake2s_compress);
> +       __blake2s_update(state, in, inlen, __blake2s_compress);
>  }
>  EXPORT_SYMBOL(blake2s_update);

Ehm, maybe not. As Jason points out, the typedef does not have quite
the right type, so that is most likely the culprit, and this
workaround would trigger CFI in exactly the same way.

Interestingly, the compiler does not seem to mind, right? Or are you
seeing any build time warnings on the reference to blake2s_compress in
the call to __blake2s_update() ?
Miles Chen Jan. 19, 2022, 9:16 a.m. UTC | #5
>typedef void (*blake2s_compress_t)(struct blake2s_state *state,
>				   const u8 *block, size_t nblocks, u32 inc);
>
>should become
>
>typedef void (*blake2s_compress_t)(struct blake2s_state *state,
>				   const u8 *block, size_t nblocks, const u32 inc);
>
>Does making that change fix things for you?
>
>Thanks,
>Jason

Thanks for your fast response.
It does not work. I tried this before reverting the commit 6048fdcc5f26.

Miles
Miles Chen Jan. 19, 2022, 9:24 a.m. UTC | #6
hi,

>Thanks for the patch. Could you let me know which architecture and
>compiler this was broken on? If I had to guess, I'd wager arm32, and
>you hit this by enabling optimized blake2s?

Actually, I am merging android-common tree and test our device.
I use arm64 and clang-r437112b.

I'm not sure which option is the right one, grep 'BLAKE' .config shows:

CONFIG_CRYPTO_BLAKE2B=y
# CONFIG_CRYPTO_BLAKE2S is not set
# CONFIG_CRYPTO_LIB_BLAKE2S_GENERIC=y

and... I found that my patch breaks arm32 build, sorry for that.

>If so, I'm not sure the problem is with weak symbols. Why should CFI
>break weak symbols? Rather, perhaps the issue is that the function is
>defined in blake2s-core.S? Are there some CFI macros we need for that
>definition?
Miles Chen Jan. 19, 2022, 9:43 a.m. UTC | #7
>Ehm, maybe not. As Jason points out, the typedef does not have quite
>the right type, so that is most likely the culprit, and this
>workaround would trigger CFI in exactly the same way.
>
>Interestingly, the compiler does not seem to mind, right? Or are you
>seeing any build time warnings on the reference to blake2s_compress in
>the call to __blake2s_update() ?

No, no warnings as the -Wcast-function-type is globally enabled.
I cleaned some cast-function-type errors but no warnings in
__blake2s_update.


Miles
Jason A. Donenfeld Jan. 19, 2022, 9:55 a.m. UTC | #8
Hi Miles,

I'm actually not able to reproduce your oops. I'm using vanilla clang
13, cross compiling for arm64, with thin LTO enabled and CFI enabled.
Kernel seems to run fine.

Are there other settings that are needed to trigger this? Do you see
it in upstream clang or just the Android fork of clang?

Jason
Miles Chen Jan. 19, 2022, 10:06 a.m. UTC | #9
Hi,

>Hi Miles,
>
>I'm actually not able to reproduce your oops. I'm using vanilla clang
>13, cross compiling for arm64, with thin LTO enabled and CFI enabled.
>Kernel seems to run fine.
>
>
>Are there other settings that are needed to trigger this? Do you see
>it in upstream clang or just the Android fork of clang?
>
I will try another clang (the previous version I use).
I am using Android fork of clang and there is a clang upgrade in this merge.

Miles
Miles Chen Jan. 19, 2022, 10:10 a.m. UTC | #10
Hi, 

> We should try to understand why CFI thinks the prototypes of the two
> symbols are different. There are still a number of issues with CFI, so
> papering over them by reverting stuff that we want for good reasons is
> not the way to go imo.
> 
> In the short term, you can work around it by avoiding the indirect
> call to blake2s_compress, e.g.,

Thanks for the patch. I tried it and the issue remains.
As Jason said, he cannot reproduce this issue. I will try another version
of clang next.

Miles
Jason A. Donenfeld Jan. 19, 2022, 10:11 a.m. UTC | #11
Hi Miles,

Okay. Keep me posted.

Just FYI, as mentioned, I'm unable to reproduce this, and you haven't
provided any further minimized guidance on how I might reproduce this,
so it'll sit in the "not a bug" bin until I have another clue on how
to reproduce. Alternatively, Nick and Nathan are now on this thread
and they usually have good luck teasing out compiler issues and such,
so maybe they'll have an idea. But I'm afraid with the information I
currently have, I'm at a dead end.

Jason
Ard Biesheuvel Jan. 19, 2022, 10:13 a.m. UTC | #12
On Wed, 19 Jan 2022 at 11:06, Miles Chen <miles.chen@mediatek.com> wrote:
>
> Hi,
>
> >Hi Miles,
> >
> >I'm actually not able to reproduce your oops. I'm using vanilla clang
> >13, cross compiling for arm64, with thin LTO enabled and CFI enabled.
> >Kernel seems to run fine.
> >
> >
> >Are there other settings that are needed to trigger this? Do you see
> >it in upstream clang or just the Android fork of clang?
> >
> I will try another clang (the previous version I use).
> I am using Android fork of clang and there is a clang upgrade in this merge.
>

One thing that could be worth a try is to make __blake2s_update() and
__blake2s_final() __always_inline rather than just inline, which by
itself does not appear to be sufficient for the code to get inlined.
(If it were, the indirect call should have disappeared as well)

Given that indirect calls suck on x86, we should probably apply that
change in any case, regardless of CFI.
Jason A. Donenfeld Jan. 19, 2022, 10:20 a.m. UTC | #13
On 1/19/22, Ard Biesheuvel <ardb@kernel.org> wrote:
> On Wed, 19 Jan 2022 at 11:06, Miles Chen <miles.chen@mediatek.com> wrote:
>>
>> Hi,
>>
>> >Hi Miles,
>> >
>> >I'm actually not able to reproduce your oops. I'm using vanilla clang
>> >13, cross compiling for arm64, with thin LTO enabled and CFI enabled.
>> >Kernel seems to run fine.
>> >
>> >
>> >Are there other settings that are needed to trigger this? Do you see
>> >it in upstream clang or just the Android fork of clang?
>> >
>> I will try another clang (the previous version I use).
>> I am using Android fork of clang and there is a clang upgrade in this
>> merge.
>>
>
> One thing that could be worth a try is to make __blake2s_update() and
> __blake2s_final() __always_inline rather than just inline, which by
> itself does not appear to be sufficient for the code to get inlined.
> (If it were, the indirect call should have disappeared as well)
>
> Given that indirect calls suck on x86, we should probably apply that
> change in any case, regardless of CFI.
>

Had the same thought at first, but then looking at the original stack
trace, it looks like the __ function is inlined:

[    0.000000][    T0]  __cfi_slowpath_diag+0x354/0x4b0
[    0.000000][    T0]  blake2s_update+0x14c/0x178
[    0.000000][    T0]  _extract_entropy+0xf4/0x29c

So that makes me think that the issue really does involve calling
through the weak alias. But why should weak alias calling trigger CFI?
Compiler bug? Some other subtlety we're missing?

Jason
Ard Biesheuvel Jan. 19, 2022, 10:35 a.m. UTC | #14
On Wed, 19 Jan 2022 at 11:20, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> On 1/19/22, Ard Biesheuvel <ardb@kernel.org> wrote:
> > On Wed, 19 Jan 2022 at 11:06, Miles Chen <miles.chen@mediatek.com> wrote:
> >>
> >> Hi,
> >>
> >> >Hi Miles,
> >> >
> >> >I'm actually not able to reproduce your oops. I'm using vanilla clang
> >> >13, cross compiling for arm64, with thin LTO enabled and CFI enabled.
> >> >Kernel seems to run fine.
> >> >
> >> >
> >> >Are there other settings that are needed to trigger this? Do you see
> >> >it in upstream clang or just the Android fork of clang?
> >> >
> >> I will try another clang (the previous version I use).
> >> I am using Android fork of clang and there is a clang upgrade in this
> >> merge.
> >>
> >
> > One thing that could be worth a try is to make __blake2s_update() and
> > __blake2s_final() __always_inline rather than just inline, which by
> > itself does not appear to be sufficient for the code to get inlined.
> > (If it were, the indirect call should have disappeared as well)
> >
> > Given that indirect calls suck on x86, we should probably apply that
> > change in any case, regardless of CFI.
> >
>
> Had the same thought at first, but then looking at the original stack
> trace, it looks like the __ function is inlined:
>
> [    0.000000][    T0]  __cfi_slowpath_diag+0x354/0x4b0
> [    0.000000][    T0]  blake2s_update+0x14c/0x178
> [    0.000000][    T0]  _extract_entropy+0xf4/0x29c
>

Indeed. How odd. I hope this doesn't happen with the x86 backend
because that would be plain silly. On arm64, it doesn't actually
matter in terms of performance, it just needs one additional callee
save register to preserve the function pointer across calls.
Jason A. Donenfeld Jan. 19, 2022, 10:56 a.m. UTC | #15
On 1/19/22, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hi Miles,
>
> Okay. Keep me posted.
>
> Just FYI, as mentioned, I'm unable to reproduce this, and you haven't
> provided any further minimized guidance on how I might reproduce this,
> so it'll sit in the "not a bug" bin until I have another clue on how
> to reproduce. Alternatively, Nick and Nathan are now on this thread
> and they usually have good luck teasing out compiler issues and such,
> so maybe they'll have an idea. But I'm afraid with the information I
> currently have, I'm at a dead end.
>
> Jason
>

We're back in business! I was able to reproduce this using FullLTO
rather than ThinLTO.
Jason A. Donenfeld Jan. 19, 2022, 12:14 p.m. UTC | #16
The below kludge of a patch fixes the issue. Still unclear whether we
should go with something like this or get clang fixed or what.

diff --git a/arch/arm/crypto/blake2s-shash.c b/arch/arm/crypto/blake2s-shash.c
index 17c1c3bfe2f5..be8cde5f1719 100644
--- a/arch/arm/crypto/blake2s-shash.c
+++ b/arch/arm/crypto/blake2s-shash.c
@@ -13,12 +13,12 @@
 static int crypto_blake2s_update_arm(struct shash_desc *desc,
 				     const u8 *in, unsigned int inlen)
 {
-	return crypto_blake2s_update(desc, in, inlen, blake2s_compress);
+	return crypto_blake2s_update(desc, in, inlen);
 }

 static int crypto_blake2s_final_arm(struct shash_desc *desc, u8 *out)
 {
-	return crypto_blake2s_final(desc, out, blake2s_compress);
+	return crypto_blake2s_final(desc, out);
 }

 #define BLAKE2S_ALG(name, driver_name, digest_size)			\
diff --git a/arch/x86/crypto/blake2s-shash.c b/arch/x86/crypto/blake2s-shash.c
index f9e2fecdb761..c81ffedb4865 100644
--- a/arch/x86/crypto/blake2s-shash.c
+++ b/arch/x86/crypto/blake2s-shash.c
@@ -18,12 +18,12 @@
 static int crypto_blake2s_update_x86(struct shash_desc *desc,
 				     const u8 *in, unsigned int inlen)
 {
-	return crypto_blake2s_update(desc, in, inlen, blake2s_compress);
+	return crypto_blake2s_update(desc, in, inlen);
 }

 static int crypto_blake2s_final_x86(struct shash_desc *desc, u8 *out)
 {
-	return crypto_blake2s_final(desc, out, blake2s_compress);
+	return crypto_blake2s_final(desc, out);
 }

 #define BLAKE2S_ALG(name, driver_name, digest_size)			\
diff --git a/crypto/blake2s_generic.c b/crypto/blake2s_generic.c
index 72fe480f9bd6..050874588a84 100644
--- a/crypto/blake2s_generic.c
+++ b/crypto/blake2s_generic.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All
Rights Reserved.
  */

+#define FORCE_BLAKE2S_GENERIC
 #include <crypto/internal/blake2s.h>
 #include <crypto/internal/hash.h>

@@ -15,12 +16,12 @@
 static int crypto_blake2s_update_generic(struct shash_desc *desc,
 					 const u8 *in, unsigned int inlen)
 {
-	return crypto_blake2s_update(desc, in, inlen, blake2s_compress_generic);
+	return crypto_blake2s_update(desc, in, inlen);
 }

 static int crypto_blake2s_final_generic(struct shash_desc *desc, u8 *out)
 {
-	return crypto_blake2s_final(desc, out, blake2s_compress_generic);
+	return crypto_blake2s_final(desc, out);
 }

 #define BLAKE2S_ALG(name, driver_name, digest_size)			\
diff --git a/include/crypto/internal/blake2s.h
b/include/crypto/internal/blake2s.h
index d39cfa0d333e..fec7eead93fc 100644
--- a/include/crypto/internal/blake2s.h
+++ b/include/crypto/internal/blake2s.h
@@ -24,14 +24,14 @@ static inline void blake2s_set_lastblock(struct
blake2s_state *state)
 	state->f[0] = -1;
 }

-typedef void (*blake2s_compress_t)(struct blake2s_state *state,
-				   const u8 *block, size_t nblocks, u32 inc);
-
 /* Helper functions for BLAKE2s shared by the library and shash APIs */

+#ifdef FORCE_BLAKE2S_GENERIC
+#define blake2s_compress blake2s_compress_generic
+#endif
+
 static inline void __blake2s_update(struct blake2s_state *state,
-				    const u8 *in, size_t inlen,
-				    blake2s_compress_t compress)
+				    const u8 *in, size_t inlen)
 {
 	const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;

@@ -39,7 +39,7 @@ static inline void __blake2s_update(struct
blake2s_state *state,
 		return;
 	if (inlen > fill) {
 		memcpy(state->buf + state->buflen, in, fill);
-		(*compress)(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
+		blake2s_compress(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
 		state->buflen = 0;
 		in += fill;
 		inlen -= fill;
@@ -47,7 +47,7 @@ static inline void __blake2s_update(struct
blake2s_state *state,
 	if (inlen > BLAKE2S_BLOCK_SIZE) {
 		const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
 		/* Hash one less (full) block than strictly possible */
-		(*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
+		blake2s_compress(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
 		in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
 		inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
 	}
@@ -55,13 +55,12 @@ static inline void __blake2s_update(struct
blake2s_state *state,
 	state->buflen += inlen;
 }

-static inline void __blake2s_final(struct blake2s_state *state, u8 *out,
-				   blake2s_compress_t compress)
+static inline void __blake2s_final(struct blake2s_state *state, u8 *out)
 {
 	blake2s_set_lastblock(state);
 	memset(state->buf + state->buflen, 0,
 	       BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
-	(*compress)(state, state->buf, 1, state->buflen);
+	blake2s_compress(state, state->buf, 1, state->buflen);
 	cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
 	memcpy(out, state->h, state->outlen);
 }
@@ -98,21 +97,19 @@ static inline int crypto_blake2s_init(struct
shash_desc *desc)
 }

 static inline int crypto_blake2s_update(struct shash_desc *desc,
-					const u8 *in, unsigned int inlen,
-					blake2s_compress_t compress)
+					const u8 *in, unsigned int inlen)
 {
 	struct blake2s_state *state = shash_desc_ctx(desc);

-	__blake2s_update(state, in, inlen, compress);
+	__blake2s_update(state, in, inlen);
 	return 0;
 }

-static inline int crypto_blake2s_final(struct shash_desc *desc, u8 *out,
-				       blake2s_compress_t compress)
+static inline int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
 {
 	struct blake2s_state *state = shash_desc_ctx(desc);

-	__blake2s_final(state, out, compress);
+	__blake2s_final(state, out);
 	return 0;
 }

diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 9364f79937b8..a13f01ff53a7 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -18,14 +18,14 @@

 void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
 {
-	__blake2s_update(state, in, inlen, blake2s_compress);
+	__blake2s_update(state, in, inlen);
 }
 EXPORT_SYMBOL(blake2s_update);

 void blake2s_final(struct blake2s_state *state, u8 *out)
 {
 	WARN_ON(IS_ENABLED(DEBUG) && !out);
-	__blake2s_final(state, out, blake2s_compress);
+	__blake2s_final(state, out);
 	memzero_explicit(state, sizeof(*state));
 }
 EXPORT_SYMBOL(blake2s_final);
Ard Biesheuvel Jan. 19, 2022, 12:18 p.m. UTC | #17
On Wed, 19 Jan 2022 at 13:15, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> The below kludge of a patch fixes the issue. Still unclear whether we
> should go with something like this or get clang fixed or what.
>
> diff --git a/arch/arm/crypto/blake2s-shash.c b/arch/arm/crypto/blake2s-shash.c
> index 17c1c3bfe2f5..be8cde5f1719 100644
> --- a/arch/arm/crypto/blake2s-shash.c
> +++ b/arch/arm/crypto/blake2s-shash.c
> @@ -13,12 +13,12 @@
>  static int crypto_blake2s_update_arm(struct shash_desc *desc,
>                                      const u8 *in, unsigned int inlen)
>  {
> -       return crypto_blake2s_update(desc, in, inlen, blake2s_compress);
> +       return crypto_blake2s_update(desc, in, inlen);
>  }
>
>  static int crypto_blake2s_final_arm(struct shash_desc *desc, u8 *out)
>  {
> -       return crypto_blake2s_final(desc, out, blake2s_compress);
> +       return crypto_blake2s_final(desc, out);
>  }
>
>  #define BLAKE2S_ALG(name, driver_name, digest_size)                    \
> diff --git a/arch/x86/crypto/blake2s-shash.c b/arch/x86/crypto/blake2s-shash.c
> index f9e2fecdb761..c81ffedb4865 100644
> --- a/arch/x86/crypto/blake2s-shash.c
> +++ b/arch/x86/crypto/blake2s-shash.c
> @@ -18,12 +18,12 @@
>  static int crypto_blake2s_update_x86(struct shash_desc *desc,
>                                      const u8 *in, unsigned int inlen)
>  {
> -       return crypto_blake2s_update(desc, in, inlen, blake2s_compress);
> +       return crypto_blake2s_update(desc, in, inlen);
>  }
>
>  static int crypto_blake2s_final_x86(struct shash_desc *desc, u8 *out)
>  {
> -       return crypto_blake2s_final(desc, out, blake2s_compress);
> +       return crypto_blake2s_final(desc, out);
>  }
>
>  #define BLAKE2S_ALG(name, driver_name, digest_size)                    \
> diff --git a/crypto/blake2s_generic.c b/crypto/blake2s_generic.c
> index 72fe480f9bd6..050874588a84 100644
> --- a/crypto/blake2s_generic.c
> +++ b/crypto/blake2s_generic.c
> @@ -5,6 +5,7 @@
>   * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All
> Rights Reserved.
>   */
>
> +#define FORCE_BLAKE2S_GENERIC
>  #include <crypto/internal/blake2s.h>
>  #include <crypto/internal/hash.h>
>

I'd prefer it if we could avoid magic #define's like this. We could
fix it up locally to crypto/internal/blake2s.h just by doing something
like the below.

diff --git a/include/crypto/internal/blake2s.h
b/include/crypto/internal/blake2s.h
index d39cfa0d333e..9e52c07c54cc 100644
--- a/include/crypto/internal/blake2s.h
+++ b/include/crypto/internal/blake2s.h
@@ -39,7 +39,11 @@ static inline void __blake2s_update(struct
blake2s_state *state,
                return;
        if (inlen > fill) {
                memcpy(state->buf + state->buflen, in, fill);
-               (*compress)(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
+               if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
+                       (*compress)(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
+               else
+                       blake2s_compress_generic(state, state->buf, 1,
+                                                BLAKE2S_BLOCK_SIZE);
                state->buflen = 0;
                in += fill;
                inlen -= fill;
@@ -47,7 +51,11 @@ static inline void __blake2s_update(struct
blake2s_state *state,
        if (inlen > BLAKE2S_BLOCK_SIZE) {
                const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
                /* Hash one less (full) block than strictly possible */
-               (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
+               if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
+                       (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
+               else
+                       blake2s_compress_generic(state, in, nblocks - 1,
+                                                BLAKE2S_BLOCK_SIZE);
                in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
                inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
        }
@@ -61,7 +69,10 @@ static inline void __blake2s_final(struct
blake2s_state *state, u8 *out,
        blake2s_set_lastblock(state);
        memset(state->buf + state->buflen, 0,
               BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
-       (*compress)(state, state->buf, 1, state->buflen);
+       if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
+               (*compress)(state, state->buf, 1, state->buflen);
+       else
+               blake2s_compress_generic(state, state->buf, 1, state->buflen);
        cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
Miles Chen Jan. 19, 2022, 12:34 p.m. UTC | #18
> We're back in business! I was able to reproduce this using FullLTO
> rather than ThinLTO.

Great!

I tried two clang (r437112b/r437112) but the issue remains.


Miles
Jason A. Donenfeld Jan. 19, 2022, 1:34 p.m. UTC | #19
On Wed, Jan 19, 2022 at 1:19 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> I'd prefer it if we could avoid magic #define's like this.

I'll send something that just replaces it with a simple bool.
David Laight Jan. 19, 2022, 2:40 p.m. UTC | #20
From: Ard Biesheuvel
> Sent: 19 January 2022 12:19
...
> -               (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
> +               if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
> +                       (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
> +               else
> +                       blake2s_compress_generic(state, in, nblocks - 1,
> +                                                BLAKE2S_BLOCK_SIZE);

Isn't that a candidate for a 'static call' ?

And, maybe all these inlined functions should be real functions?
No point having all the bloat on every call site.
Much better to call a real function and used the cached instructions.

Although, having looked at the source and the generated code for
x86-64 and arm64 I'm not sure I'd want to try to generate
optimised assembler for it.
(Unless you can a instruction that does exactly what the code wants.)

Basically the compiler can merge the instructions for 4 of the
G() expansions so that they can execute in parallel on a multi-issue
cpu. Doing that by hand will be error prone.
Each G() expansion is pretty much a register dependency chain,
not much chance of parallel execution.

There are clearly optimisations for the top/bottom of the loop.
But they can be done to the generic C version.

The real problem is lack of registers - the code needs 16 for the
v[] array plus a few extras.
So some have to spill to stack.

The unrolled code is about 1200 instructions on arm64 and x86-64.
Each of the 10 rounds reads all 16 of the u32 input values.
So that is about 8 (1200/160) instructions for each read.
Which means there is plenty of memory bandwidth for other
reads.

So 'rolling up' the rounds - which adds in the blake2s_sigma[]
reads could easily be 'almost free'.
Certainly on x86 where you are just (well should be just) adding an
extra memory uop for each input buffer reads.

I'm not sure the 8 G() calls can be folded into two sets of 4
while still getting the compiler to interleave the generated code.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jason A. Donenfeld Jan. 19, 2022, 3:03 p.m. UTC | #21
Hi David,

On Wed, Jan 19, 2022 at 3:41 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Ard Biesheuvel
> > Sent: 19 January 2022 12:19
> ...
> > -               (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
> > +               if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
> > +                       (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
> > +               else
> > +                       blake2s_compress_generic(state, in, nblocks - 1,
> > +                                                BLAKE2S_BLOCK_SIZE);
>
> Isn't that a candidate for a 'static call' ?
>
> And, maybe all these inlined functions should be real functions?
> No point having all the bloat on every call site.
> Much better to call a real function and used the cached instructions.

Not a good candidate for static call, as this doesn't actually need to
change at runtime ever. It's using a function pointer here out of
laziness to keep the same body of the function, like a compile-time
template. You can sort of squint and imagine the C++. Unfortunately,
CFI felt differently and still treats it as an indirect call.

https://lore.kernel.org/linux-crypto/20220119135450.564115-1-Jason@zx2c4.com/
fixes it up to use a boolean instead, which will certainly be inlined
away. So that's definitely an improvement on what's there now.

For 5.18, I think it's probable that all of this stuff goes away
anyway, and we don't need the templated helpers at all. So perhaps my
patch will serve as an okay stop gap. Alternatively, maybe the clang
people will say, "oh no, our bug" and then fix it in their
neighborhood. According to
https://github.com/ClangBuiltLinux/linux/issues/1567 it looks like
that could be the case.

> There are clearly optimisations for the top/bottom of the loop.
> But they can be done to the generic C version.

Optimizing the generic C version would be quite nice, as it'd help all
platforms.

Jason
diff mbox series

Patch

diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c
index 75ccb3e633e6..22fa3ea1689e 100644
--- a/lib/crypto/blake2s-generic.c
+++ b/lib/crypto/blake2s-generic.c
@@ -38,8 +38,7 @@  static inline void blake2s_increment_counter(struct blake2s_state *state,
 }
 
 void blake2s_compress(struct blake2s_state *state, const u8 *block,
-		      size_t nblocks, const u32 inc)
-		      __weak __alias(blake2s_compress_generic);
+		      size_t nblocks, const u32 inc);
 
 void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
 			      size_t nblocks, const u32 inc)
diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 93f2ae051370..4055aa593ec4 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -16,6 +16,12 @@ 
 #include <linux/init.h>
 #include <linux/bug.h>
 
+#if IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S)
+#  define blake2s_compress blake2s_compress_arch
+#else
+#  define blake2s_compress blake2s_compress_generic
+#endif
+
 void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
 {
 	__blake2s_update(state, in, inlen, blake2s_compress);