Message ID | 20220119082447.1675-1-miles.chen@mediatek.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
Series | lib/crypto: blake2s: fix a CFI failure | expand |
Hi Miles, Thanks for the patch. Could you let me know which architecture and compiler this was broken on? If I had to guess, I'd wager arm32, and you hit this by enabling optimized blake2s? If so, I'm not sure the problem is with weak symbols. Why should CFI break weak symbols? Rather, perhaps the issue is that the function is defined in blake2s-core.S? Are there some CFI macros we need for that definition? Jason
Hey again, Actually... It looks like the issue is that in this file: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/crypto/internal/blake2s.h this line typedef void (*blake2s_compress_t)(struct blake2s_state *state, const u8 *block, size_t nblocks, u32 inc); should become typedef void (*blake2s_compress_t)(struct blake2s_state *state, const u8 *block, size_t nblocks, const u32 inc); Does making that change fix things for you? Thanks, Jason
(+ Sami, Eric) On Wed, 19 Jan 2022 at 10:00, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > Hi Miles, > > Thanks for the patch. Could you let me know which architecture and > compiler this was broken on? If I had to guess, I'd wager arm32, and > you hit this by enabling optimized blake2s? > > If so, I'm not sure the problem is with weak symbols. Why should CFI > break weak symbols? Rather, perhaps the issue is that the function is > defined in blake2s-core.S? Are there some CFI macros we need for that > definition? > We should try to understand why CFI thinks the prototypes of the two symbols are different. There are still a number of issues with CFI, so papering over them by reverting stuff that we want for good reasons is not the way to go imo. In the short term, you can work around it by avoiding the indirect call to blake2s_compress, e.g., diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c index 93f2ae051370..fef2ff678431 100644 --- a/lib/crypto/blake2s.c +++ b/lib/crypto/blake2s.c @@ -16,9 +16,15 @@ #include <linux/init.h> #include <linux/bug.h> +static void __blake2s_compress(struct blake2s_state *state, const u8 *block, + size_t nblocks, const u32 inc) +{ + return blake2s_compress(state, block, nblocks, inc); +} + void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen) { - __blake2s_update(state, in, inlen, blake2s_compress); + __blake2s_update(state, in, inlen, __blake2s_compress); } EXPORT_SYMBOL(blake2s_update);
On Wed, 19 Jan 2022 at 10:09, Ard Biesheuvel <ardb@kernel.org> wrote: > > (+ Sami, Eric) > > On Wed, 19 Jan 2022 at 10:00, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > Hi Miles, > > > > Thanks for the patch. Could you let me know which architecture and > > compiler this was broken on? If I had to guess, I'd wager arm32, and > > you hit this by enabling optimized blake2s? > > > > If so, I'm not sure the problem is with weak symbols. Why should CFI > > break weak symbols? Rather, perhaps the issue is that the function is > > defined in blake2s-core.S? Are there some CFI macros we need for that > > definition? > > > > We should try to understand why CFI thinks the prototypes of the two > symbols are different. There are still a number of issues with CFI, so > papering over them by reverting stuff that we want for good reasons is > not the way to go imo. > > In the short term, you can work around it by avoiding the indirect > call to blake2s_compress, e.g., > > diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c > index 93f2ae051370..fef2ff678431 100644 > --- a/lib/crypto/blake2s.c > +++ b/lib/crypto/blake2s.c > @@ -16,9 +16,15 @@ > #include <linux/init.h> > #include <linux/bug.h> > > +static void __blake2s_compress(struct blake2s_state *state, const u8 *block, > + size_t nblocks, const u32 inc) > +{ > + return blake2s_compress(state, block, nblocks, inc); > +} > + > void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen) > { > - __blake2s_update(state, in, inlen, blake2s_compress); > + __blake2s_update(state, in, inlen, __blake2s_compress); > } > EXPORT_SYMBOL(blake2s_update); Ehm, maybe not. As Jason points out, the typedef does not have quite the right type, so that is most likely the culprit, and this workaround would trigger CFI in exactly the same way. Interestingly, the compiler does not seem to mind, right? Or are you seeing any build time warnings on the reference to blake2s_compress in the call to __blake2s_update() ?
>typedef void (*blake2s_compress_t)(struct blake2s_state *state, > const u8 *block, size_t nblocks, u32 inc); > >should become > >typedef void (*blake2s_compress_t)(struct blake2s_state *state, > const u8 *block, size_t nblocks, const u32 inc); > >Does making that change fix things for you? > >Thanks, >Jason Thanks for your fast response. It does not work. I tried this before reverting the commit 6048fdcc5f26. Miles
hi, >Thanks for the patch. Could you let me know which architecture and >compiler this was broken on? If I had to guess, I'd wager arm32, and >you hit this by enabling optimized blake2s? Actually, I am merging android-common tree and test our device. I use arm64 and clang-r437112b. I'm not sure which option is the right one, grep 'BLAKE' .config shows: CONFIG_CRYPTO_BLAKE2B=y # CONFIG_CRYPTO_BLAKE2S is not set # CONFIG_CRYPTO_LIB_BLAKE2S_GENERIC=y and... I found that my patch breaks arm32 build, sorry for that. >If so, I'm not sure the problem is with weak symbols. Why should CFI >break weak symbols? Rather, perhaps the issue is that the function is >defined in blake2s-core.S? Are there some CFI macros we need for that >definition?
>Ehm, maybe not. As Jason points out, the typedef does not have quite >the right type, so that is most likely the culprit, and this >workaround would trigger CFI in exactly the same way. > >Interestingly, the compiler does not seem to mind, right? Or are you >seeing any build time warnings on the reference to blake2s_compress in >the call to __blake2s_update() ? No, no warnings as the -Wcast-function-type is globally enabled. I cleaned some cast-function-type errors but no warnings in __blake2s_update. Miles
Hi Miles, I'm actually not able to reproduce your oops. I'm using vanilla clang 13, cross compiling for arm64, with thin LTO enabled and CFI enabled. Kernel seems to run fine. Are there other settings that are needed to trigger this? Do you see it in upstream clang or just the Android fork of clang? Jason
Hi, >Hi Miles, > >I'm actually not able to reproduce your oops. I'm using vanilla clang >13, cross compiling for arm64, with thin LTO enabled and CFI enabled. >Kernel seems to run fine. > > >Are there other settings that are needed to trigger this? Do you see >it in upstream clang or just the Android fork of clang? > I will try another clang (the previous version I use). I am using Android fork of clang and there is a clang upgrade in this merge. Miles
Hi, > We should try to understand why CFI thinks the prototypes of the two > symbols are different. There are still a number of issues with CFI, so > papering over them by reverting stuff that we want for good reasons is > not the way to go imo. > > In the short term, you can work around it by avoiding the indirect > call to blake2s_compress, e.g., Thanks for the patch. I tried it and the issue remains. As Jason said, he cannot reproduce this issue. I will try another version of clang next. Miles
Hi Miles, Okay. Keep me posted. Just FYI, as mentioned, I'm unable to reproduce this, and you haven't provided any further minimized guidance on how I might reproduce this, so it'll sit in the "not a bug" bin until I have another clue on how to reproduce. Alternatively, Nick and Nathan are now on this thread and they usually have good luck teasing out compiler issues and such, so maybe they'll have an idea. But I'm afraid with the information I currently have, I'm at a dead end. Jason
On Wed, 19 Jan 2022 at 11:06, Miles Chen <miles.chen@mediatek.com> wrote: > > Hi, > > >Hi Miles, > > > >I'm actually not able to reproduce your oops. I'm using vanilla clang > >13, cross compiling for arm64, with thin LTO enabled and CFI enabled. > >Kernel seems to run fine. > > > > > >Are there other settings that are needed to trigger this? Do you see > >it in upstream clang or just the Android fork of clang? > > > I will try another clang (the previous version I use). > I am using Android fork of clang and there is a clang upgrade in this merge. > One thing that could be worth a try is to make __blake2s_update() and __blake2s_final() __always_inline rather than just inline, which by itself does not appear to be sufficient for the code to get inlined. (If it were, the indirect call should have disappeared as well) Given that indirect calls suck on x86, we should probably apply that change in any case, regardless of CFI.
On 1/19/22, Ard Biesheuvel <ardb@kernel.org> wrote: > On Wed, 19 Jan 2022 at 11:06, Miles Chen <miles.chen@mediatek.com> wrote: >> >> Hi, >> >> >Hi Miles, >> > >> >I'm actually not able to reproduce your oops. I'm using vanilla clang >> >13, cross compiling for arm64, with thin LTO enabled and CFI enabled. >> >Kernel seems to run fine. >> > >> > >> >Are there other settings that are needed to trigger this? Do you see >> >it in upstream clang or just the Android fork of clang? >> > >> I will try another clang (the previous version I use). >> I am using Android fork of clang and there is a clang upgrade in this >> merge. >> > > One thing that could be worth a try is to make __blake2s_update() and > __blake2s_final() __always_inline rather than just inline, which by > itself does not appear to be sufficient for the code to get inlined. > (If it were, the indirect call should have disappeared as well) > > Given that indirect calls suck on x86, we should probably apply that > change in any case, regardless of CFI. > Had the same thought at first, but then looking at the original stack trace, it looks like the __ function is inlined: [ 0.000000][ T0] __cfi_slowpath_diag+0x354/0x4b0 [ 0.000000][ T0] blake2s_update+0x14c/0x178 [ 0.000000][ T0] _extract_entropy+0xf4/0x29c So that makes me think that the issue really does involve calling through the weak alias. But why should weak alias calling trigger CFI? Compiler bug? Some other subtlety we're missing? Jason
On Wed, 19 Jan 2022 at 11:20, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > On 1/19/22, Ard Biesheuvel <ardb@kernel.org> wrote: > > On Wed, 19 Jan 2022 at 11:06, Miles Chen <miles.chen@mediatek.com> wrote: > >> > >> Hi, > >> > >> >Hi Miles, > >> > > >> >I'm actually not able to reproduce your oops. I'm using vanilla clang > >> >13, cross compiling for arm64, with thin LTO enabled and CFI enabled. > >> >Kernel seems to run fine. > >> > > >> > > >> >Are there other settings that are needed to trigger this? Do you see > >> >it in upstream clang or just the Android fork of clang? > >> > > >> I will try another clang (the previous version I use). > >> I am using Android fork of clang and there is a clang upgrade in this > >> merge. > >> > > > > One thing that could be worth a try is to make __blake2s_update() and > > __blake2s_final() __always_inline rather than just inline, which by > > itself does not appear to be sufficient for the code to get inlined. > > (If it were, the indirect call should have disappeared as well) > > > > Given that indirect calls suck on x86, we should probably apply that > > change in any case, regardless of CFI. > > > > Had the same thought at first, but then looking at the original stack > trace, it looks like the __ function is inlined: > > [ 0.000000][ T0] __cfi_slowpath_diag+0x354/0x4b0 > [ 0.000000][ T0] blake2s_update+0x14c/0x178 > [ 0.000000][ T0] _extract_entropy+0xf4/0x29c > Indeed. How odd. I hope this doesn't happen with the x86 backend because that would be plain silly. On arm64, it doesn't actually matter in terms of performance, it just needs one additional callee save register to preserve the function pointer across calls.
On 1/19/22, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > Hi Miles, > > Okay. Keep me posted. > > Just FYI, as mentioned, I'm unable to reproduce this, and you haven't > provided any further minimized guidance on how I might reproduce this, > so it'll sit in the "not a bug" bin until I have another clue on how > to reproduce. Alternatively, Nick and Nathan are now on this thread > and they usually have good luck teasing out compiler issues and such, > so maybe they'll have an idea. But I'm afraid with the information I > currently have, I'm at a dead end. > > Jason > We're back in business! I was able to reproduce this using FullLTO rather than ThinLTO.
The below kludge of a patch fixes the issue. Still unclear whether we should go with something like this or get clang fixed or what. diff --git a/arch/arm/crypto/blake2s-shash.c b/arch/arm/crypto/blake2s-shash.c index 17c1c3bfe2f5..be8cde5f1719 100644 --- a/arch/arm/crypto/blake2s-shash.c +++ b/arch/arm/crypto/blake2s-shash.c @@ -13,12 +13,12 @@ static int crypto_blake2s_update_arm(struct shash_desc *desc, const u8 *in, unsigned int inlen) { - return crypto_blake2s_update(desc, in, inlen, blake2s_compress); + return crypto_blake2s_update(desc, in, inlen); } static int crypto_blake2s_final_arm(struct shash_desc *desc, u8 *out) { - return crypto_blake2s_final(desc, out, blake2s_compress); + return crypto_blake2s_final(desc, out); } #define BLAKE2S_ALG(name, driver_name, digest_size) \ diff --git a/arch/x86/crypto/blake2s-shash.c b/arch/x86/crypto/blake2s-shash.c index f9e2fecdb761..c81ffedb4865 100644 --- a/arch/x86/crypto/blake2s-shash.c +++ b/arch/x86/crypto/blake2s-shash.c @@ -18,12 +18,12 @@ static int crypto_blake2s_update_x86(struct shash_desc *desc, const u8 *in, unsigned int inlen) { - return crypto_blake2s_update(desc, in, inlen, blake2s_compress); + return crypto_blake2s_update(desc, in, inlen); } static int crypto_blake2s_final_x86(struct shash_desc *desc, u8 *out) { - return crypto_blake2s_final(desc, out, blake2s_compress); + return crypto_blake2s_final(desc, out); } #define BLAKE2S_ALG(name, driver_name, digest_size) \ diff --git a/crypto/blake2s_generic.c b/crypto/blake2s_generic.c index 72fe480f9bd6..050874588a84 100644 --- a/crypto/blake2s_generic.c +++ b/crypto/blake2s_generic.c @@ -5,6 +5,7 @@ * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. */ +#define FORCE_BLAKE2S_GENERIC #include <crypto/internal/blake2s.h> #include <crypto/internal/hash.h> @@ -15,12 +16,12 @@ static int crypto_blake2s_update_generic(struct shash_desc *desc, const u8 *in, unsigned int inlen) { - return crypto_blake2s_update(desc, in, inlen, blake2s_compress_generic); + return crypto_blake2s_update(desc, in, inlen); } static int crypto_blake2s_final_generic(struct shash_desc *desc, u8 *out) { - return crypto_blake2s_final(desc, out, blake2s_compress_generic); + return crypto_blake2s_final(desc, out); } #define BLAKE2S_ALG(name, driver_name, digest_size) \ diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h index d39cfa0d333e..fec7eead93fc 100644 --- a/include/crypto/internal/blake2s.h +++ b/include/crypto/internal/blake2s.h @@ -24,14 +24,14 @@ static inline void blake2s_set_lastblock(struct blake2s_state *state) state->f[0] = -1; } -typedef void (*blake2s_compress_t)(struct blake2s_state *state, - const u8 *block, size_t nblocks, u32 inc); - /* Helper functions for BLAKE2s shared by the library and shash APIs */ +#ifdef FORCE_BLAKE2S_GENERIC +#define blake2s_compress blake2s_compress_generic +#endif + static inline void __blake2s_update(struct blake2s_state *state, - const u8 *in, size_t inlen, - blake2s_compress_t compress) + const u8 *in, size_t inlen) { const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen; @@ -39,7 +39,7 @@ static inline void __blake2s_update(struct blake2s_state *state, return; if (inlen > fill) { memcpy(state->buf + state->buflen, in, fill); - (*compress)(state, state->buf, 1, BLAKE2S_BLOCK_SIZE); + blake2s_compress(state, state->buf, 1, BLAKE2S_BLOCK_SIZE); state->buflen = 0; in += fill; inlen -= fill; @@ -47,7 +47,7 @@ static inline void __blake2s_update(struct blake2s_state *state, if (inlen > BLAKE2S_BLOCK_SIZE) { const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE); /* Hash one less (full) block than strictly possible */ - (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE); + blake2s_compress(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE); in += BLAKE2S_BLOCK_SIZE * (nblocks - 1); inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1); } @@ -55,13 +55,12 @@ static inline void __blake2s_update(struct blake2s_state *state, state->buflen += inlen; } -static inline void __blake2s_final(struct blake2s_state *state, u8 *out, - blake2s_compress_t compress) +static inline void __blake2s_final(struct blake2s_state *state, u8 *out) { blake2s_set_lastblock(state); memset(state->buf + state->buflen, 0, BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */ - (*compress)(state, state->buf, 1, state->buflen); + blake2s_compress(state, state->buf, 1, state->buflen); cpu_to_le32_array(state->h, ARRAY_SIZE(state->h)); memcpy(out, state->h, state->outlen); } @@ -98,21 +97,19 @@ static inline int crypto_blake2s_init(struct shash_desc *desc) } static inline int crypto_blake2s_update(struct shash_desc *desc, - const u8 *in, unsigned int inlen, - blake2s_compress_t compress) + const u8 *in, unsigned int inlen) { struct blake2s_state *state = shash_desc_ctx(desc); - __blake2s_update(state, in, inlen, compress); + __blake2s_update(state, in, inlen); return 0; } -static inline int crypto_blake2s_final(struct shash_desc *desc, u8 *out, - blake2s_compress_t compress) +static inline int crypto_blake2s_final(struct shash_desc *desc, u8 *out) { struct blake2s_state *state = shash_desc_ctx(desc); - __blake2s_final(state, out, compress); + __blake2s_final(state, out); return 0; } diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c index 9364f79937b8..a13f01ff53a7 100644 --- a/lib/crypto/blake2s.c +++ b/lib/crypto/blake2s.c @@ -18,14 +18,14 @@ void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen) { - __blake2s_update(state, in, inlen, blake2s_compress); + __blake2s_update(state, in, inlen); } EXPORT_SYMBOL(blake2s_update); void blake2s_final(struct blake2s_state *state, u8 *out) { WARN_ON(IS_ENABLED(DEBUG) && !out); - __blake2s_final(state, out, blake2s_compress); + __blake2s_final(state, out); memzero_explicit(state, sizeof(*state)); } EXPORT_SYMBOL(blake2s_final);
On Wed, 19 Jan 2022 at 13:15, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > The below kludge of a patch fixes the issue. Still unclear whether we > should go with something like this or get clang fixed or what. > > diff --git a/arch/arm/crypto/blake2s-shash.c b/arch/arm/crypto/blake2s-shash.c > index 17c1c3bfe2f5..be8cde5f1719 100644 > --- a/arch/arm/crypto/blake2s-shash.c > +++ b/arch/arm/crypto/blake2s-shash.c > @@ -13,12 +13,12 @@ > static int crypto_blake2s_update_arm(struct shash_desc *desc, > const u8 *in, unsigned int inlen) > { > - return crypto_blake2s_update(desc, in, inlen, blake2s_compress); > + return crypto_blake2s_update(desc, in, inlen); > } > > static int crypto_blake2s_final_arm(struct shash_desc *desc, u8 *out) > { > - return crypto_blake2s_final(desc, out, blake2s_compress); > + return crypto_blake2s_final(desc, out); > } > > #define BLAKE2S_ALG(name, driver_name, digest_size) \ > diff --git a/arch/x86/crypto/blake2s-shash.c b/arch/x86/crypto/blake2s-shash.c > index f9e2fecdb761..c81ffedb4865 100644 > --- a/arch/x86/crypto/blake2s-shash.c > +++ b/arch/x86/crypto/blake2s-shash.c > @@ -18,12 +18,12 @@ > static int crypto_blake2s_update_x86(struct shash_desc *desc, > const u8 *in, unsigned int inlen) > { > - return crypto_blake2s_update(desc, in, inlen, blake2s_compress); > + return crypto_blake2s_update(desc, in, inlen); > } > > static int crypto_blake2s_final_x86(struct shash_desc *desc, u8 *out) > { > - return crypto_blake2s_final(desc, out, blake2s_compress); > + return crypto_blake2s_final(desc, out); > } > > #define BLAKE2S_ALG(name, driver_name, digest_size) \ > diff --git a/crypto/blake2s_generic.c b/crypto/blake2s_generic.c > index 72fe480f9bd6..050874588a84 100644 > --- a/crypto/blake2s_generic.c > +++ b/crypto/blake2s_generic.c > @@ -5,6 +5,7 @@ > * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All > Rights Reserved. > */ > > +#define FORCE_BLAKE2S_GENERIC > #include <crypto/internal/blake2s.h> > #include <crypto/internal/hash.h> > I'd prefer it if we could avoid magic #define's like this. We could fix it up locally to crypto/internal/blake2s.h just by doing something like the below. diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h index d39cfa0d333e..9e52c07c54cc 100644 --- a/include/crypto/internal/blake2s.h +++ b/include/crypto/internal/blake2s.h @@ -39,7 +39,11 @@ static inline void __blake2s_update(struct blake2s_state *state, return; if (inlen > fill) { memcpy(state->buf + state->buflen, in, fill); - (*compress)(state, state->buf, 1, BLAKE2S_BLOCK_SIZE); + if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S)) + (*compress)(state, state->buf, 1, BLAKE2S_BLOCK_SIZE); + else + blake2s_compress_generic(state, state->buf, 1, + BLAKE2S_BLOCK_SIZE); state->buflen = 0; in += fill; inlen -= fill; @@ -47,7 +51,11 @@ static inline void __blake2s_update(struct blake2s_state *state, if (inlen > BLAKE2S_BLOCK_SIZE) { const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE); /* Hash one less (full) block than strictly possible */ - (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE); + if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S)) + (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE); + else + blake2s_compress_generic(state, in, nblocks - 1, + BLAKE2S_BLOCK_SIZE); in += BLAKE2S_BLOCK_SIZE * (nblocks - 1); inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1); } @@ -61,7 +69,10 @@ static inline void __blake2s_final(struct blake2s_state *state, u8 *out, blake2s_set_lastblock(state); memset(state->buf + state->buflen, 0, BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */ - (*compress)(state, state->buf, 1, state->buflen); + if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S)) + (*compress)(state, state->buf, 1, state->buflen); + else + blake2s_compress_generic(state, state->buf, 1, state->buflen); cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
> We're back in business! I was able to reproduce this using FullLTO > rather than ThinLTO. Great! I tried two clang (r437112b/r437112) but the issue remains. Miles
On Wed, Jan 19, 2022 at 1:19 PM Ard Biesheuvel <ardb@kernel.org> wrote: > > I'd prefer it if we could avoid magic #define's like this. I'll send something that just replaces it with a simple bool.
From: Ard Biesheuvel > Sent: 19 January 2022 12:19 ... > - (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE); > + if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S)) > + (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE); > + else > + blake2s_compress_generic(state, in, nblocks - 1, > + BLAKE2S_BLOCK_SIZE); Isn't that a candidate for a 'static call' ? And, maybe all these inlined functions should be real functions? No point having all the bloat on every call site. Much better to call a real function and used the cached instructions. Although, having looked at the source and the generated code for x86-64 and arm64 I'm not sure I'd want to try to generate optimised assembler for it. (Unless you can a instruction that does exactly what the code wants.) Basically the compiler can merge the instructions for 4 of the G() expansions so that they can execute in parallel on a multi-issue cpu. Doing that by hand will be error prone. Each G() expansion is pretty much a register dependency chain, not much chance of parallel execution. There are clearly optimisations for the top/bottom of the loop. But they can be done to the generic C version. The real problem is lack of registers - the code needs 16 for the v[] array plus a few extras. So some have to spill to stack. The unrolled code is about 1200 instructions on arm64 and x86-64. Each of the 10 rounds reads all 16 of the u32 input values. So that is about 8 (1200/160) instructions for each read. Which means there is plenty of memory bandwidth for other reads. So 'rolling up' the rounds - which adds in the blake2s_sigma[] reads could easily be 'almost free'. Certainly on x86 where you are just (well should be just) adding an extra memory uop for each input buffer reads. I'm not sure the 8 G() calls can be folded into two sets of 4 while still getting the compiler to interleave the generated code. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Hi David, On Wed, Jan 19, 2022 at 3:41 PM David Laight <David.Laight@aculab.com> wrote: > > From: Ard Biesheuvel > > Sent: 19 January 2022 12:19 > ... > > - (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE); > > + if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S)) > > + (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE); > > + else > > + blake2s_compress_generic(state, in, nblocks - 1, > > + BLAKE2S_BLOCK_SIZE); > > Isn't that a candidate for a 'static call' ? > > And, maybe all these inlined functions should be real functions? > No point having all the bloat on every call site. > Much better to call a real function and used the cached instructions. Not a good candidate for static call, as this doesn't actually need to change at runtime ever. It's using a function pointer here out of laziness to keep the same body of the function, like a compile-time template. You can sort of squint and imagine the C++. Unfortunately, CFI felt differently and still treats it as an indirect call. https://lore.kernel.org/linux-crypto/20220119135450.564115-1-Jason@zx2c4.com/ fixes it up to use a boolean instead, which will certainly be inlined away. So that's definitely an improvement on what's there now. For 5.18, I think it's probable that all of this stuff goes away anyway, and we don't need the templated helpers at all. So perhaps my patch will serve as an okay stop gap. Alternatively, maybe the clang people will say, "oh no, our bug" and then fix it in their neighborhood. According to https://github.com/ClangBuiltLinux/linux/issues/1567 it looks like that could be the case. > There are clearly optimisations for the top/bottom of the loop. > But they can be done to the generic C version. Optimizing the generic C version would be quite nice, as it'd help all platforms. Jason
diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c index 75ccb3e633e6..22fa3ea1689e 100644 --- a/lib/crypto/blake2s-generic.c +++ b/lib/crypto/blake2s-generic.c @@ -38,8 +38,7 @@ static inline void blake2s_increment_counter(struct blake2s_state *state, } void blake2s_compress(struct blake2s_state *state, const u8 *block, - size_t nblocks, const u32 inc) - __weak __alias(blake2s_compress_generic); + size_t nblocks, const u32 inc); void blake2s_compress_generic(struct blake2s_state *state, const u8 *block, size_t nblocks, const u32 inc) diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c index 93f2ae051370..4055aa593ec4 100644 --- a/lib/crypto/blake2s.c +++ b/lib/crypto/blake2s.c @@ -16,6 +16,12 @@ #include <linux/init.h> #include <linux/bug.h> +#if IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S) +# define blake2s_compress blake2s_compress_arch +#else +# define blake2s_compress blake2s_compress_generic +#endif + void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen) { __blake2s_update(state, in, inlen, blake2s_compress);