Message ID | ZtQgVOnK6WzdIDlU@gondor.apana.org.au (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Herbert Xu |
Headers | show |
Series | crypto: api - Fix generic algorithm self-test races | expand |
On Sun, Sep 01, 2024 at 04:05:40PM +0800, Herbert Xu wrote: > On Fri, Aug 30, 2024 at 10:51:54AM -0700, Eric Biggers wrote: > > > > Given below in defconfig form, use 'make olddefconfig' to apply. The failures > > are nondeterministic and sometimes there are different ones, for example: > > > > [ 0.358017] alg: skcipher: failed to allocate transform for cbc(twofish-generic): -2 > > [ 0.358365] alg: self-tests for cbc(twofish) using cbc(twofish-generic) failed (rc=-2) > > [ 0.358535] alg: skcipher: failed to allocate transform for cbc(camellia-generic): -2 > > [ 0.358918] alg: self-tests for cbc(camellia) using cbc(camellia-generic) failed (rc=-2) > > [ 0.371533] alg: skcipher: failed to allocate transform for xts(ecb(aes-generic)): -2 > > [ 0.371922] alg: self-tests for xts(aes) using xts(ecb(aes-generic)) failed (rc=-2) > > > > Modules are not enabled, maybe that matters (I haven't checked yet). > > Yes I think that was the key. This triggers a massive self-test > run which executes in parallel and reveals a few race conditions > in the system. I think it boils down to the following scenario: > > Base algorithm X-generic, X-optimised > Template Y > Optimised algorithm Y-X-optimised > > Everything gets registered, and then the self-tests are started. > When Y-X-optimised gets tested, it requests the creation of the > generic Y(X-generic). Which then itself undergoes testing. > > The race is that after Y(X-generic) gets registered, but just > before it gets tested, X-optimised finally finishes self-testing > which then causes all spawns of X-generic to be destroyed. So > by the time the self-test for Y(X-generic) comes along, it can > no longer find the algorithm. This error then bubbles up all > the way up to the self-test of Y-X-optimised which then fails. > > Note that there is some complexity that I've omitted here because > when the generic self-test fails to find Y(X-generic) it actually > triggers the construction of it again which then fails for various > other reasons (these are not important because the construction > should *not* be triggered at this point). > > So in a way the error is expected, and we should probably remove > the pr_err for the case where ENOENT is returned for the algorithm > that we're currently testing. > > The solution is two-fold. First when an algorithm undergoes > self-testing it should not trigger its construction. Secondly > if an instance larval fails to materialise due to it being destroyed > by a more optimised algorithm coming along, it should obviously > retry the construction. > > Remove the check in __crypto_alg_lookup that stops a larval from > matching new requests based on differences in the mask. It is better > to block new requests even if it is wrong and then simply retry the > lookup. If this ends up being the wrong larval it will sort iself > out during the retry. > > Reduce the CRYPTO_ALG_TYPE_MASK bits in type during larval creation > as otherwise LSKCIPHER algorithms may not match SKCIPHER larvals. > > Also block the instance creation during self-testing in the function > crypto_larval_lookup by checking for CRYPTO_ALG_TESTED in the mask > field. > > Finally change the return value when crypto_alg_lookup fails in > crypto_larval_wait to EAGAIN to redo the lookup. > > Fixes: 37da5d0ffa7b ("crypto: api - Do not wait for tests during registration") > Reported-by: Eric Biggers <ebiggers@kernel.org> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> > > diff --git a/crypto/api.c b/crypto/api.c > index bbe29d438815..bfd177a4313a 100644 > --- a/crypto/api.c > +++ b/crypto/api.c > @@ -70,11 +70,6 @@ static struct crypto_alg *__crypto_alg_lookup(const char *name, u32 type, > if ((q->cra_flags ^ type) & mask) > continue; > > - if (crypto_is_larval(q) && > - !crypto_is_test_larval((struct crypto_larval *)q) && > - ((struct crypto_larval *)q)->mask != mask) > - continue; > - > exact = !strcmp(q->cra_driver_name, name); > fuzzy = !strcmp(q->cra_name, name); > if (!exact && !(fuzzy && q->cra_priority > best)) > @@ -113,6 +108,8 @@ struct crypto_larval *crypto_larval_alloc(const char *name, u32 type, u32 mask) > if (!larval) > return ERR_PTR(-ENOMEM); > > + type &= ~CRYPTO_ALG_TYPE_MASK | (mask ?: CRYPTO_ALG_TYPE_MASK); > + > larval->mask = mask; > larval->alg.cra_flags = CRYPTO_ALG_LARVAL | type; > larval->alg.cra_priority = -1; > @@ -229,7 +226,7 @@ static struct crypto_alg *crypto_larval_wait(struct crypto_alg *alg) > type = alg->cra_flags & ~(CRYPTO_ALG_LARVAL | CRYPTO_ALG_DEAD); > mask = larval->mask; > alg = crypto_alg_lookup(alg->cra_name, type, mask) ?: > - ERR_PTR(-ENOENT); > + ERR_PTR(-EAGAIN); > } else if (IS_ERR(alg)) > ; > else if (crypto_is_test_larval(larval) && > @@ -308,8 +305,12 @@ static struct crypto_alg *crypto_larval_lookup(const char *name, u32 type, > > if (!IS_ERR_OR_NULL(alg) && crypto_is_larval(alg)) > alg = crypto_larval_wait(alg); > - else if (!alg) > + else if (alg) > + ; > + else if (!(mask & CRYPTO_ALG_TESTED)) > alg = crypto_larval_add(name, type, mask); > + else > + alg = ERR_PTR(-ENOENT); > > return alg; > } With both this patch "crypto: api - Fix generic algorithm self-test races" and your other patch "crypto: algboss - Pass instance creation error up" applied, I'm still getting errors occasionally, e.g.: [ 5.155587] alg: skcipher: failed to allocate transform for cbc(sm4-generic): -2 [ 5.155954] alg: self-tests for cbc(sm4) using cbc(sm4-generic) failed (rc=-2) [ 5.372511] alg: aead: failed to allocate transform for gcm_base(ctr(aes-generic),ghash-generic): -2 [ 5.372861] alg: self-tests for gcm(aes) using gcm_base(ctr(aes-generic),ghash-generic) failed (rc=-2) I can't follow your explanation of what is going on here and what the fix is. Would it make any sense to just revert the commits that introduced this problem? - Eric
On Mon, Sep 02, 2024 at 10:05:54AM -0700, Eric Biggers wrote: > > With both this patch "crypto: api - Fix generic algorithm self-test races" and > your other patch "crypto: algboss - Pass instance creation error up" applied, > I'm still getting errors occasionally, e.g.: > > [ 5.155587] alg: skcipher: failed to allocate transform for cbc(sm4-generic): -2 > [ 5.155954] alg: self-tests for cbc(sm4) using cbc(sm4-generic) failed (rc=-2) > [ 5.372511] alg: aead: failed to allocate transform for gcm_base(ctr(aes-generic),ghash-generic): -2 > [ 5.372861] alg: self-tests for gcm(aes) using gcm_base(ctr(aes-generic),ghash-generic) failed (rc=-2) > > I can't follow your explanation of what is going on here and what the fix is. > Would it make any sense to just revert the commits that introduced this problem? As I said earlier, these errors are expected. What's happening is this: __ecb-sm4-aesni-avx gets registered (but not tested) cbc(sm4-generic) gets registered (but not tested) __ecb-sm4-aesni-avx finishes testing with lskcipher this is equivalent to crypto_cipher sm4 so it triggers the destruction of all instances of sm4 cbc(sm4-generic) gets marked as dead cbc(sm4-generic) fails self-test because it's already dead (ENOENT) It's harmless because whatever that is asking for cbc(sm4-generic) (in this case it's the extra-test mechanism) will simply retry the allocation which will then succeed. I will send a patch to disable the warning when allocating X returns ENOENT while we're testing X itself. This can always happen if X gets killed for the reason mentioned above and it's perfectly harmless. It's just that the race window was tiny previously because testing occurred immediately after registration. But now we've magnified that window many times with asynchronous testing. Cheers,
On Tue, Sep 03, 2024 at 07:07:38AM +0800, Herbert Xu wrote: > On Mon, Sep 02, 2024 at 10:05:54AM -0700, Eric Biggers wrote: > > > > With both this patch "crypto: api - Fix generic algorithm self-test races" and > > your other patch "crypto: algboss - Pass instance creation error up" applied, > > I'm still getting errors occasionally, e.g.: > > > > [ 5.155587] alg: skcipher: failed to allocate transform for cbc(sm4-generic): -2 > > [ 5.155954] alg: self-tests for cbc(sm4) using cbc(sm4-generic) failed (rc=-2) > > [ 5.372511] alg: aead: failed to allocate transform for gcm_base(ctr(aes-generic),ghash-generic): -2 > > [ 5.372861] alg: self-tests for gcm(aes) using gcm_base(ctr(aes-generic),ghash-generic) failed (rc=-2) > > > > I can't follow your explanation of what is going on here and what the fix is. > > Would it make any sense to just revert the commits that introduced this problem? > > As I said earlier, these errors are expected. What's happening > is this: > > __ecb-sm4-aesni-avx gets registered (but not tested) > > cbc(sm4-generic) gets registered (but not tested) > > __ecb-sm4-aesni-avx finishes testing > with lskcipher this is equivalent to crypto_cipher sm4 > so it triggers the destruction of all instances of sm4 > > cbc(sm4-generic) gets marked as dead > > cbc(sm4-generic) fails self-test because it's already dead (ENOENT) > > It's harmless because whatever that is asking for cbc(sm4-generic) > (in this case it's the extra-test mechanism) will simply retry the > allocation which will then succeed. > > I will send a patch to disable the warning when allocating X returns > ENOENT while we're testing X itself. This can always happen if X > gets killed for the reason mentioned above and it's perfectly harmless. > > It's just that the race window was tiny previously because testing > occurred immediately after registration. But now we've magnified > that window many times with asynchronous testing. > The tests are still failing on upstream: [ 0.343845] alg: self-tests for rfc4106(gcm(aes)) using rfc4106(gcm_base(ctr(aes-generic),ghash-generic)) failed (rc=-2) To me it still seems like commit 37da5d0ffa7b ("crypto: api - Do not wait for tests during registration") is just broken and should be reverted. Besides the test failures, it looks like there's no longer any guarantee that algorithms are actually available now that their module is loaded. E.g. consider if someone does 'modprobe aesni-intel' and then immediately creates a dm-crypt device. Now it sounds like the AES-NI algorithms might not have finished being tested yet and the generic algorithms can be used instead, resulting in a performance regression. I understand that you want to try to fix the edge cases in "fallback" ciphers. But "fallback" ciphers have always seemed like a bad design due to how they use the crypto API recursively. I think the algorithms that use them should generally be migrated off of them, e.g. as I did in commit f235bc11cc95 ("crypto: arm/aes-neonbs - go back to using aes-arm directly"). That fixed the problem in aes-neonbs that seems to have triggered this work in the first place. - Eric
On Sat, Oct 05, 2024 at 03:24:48PM -0700, Eric Biggers wrote: > > The tests are still failing on upstream: > > [ 0.343845] alg: self-tests for rfc4106(gcm(aes)) using rfc4106(gcm_base(ctr(aes-generic),ghash-generic)) failed (rc=-2) You're right. I only disabled the warnings at the point of allocation, the overall self-test warning is still there. Let me get rid of them too. > Besides the test failures, it looks like there's no longer any guarantee that > algorithms are actually available now that their module is loaded. That would indeed be a bug. But I haven't seen it in practice. Although the s390 folks were reporting some weird errors with dm-crypt, they have recently disappeared. If you do see an actual failure please report it and then I'll consider reverting it until it's fixed. > E.g. consider if someone does 'modprobe aesni-intel' and then immediately > creates a dm-crypt device. Now it sounds like the AES-NI algorithms might not > have finished being tested yet and the generic algorithms can be used instead, > resulting in a performance regression. That is not the case. After modprobe returns, the algorithm is guaranteed to have been registered. Yes it is untested, but that should not be a problem because a test larval will have been created and all users looking for that algorithm will be waiting on that test larval. > I understand that you want to try to fix the edge cases in "fallback" ciphers. > But "fallback" ciphers have always seemed like a bad design due to how they use > the crypto API recursively. I think the algorithms that use them should > generally be migrated off of them, e.g. as I did in commit f235bc11cc95 > ("crypto: arm/aes-neonbs - go back to using aes-arm directly"). That fixed the > problem in aes-neonbs that seems to have triggered this work in the first place. Yes getting rid of fallbacks is nice, but this it not the reason why we're making self-test asynchronous. The primary issue with synchronous self-tests is the modprobe dead-lock. Cheers,
On Sun, Oct 06, 2024 at 08:53:28AM +0800, Herbert Xu wrote: > On Sat, Oct 05, 2024 at 03:24:48PM -0700, Eric Biggers wrote: > > > > The tests are still failing on upstream: > > > > [ 0.343845] alg: self-tests for rfc4106(gcm(aes)) using rfc4106(gcm_base(ctr(aes-generic),ghash-generic)) failed (rc=-2) > > You're right. I only disabled the warnings at the point of > allocation, the overall self-test warning is still there. Let > me get rid of them too. > > > Besides the test failures, it looks like there's no longer any guarantee that > > algorithms are actually available now that their module is loaded. > > That would indeed be a bug. But I haven't seen it in practice. > Although the s390 folks were reporting some weird errors with > dm-crypt, they have recently disappeared. > > If you do see an actual failure please report it and then I'll > consider reverting it until it's fixed. > > > E.g. consider if someone does 'modprobe aesni-intel' and then immediately > > creates a dm-crypt device. Now it sounds like the AES-NI algorithms might not > > have finished being tested yet and the generic algorithms can be used instead, > > resulting in a performance regression. > > That is not the case. After modprobe returns, the algorithm is > guaranteed to have been registered. Yes it is untested, but that > should not be a problem because a test larval will have been created > and all users looking for that algorithm will be waiting on that > test larval. I'm not sure about that, since the code that looks up algorithms only looks for algorithms that already have the CRYPTO_ALG_TESTED flag. > > I understand that you want to try to fix the edge cases in "fallback" ciphers. > > But "fallback" ciphers have always seemed like a bad design due to how they use > > the crypto API recursively. I think the algorithms that use them should > > generally be migrated off of them, e.g. as I did in commit f235bc11cc95 > > ("crypto: arm/aes-neonbs - go back to using aes-arm directly"). That fixed the > > problem in aes-neonbs that seems to have triggered this work in the first place. > > Yes getting rid of fallbacks is nice, but this it not the reason why > we're making self-test asynchronous. The primary issue with synchronous > self-tests is the modprobe dead-lock. That problem is caused by the use of fallback ciphers, though. - Eric
On Sat, Oct 05, 2024 at 08:06:18PM -0700, Eric Biggers wrote: > > I'm not sure about that, since the code that looks up algorithms only looks for > algorithms that already have the CRYPTO_ALG_TESTED flag. For normal lookups (one without CRYPTO_ALG_TESTED set in the mask field), the core API will first look for a tested algorithm, and if that fails then it will look for an untested algorithm. The second step should find the larval and then sleep on that until it's done. > That problem is caused by the use of fallback ciphers, though. Sure that particular deadlock may have been due to a fallback, but such dependencies exist outside of fallbacks too. Especially now that we have the fuzz testing which will dynamically load the generic algorithms, it's easy to envisage a scenario where one module registers an algorithm, which then triggers modprobe's on the generic implementation of the same algorithm that then dead-locks. PS it looks like there is an actual report of things breaking with async testing in mv_cesa so I might revert/disable the async testing after all. Thanks,
On Mon, Oct 07, 2024 at 12:32:22PM +0800, Herbert Xu wrote: > > For normal lookups (one without CRYPTO_ALG_TESTED set in the mask > field), the core API will first look for a tested algorithm, and > if that fails then it will look for an untested algorithm. The > second step should find the larval and then sleep on that until it's > done. Actually that's not quite right. The test larval is registered with TESTED set so normal lookups will latch onto that and then wait. Cheers,
On Mon, Oct 07, 2024 at 12:32:22PM +0800, Herbert Xu wrote: > > PS it looks like there is an actual report of things breaking with > async testing in mv_cesa so I might revert/disable the async testing > after all. It looks like it wasn't a bug in the async self-test. Instead this appears to be a real bug that was discovered by the async testing (because we now run all the tests at the same time, thus testing the whether the driver deals with parallel requests or not). This is a bit accidental, because the driver in question registered multiple hash algorithms. Had it only registered one, then nothing would have changed. Is this something that we could improve in testmgr? Perhaps we can add a bit of parallelism ourselves to cover the case where a driver only registers one hash algorithm. Cheers,
diff --git a/crypto/api.c b/crypto/api.c index bbe29d438815..bfd177a4313a 100644 --- a/crypto/api.c +++ b/crypto/api.c @@ -70,11 +70,6 @@ static struct crypto_alg *__crypto_alg_lookup(const char *name, u32 type, if ((q->cra_flags ^ type) & mask) continue; - if (crypto_is_larval(q) && - !crypto_is_test_larval((struct crypto_larval *)q) && - ((struct crypto_larval *)q)->mask != mask) - continue; - exact = !strcmp(q->cra_driver_name, name); fuzzy = !strcmp(q->cra_name, name); if (!exact && !(fuzzy && q->cra_priority > best)) @@ -113,6 +108,8 @@ struct crypto_larval *crypto_larval_alloc(const char *name, u32 type, u32 mask) if (!larval) return ERR_PTR(-ENOMEM); + type &= ~CRYPTO_ALG_TYPE_MASK | (mask ?: CRYPTO_ALG_TYPE_MASK); + larval->mask = mask; larval->alg.cra_flags = CRYPTO_ALG_LARVAL | type; larval->alg.cra_priority = -1; @@ -229,7 +226,7 @@ static struct crypto_alg *crypto_larval_wait(struct crypto_alg *alg) type = alg->cra_flags & ~(CRYPTO_ALG_LARVAL | CRYPTO_ALG_DEAD); mask = larval->mask; alg = crypto_alg_lookup(alg->cra_name, type, mask) ?: - ERR_PTR(-ENOENT); + ERR_PTR(-EAGAIN); } else if (IS_ERR(alg)) ; else if (crypto_is_test_larval(larval) && @@ -308,8 +305,12 @@ static struct crypto_alg *crypto_larval_lookup(const char *name, u32 type, if (!IS_ERR_OR_NULL(alg) && crypto_is_larval(alg)) alg = crypto_larval_wait(alg); - else if (!alg) + else if (alg) + ; + else if (!(mask & CRYPTO_ALG_TESTED)) alg = crypto_larval_add(name, type, mask); + else + alg = ERR_PTR(-ENOENT); return alg; }