mbox series

[0/5] crypto: Implement cmac based on cbc skcipher

Message ID 20200818082410.GA24497@gondor.apana.org.au (mailing list archive)
Headers show
Series crypto: Implement cmac based on cbc skcipher | expand

Message

Herbert Xu Aug. 18, 2020, 8:24 a.m. UTC
On Sun, Aug 02, 2020 at 12:06:16PM +0300, Ard Biesheuvel wrote:
> Ben reports that CCM using AES-NI instructions performs pathologically
> poorly, which is due to the overhead of preserving/restoring the SIMD
> state, which is repeated after every 16 bytes of input when executing
> the CBCMAC portion of the algorithm.
> 
> So let's clone the arm64 implementation of cbcmac(aes), which takes
> care to only preserve/restore the SIMD state after processing the
> whole input. Since cmac(aes) and xcbc(aes) can reuse most of the code,
> let's expose those as well.
> 
> Cc: Ben Greear <greearb@candelatech.com>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  arch/x86/crypto/Makefile           |   2 +-
>  arch/x86/crypto/aesni-intel.h      |  39 +++
>  arch/x86/crypto/aesni-intel_glue.c |  42 +---
>  arch/x86/crypto/aesni-intel_mac.c  | 257 ++++++++++++++++++++
>  4 files changed, 306 insertions(+), 34 deletions(-)

We should just use the accelerated cbc skcipher.

Cheers,

Comments

Ard Biesheuvel Aug. 18, 2020, 8:31 a.m. UTC | #1
On Tue, 18 Aug 2020 at 10:24, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Sun, Aug 02, 2020 at 12:06:16PM +0300, Ard Biesheuvel wrote:
> > Ben reports that CCM using AES-NI instructions performs pathologically
> > poorly, which is due to the overhead of preserving/restoring the SIMD
> > state, which is repeated after every 16 bytes of input when executing
> > the CBCMAC portion of the algorithm.
> >
> > So let's clone the arm64 implementation of cbcmac(aes), which takes
> > care to only preserve/restore the SIMD state after processing the
> > whole input. Since cmac(aes) and xcbc(aes) can reuse most of the code,
> > let's expose those as well.
> >
> > Cc: Ben Greear <greearb@candelatech.com>
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  arch/x86/crypto/Makefile           |   2 +-
> >  arch/x86/crypto/aesni-intel.h      |  39 +++
> >  arch/x86/crypto/aesni-intel_glue.c |  42 +---
> >  arch/x86/crypto/aesni-intel_mac.c  | 257 ++++++++++++++++++++
> >  4 files changed, 306 insertions(+), 34 deletions(-)
>
> We should just use the accelerated cbc skcipher.
>

What do you mean? You cannot implement cbcmac using a cbc skcipher
unless you provide a scratch buffer of arbitrary size as the
destination, in order to capture the skcipher output IV as the MAC.
Herbert Xu Aug. 18, 2020, 1:51 p.m. UTC | #2
On Tue, Aug 18, 2020 at 10:31:39AM +0200, Ard Biesheuvel wrote:
>
> What do you mean? You cannot implement cbcmac using a cbc skcipher
> unless you provide a scratch buffer of arbitrary size as the
> destination, in order to capture the skcipher output IV as the MAC.

Please have a look at patch 6.  The trick is to use an SG list
chained to itself.

Cheers,
Ben Greear Aug. 18, 2020, 1:56 p.m. UTC | #3
On 8/18/20 6:51 AM, Herbert Xu wrote:
> On Tue, Aug 18, 2020 at 10:31:39AM +0200, Ard Biesheuvel wrote:
>>
>> What do you mean? You cannot implement cbcmac using a cbc skcipher
>> unless you provide a scratch buffer of arbitrary size as the
>> destination, in order to capture the skcipher output IV as the MAC.
> 
> Please have a look at patch 6.  The trick is to use an SG list
> chained to itself.

Herbert, thanks for working on this.  If I apply the patches you posted,
that is expected to provide wifi aes decryption speedup similar to what
the original patch I sent does?  Or, are additional patches needed?

Thanks,
Ben
Herbert Xu Aug. 18, 2020, 2:05 p.m. UTC | #4
On Tue, Aug 18, 2020 at 06:56:12AM -0700, Ben Greear wrote:
>
> Herbert, thanks for working on this.  If I apply the patches you posted,
> that is expected to provide wifi aes decryption speedup similar to what
> the original patch I sent does?  Or, are additional patches needed?

It depends on whether the wifi code uses the async ahash interface,
if not it probably won't make much of an impact at all.

I just checked and if the code you're using is net/mac80211/aes_cmac.c
then it's using shash so it won't really benefit from this.  However,
there's no reason why mac80211 can't be converted over to async as
we did for IPsec.

Cheers,
Ben Greear Aug. 18, 2020, 2:17 p.m. UTC | #5
On 8/18/20 7:05 AM, Herbert Xu wrote:
> On Tue, Aug 18, 2020 at 06:56:12AM -0700, Ben Greear wrote:
>>
>> Herbert, thanks for working on this.  If I apply the patches you posted,
>> that is expected to provide wifi aes decryption speedup similar to what
>> the original patch I sent does?  Or, are additional patches needed?
> 
> It depends on whether the wifi code uses the async ahash interface,
> if not it probably won't make much of an impact at all.
> 
> I just checked and if the code you're using is net/mac80211/aes_cmac.c
> then it's using shash so it won't really benefit from this.  However,
> there's no reason why mac80211 can't be converted over to async as
> we did for IPsec.

I think converting it to async is beyond what I have time to work on
and not sure if it would be accepted upstream anyway.

Is there any easy way to use your work to make shash fast for aesni?  I
basically just want it to perform as well as it used to with my patch.

Thanks,
Ben
Herbert Xu Aug. 18, 2020, 10:15 p.m. UTC | #6
On Tue, Aug 18, 2020 at 07:17:35AM -0700, Ben Greear wrote:
>
> Is there any easy way to use your work to make shash fast for aesni?  I
> basically just want it to perform as well as it used to with my patch.

Yes.  We could add a sync version of aesni that simply falls back
to aes-generic when simd is unavailable.

Cheers,
Herbert Xu Aug. 18, 2020, 10:27 p.m. UTC | #7
On Wed, Aug 19, 2020 at 08:15:50AM +1000, Herbert Xu wrote:
> On Tue, Aug 18, 2020 at 07:17:35AM -0700, Ben Greear wrote:
> >
> > Is there any easy way to use your work to make shash fast for aesni?  I
> > basically just want it to perform as well as it used to with my patch.
> 
> Yes.  We could add a sync version of aesni that simply falls back
> to aes-generic when simd is unavailable.

But I think before anyone attempts this we should explore making
mac80211 async like IPsec.  Is there any fundamental reason why
that is not possible? Have the wireless people expressed any
objections to making this async before?

Cheers,
Ben Greear Aug. 18, 2020, 10:31 p.m. UTC | #8
On 8/18/20 3:27 PM, Herbert Xu wrote:
> On Wed, Aug 19, 2020 at 08:15:50AM +1000, Herbert Xu wrote:
>> On Tue, Aug 18, 2020 at 07:17:35AM -0700, Ben Greear wrote:
>>>
>>> Is there any easy way to use your work to make shash fast for aesni?  I
>>> basically just want it to perform as well as it used to with my patch.
>>
>> Yes.  We could add a sync version of aesni that simply falls back
>> to aes-generic when simd is unavailable.
> 
> But I think before anyone attempts this we should explore making
> mac80211 async like IPsec.  Is there any fundamental reason why
> that is not possible? Have the wireless people expressed any
> objections to making this async before?

I don't think it has been discussed recently, but mac80211 is already
a complicated beast, so if this added any significant complexity
it might not be worth it.

Truth is though, I know very little about what changes would be
needed to make it do async decrypt, so maybe it would be a simple
matter?

Thanks,
Ben
Herbert Xu Aug. 18, 2020, 10:33 p.m. UTC | #9
On Tue, Aug 18, 2020 at 03:31:10PM -0700, Ben Greear wrote:
>
> I don't think it has been discussed recently, but mac80211 is already
> a complicated beast, so if this added any significant complexity
> it might not be worth it.

Any bulk data path should be using the async interface, otherwise
performance will seriously suffer should SIMD be unavailable.  I
think someone should look at converting wireless to async like IPsec.

> Truth is though, I know very little about what changes would be
> needed to make it do async decrypt, so maybe it would be a simple
> matter?

IPsec was actually quite straightforward.

Cheers,
Ben Greear Aug. 18, 2020, 10:39 p.m. UTC | #10
On 8/18/20 3:33 PM, Herbert Xu wrote:
> On Tue, Aug 18, 2020 at 03:31:10PM -0700, Ben Greear wrote:
>>
>> I don't think it has been discussed recently, but mac80211 is already
>> a complicated beast, so if this added any significant complexity
>> it might not be worth it.
> 
> Any bulk data path should be using the async interface, otherwise
> performance will seriously suffer should SIMD be unavailable.  I
> think someone should look at converting wireless to async like IPsec.

Most users in most cases are using hw crypt, so that is likely why
it hasn't gotten a huge amount of effort to optimize the software
crypt path.

If someone wants to give this async api a try for mac80211, I can
test, and I can sponsor the work, but I don't have time to try
to implement it myself.

Thanks,
Ben

> 
>> Truth is though, I know very little about what changes would be
>> needed to make it do async decrypt, so maybe it would be a simple
>> matter?
> 
> IPsec was actually quite straightforward.
> 
> Cheers,
>
Ard Biesheuvel Aug. 20, 2020, 6:58 a.m. UTC | #11
On Wed, 19 Aug 2020 at 00:39, Ben Greear <greearb@candelatech.com> wrote:
>
> On 8/18/20 3:33 PM, Herbert Xu wrote:
> > On Tue, Aug 18, 2020 at 03:31:10PM -0700, Ben Greear wrote:
> >>
> >> I don't think it has been discussed recently, but mac80211 is already
> >> a complicated beast, so if this added any significant complexity
> >> it might not be worth it.
> >
> > Any bulk data path should be using the async interface, otherwise
> > performance will seriously suffer should SIMD be unavailable.  I
> > think someone should look at converting wireless to async like IPsec.
>
> Most users in most cases are using hw crypt, so that is likely why
> it hasn't gotten a huge amount of effort to optimize the software
> crypt path.
>

As I understand it, it is highly unusual for these code paths to be
exercised in the first place. All mac80211 hardware anyone still cares
about supports CCMP offload, and so only in special cases like Ben's
do we need the software implementation. Also, in Ben's case, it is on
a hot path whereas obsolete hardware that does not implement CCMP
offload does not support anything over 11g speeds to begin with.

Then, there is the additional issue where the FPU preserve/restore
appears to be disproportionately expensive on the actual SoC in
question.

My preferred approach for cbcmac(aes-ni) would be to mirror the arm64
exactly, which means going through the data only a single time, and
interleave the CTR and CBCMAC operations at the AES round level. My
cbcmac ahash approach (v2) is plan B as far as I am concerned, but it
turns out to be flawed and I haven't had time to look into this.

But if we look at the actual issue at hand, we might also look into
amortizing the FPU preserve/restore over multiple invocations of a
cipher. I proposed a patch a while ago that makes cipher an internal
crypto API abstraction, and we could easily add pre/post hooks that
preserve/restore the FPU in this case, in which case we would not need
any changes at higher levels.


> If someone wants to give this async api a try for mac80211, I can
> test, and I can sponsor the work, but I don't have time to try
> to implement it myself.
>
> Thanks,
> Ben
>
> >
> >> Truth is though, I know very little about what changes would be
> >> needed to make it do async decrypt, so maybe it would be a simple
> >> matter?
> >
> > IPsec was actually quite straightforward.
> >
> > Cheers,
> >
>
>
> --
> Ben Greear <greearb@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com
Herbert Xu Aug. 20, 2020, 7:01 a.m. UTC | #12
On Thu, Aug 20, 2020 at 08:58:15AM +0200, Ard Biesheuvel wrote:
>
> But if we look at the actual issue at hand, we might also look into
> amortizing the FPU preserve/restore over multiple invocations of a
> cipher. I proposed a patch a while ago that makes cipher an internal
> crypto API abstraction, and we could easily add pre/post hooks that
> preserve/restore the FPU in this case, in which case we would not need
> any changes at higher levels.

I think any use of SIMD crypto_cipher on bulk data is just wrong.
Because the performance degradation when SIMD cannot be used is
too great for this to make sense.

So optimising the FPU overhead is attacking the wrong problem.

Cheers,
Ard Biesheuvel Aug. 20, 2020, 7:04 a.m. UTC | #13
On Thu, 20 Aug 2020 at 09:01, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Thu, Aug 20, 2020 at 08:58:15AM +0200, Ard Biesheuvel wrote:
> >
> > But if we look at the actual issue at hand, we might also look into
> > amortizing the FPU preserve/restore over multiple invocations of a
> > cipher. I proposed a patch a while ago that makes cipher an internal
> > crypto API abstraction, and we could easily add pre/post hooks that
> > preserve/restore the FPU in this case, in which case we would not need
> > any changes at higher levels.
>
> I think any use of SIMD crypto_cipher on bulk data is just wrong.
> Because the performance degradation when SIMD cannot be used is
> too great for this to make sense.
>
> So optimising the FPU overhead is attacking the wrong problem.
>

I don't disagree with that, especially given all the effort that went
into optimizing FPU preserve/restore on both arm64 and x86. But the
bottom line is that this is what is causing the degradation in Ben's
case, so we cannot disregard it.
Herbert Xu Aug. 20, 2020, 7:06 a.m. UTC | #14
On Thu, Aug 20, 2020 at 09:04:26AM +0200, Ard Biesheuvel wrote:
>
> I don't disagree with that, especially given all the effort that went
> into optimizing FPU preserve/restore on both arm64 and x86. But the
> bottom line is that this is what is causing the degradation in Ben's
> case, so we cannot disregard it.

If he's having problems with the performance when SIMD is in use
due to preserve/restore, I'd hate to see his numbers when SIMD is
not available.

IOW if this really matters to him, then wireless code needs to switch
over to ahash.

Solving half of the problem simply makes no sense.

Cheers,
Ard Biesheuvel Aug. 20, 2020, 7:19 a.m. UTC | #15
On Thu, 20 Aug 2020 at 09:06, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Thu, Aug 20, 2020 at 09:04:26AM +0200, Ard Biesheuvel wrote:
> >
> > I don't disagree with that, especially given all the effort that went
> > into optimizing FPU preserve/restore on both arm64 and x86. But the
> > bottom line is that this is what is causing the degradation in Ben's
> > case, so we cannot disregard it.
>
> If he's having problems with the performance when SIMD is in use
> due to preserve/restore, I'd hate to see his numbers when SIMD is
> not available.
>

Actually, I'm not so sure that they will be so much worse. The
expensive FPU preserve/restore occurs for every 16 bytes of data
processed by the AES cipher, which I'd estimate to take ~10 cycles per
byte for an unaccelerated implementation. But table based AES should
be avoided, especially for MAC algorithms where the plaintext may be
known to an attacker who is after the key.

However, the CCMP handling is invoked from softirq context or from
task context, and so SIMD is generally available unless the softirq
happens to be taken over the back of a hardirq that interrupted a task
running in the kernel that was using the SIMD already. IOW, this
happens so rarely in practice that I would not expect it to be
noticeable in the performance stats.

> IOW if this really matters to him, then wireless code needs to switch
> over to ahash.
>
> Solving half of the problem simply makes no sense.
>

My v2 attempt at cbcmac(aesni) implements an ahash, but a synchronous
one. This means we can amortize the FPU preserve/restore over the
entire scatterlist, instead of relying on the ahash walk to present
the data in virtually mapped chunks.

I'd still like to explore this approach, but I simply haven't had the
spare cycles to spend on this.
Herbert Xu Aug. 20, 2020, 7:29 a.m. UTC | #16
On Thu, Aug 20, 2020 at 09:19:16AM +0200, Ard Biesheuvel wrote:
>
> Actually, I'm not so sure that they will be so much worse. The
> expensive FPU preserve/restore occurs for every 16 bytes of data
> processed by the AES cipher, which I'd estimate to take ~10 cycles per
> byte for an unaccelerated implementation. But table based AES should
> be avoided, especially for MAC algorithms where the plaintext may be
> known to an attacker who is after the key.

On my machine the performance difference on a 1472-byte request
between SIMD and generic is 2161 vs. 7558 (cycles).
> 
> However, the CCMP handling is invoked from softirq context or from
> task context, and so SIMD is generally available unless the softirq
> happens to be taken over the back of a hardirq that interrupted a task
> running in the kernel that was using the SIMD already. IOW, this
> happens so rarely in practice that I would not expect it to be
> noticeable in the performance stats.

What if the same machine was doing TLS/IPsec sends at full throttle?
That would be exactly the wrong time to slow down softirqs four-fold,
no?

> My v2 attempt at cbcmac(aesni) implements an ahash, but a synchronous
> one. This means we can amortize the FPU preserve/restore over the
> entire scatterlist, instead of relying on the ahash walk to present
> the data in virtually mapped chunks.
> 
> I'd still like to explore this approach, but I simply haven't had the
> spare cycles to spend on this.

I don't have an issue your patch per se.  But please make it so that
it has the async path like everything else.  Also wireless uses shash
so it can't use an ahash anyway even if it is sync.

Cheers,
Ard Biesheuvel Aug. 20, 2020, 7:33 a.m. UTC | #17
On Thu, 20 Aug 2020 at 09:29, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Thu, Aug 20, 2020 at 09:19:16AM +0200, Ard Biesheuvel wrote:
> >
> > Actually, I'm not so sure that they will be so much worse. The
> > expensive FPU preserve/restore occurs for every 16 bytes of data
> > processed by the AES cipher, which I'd estimate to take ~10 cycles per
> > byte for an unaccelerated implementation. But table based AES should
> > be avoided, especially for MAC algorithms where the plaintext may be
> > known to an attacker who is after the key.
>
> On my machine the performance difference on a 1472-byte request
> between SIMD and generic is 2161 vs. 7558 (cycles).

Sure. But your machine does not have the pathological FPU
preserve/restore performance.

> >
> > However, the CCMP handling is invoked from softirq context or from
> > task context, and so SIMD is generally available unless the softirq
> > happens to be taken over the back of a hardirq that interrupted a task
> > running in the kernel that was using the SIMD already. IOW, this
> > happens so rarely in practice that I would not expect it to be
> > noticeable in the performance stats.
>
> What if the same machine was doing TLS/IPsec sends at full throttle?
> That would be exactly the wrong time to slow down softirqs four-fold,
> no?
>

Fair point.

> > My v2 attempt at cbcmac(aesni) implements an ahash, but a synchronous
> > one. This means we can amortize the FPU preserve/restore over the
> > entire scatterlist, instead of relying on the ahash walk to present
> > the data in virtually mapped chunks.
> >
> > I'd still like to explore this approach, but I simply haven't had the
> > spare cycles to spend on this.
>
> I don't have an issue your patch per se.  But please make it so that
> it has the async path like everything else.  Also wireless uses shash
> so it can't use an ahash anyway even if it is sync.
>

The mac80211 CCMP code uses a synchronous ccm aead, which gets backed
by a skcipher+ahash combo by the ccm template. So a synchronous ahash
is fine for this particular case.
Herbert Xu Aug. 20, 2020, 7:44 a.m. UTC | #18
On Thu, Aug 20, 2020 at 09:33:21AM +0200, Ard Biesheuvel wrote:
>
> > On my machine the performance difference on a 1472-byte request
> > between SIMD and generic is 2161 vs. 7558 (cycles).
> 
> Sure. But your machine does not have the pathological FPU
> preserve/restore performance.

Why does that matter? These are numbers for cbc-aesni which means
just a single preserve/restore for the whole request.

Or are you saying on Ben's machine cbc-aesni would have worse
performance vs. aes-generic?
 
> The mac80211 CCMP code uses a synchronous ccm aead, which gets backed
> by a skcipher+ahash combo by the ccm template. So a synchronous ahash
> is fine for this particular case.

OK I was just grepping for cmac so didn't see this.

For this case, I think it's even more important that it be converted
over to async because its sending path is also in user context just
like IPsec.

So simply by sending wireless packets you can hog the CPU while
doing SIMD in kernel context which would then kill the receive
path if you're using the generic fallback.

Cheers,
Ard Biesheuvel Aug. 20, 2020, 7:48 a.m. UTC | #19
On Thu, 20 Aug 2020 at 09:44, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Thu, Aug 20, 2020 at 09:33:21AM +0200, Ard Biesheuvel wrote:
> >
> > > On my machine the performance difference on a 1472-byte request
> > > between SIMD and generic is 2161 vs. 7558 (cycles).
> >
> > Sure. But your machine does not have the pathological FPU
> > preserve/restore performance.
>
> Why does that matter? These are numbers for cbc-aesni which means
> just a single preserve/restore for the whole request.
>

No, that is the whole problem. The CCM template has a CBCMAC
implementation that wraps the bare cipher, which means it invokes
crypto_cipher_encrypt_one() for each 16 bytes of input, and each of
those calls involves a FPU preserve/restore.

> Or are you saying on Ben's machine cbc-aesni would have worse
> performance vs. aes-generic?
>

Yes, given the pathological overhead of FPU preserve/restore for every
block of 16 bytes processed by the cbcmac wrapper.

> > The mac80211 CCMP code uses a synchronous ccm aead, which gets backed
> > by a skcipher+ahash combo by the ccm template. So a synchronous ahash
> > is fine for this particular case.
>
> OK I was just grepping for cmac so didn't see this.
>
> For this case, I think it's even more important that it be converted
> over to async because its sending path is also in user context just
> like IPsec.
>

Indeed.

cmac() is not really relevant for performance, afaict. Only cbcmac()
is used for bulk data.

> So simply by sending wireless packets you can hog the CPU while
> doing SIMD in kernel context which would then kill the receive
> path if you're using the generic fallback.
>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Herbert Xu Aug. 20, 2020, 7:53 a.m. UTC | #20
On Thu, Aug 20, 2020 at 09:48:02AM +0200, Ard Biesheuvel wrote:
>
> > Or are you saying on Ben's machine cbc-aesni would have worse
> > performance vs. aes-generic?
> >
> 
> Yes, given the pathological overhead of FPU preserve/restore for every
> block of 16 bytes processed by the cbcmac wrapper.

I'm sceptical.  Do we have numbers showing this? You can get them
from tcrypt with my patch:

	https://patchwork.kernel.org/patch/11701343/

Just do

	modprobe tcrypt mode=400 alg='cmac(aes-aesni)' klen=16
	modprobe tcrypt mode=400 alg='cmac(aes-generic)' klen=16

> cmac() is not really relevant for performance, afaict. Only cbcmac()
> is used for bulk data.

Sure but it's trivial to extend my cmac patch to support cbcmac.

Cheers,
Ard Biesheuvel Aug. 20, 2020, 7:56 a.m. UTC | #21
On Thu, 20 Aug 2020 at 09:54, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Thu, Aug 20, 2020 at 09:48:02AM +0200, Ard Biesheuvel wrote:
> >
> > > Or are you saying on Ben's machine cbc-aesni would have worse
> > > performance vs. aes-generic?
> > >
> >
> > Yes, given the pathological overhead of FPU preserve/restore for every
> > block of 16 bytes processed by the cbcmac wrapper.
>
> I'm sceptical.  Do we have numbers showing this? You can get them
> from tcrypt with my patch:
>
>         https://patchwork.kernel.org/patch/11701343/
>
> Just do
>
>         modprobe tcrypt mode=400 alg='cmac(aes-aesni)' klen=16
>         modprobe tcrypt mode=400 alg='cmac(aes-generic)' klen=16
>
> > cmac() is not really relevant for performance, afaict. Only cbcmac()
> > is used for bulk data.
>
> Sure but it's trivial to extend my cmac patch to support cbcmac.
>


Sure.

Ben, care to have a go at the above on your hardware? It would help us
get to the bottom of this issue.
Ben Greear Aug. 20, 2020, 1:54 p.m. UTC | #22
On 8/20/20 12:56 AM, Ard Biesheuvel wrote:
> On Thu, 20 Aug 2020 at 09:54, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>>
>> On Thu, Aug 20, 2020 at 09:48:02AM +0200, Ard Biesheuvel wrote:
>>>
>>>> Or are you saying on Ben's machine cbc-aesni would have worse
>>>> performance vs. aes-generic?
>>>>
>>>
>>> Yes, given the pathological overhead of FPU preserve/restore for every
>>> block of 16 bytes processed by the cbcmac wrapper.
>>
>> I'm sceptical.  Do we have numbers showing this? You can get them
>> from tcrypt with my patch:
>>
>>          https://patchwork.kernel.org/patch/11701343/
>>
>> Just do
>>
>>          modprobe tcrypt mode=400 alg='cmac(aes-aesni)' klen=16
>>          modprobe tcrypt mode=400 alg='cmac(aes-generic)' klen=16
>>
>>> cmac() is not really relevant for performance, afaict. Only cbcmac()
>>> is used for bulk data.
>>
>> Sure but it's trivial to extend my cmac patch to support cbcmac.
>>
> 
> 
> Sure.
> 
> Ben, care to have a go at the above on your hardware? It would help us
> get to the bottom of this issue.

Here's a run on an:  Intel(R) Core(TM) i7-7700T CPU @ 2.90GHz

                testing speed of async cmac(aes-aesni) (cmac(aes-aesni))
[  259.397756] tcrypt: test  0 (   16 byte blocks,   16 bytes per update,   1 updates):    244 cycles/operation,   15 cycles/byte
[  259.397759] tcrypt: test  1 (   64 byte blocks,   16 bytes per update,   4 updates):   1052 cycles/operation,   16 cycles/byte
[  259.397765] tcrypt: test  2 (   64 byte blocks,   64 bytes per update,   1 updates):    641 cycles/operation,   10 cycles/byte
[  259.397768] tcrypt: test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   3909 cycles/operation,   15 cycles/byte
[  259.397786] tcrypt: test  4 (  256 byte blocks,   64 bytes per update,   4 updates):   2602 cycles/operation,   10 cycles/byte
[  259.397797] tcrypt: test  5 (  256 byte blocks,  256 bytes per update,   1 updates):   2211 cycles/operation,    8 cycles/byte
[  259.397807] tcrypt: test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):  15453 cycles/operation,   15 cycles/byte
[  259.397872] tcrypt: test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):   8863 cycles/operation,    8 cycles/byte
[  259.397910] tcrypt: test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):   8442 cycles/operation,    8 cycles/byte
[  259.397946] tcrypt: test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):  43542 cycles/operation,   21 cycles/byte
[  259.398110] tcrypt: test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):  17649 cycles/operation,    8 cycles/byte
[  259.398184] tcrypt: test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):  21255 cycles/operation,   10 cycles/byte
[  259.398267] tcrypt: test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):  16322 cycles/operation,    7 cycles/byte
[  259.398335] tcrypt: test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  60301 cycles/operation,   14 cycles/byte
[  259.398585] tcrypt: test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):  34413 cycles/operation,    8 cycles/byte
[  259.398728] tcrypt: test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):  32894 cycles/operation,    8 cycles/byte
[  259.398865] tcrypt: test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):  32521 cycles/operation,    7 cycles/byte
[  259.399000] tcrypt: test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates): 120415 cycles/operation,   14 cycles/byte
[  259.399550] tcrypt: test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):  68635 cycles/operation,    8 cycles/byte
[  259.399834] tcrypt: test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):  83770 cycles/operation,   10 cycles/byte
[  259.400157] tcrypt: test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):  65075 cycles/operation,    7 cycles/byte
[  259.400427] tcrypt: test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):  65085 cycles/operation,    7 cycles/byte
[  294.171336]
                testing speed of async cmac(aes-generic) (cmac(aes-generic))
[  294.171340] tcrypt: test  0 (   16 byte blocks,   16 bytes per update,   1 updates):    275 cycles/operation,   17 cycles/byte
[  294.171343] tcrypt: test  1 (   64 byte blocks,   16 bytes per update,   4 updates):   1191 cycles/operation,   18 cycles/byte
[  294.171350] tcrypt: test  2 (   64 byte blocks,   64 bytes per update,   1 updates):    738 cycles/operation,   11 cycles/byte
[  294.171354] tcrypt: test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   4386 cycles/operation,   17 cycles/byte
[  294.171374] tcrypt: test  4 (  256 byte blocks,   64 bytes per update,   4 updates):   2915 cycles/operation,   11 cycles/byte
[  294.171387] tcrypt: test  5 (  256 byte blocks,  256 bytes per update,   1 updates):   2464 cycles/operation,    9 cycles/byte
[  294.171398] tcrypt: test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):  17558 cycles/operation,   17 cycles/byte
[  294.171472] tcrypt: test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):  14022 cycles/operation,   13 cycles/byte
[  294.171530] tcrypt: test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):   9022 cycles/operation,    8 cycles/byte
[  294.171569] tcrypt: test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):  38107 cycles/operation,   18 cycles/byte
[  294.171722] tcrypt: test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):  18083 cycles/operation,    8 cycles/byte
[  294.171798] tcrypt: test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):  17260 cycles/operation,    8 cycles/byte
[  294.171870] tcrypt: test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):  17415 cycles/operation,    8 cycles/byte
[  294.171943] tcrypt: test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates):  66005 cycles/operation,   16 cycles/byte
[  294.172217] tcrypt: test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):  36035 cycles/operation,    8 cycles/byte
[  294.172366] tcrypt: test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):  42812 cycles/operation,   10 cycles/byte
[  294.172533] tcrypt: test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):  53415 cycles/operation,   13 cycles/byte
[  294.172745] tcrypt: test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates): 133326 cycles/operation,   16 cycles/byte
[  294.173297] tcrypt: test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates):  90271 cycles/operation,   11 cycles/byte
[  294.173646] tcrypt: test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates):  68703 cycles/operation,    8 cycles/byte
[  294.173931] tcrypt: test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates):  67951 cycles/operation,    8 cycles/byte
[  294.174213] tcrypt: test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates):  68370 cycles/operation,    8 cycles/byte


On my slow apu2 board with processor: AMD GX-412TC SOC

               testing speed of async cmac(aes-aesni) (cmac(aes-aesni))
[   51.750514] tcrypt: test  0 (   16 byte blocks,   16 bytes per update,   1 updates):    600 cycles/operation,   37 cycle
[   51.750532] tcrypt: test  1 (   64 byte blocks,   16 bytes per update,   4 updates):   2063 cycles/operation,   32 cycle
[   51.750582] tcrypt: test  2 (   64 byte blocks,   64 bytes per update,   1 updates):   1326 cycles/operation,   20 cycle
[   51.750619] tcrypt: test  3 (  256 byte blocks,   16 bytes per update,  16 updates):  11190 cycles/operation,   43 cycle
[   51.750775] tcrypt: test  4 (  256 byte blocks,   64 bytes per update,   4 updates):   4935 cycles/operation,   19 cycle
[   51.750840] tcrypt: test  5 (  256 byte blocks,  256 bytes per update,   1 updates):   8652 cycles/operation,   33 cycle
[   51.750948] tcrypt: test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):  43430 cycles/operation,   42 cycle
[   51.751488] tcrypt: test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):  23589 cycles/operation,   23 cycle
[   51.751810] tcrypt: test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):  18759 cycles/operation,   18 cycle
[   51.752027] tcrypt: test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):  79699 cycles/operation,   38 cycle
[   51.753035] tcrypt: test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):  39900 cycles/operation,   19 cycle
[   51.753559] tcrypt: test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):  38390 cycles/operation,   18 cycle
[   51.754057] tcrypt: test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):  40888 cycles/operation,   19 cycle
[   51.754615] tcrypt: test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates): 143019 cycles/operation,   34 cycle
[   51.756369] tcrypt: test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates):  89046 cycles/operation,   21 cycle
[   51.757527] tcrypt: test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):  77992 cycles/operation,   19 cycle
[   51.758526] tcrypt: test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates):  76021 cycles/operation,   18 cycle
[   51.759442] tcrypt: test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates): 312260 cycles/operation,   38 cycle
[   51.763195] tcrypt: test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates): 176472 cycles/operation,   21 cycle
[   51.765255] tcrypt: test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates): 169565 cycles/operation,   20 cycle
[   51.767321] tcrypt: test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates): 164968 cycles/operation,   20 cycle
[   51.769256] tcrypt: test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates): 165096 cycles/operation,   20 cycle

               testing speed of async cmac(aes-generic) (cmac(aes-generic))
[   97.835925] tcrypt: test  0 (   16 byte blocks,   16 bytes per update,   1 updates):    665 cycles/operation,   41 cycle
[   97.835945] tcrypt: test  1 (   64 byte blocks,   16 bytes per update,   4 updates):   2430 cycles/operation,   37 cycle
[   97.836016] tcrypt: test  2 (   64 byte blocks,   64 bytes per update,   1 updates):   1656 cycles/operation,   25 cycle
[   97.836044] tcrypt: test  3 (  256 byte blocks,   16 bytes per update,  16 updates):   9014 cycles/operation,   35 cycle
[   97.836259] tcrypt: test  4 (  256 byte blocks,   64 bytes per update,   4 updates):  13444 cycles/operation,   52 cycle
[   97.836399] tcrypt: test  5 (  256 byte blocks,  256 bytes per update,   1 updates):   8960 cycles/operation,   35 cycle
[   97.836515] tcrypt: test  6 ( 1024 byte blocks,   16 bytes per update,  64 updates):  51594 cycles/operation,   50 cycle
[   97.837151] tcrypt: test  7 ( 1024 byte blocks,  256 bytes per update,   4 updates):  28105 cycles/operation,   27 cycle
[   97.837497] tcrypt: test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):  31365 cycles/operation,   30 cycle
[   97.837865] tcrypt: test  9 ( 2048 byte blocks,   16 bytes per update, 128 updates):  86111 cycles/operation,   42 cycle
[   97.838927] tcrypt: test 10 ( 2048 byte blocks,  256 bytes per update,   8 updates):  60021 cycles/operation,   29 cycle
[   97.839628] tcrypt: test 11 ( 2048 byte blocks, 1024 bytes per update,   2 updates):  56311 cycles/operation,   27 cycle
[   97.840308] tcrypt: test 12 ( 2048 byte blocks, 2048 bytes per update,   1 updates):  50877 cycles/operation,   24 cycle
[   97.840943] tcrypt: test 13 ( 4096 byte blocks,   16 bytes per update, 256 updates): 174028 cycles/operation,   42 cycle
[   97.843205] tcrypt: test 14 ( 4096 byte blocks,  256 bytes per update,  16 updates): 103243 cycles/operation,   25 cycle
[   97.844524] tcrypt: test 15 ( 4096 byte blocks, 1024 bytes per update,   4 updates):  99960 cycles/operation,   24 cycle
[   97.845865] tcrypt: test 16 ( 4096 byte blocks, 4096 bytes per update,   1 updates): 121735 cycles/operation,   29 cycle
[   97.847355] tcrypt: test 17 ( 8192 byte blocks,   16 bytes per update, 512 updates): 387559 cycles/operation,   47 cycle
[   97.851930] tcrypt: test 18 ( 8192 byte blocks,  256 bytes per update,  32 updates): 223662 cycles/operation,   27 cycle
[   97.854617] tcrypt: test 19 ( 8192 byte blocks, 1024 bytes per update,   8 updates): 226131 cycles/operation,   27 cycle
[   97.857385] tcrypt: test 20 ( 8192 byte blocks, 4096 bytes per update,   2 updates): 203840 cycles/operation,   24 cycle
[   97.859888] tcrypt: test 21 ( 8192 byte blocks, 8192 bytes per update,   1 updates): 220232 cycles/operation,   26 cycle

Thanks,
Ben
Herbert Xu Aug. 20, 2020, 8:10 p.m. UTC | #23
On Thu, Aug 20, 2020 at 06:54:58AM -0700, Ben Greear wrote:
>
> Here's a run on an:  Intel(R) Core(TM) i7-7700T CPU @ 2.90GHz
> 
>                testing speed of async cmac(aes-aesni) (cmac(aes-aesni))
>
> [  259.397910] tcrypt: test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):   8442 cycles/operation,    8 cycles/byte
>
>                testing speed of async cmac(aes-generic) (cmac(aes-generic))
>
> [  294.171530] tcrypt: test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):   9022 cycles/operation,    8 cycles/byte
> 
> On my slow apu2 board with processor: AMD GX-412TC SOC
> 
>               testing speed of async cmac(aes-aesni) (cmac(aes-aesni))
>
> [   51.751810] tcrypt: test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):  18759 cycles/operation,   18 cycle
>
>               testing speed of async cmac(aes-generic) (cmac(aes-generic))
>
> [   97.837497] tcrypt: test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):  31365 cycles/operation,   30 cycle

So clearly aes-generic is slower than aes-aesni even with saving and
restoring for each block.  Therefore improving the performance of
the latter per se does not make sense.

Cheers,
Ben Greear Aug. 20, 2020, 10:09 p.m. UTC | #24
On 8/20/20 1:10 PM, Herbert Xu wrote:
> On Thu, Aug 20, 2020 at 06:54:58AM -0700, Ben Greear wrote:
>>
>> Here's a run on an:  Intel(R) Core(TM) i7-7700T CPU @ 2.90GHz
>>
>>                 testing speed of async cmac(aes-aesni) (cmac(aes-aesni))
>>
>> [  259.397910] tcrypt: test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):   8442 cycles/operation,    8 cycles/byte
>>
>>                 testing speed of async cmac(aes-generic) (cmac(aes-generic))
>>
>> [  294.171530] tcrypt: test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):   9022 cycles/operation,    8 cycles/byte
>>
>> On my slow apu2 board with processor: AMD GX-412TC SOC
>>
>>                testing speed of async cmac(aes-aesni) (cmac(aes-aesni))
>>
>> [   51.751810] tcrypt: test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):  18759 cycles/operation,   18 cycle
>>
>>                testing speed of async cmac(aes-generic) (cmac(aes-generic))
>>
>> [   97.837497] tcrypt: test  8 ( 1024 byte blocks, 1024 bytes per update,   1 updates):  31365 cycles/operation,   30 cycle
> 
> So clearly aes-generic is slower than aes-aesni even with saving and
> restoring for each block.  Therefore improving the performance of
> the latter per se does not make sense.

I have always assumed that I need aesni instructions to have any chance at this performing well,
but there are certainly chips out there that don't have aesni, so possibly it is still worth improving
if it is relatively easy to do so.

I am currently using x86-64 CPUs with aesni, and also some AP platforms running QCA ARM chips.
I am not sure if ARM is using aesni or not...it is certainly not that fast, but maybe for other
reasons.

Thanks,
Ben

> 
> Cheers,
>
Herbert Xu Aug. 20, 2020, 10:12 p.m. UTC | #25
On Thu, Aug 20, 2020 at 03:09:55PM -0700, Ben Greear wrote:
> 
> I have always assumed that I need aesni instructions to have any chance at this performing well,
> but there are certainly chips out there that don't have aesni, so possibly it is still worth improving
> if it is relatively easy to do so.

What we were discussing is the merit of improving aesni only
while still being exposed to aes-generic on the softirq path.

This is clearly not acceptable.

Improving aes-generic obviously would make sense.

Cheers,
Christian Lamparter Aug. 22, 2020, 10:35 p.m. UTC | #26
On 2020-08-19 00:27, Herbert Xu wrote:
> On Wed, Aug 19, 2020 at 08:15:50AM +1000, Herbert Xu wrote:
>> On Tue, Aug 18, 2020 at 07:17:35AM -0700, Ben Greear wrote:
>>>
>>> Is there any easy way to use your work to make shash fast for aesni?  I
>>> basically just want it to perform as well as it used to with my patch.
>>
>> Yes.  We could add a sync version of aesni that simply falls back
>> to aes-generic when simd is unavailable.
> 
> But I think before anyone attempts this we should explore making
> mac80211 async like IPsec.  Is there any fundamental reason why
> that is not possible? Have the wireless people expressed any
> objections to making this async before?

Ohh, is this still about a heavily-updated and rewritten version
of my old initial patch from 2014 for 3.16-wl?
<https://lore.kernel.org/linux-wireless/1518134.xFh23iA8q1@blech/>

Because back in 2016, I've asked this on linux-wireless:

| It would be a great if mac80211 would do to the encryption and
| decryption asynchronously. As this would work for other ciphers
| and also allows crypto offload to dedicated crypto hardware.

And the answer back then (same as now) was:
<https://lore.kernel.org/linux-wireless/1477168300.4123.8.camel@sipsolutions.net/>

 >The only problem with that is that we'd essentially need a software
 >queue for *all* frames, and release them from there in RX order after
 >decrypt. That's surely doable, but so far nobody has thought it
 >important enough since mostly HW crypto is used ...

Ben, keep up the good work!

Cheers,
Christian