mbox series

[0/6] Faster AES-XTS on modern x86_64 CPUs

Message ID 20240326080305.402382-1-ebiggers@kernel.org (mailing list archive)
Headers show
Series Faster AES-XTS on modern x86_64 CPUs | expand

Message

Eric Biggers March 26, 2024, 8:02 a.m. UTC
This patchset adds new AES-XTS implementations that accelerate disk and
file encryption on modern x86_64 CPUs.

The largest improvements are seen on CPUs that support the VAES
extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and
later.  However, an implementation using plain AESNI + AVX is also added
and provides a small boost on older CPUs too.

To try to handle the mess that is x86 SIMD, the code for all the new
AES-XTS implementations is generated from an assembly macro.  This makes
it so that we e.g. don't have to have entirely different source code
just for different vector lengths (xmm, ymm, zmm).

To avoid downclocking effects, zmm registers aren't used on certain
Intel CPU models such as Ice Lake.  These CPU models default to an
implementation using ymm registers instead.

This patchset increases the throughput of AES-256-XTS decryption by the
following amounts on the following CPUs:
                            
                          | 4096-byte messages | 512-byte messages |
    ----------------------+--------------------+-------------------+
    Intel Skylake         |        1%          |       11%         |
    Intel Ice Lake        |        92%         |       59%         |
    Intel Sapphire Rapids |       115%         |       78%         |
    AMD Zen 1             |        25%         |       20%         |
    AMD Zen 2             |        26%         |       20%         |
    AMD Zen 3             |        82%         |       40%         |
    AMD Zen 4             |       118%         |       48%         |

(The results for encryption are very similar to decryption.  I just tend
to measure decryption because decryption performance is more important.)

There's no separate kconfig option for the new AES-XTS implementations,
as they are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL.

To make testing easier, all four new AES-XTS implementations are
registered separately with the crypto API.  They are prioritized
appropriately so that the best one for the CPU is used by default.

Open questions:

- Is the policy that I implemented for preferring ymm registers to zmm
  registers the right one?  arch/x86/crypto/poly1305_glue.c thinks that
  only Skylake has the bad downclocking.  My current proposal is a bit
  more conservative; it also excludes Ice Lake and Tiger Lake.  Those
  CPUs supposedly still have some downclocking, though not as much.

- Should the policy on the use of zmm registers be in a centralized
  place?  It probably doesn't make sense to have random different
  policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).

- Are there any other known issues with using AVX512 in kernel mode?  It
  seems to work, and technically it's not new because Poly1305 and ARIA
  already use AVX512, including the mask registers and zmm registers up
  to 31.  So if there was a major issue, like the new registers not
  being properly saved and restored, it probably would have already been
  found.  But AES-XTS support would introduce a wider use of it.

Eric Biggers (6):
  x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
  crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs
  crypto: x86/aes-xts - wire up AESNI + AVX implementation
  crypto: x86/aes-xts - wire up VAES + AVX2 implementation
  crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation
  crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation

 arch/x86/Kconfig.assembler           |  10 +
 arch/x86/crypto/Makefile             |   3 +-
 arch/x86/crypto/aes-xts-avx-x86_64.S | 796 +++++++++++++++++++++++++++
 arch/x86/crypto/aesni-intel_glue.c   | 263 ++++++++-
 4 files changed, 1070 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/crypto/aes-xts-avx-x86_64.S


base-commit: 4cece764965020c22cff7665b18a012006359095

Comments

Ard Biesheuvel March 26, 2024, 8:51 a.m. UTC | #1
On Tue, 26 Mar 2024 at 10:06, Eric Biggers <ebiggers@kernel.org> wrote:
>
> This patchset adds new AES-XTS implementations that accelerate disk and
> file encryption on modern x86_64 CPUs.
>
> The largest improvements are seen on CPUs that support the VAES
> extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and
> later.  However, an implementation using plain AESNI + AVX is also added
> and provides a small boost on older CPUs too.
>
> To try to handle the mess that is x86 SIMD, the code for all the new
> AES-XTS implementations is generated from an assembly macro.  This makes
> it so that we e.g. don't have to have entirely different source code
> just for different vector lengths (xmm, ymm, zmm).
>
> To avoid downclocking effects, zmm registers aren't used on certain
> Intel CPU models such as Ice Lake.  These CPU models default to an
> implementation using ymm registers instead.
>
> This patchset increases the throughput of AES-256-XTS decryption by the
> following amounts on the following CPUs:
>
>                           | 4096-byte messages | 512-byte messages |
>     ----------------------+--------------------+-------------------+
>     Intel Skylake         |        1%          |       11%         |
>     Intel Ice Lake        |        92%         |       59%         |
>     Intel Sapphire Rapids |       115%         |       78%         |
>     AMD Zen 1             |        25%         |       20%         |
>     AMD Zen 2             |        26%         |       20%         |
>     AMD Zen 3             |        82%         |       40%         |
>     AMD Zen 4             |       118%         |       48%         |
>
> (The results for encryption are very similar to decryption.  I just tend
> to measure decryption because decryption performance is more important.)
>
> There's no separate kconfig option for the new AES-XTS implementations,
> as they are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL.
>
> To make testing easier, all four new AES-XTS implementations are
> registered separately with the crypto API.  They are prioritized
> appropriately so that the best one for the CPU is used by default.
>

This is very nice work!

I didn't check the performance delta on my system (it's Intel but I
have no idea which uarch), but it supports all flavours that you
implemented here, and all pass the selftests with
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y, so for the series

Tested-by: Ard Biesheuvel <ardb@kernel.org>

I will try to make time to review the code as well.

> Open questions:
>
> - Is the policy that I implemented for preferring ymm registers to zmm
>   registers the right one?  arch/x86/crypto/poly1305_glue.c thinks that
>   only Skylake has the bad downclocking.  My current proposal is a bit
>   more conservative; it also excludes Ice Lake and Tiger Lake.  Those
>   CPUs supposedly still have some downclocking, though not as much.
>
> - Should the policy on the use of zmm registers be in a centralized
>   place?  It probably doesn't make sense to have random different
>   policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).
>
> - Are there any other known issues with using AVX512 in kernel mode?  It
>   seems to work, and technically it's not new because Poly1305 and ARIA
>   already use AVX512, including the mask registers and zmm registers up
>   to 31.  So if there was a major issue, like the new registers not
>   being properly saved and restored, it probably would have already been
>   found.  But AES-XTS support would introduce a wider use of it.
>

I don't have much input here, except that I think we should just
disable AVX512 kernel-wide on systems where there is no benefit in
terms of throughput. I suspect this might change with algorithms that
rely more heavily on the masking, but so far, we have been making
quite effective use of simple permute vectors and overlapping loads
and stores to do the same. And as Eric points out, the only relevant
use case in the kernel is blocks of size 2^n where n is at least 9.
Eric Biggers March 26, 2024, 4:47 p.m. UTC | #2
On Tue, Mar 26, 2024 at 10:51:48AM +0200, Ard Biesheuvel wrote:
> > Open questions:
> >
> > - Is the policy that I implemented for preferring ymm registers to zmm
> >   registers the right one?  arch/x86/crypto/poly1305_glue.c thinks that
> >   only Skylake has the bad downclocking.  My current proposal is a bit
> >   more conservative; it also excludes Ice Lake and Tiger Lake.  Those
> >   CPUs supposedly still have some downclocking, though not as much.
> >
> > - Should the policy on the use of zmm registers be in a centralized
> >   place?  It probably doesn't make sense to have random different
> >   policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).
> >
> > - Are there any other known issues with using AVX512 in kernel mode?  It
> >   seems to work, and technically it's not new because Poly1305 and ARIA
> >   already use AVX512, including the mask registers and zmm registers up
> >   to 31.  So if there was a major issue, like the new registers not
> >   being properly saved and restored, it probably would have already been
> >   found.  But AES-XTS support would introduce a wider use of it.
> >
> 
> I don't have much input here, except that I think we should just
> disable AVX512 kernel-wide on systems where there is no benefit in
> terms of throughput. I suspect this might change with algorithms that
> rely more heavily on the masking, but so far, we have been making
> quite effective use of simple permute vectors and overlapping loads
> and stores to do the same. And as Eric points out, the only relevant
> use case in the kernel is blocks of size 2^n where n is at least 9.

There are several benefits to AVX512 besides the 512-bit zmm registers.  Besides
masking, there are also twice as many SIMD registers which make it possible to
cache all the AES round keys.  There are also other new instructions such as
vpternlogd which I've used in AES-XTS to XOR values together more efficiently.

That's why this patchset adds both xts-aes-vaes-avx10_256 and
xts-aes-vaes-avx10_512.  And I've adopted the new "AVX10" naming, maybe a bit
early, to emphasize that it's not just about 512-bit...

Consider Intel Ice Lake for example, these are the AES-256-XTS encryption speeds
on 4096-byte messages in MB/s I'm seeing:

    xts-aes-aesni                  5136
    xts-aes-aesni-avx              5366
    xts-aes-vaes-avx2              9337
    xts-aes-vaes-avx10_256         9876
    xts-aes-vaes-avx10_512         10215

So yes, on that CPU the biggest boost comes just from VAES, staying on AVX2.
But taking advantage of AVX512 does help a bit more, first from the parts other
than 512-bit registers, then a bit more from 512-bit registers.

I do have Ice Lake on the exclusion list from xts-aes-vaes-avx10_512 anyway,
since the concern with downclocking is not really about the performance of the
code itself but rather the impact on unrelated code running on the CPU.

And I *think* the right policy is to just disable the use of the zmm registers,
as opposed to AVX512 entirely.  As AVX512 was originally presented it did tie
these together, but they don't have to be.  AVX10 (which supposedly future
x86_64 CPUs will have) explicitly moves away from that by repackaging the
existing AVX512 features and making the zmm registers optional.

- Eric
David Laight April 3, 2024, 8:12 a.m. UTC | #3
From: Eric Biggers
> Sent: 26 March 2024 16:48
....
> Consider Intel Ice Lake for example, these are the AES-256-XTS encryption speeds
> on 4096-byte messages in MB/s I'm seeing:
> 
>     xts-aes-aesni                  5136
>     xts-aes-aesni-avx              5366
>     xts-aes-vaes-avx2              9337
>     xts-aes-vaes-avx10_256         9876
>     xts-aes-vaes-avx10_512         10215
> 
> So yes, on that CPU the biggest boost comes just from VAES, staying on AVX2.
> But taking advantage of AVX512 does help a bit more, first from the parts other
> than 512-bit registers, then a bit more from 512-bit registers.

How much does the kernel_fpu_begin() cost on real workloads?
(ie when the registers are live and it forces an extra save/restore)

I've not looked at the code but I often see what looks like
excessive inlining in crypto code.
This will speed up benchmarks but can have a negative effect
on real code both because of the time taken to load the
code and the effect of displacing other code.

It might be that this code is a simple loop....

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Eric Biggers April 4, 2024, 1:35 a.m. UTC | #4
Hi David,

On Wed, Apr 03, 2024 at 08:12:09AM +0000, David Laight wrote:
> From: Eric Biggers
> > Sent: 26 March 2024 16:48
> ....
> > Consider Intel Ice Lake for example, these are the AES-256-XTS encryption speeds
> > on 4096-byte messages in MB/s I'm seeing:
> > 
> >     xts-aes-aesni                  5136
> >     xts-aes-aesni-avx              5366
> >     xts-aes-vaes-avx2              9337
> >     xts-aes-vaes-avx10_256         9876
> >     xts-aes-vaes-avx10_512         10215
> > 
> > So yes, on that CPU the biggest boost comes just from VAES, staying on AVX2.
> > But taking advantage of AVX512 does help a bit more, first from the parts other
> > than 512-bit registers, then a bit more from 512-bit registers.
> 
> How much does the kernel_fpu_begin() cost on real workloads?
> (ie when the registers are live and it forces an extra save/restore)

x86 Linux does lazy restore of the FPU state.  The first kernel_fpu_begin() can
have a significant cost, as it issues an XSAVE (or equivalent) instruction and
causes an XRSTOR (or equivalent) instruction to be issued when returning to
userspace when it otherwise might not be needed.  Additional kernel_fpu_begin()
/ kernel_fpu_end() pairs without returning to userspace have only a small cost,
as they don't cause any more saves or restores of the FPU state to be done.

My new xts(aes) implementations have one kernel_fpu_begin() / kernel_fpu_end()
pair per message (if the message doesn't span any page boundaries, which is
almost always the case).  That's exactly the same as the current xts-aes-aesni.

I think what you may really be asking is how much the overhead of the XSAVE /
XRSTOR pair associated with kernel-mode use of the FPU *increases* if the kernel
clobbers AVX or AVX512 state, instead of just SSE state as xts-aes-aesni does.
That's much more relevant to this patchset.

I think the answer is that there is no additional overhead.  This is because the
XSAVE / XRSTOR pair happens regardless of the type of state the kernel clobbers,
and it operates on the userspace state, not the kernel's.  Some of the newer
variants of XSAVE (XSAVEOPT and XSAVES) do have a "modified" optimization where
they don't save parts of the state that are unmodified since the last XRSTOR;
however, that is unimportant here because the kernel's FPU state is never saved.

(This would change if x86 Linux were to support preemption of kernel-mode FPU
code.  In that case, we may need to take more care to minimize use of AVX and
AVX512 state.  That being said, AES-XTS tends to be used for bulk data anyway.)

This is based on theory, though.  I'll do a test to confirm that there's indeed
no additional overhead.  And also, even if there's no additional overhead, what
the existing overhead actually is.

> I've not looked at the code but I often see what looks like
> excessive inlining in crypto code.
> This will speed up benchmarks but can have a negative effect
> on real code both because of the time taken to load the
> code and the effect of displacing other code.
> 
> It might be that this code is a simple loop....

This is a different topic.  By "inlining" I assume that you also mean things
like loop unrolling.  I totally agree that some of the crypto assembly code goes
way overboard on this, resulting in an unreasonably large machine code size.
The AVX implementation of AES-GCM (aesni-intel_avx-x86_64.S), which was written
by Intel, is the worst offender by far, generating 256011 bytes of machine code.
In OpenSSL, Intel has even taken that to the next level with their VAES
optimized implementation of AES-GCM generating 696040 bytes of machine code.

For my AES-XTS code I've limited the code size to a much more reasonable level
by focusing on the things that make the most difference.  My assembly file
compiles to 14386 bytes of machine code (less than 6% of AES-GCM).  It consists
of encryption and decryption functions for each of the four included
implementations, and also the short function aes_xts_encrypt_iv().  On a
particular CPU model, only one implementation is actually used, resulting in at
most 3500-4000 bytes being actually used at runtime.  However, roughly half of
that is code to handle messages that aren't a multiple of 256 bytes, which
aren't really encountered in practice.  I've placed that code out-of-line to try
to prevent it from polluting the CPU's instruction cache.

On the C side in aesni-intel-glue.c, I have roughly ~600 bytes of code per
implementation for the inlined fast path: half for encryption, half for
decryption.  There arewith ~600 additional bytes for the rarely-executed slow
path of page-spanning messages shared by all implementations.

So in practice, at runtime just over 2 KB of AES-XTS code will get executed,
half for encryption and half for decryption.  That seems reasonable for
something as performance-critical as disk and file encryption.

There are changes that could be made to make the code smaller, for example
rolling up the AES rounds, making encryption and decryption share more code,
doing 1x-wide instead of 4x-wide, etc.  We could also skip the AVX512
implementations and top out at VAES + AVX2.  There are issues with these changes
though -- either they straight up hurt performance on CPUs that I tested, or
they demand a lot more out of the CPU (e.g. relying much more heavily on the
branch predictor) and I was concerned about issues on non-tested or future CPUs.

So, I think my current proposal is at a reasonable place regarding compiled code
size, especially when it's compared to the monstrosity that is some of the
existing crypto assembly code.  But let me know if there are any specific
choices I've made that you may have a different opinion on.

- Eric
David Laight April 4, 2024, 7:53 a.m. UTC | #5
From: Eric Biggers
> Sent: 04 April 2024 02:35
> 
> Hi David,
> 
> On Wed, Apr 03, 2024 at 08:12:09AM +0000, David Laight wrote:
> > From: Eric Biggers
> > > Sent: 26 March 2024 16:48
> > ....
> > > Consider Intel Ice Lake for example, these are the AES-256-XTS encryption speeds
> > > on 4096-byte messages in MB/s I'm seeing:
> > >
> > >     xts-aes-aesni                  5136
> > >     xts-aes-aesni-avx              5366
> > >     xts-aes-vaes-avx2              9337
> > >     xts-aes-vaes-avx10_256         9876
> > >     xts-aes-vaes-avx10_512         10215
> > >
> > > So yes, on that CPU the biggest boost comes just from VAES, staying on AVX2.
> > > But taking advantage of AVX512 does help a bit more, first from the parts other
> > > than 512-bit registers, then a bit more from 512-bit registers.
> >
> > How much does the kernel_fpu_begin() cost on real workloads?
> > (ie when the registers are live and it forces an extra save/restore)
> 
> x86 Linux does lazy restore of the FPU state.  The first kernel_fpu_begin() can
> have a significant cost, as it issues an XSAVE (or equivalent) instruction and
> causes an XRSTOR (or equivalent) instruction to be issued when returning to
> userspace when it otherwise might not be needed.  Additional kernel_fpu_begin()
> / kernel_fpu_end() pairs without returning to userspace have only a small cost,
> as they don't cause any more saves or restores of the FPU state to be done.
> 
> My new xts(aes) implementations have one kernel_fpu_begin() / kernel_fpu_end()
> pair per message (if the message doesn't span any page boundaries, which is
> almost always the case).  That's exactly the same as the current xts-aes-aesni.

I realised after sending it that the code almost certainly already did
kernel_fpu_begin() - so there probably isn't a difference because all the
fpu state is always saved.
(I'm sure there should be a way of getting access to (say) 2 ymm registers
by providing an on-stack save area to allow wide data copies or special
instructions - but that is a different issue.)

> I think what you may really be asking is how much the overhead of the XSAVE /
> XRSTOR pair associated with kernel-mode use of the FPU *increases* if the kernel
> clobbers AVX or AVX512 state, instead of just SSE state as xts-aes-aesni does.
> That's much more relevant to this patchset.

It depends on what has to be saved, not on what is used.
Although, since all the x/y/zmm registers are caller-saved I think they could
be 'zapped' on syscall entry (and restored as zero later).
Trouble is I suspect there is a single piece of code somewhere that relies
on them being preserved across an inlined system call.

> I think the answer is that there is no additional overhead.  This is because the
> XSAVE / XRSTOR pair happens regardless of the type of state the kernel clobbers,
> and it operates on the userspace state, not the kernel's.  Some of the newer
> variants of XSAVE (XSAVEOPT and XSAVES) do have a "modified" optimization where
> they don't save parts of the state that are unmodified since the last XRSTOR;
> however, that is unimportant here because the kernel's FPU state is never saved.
> 
> (This would change if x86 Linux were to support preemption of kernel-mode FPU
> code.  In that case, we may need to take more care to minimize use of AVX and
> AVX512 state.  That being said, AES-XTS tends to be used for bulk data anyway.)
> 
> This is based on theory, though.  I'll do a test to confirm that there's indeed
> no additional overhead.  And also, even if there's no additional overhead, what
> the existing overhead actually is.

Yes, I was wondering how it is used for 'real applications'.
If a system call that would normally return immediately (or at least without
a full process switch) hits the aes code it gets the cost of the XSAVE added.
Whereas the benchmark probably doesn't do anywhere near as many.

OTOH this is probably no different.

> 
> > I've not looked at the code but I often see what looks like
> > excessive inlining in crypto code.
> > This will speed up benchmarks but can have a negative effect
> > on real code both because of the time taken to load the
> > code and the effect of displacing other code.
> >
> > It might be that this code is a simple loop....
> 
> This is a different topic.  By "inlining" I assume that you also mean things
> like loop unrolling.  I totally agree that some of the crypto assembly code goes
> way overboard on this, resulting in an unreasonably large machine code size.
> The AVX implementation of AES-GCM (aesni-intel_avx-x86_64.S), which was written
> by Intel, is the worst offender by far, generating 256011 bytes of machine code.
> In OpenSSL, Intel has even taken that to the next level with their VAES
> optimized implementation of AES-GCM generating 696040 bytes of machine code.

That is truly stunning!
I can't believe anything that big is actually 'optimised'.
Just think of all the TLB misses :-)
Unless it is slightly faster if you are encrypting several TB of data.

...
> So, I think my current proposal is at a reasonable place regarding compiled code
> size, especially when it's compared to the monstrosity that is some of the
> existing crypto assembly code.  But let me know if there are any specific
> choices I've made that you may have a different opinion on.

At least you've thought about code size.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Herbert Xu April 5, 2024, 7:58 a.m. UTC | #6
Eric Biggers <ebiggers@kernel.org> wrote:
> This patchset adds new AES-XTS implementations that accelerate disk and
> file encryption on modern x86_64 CPUs.
> 
> The largest improvements are seen on CPUs that support the VAES
> extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and
> later.  However, an implementation using plain AESNI + AVX is also added
> and provides a small boost on older CPUs too.
> 
> To try to handle the mess that is x86 SIMD, the code for all the new
> AES-XTS implementations is generated from an assembly macro.  This makes
> it so that we e.g. don't have to have entirely different source code
> just for different vector lengths (xmm, ymm, zmm).
> 
> To avoid downclocking effects, zmm registers aren't used on certain
> Intel CPU models such as Ice Lake.  These CPU models default to an
> implementation using ymm registers instead.
> 
> This patchset increases the throughput of AES-256-XTS decryption by the
> following amounts on the following CPUs:
>                            
>                          | 4096-byte messages | 512-byte messages |
>    ----------------------+--------------------+-------------------+
>    Intel Skylake         |        1%          |       11%         |
>    Intel Ice Lake        |        92%         |       59%         |
>    Intel Sapphire Rapids |       115%         |       78%         |
>    AMD Zen 1             |        25%         |       20%         |
>    AMD Zen 2             |        26%         |       20%         |
>    AMD Zen 3             |        82%         |       40%         |
>    AMD Zen 4             |       118%         |       48%         |
> 
> (The results for encryption are very similar to decryption.  I just tend
> to measure decryption because decryption performance is more important.)
> 
> There's no separate kconfig option for the new AES-XTS implementations,
> as they are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL.
> 
> To make testing easier, all four new AES-XTS implementations are
> registered separately with the crypto API.  They are prioritized
> appropriately so that the best one for the CPU is used by default.
> 
> Open questions:
> 
> - Is the policy that I implemented for preferring ymm registers to zmm
>  registers the right one?  arch/x86/crypto/poly1305_glue.c thinks that
>  only Skylake has the bad downclocking.  My current proposal is a bit
>  more conservative; it also excludes Ice Lake and Tiger Lake.  Those
>  CPUs supposedly still have some downclocking, though not as much.
> 
> - Should the policy on the use of zmm registers be in a centralized
>  place?  It probably doesn't make sense to have random different
>  policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).
> 
> - Are there any other known issues with using AVX512 in kernel mode?  It
>  seems to work, and technically it's not new because Poly1305 and ARIA
>  already use AVX512, including the mask registers and zmm registers up
>  to 31.  So if there was a major issue, like the new registers not
>  being properly saved and restored, it probably would have already been
>  found.  But AES-XTS support would introduce a wider use of it.
> 
> Eric Biggers (6):
>  x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
>  crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs
>  crypto: x86/aes-xts - wire up AESNI + AVX implementation
>  crypto: x86/aes-xts - wire up VAES + AVX2 implementation
>  crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation
>  crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
> 
> arch/x86/Kconfig.assembler           |  10 +
> arch/x86/crypto/Makefile             |   3 +-
> arch/x86/crypto/aes-xts-avx-x86_64.S | 796 +++++++++++++++++++++++++++
> arch/x86/crypto/aesni-intel_glue.c   | 263 ++++++++-
> 4 files changed, 1070 insertions(+), 2 deletions(-)
> create mode 100644 arch/x86/crypto/aes-xts-avx-x86_64.S
> 
> 
> base-commit: 4cece764965020c22cff7665b18a012006359095

All applied.  Thanks.
Eric Biggers April 5, 2024, 7:19 p.m. UTC | #7
On Thu, Apr 04, 2024 at 07:53:48AM +0000, David Laight wrote:
> > >
> > > How much does the kernel_fpu_begin() cost on real workloads?
> > > (ie when the registers are live and it forces an extra save/restore)
> > 
> > x86 Linux does lazy restore of the FPU state.  The first kernel_fpu_begin() can
> > have a significant cost, as it issues an XSAVE (or equivalent) instruction and
> > causes an XRSTOR (or equivalent) instruction to be issued when returning to
> > userspace when it otherwise might not be needed.  Additional kernel_fpu_begin()
> > / kernel_fpu_end() pairs without returning to userspace have only a small cost,
> > as they don't cause any more saves or restores of the FPU state to be done.
> > 
> > My new xts(aes) implementations have one kernel_fpu_begin() / kernel_fpu_end()
> > pair per message (if the message doesn't span any page boundaries, which is
> > almost always the case).  That's exactly the same as the current xts-aes-aesni.
> 
> I realised after sending it that the code almost certainly already did
> kernel_fpu_begin() - so there probably isn't a difference because all the
> fpu state is always saved.
> (I'm sure there should be a way of getting access to (say) 2 ymm registers
> by providing an on-stack save area to allow wide data copies or special
> instructions - but that is a different issue.)
> 
> > I think what you may really be asking is how much the overhead of the XSAVE /
> > XRSTOR pair associated with kernel-mode use of the FPU *increases* if the kernel
> > clobbers AVX or AVX512 state, instead of just SSE state as xts-aes-aesni does.
> > That's much more relevant to this patchset.
> 
> It depends on what has to be saved, not on what is used.
> Although, since all the x/y/zmm registers are caller-saved I think they could
> be 'zapped' on syscall entry (and restored as zero later).
> Trouble is I suspect there is a single piece of code somewhere that relies
> on them being preserved across an inlined system call.
> 
> > I think the answer is that there is no additional overhead.  This is because the
> > XSAVE / XRSTOR pair happens regardless of the type of state the kernel clobbers,
> > and it operates on the userspace state, not the kernel's.  Some of the newer
> > variants of XSAVE (XSAVEOPT and XSAVES) do have a "modified" optimization where
> > they don't save parts of the state that are unmodified since the last XRSTOR;
> > however, that is unimportant here because the kernel's FPU state is never saved.
> > 
> > (This would change if x86 Linux were to support preemption of kernel-mode FPU
> > code.  In that case, we may need to take more care to minimize use of AVX and
> > AVX512 state.  That being said, AES-XTS tends to be used for bulk data anyway.)
> > 
> > This is based on theory, though.  I'll do a test to confirm that there's indeed
> > no additional overhead.  And also, even if there's no additional overhead, what
> > the existing overhead actually is.
> 
> Yes, I was wondering how it is used for 'real applications'.
> If a system call that would normally return immediately (or at least without
> a full process switch) hits the aes code it gets the cost of the XSAVE added.
> Whereas the benchmark probably doesn't do anywhere near as many.
> 
> OTOH this is probably no different.

I did some tests on Sapphire Rapids using a system call that I customized to do
nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair.

On average the bare syscall took 70 ns.  The syscall with the kernel_fpu_begin /
kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns
if it used ymm, or 360 ns if it used zmm.  I also tried making the kernel
clobber different registers in the kernel_fpu_begin / kernel_fpu_end section,
and as I expected this did not make any difference.

Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI
instructions cannot be used and the alternative would be xts(ecb(aes-generic)).
On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic))
takes about 2235ns.  With xts-aes-vaes-avx10_512 it takes 75 ns.  (Not a typo --
it really is almost 30 times faster!)  So it seems clear the FPU state save and
restore is worth it even just for a single sector using the traditional 512-byte
sector size, let alone a 4096-byte sector size which is recommended these days.

- Eric
David Laight April 8, 2024, 7:41 a.m. UTC | #8
From: Eric Biggers
> Sent: 05 April 2024 20:19
...
> I did some tests on Sapphire Rapids using a system call that I customized to do
> nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair.
> 
> On average the bare syscall took 70 ns.  The syscall with the kernel_fpu_begin /
> kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns
> if it used ymm, or 360 ns if it used zmm...
> 
> Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI
> instructions cannot be used and the alternative would be xts(ecb(aes-generic)).
> On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic))
> takes about 2235ns.  With xts-aes-vaes-avx10_512 it takes 75 ns...

So most of the cost of a single 512-byte sector is the kernel_fpu_begin().
But it is so much slower any other way it is still faster.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Eric Biggers April 8, 2024, 12:31 p.m. UTC | #9
On Mon, Apr 08, 2024 at 07:41:44AM +0000, David Laight wrote:
> From: Eric Biggers
> > Sent: 05 April 2024 20:19
> ...
> > I did some tests on Sapphire Rapids using a system call that I customized to do
> > nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair.
> > 
> > On average the bare syscall took 70 ns.  The syscall with the kernel_fpu_begin /
> > kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns
> > if it used ymm, or 360 ns if it used zmm...
> > 
> > Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI
> > instructions cannot be used and the alternative would be xts(ecb(aes-generic)).
> > On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic))
> > takes about 2235ns.  With xts-aes-vaes-avx10_512 it takes 75 ns...
> 
> So most of the cost of a single 512-byte sector is the kernel_fpu_begin().
> But it is so much slower any other way it is still faster.
> 

Yes.  To clarify, the 75 ns time I mentioned for a 512-byte sector is the
average for repeated calls, amortizing the XSAVE and XRSTOR.  For a real single
512-byte sector that eats the entire cost of the XSAVE and XRSTOR by itself, if
all state is in-use it should be about 75 + (360 - 70) = 365 ns (based on the
syscall benchmarks I did), with the XSAVE and XRSTOR accounting for 80% of that
time.  But yes, that's still over 6 times faster than the scalar alternative.

- Eric