mbox series

[v2,0/6] Faster AES-XTS on modern x86_64 CPUs

Message ID 20240329080355.2871-1-ebiggers@kernel.org (mailing list archive)
Headers show
Series Faster AES-XTS on modern x86_64 CPUs | expand

Message

Eric Biggers March 29, 2024, 8:03 a.m. UTC
This patchset adds new AES-XTS implementations that accelerate disk and
file encryption on modern x86_64 CPUs.

The largest improvements are seen on CPUs that support the VAES
extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and
later.  However, an implementation using plain AESNI + AVX is also added
and provides a boost on older CPUs too.

To try to handle the mess that is x86 SIMD, the code for all the new
AES-XTS implementations is generated from an assembly macro.  This makes
it so that we e.g. don't have to have entirely different source code
just for different vector lengths (xmm, ymm, zmm).

To avoid downclocking effects, zmm registers aren't used on certain
Intel CPU models such as Ice Lake.  These CPU models default to an
implementation using ymm registers instead.

To make testing easier, all four new AES-XTS implementations are
registered separately with the crypto API.  They are prioritized
appropriately so that the best one for the CPU is used by default.

There's no separate kconfig option for the new implementations, as they
are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL.

This patchset increases the throughput of AES-256-XTS by the following
amounts on the following CPUs:

                          | 4096-byte messages | 512-byte messages |
    ----------------------+--------------------+-------------------+
    Intel Skylake         |        6%          |       31%         |
    Intel Cascade Lake    |        4%          |       26%         |
    Intel Ice Lake        |       127%         |      120%         |
    Intel Sapphire Rapids |       151%         |      112%         |
    AMD Zen 1             |        61%         |       73%         |
    AMD Zen 2             |        36%         |       59%         |
    AMD Zen 3             |       138%         |       99%         |
    AMD Zen 4             |       155%         |      117%         |

To summarize how the XTS implementations perform in general, here are
benchmarks of all of them on AMD Zen 4, with 4096-byte messages.  (Of
course, in practice only the best one for the CPU actually gets used.)

    xts-aes-aesni                  4247 MB/s
    xts-aes-aesni-avx              5669 MB/s
    xts-aes-vaes-avx2              9588 MB/s
    xts-aes-vaes-avx10_256         9631 MB/s
    xts-aes-vaes-avx10_512         10868 MB/s

... and on Intel Sapphire Rapids:

    xts-aes-aesni                  4848 MB/s
    xts-aes-aesni-avx              5287 MB/s
    xts-aes-vaes-avx2              11685 MB/s
    xts-aes-vaes-avx10_256         11938 MB/s
    xts-aes-vaes-avx10_512         12176 MB/s

Notes about benchmarking methods:

- All my benchmarks were done using a custom kernel module that invokes
  the crypto_skcipher API.  Note that benchmarking the crypto API from
  userspace using AF_ALG, e.g. as 'cryptsetup benchmark' does, is bad at
  measuring fast algorithms due to the syscall overhead of AF_ALG.  I
  don't recommend that method.  Instead, I measured the crypto
  performance directly, as that's what this patchset focuses on.

- All numbers I give are for decryption.  However, on all the CPUs I
  tested, encryption performs almost identically to decryption.

Open questions:

- Is the policy that I implemented for preferring ymm registers to zmm
  registers the right one?  arch/x86/crypto/poly1305_glue.c thinks that
  only Skylake has the bad downclocking.  My current proposal is a bit
  more conservative; it also excludes Ice Lake and Tiger Lake.  Those
  CPUs supposedly still have some downclocking, though not as much.

- Should the policy on the use of zmm registers be in a centralized
  place?  It probably doesn't make sense to have random different
  policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).

- Are there any other known issues with using AVX512 in kernel mode?  It
  seems to work, and technically it's not new because Poly1305 and ARIA
  already use AVX512, including the mask registers and zmm registers up
  to 31.  So if there was a major issue, like the new registers not
  being properly saved and restored, it probably would have already been
  found.  But AES-XTS support would introduce a wider use of it.

- Should we perhaps not even bother with AVX512 / AVX10 at all for now,
  given that on current CPUs most of the improvement is achieved by
  going to VAES + AVX2?  I.e. should we skip the last two patches?  I'm
  hoping the improvement will be greater on future CPUs, though.

Changed in v2:
  - Additional optimizations:
      - Interleaved the tweak computation with AES en/decryption.  This
        helps significantly on some CPUs, especially those without VAES.
      - Further optimized for single-page sources and destinations.
      - Used fewer instructions to update tweaks in VPCLMULQDQ case.
      - Improved handling of "round 0".
      - Eliminated a jump instruction from the main loop.
  - Other
      - Fixed zmm_exclusion_list[] to be null-terminated.
      - Added missing #ifdef to unregister_xts_algs().
      - Added some more comments.
      - Improved cover letter and some commit messages.
      - Now that the next tweak is always computed anyways, made it be
        returned unconditionally.
      - Moved the IV encryption to a separate function.

Eric Biggers (6):
  x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
  crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs
  crypto: x86/aes-xts - wire up AESNI + AVX implementation
  crypto: x86/aes-xts - wire up VAES + AVX2 implementation
  crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation
  crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation

 arch/x86/Kconfig.assembler           |  10 +
 arch/x86/crypto/Makefile             |   3 +-
 arch/x86/crypto/aes-xts-avx-x86_64.S | 838 +++++++++++++++++++++++++++
 arch/x86/crypto/aesni-intel_glue.c   | 270 ++++++++-
 4 files changed, 1118 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/crypto/aes-xts-avx-x86_64.S


base-commit: 4cece764965020c22cff7665b18a012006359095

Comments

Ard Biesheuvel March 29, 2024, 9:03 a.m. UTC | #1
On Fri, 29 Mar 2024 at 10:06, Eric Biggers <ebiggers@kernel.org> wrote:
>
> This patchset adds new AES-XTS implementations that accelerate disk and
> file encryption on modern x86_64 CPUs.
>
> The largest improvements are seen on CPUs that support the VAES
> extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and
> later.  However, an implementation using plain AESNI + AVX is also added
> and provides a boost on older CPUs too.
>
> To try to handle the mess that is x86 SIMD, the code for all the new
> AES-XTS implementations is generated from an assembly macro.  This makes
> it so that we e.g. don't have to have entirely different source code
> just for different vector lengths (xmm, ymm, zmm).
>
> To avoid downclocking effects, zmm registers aren't used on certain
> Intel CPU models such as Ice Lake.  These CPU models default to an
> implementation using ymm registers instead.
>
> To make testing easier, all four new AES-XTS implementations are
> registered separately with the crypto API.  They are prioritized
> appropriately so that the best one for the CPU is used by default.
>
> There's no separate kconfig option for the new implementations, as they
> are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL.
>
> This patchset increases the throughput of AES-256-XTS by the following
> amounts on the following CPUs:
>
>                           | 4096-byte messages | 512-byte messages |
>     ----------------------+--------------------+-------------------+
>     Intel Skylake         |        6%          |       31%         |
>     Intel Cascade Lake    |        4%          |       26%         |
>     Intel Ice Lake        |       127%         |      120%         |
>     Intel Sapphire Rapids |       151%         |      112%         |
>     AMD Zen 1             |        61%         |       73%         |
>     AMD Zen 2             |        36%         |       59%         |
>     AMD Zen 3             |       138%         |       99%         |
>     AMD Zen 4             |       155%         |      117%         |
>
> To summarize how the XTS implementations perform in general, here are
> benchmarks of all of them on AMD Zen 4, with 4096-byte messages.  (Of
> course, in practice only the best one for the CPU actually gets used.)
>
>     xts-aes-aesni                  4247 MB/s
>     xts-aes-aesni-avx              5669 MB/s
>     xts-aes-vaes-avx2              9588 MB/s
>     xts-aes-vaes-avx10_256         9631 MB/s
>     xts-aes-vaes-avx10_512         10868 MB/s
>
> ... and on Intel Sapphire Rapids:
>
>     xts-aes-aesni                  4848 MB/s
>     xts-aes-aesni-avx              5287 MB/s
>     xts-aes-vaes-avx2              11685 MB/s
>     xts-aes-vaes-avx10_256         11938 MB/s
>     xts-aes-vaes-avx10_512         12176 MB/s
>
> Notes about benchmarking methods:
>
> - All my benchmarks were done using a custom kernel module that invokes
>   the crypto_skcipher API.  Note that benchmarking the crypto API from
>   userspace using AF_ALG, e.g. as 'cryptsetup benchmark' does, is bad at
>   measuring fast algorithms due to the syscall overhead of AF_ALG.  I
>   don't recommend that method.  Instead, I measured the crypto
>   performance directly, as that's what this patchset focuses on.
>
> - All numbers I give are for decryption.  However, on all the CPUs I
>   tested, encryption performs almost identically to decryption.
>
> Open questions:
>
> - Is the policy that I implemented for preferring ymm registers to zmm
>   registers the right one?  arch/x86/crypto/poly1305_glue.c thinks that
>   only Skylake has the bad downclocking.  My current proposal is a bit
>   more conservative; it also excludes Ice Lake and Tiger Lake.  Those
>   CPUs supposedly still have some downclocking, though not as much.
>
> - Should the policy on the use of zmm registers be in a centralized
>   place?  It probably doesn't make sense to have random different
>   policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).
>
> - Are there any other known issues with using AVX512 in kernel mode?  It
>   seems to work, and technically it's not new because Poly1305 and ARIA
>   already use AVX512, including the mask registers and zmm registers up
>   to 31.  So if there was a major issue, like the new registers not
>   being properly saved and restored, it probably would have already been
>   found.  But AES-XTS support would introduce a wider use of it.
>
> - Should we perhaps not even bother with AVX512 / AVX10 at all for now,
>   given that on current CPUs most of the improvement is achieved by
>   going to VAES + AVX2?  I.e. should we skip the last two patches?  I'm
>   hoping the improvement will be greater on future CPUs, though.
>
> Changed in v2:
>   - Additional optimizations:
>       - Interleaved the tweak computation with AES en/decryption.  This
>         helps significantly on some CPUs, especially those without VAES.
>       - Further optimized for single-page sources and destinations.
>       - Used fewer instructions to update tweaks in VPCLMULQDQ case.
>       - Improved handling of "round 0".
>       - Eliminated a jump instruction from the main loop.
>   - Other
>       - Fixed zmm_exclusion_list[] to be null-terminated.
>       - Added missing #ifdef to unregister_xts_algs().
>       - Added some more comments.
>       - Improved cover letter and some commit messages.
>       - Now that the next tweak is always computed anyways, made it be
>         returned unconditionally.
>       - Moved the IV encryption to a separate function.
>
> Eric Biggers (6):
>   x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
>   crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs
>   crypto: x86/aes-xts - wire up AESNI + AVX implementation
>   crypto: x86/aes-xts - wire up VAES + AVX2 implementation
>   crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation
>   crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
>

Retested this v2:

Tested-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>

Hopefully, the AES-KL keylocker implementation can be based on this
template as well. I wouldn't mind retiring the existing xts(aesni)
code entirely, and using the xts() wrapper around ecb-aes-aesni on
32-bit and on non-AVX uarchs with AES-NI.
Eric Biggers March 29, 2024, 9:31 a.m. UTC | #2
On Fri, Mar 29, 2024 at 11:03:07AM +0200, Ard Biesheuvel wrote:
> 
> Retested this v2:
> 
> Tested-by: Ard Biesheuvel <ardb@kernel.org>
> Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
> 
> Hopefully, the AES-KL keylocker implementation can be based on this
> template as well.

As-is, it would be a bit ugly to add keylocker support to my template because my
template always processes 4 registers of AES blocks per iteration of the main
loop (like the existing aes-xts-aesni), whereas the keylocker instructions are
hardcoded to operate on 8 AES blocks at a time in xmm0-xmm7, presumably to
reduce the overhead of unwrapping the key.

I did try an 8-wide version briefly.  There are some older CPUs on which it
helps.  (On newer CPUs, AES latency is lower, and the width increases by moving
to ymm or zmm registers anyway.)  But it didn't seem too attractive to me.  It
causes registers to spill, and it becomes a bit awkward to unroll the AES rounds
when the code size is twice as large, so it may need to be re-rolled.  I should
take a closer look, but I decided to just stay with a 4-wide version for now.

So I *think* AES-KL is best kept separate for now.  I do wonder if the AES-KL
code should adopt the idea of using VEX-coded instructions, though --- surely
it's the case that in practice, any CPU with AES-KL also supports AVX.

> I wouldn't mind retiring the existing xts(aesni)
> code entirely, and using the xts() wrapper around ecb-aes-aesni on
> 32-bit and on non-AVX uarchs with AES-NI.

Yes, it will need to be benchmarked, but that probably makes sense.  If
Wikipedia is to be trusted, on the Intel side only Westmere (from 2010) has
AES-NI but not AVX, and on the AMD side all CPUs with AES-NI have AVX...

- Eric
Eric Biggers April 3, 2024, 12:44 a.m. UTC | #3
On Fri, Mar 29, 2024 at 02:31:30AM -0700, Eric Biggers wrote:
> > I wouldn't mind retiring the existing xts(aesni)
> > code entirely, and using the xts() wrapper around ecb-aes-aesni on
> > 32-bit and on non-AVX uarchs with AES-NI.
> 
> Yes, it will need to be benchmarked, but that probably makes sense.  If
> Wikipedia is to be trusted, on the Intel side only Westmere (from 2010) has
> AES-NI but not AVX, and on the AMD side all CPUs with AES-NI have AVX...

It looks like I missed some low-power CPUs.  Intel's Silvermont (2013), Goldmont
(2016), Goldmont Plus (2017), and Tremont (2020) support AES-NI but not AVX.
Their successor, Gracemont (2021), supports AVX.

I don't have any one of those immediately available to run a test on.  But just
doing a quick benchmark on Zen 1, xts-aes-aesni has 62% higher throughput than
xts(ecb-aes-aesni).  The significant difference seems expected, since there's a
lot of API overhead in the xts template, and it computes all the tweaks twice in
C code.

So I'm thinking we'll need to keep xts-aes-aesni around for now, alongside
xts-aes-aesni-avx.

(And with all the SIMD instructions taking a different number of arguments and
having different names for AVX vs non-AVX, I don't see a clean way to unify them
in assembly.  They could be unified if we used C intrinsics instead of assembly
and compiled a C function with and without the "avx" target.  However,
intrinsics bring their own issues and make it hard to control the generated
code.  I don't really want to rely on intrinsics for this code.)

- Eric