[v3] crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM

From: Eric Biggers <ebiggers@google.com>

From: Eric Biggers <ebiggers@google.com>

Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector
AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or
AVX10.  There are two implementations, sharing most source code: one
using 256-bit vectors and one using 512-bit vectors.  This patch
improves AES-GCM performance by up to 162%; see Tables 1 and 2 below.

I wrote the new AES-GCM assembly code from scratch, focusing on
correctness, performance, code size (both source and binary), and
documenting the source.  The new assembly file aes-gcm-avx10-x86_64.S is
about 1200 lines including extensive comments, and it generates less
than 8 KB of binary code.  The main loop does 4 vectors at a time, with
the AES and GHASH instructions interleaved.  Any remainder is handled
using a simple 1 vector at a time loop, with masking.

Several VAES + AVX512 implementations of AES-GCM exist from Intel,
including one in OpenSSL and one proposed for inclusion in Linux in 2021
(https://lore.kernel.org/linux-crypto/1611386920-28579-6-git-send-email-megha.dey@intel.com/).
These aren't really suitable to be used, though, due to the massive
amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux)
and well as the significantly larger amount of assembly source (4978
lines for OpenSSL, 1788 lines for Linux).  Also, Intel's code does not
support 256-bit vectors, which makes it not usable on future
AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have
downclocking issues.  So I ended up starting from scratch.  Usually my
much shorter code is actually slightly faster than Intel's AVX512 code,
though it depends on message length and on which of Intel's
implementations is used; for details, see Tables 3 and 4 below.

To facilitate potential integration into other projects, I've
dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause,
the same as the recently added RISC-V crypto code.

Note that although much of the new assembly code is agnostic to vector
length, it wouldn't be easy to make it support CPUs that lack AVX512 or
AVX10.  This was doable for the new AES-XTS code I recently added;
however, AES-GCM relies more heavily on the full set of 32 SIMD
registers, masking, and new instructions provided by AVX512 or AVX10.

The following two tables summarize the performance improvement over the
existing AES-GCM code in Linux that uses AES-NI and AVX2:

Table 1: AES-256-GCM encryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                      | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake        |   42% |   48% |   60% |   62% |   70% |   69% |
Intel Sapphire Rapids |  157% |  145% |  162% |  119% |   96% |   96% |
Intel Emerald Rapids  |  156% |  144% |  161% |  115% |   95% |  100% |
AMD Zen 4             |  103% |   89% |   78% |   56% |   54% |   54% |

                      |   300 |   200 |    64 |    63 |    16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake        |   66% |   48% |   49% |   70% |   53% |
Intel Sapphire Rapids |   80% |   60% |   41% |   62% |   38% |
Intel Emerald Rapids  |   79% |   60% |   41% |   62% |   38% |
AMD Zen 4             |   51% |   35% |   27% |   32% |   25% |

Table 2: AES-256-GCM decryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                      | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake        |   42% |   48% |   59% |   63% |   67% |   71% |
Intel Sapphire Rapids |  159% |  145% |  161% |  125% |  102% |  100% |
Intel Emerald Rapids  |  158% |  144% |  161% |  124% |  100% |  103% |
AMD Zen 4             |  110% |   95% |   80% |   59% |   56% |   54% |

                      |   300 |   200 |    64 |    63 |    16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake        |   67% |   56% |   46% |   70% |   56% |
Intel Sapphire Rapids |   79% |   62% |   39% |   61% |   39% |
Intel Emerald Rapids  |   80% |   62% |   40% |   58% |   40% |
AMD Zen 4             |   49% |   36% |   30% |   35% |   28% |

The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be
listed as 50%.  They were collected by directly measuring the Linux
crypto API performance using a custom kernel module.  Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference.  All
these benchmarks used an associated data length of 16 bytes.  Note that
AES-GCM is almost always used with short associated data lengths.

The following two tables summarize how the performance of my code
compares with Intel's AVX512 AES-GCM code, both the version that is in
OpenSSL and the version that was proposed for inclusion in Linux.
Neither version exists in Linux currently, but these are alternative
AES-GCM implementations that could be chosen instead of mine.  I
collected the following numbers on Emerald Rapids using a userspace
benchmark program that calls the assembly functions directly.

I've also included a comparison with Cloudflare's AES-GCM implementation
from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3.

Table 3: VAES-based AES-256-GCM encryption throughput in MB/s,
         implementation name vs. message length in bytes:

                     | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
---------------------+-------+-------+-------+-------+-------+-------+
This implementation  | 14171 | 12956 | 12318 |  9588 |  7293 |  6449 |
AVX512_Intel_OpenSSL | 14022 | 12467 | 11863 |  9107 |  5891 |  6472 |
AVX512_Intel_Linux   | 13954 | 12277 | 11530 |  8712 |  6627 |  5898 |
AVX512_Cloudflare    | 12564 | 11050 | 10905 |  8152 |  5345 |  5202 |

                     |   300 |   200 |    64 |    63 |    16 |
---------------------+-------+-------+-------+-------+-------+
This implementation  |  4939 |  3688 |  1846 |  1821 |   738 |
AVX512_Intel_OpenSSL |  4629 |  4532 |  2734 |  2332 |  1131 |
AVX512_Intel_Linux   |  4035 |  2966 |  1567 |  1330 |   639 |
AVX512_Cloudflare    |  3344 |  2485 |  1141 |  1127 |   456 |

Table 4: VAES-based AES-256-GCM decryption throughput in MB/s,
         implementation name vs. message length in bytes:

                     | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
---------------------+-------+-------+-------+-------+-------+-------+
This implementation  | 14276 | 13311 | 13007 | 11086 |  8268 |  8086 |
AVX512_Intel_OpenSSL | 14067 | 12620 | 12421 |  9587 |  5954 |  7060 |
AVX512_Intel_Linux   | 14116 | 12795 | 11778 |  9269 |  7735 |  6455 |
AVX512_Cloudflare    | 13301 | 12018 | 11919 |  9182 |  7189 |  6726 |

                     |   300 |   200 |    64 |    63 |    16 |
---------------------+-------+-------+-------+-------+-------+
This implementation  |  6454 |  5020 |  2635 |  2602 |  1079 |
AVX512_Intel_OpenSSL |  5184 |  5799 |  2957 |  2545 |  1228 |
AVX512_Intel_Linux   |  4394 |  4247 |  2235 |  1635 |   922 |
AVX512_Cloudflare    |  4289 |  3851 |  1435 |  1417 |   574 |

So, usually my code is actually slightly faster than Intel's code,
though the OpenSSL implementation has a slight edge on messages shorter
than 256 bytes in this microbenchmark.  (This also holds true when doing
the same tests on AMD Zen 4.)  It can be seen that the large code size
(up to 94x larger!) of the Intel implementations doesn't seem to bring
much benefit, so starting from scratch with much smaller code, as I've
done, seems appropriate.  The performance of my code on messages shorter
than 256 bytes could be improved through a limited amount of unrolling,
but it's unclear it would be worth it, given code size considerations
(e.g. caches) that don't get measured in microbenchmarks.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---

Changed in v3:
- Optimized the finalization code slightly.
- Fixed a minor issue in my userspace benchmark program (guard page
  after key struct made "AVX512_Cloudflare" extra slow on some input
  lengths) and regenerated tables 3-4.  Also upgraded to Emerald Rapids.
- Eliminated an instruction from _aes_gcm_precompute.

Changed in v2:
- Additional assembly optimizations
- Improved some comments
- Aligned key struct to 64 bytes
- Added comparison with Cloudflare's implementation of AES-GCM
- Other cleanups

 arch/x86/crypto/Kconfig                |    1 +
 arch/x86/crypto/Makefile               |    3 +
 arch/x86/crypto/aes-gcm-avx10-x86_64.S | 1223 ++++++++++++++++++++++++
 arch/x86/crypto/aesni-intel_glue.c     |  550 ++++++++++-
 4 files changed, 1761 insertions(+), 16 deletions(-)
 create mode 100644 arch/x86/crypto/aes-gcm-avx10-x86_64.S

base-commit: 0450d2083be6bdcd18c9535ac50c55266499b2df

Message ID	20240519065219.128027-1-ebiggers@kernel.org (mailing list archive)
State	Superseded
Delegated to:	Herbert Xu
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D98FA4A31; Sun, 19 May 2024 06:55:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716101705; cv=none; b=J9fs91xXHkR5pCdC/NlXSfMgywmvjYNG4E9GzQ/jI1gmVpGVbZn8WveSPfJ6bMSsVC4pb+Wok5LJttIvYTMuQ1nYvZ44ZqqMOAubt1ntaRanAhQh9yhB+cQXlcg4vPo6M+d6VtKyzVulFRribL7ExiwkyZU3h64e6wFsXD8yXxs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716101705; c=relaxed/simple; bh=5T927vPvnFh8fGIdt/sLEipELdUBbeXxIOFFlPotqok=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=gBOMBeNIfFtdpy7AehU15M3dg4GZW3U68Qp7fj15xCU/Z3OBioDnxy8WDDNyHzaR/6YM1sfVwXRLQux/VEoIIWKLOP5XRhLWLyVRm3MMM98g991mmziNf/z4ZwkL5LUrHwFy0DHCb8F93Esq2hewReoYRHnoJSGqdWrIn2ymrWw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=ovt6W1mt; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ovt6W1mt" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1B621C32781; Sun, 19 May 2024 06:55:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1716101704; bh=5T927vPvnFh8fGIdt/sLEipELdUBbeXxIOFFlPotqok=; h=From:To:Cc:Subject:Date:From; b=ovt6W1mtSGqaI0VSOFeetUWARIEZujSTPL6G36VGku/2kldt/bNaO49ujkggZ4rZT rxgWsULONsmePJpPlcMrC/oJCx2dwbxBwqqQ1jCg13ro1zqYj4RNEgKdbZfEtQR02V nJe4fnTgq0QavpBVSpkRRke0BpDc2pnGvmAWc8k2v0rMdLm6f6Nmg7xjDgFeGZdLMh uB39wZggGLCOiMPcH+Z0ND32u8hUlQt5n/izYqmVwNkm1rxbzHq2IaqmotSrAVi0VL DZLcf+45Pi26s7oJbhlBQ4TOtQ42nykoMdoVUbTuYBERxOaZUe7zW8Cs4xRTAM1aCG ROQSrNYdGa/6A== From: Eric Biggers <ebiggers@kernel.org> To: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org, x86@kernel.org Subject: [PATCH v3] crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM Date: Sat, 18 May 2024 23:52:19 -0700 Message-ID: <20240519065219.128027-1-ebiggers@kernel.org> X-Mailer: git-send-email 2.45.0 Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: <linux-crypto.vger.kernel.org> List-Subscribe: <mailto:linux-crypto+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-crypto+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v3] crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM \| expand [v3] crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM

[v3] crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM

Commit Message

Patch