[v4,2/2] crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM

From: Eric Biggers <ebiggers@google.com>

From: Eric Biggers <ebiggers@google.com>

Rewrite the AES-NI implementations of AES-GCM, taking advantage of
things I learned while writing the VAES-AVX10 implementations.  This is
a complete rewrite that reduces the AES-NI GCM source code size by about
70% and the binary code size by about 95%, while not regressing
performance and in fact improving it significantly in many cases.

The following summarizes the state before this patch:

- The aesni-intel module registered algorithms "generic-gcm-aesni" and
  "rfc4106-gcm-aesni" with the crypto API that actually delegated to one
  of three underlying implementations according to the CPU capabilities
  detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2.

- The AES-NI + AVX and AES-NI + AVX2 assembly code was in
  aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and
  257 KB of binary.  This massive binary size was not really
  appropriate, and depending on the kconfig it could take up over 1% the
  size of the entire vmlinux.  The main loops did 8 blocks per
  iteration.  The AVX code minimized the use of carryless multiplication
  whereas the AVX2 code did not.  The "AVX2" code did not actually use
  AVX2; the check for AVX2 was really a check for Intel Haswell or later
  to detect support for fast carryless multiplication.  The long source
  length was caused by factors such as significant code duplication.

- The AES-NI only assembly code was in aesni-intel_asm.S and consisted
  of 1501 lines of source and 15 KB of binary.  The main loops did 4
  blocks per iteration and minimized the use of carryless multiplication
  by using Karatsuba multiplication and a multiplication-less reduction.

- The assembly code was contributed in 2010-2013.  Maintenance has been
  sporadic and most design choices haven't been revisited.

- The assembly function prototypes and the corresponding glue code were
  separate from and were not consistent with the new VAES-AVX10 code I
  recently added.  The older code had several issues such as not
  precomputing the GHASH key powers, which hurt performance.

This rewrite achieves the following goals:

- Much shorter source and binary sizes.  The assembly source shrinks
  from 4300 lines to 1130 lines, and it produces about 9 KB of binary
  instead of 272 KB.  This is achieved via a better designed AES-GCM
  implementation that doesn't excessively unroll the code and instead
  prioritizes the parts that really matter.  Sharing the C glue code
  with the VAES-AVX10 implementations also saves 250 lines of C source.

- Improve performance on most (possibly all) CPUs on which this code
  runs, for most (possibly all) message lengths.  Benchmark results are
  given in Tables 1 and 2 below.

- Use the same function prototypes and glue code as the new VAES-AVX10
  algorithms.  This fixes some issues with the integration of the
  assembly and results in some significant performance improvements,
  primarily on short messages.  Also, the AVX and non-AVX
  implementations are now registered as separate algorithms with the
  crypto API, which makes them both testable by the self-tests.

- Keep support for AES-NI without AVX (for Westmere, Silvermont,
  Goldmont, and Tremont), but unify the source code with AES-NI + AVX.
  Since 256-bit vectors cannot be used without VAES anyway, this is made
  feasible by just using the non-VEX coded form of most instructions.

- Use a unified approach where the main loop does 8 blocks per iteration
  and uses Karatsuba multiplication to save one pclmulqdq per block but
  does not use the multiplication-less reduction.  This strikes a good
  balance across the range of CPUs on which this code runs.

- Don't spam the kernel log with an informational message on every boot.

The following tables summarize the improvement in AES-GCM throughput on
various CPU microarchitectures as a result of this patch:

Table 1: AES-256-GCM encryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                   | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell    |    2% |    8% |   11% |   18% |   31% |   26% |
Intel Skylake      |    1% |    4% |    7% |   12% |   26% |   19% |
Intel Cascade Lake |    3% |    8% |   10% |   18% |   33% |   24% |
AMD Zen 1          |    6% |   12% |    6% |   15% |   27% |   24% |
AMD Zen 2          |    8% |   13% |   13% |   19% |   26% |   28% |
AMD Zen 3          |    8% |   14% |   13% |   19% |   26% |   25% |

                   |   300 |   200 |    64 |    63 |    16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell    |   35% |   29% |   45% |   55% |   54% |
Intel Skylake      |   25% |   19% |   28% |   33% |   27% |
Intel Cascade Lake |   36% |   28% |   39% |   49% |   54% |
AMD Zen 1          |   27% |   22% |   23% |   29% |   26% |
AMD Zen 2          |   32% |   24% |   22% |   25% |   31% |
AMD Zen 3          |   30% |   24% |   22% |   23% |   26% |

Table 2: AES-256-GCM decryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                   | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell    |    3% |    8% |   11% |   19% |   32% |   28% |
Intel Skylake      |    3% |    4% |    7% |   13% |   28% |   27% |
Intel Cascade Lake |    3% |    9% |   11% |   19% |   33% |   28% |
AMD Zen 1          |   15% |   18% |   14% |   20% |   36% |   33% |
AMD Zen 2          |    9% |   16% |   13% |   21% |   26% |   27% |
AMD Zen 3          |    8% |   15% |   12% |   18% |   23% |   23% |

                   |   300 |   200 |    64 |    63 |    16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell    |   36% |   31% |   40% |   51% |   53% |
Intel Skylake      |   28% |   21% |   23% |   30% |   30% |
Intel Cascade Lake |   36% |   29% |   36% |   47% |   53% |
AMD Zen 1          |   35% |   31% |   32% |   35% |   36% |
AMD Zen 2          |   31% |   30% |   27% |   38% |   30% |
AMD Zen 3          |   27% |   23% |   24% |   32% |   26% |

The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be
listed as 10%.  They were collected by directly measuring the Linux
crypto API performance using a custom kernel module.  Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference.  All
these benchmarks used an associated data length of 16 bytes.  Note that
AES-GCM is almost always used with short associated data lengths.

I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or
Intel low-power CPUs, as these weren't readily available to me.
However, based on the design of the new code and the available
information about these other CPU microarchitectures, I wouldn't expect
any significant regressions, and there's a good chance performance is
improved just as it is above.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/Makefile                 |    5 +-
 arch/x86/crypto/aes-gcm-aesni-x86_64.S   | 1131 +++++++++
 arch/x86/crypto/aesni-intel_asm.S        | 1503 +-----------
 arch/x86/crypto/aesni-intel_avx-x86_64.S | 2804 ----------------------
 arch/x86/crypto/aesni-intel_glue.c       |  764 ++----
 5 files changed, 1391 insertions(+), 4816 deletions(-)
 create mode 100644 arch/x86/crypto/aes-gcm-aesni-x86_64.S
 delete mode 100644 arch/x86/crypto/aesni-intel_avx-x86_64.S

Message ID	20240527075626.142576-3-ebiggers@kernel.org (mailing list archive)
State	Superseded
Delegated to:	Herbert Xu
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0129F3A1DA; Mon, 27 May 2024 07:57:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716796677; cv=none; b=DZOLkZiumC7VUqvvoUimgMGwtcCRYOeNPwTNDgTom+p7xiFweJK9OlB4/f1hImrtPOxDKO06Kd2hqH/2Gsni2UC9dtxdPKYeqkRWQfd4oPpbE7kEtpU9AeyGAbEKCNGLUMycRH9HEWkCyEqCF6MJo+1WcOW2xm886YsupvEg9ME= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716796677; c=relaxed/simple; bh=2j24p6COo4lV/qU0FWW4vr3ApA68WDPpBeoYbaAKdIg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=UpEL0d4rzPgjO6B6kJ0oG0BclFmkehUsvBSlkuzpiBIsPco+ilXq1FJKk0pE77jKfotoZgzgXO+kZpfXd0qBkPFuWLvKtU5nU+6EKxsYdgryWyUo+1ftD1Uzy+C9beo01mFtUlgbPQXpvYRJEfPCrYBC7NBNrdrqROHGKeCo4+s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=rq47BWEO; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="rq47BWEO" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5B47FC4AF09; Mon, 27 May 2024 07:57:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1716796676; bh=2j24p6COo4lV/qU0FWW4vr3ApA68WDPpBeoYbaAKdIg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=rq47BWEOT4CntnHvPOSLhczAT/0eyD4Wn6H+9t2ABMZKBf2/iAz+zpuD8Zrn1rlvw 0DDQou89KVeij5iK5HdfnG+IwxLuHltGfb9UVPT890/mPqwNIafJQwlvajEVA2NgfT nEqGw8VPwkiNOfCjEbgc+dKF8fakUgFwmth4DaJiZG47N8tfWiEN8NyfmPl4IrlaFs fFuz/Vp6OP6dg6zYOsMvQPLqPkCljb8r9l5rupGEDT0XiIekHJJy5Uaq0RIo5sjwgF 7boCAwJLDsuPWSBN2ZlFG5Klgt63RkuHpkKB3zrCQCHtQHGYkdCdNQEZ2/SvW/YH51 6cQvm/+oZLA3Q== From: Eric Biggers <ebiggers@kernel.org> To: linux-crypto@vger.kernel.org Cc: x86@kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 2/2] crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM Date: Mon, 27 May 2024 00:56:26 -0700 Message-ID: <20240527075626.142576-3-ebiggers@kernel.org> X-Mailer: git-send-email 2.45.1 In-Reply-To: <20240527075626.142576-1-ebiggers@kernel.org> References: <20240527075626.142576-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: <linux-crypto.vger.kernel.org> List-Subscribe: <mailto:linux-crypto+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-crypto+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	x86_64 AES-GCM improvements \| expand [v4,0/2] x86_64 AES-GCM improvements [v4,1/2] crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM [v4,2/2] crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM

[v4,2/2] crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM

Commit Message

Patch