[3/6] crypto: aegis - avoid prerotated AES tables

Message ID	20190624073818.29296-4-ard.biesheuvel@linaro.org (mailing list archive)
State	Changes Requested
Delegated to:	Herbert Xu
Headers	show Return-Path: <linux-crypto-owner@kernel.org> From: Ard Biesheuvel <ard.biesheuvel@linaro.org> To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, Ard Biesheuvel <ard.biesheuvel@linaro.org>, Eric Biggers <ebiggers@google.com>, Ondrej Mosnacek <omosnace@redhat.com>, Herbert Xu <herbert@gondor.apana.org.au>, Steve Capper <steve.capper@arm.com> Subject: [PATCH 3/6] crypto: aegis - avoid prerotated AES tables Date: Mon, 24 Jun 2019 09:38:15 +0200 Message-Id: <20190624073818.29296-4-ard.biesheuvel@linaro.org> In-Reply-To: <20190624073818.29296-1-ard.biesheuvel@linaro.org> References: <20190624073818.29296-1-ard.biesheuvel@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-crypto-owner@vger.kernel.org Precedence: bulk
Series	crypto: aegis128 - add NEON intrinsics version for ARM/arm64 \| expand [0/6] crypto: aegis128 - add NEON intrinsics version for ARM/arm64 [1/6] crypto: aegis128 - use unaliged helper in unaligned decrypt path [2/6] crypto: aegis - drop empty TFM init/exit routines [3/6] crypto: aegis - avoid prerotated AES tables [4/6] crypto: aegis128 - add support for SIMD acceleration [5/6] crypto: aegis128 - provide a SIMD implementation based on NEON intrinsics [6/6] crypto: tcrypt - add a speed test for AEGIS128

Message ID

20190624073818.29296-4-ard.biesheuvel@linaro.org (mailing list archive)

State

Changes Requested

Delegated to:

Herbert Xu

Headers

From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: linux-crypto@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org,
        Ard Biesheuvel <ard.biesheuvel@linaro.org>,
        Eric Biggers <ebiggers@google.com>,
        Ondrej Mosnacek <omosnace@redhat.com>,
        Herbert Xu <herbert@gondor.apana.org.au>,
        Steve Capper <steve.capper@arm.com>
Subject: [PATCH 3/6] crypto: aegis - avoid prerotated AES tables
Date: Mon, 24 Jun 2019 09:38:15 +0200
Message-Id: <20190624073818.29296-4-ard.biesheuvel@linaro.org>
In-Reply-To: <20190624073818.29296-1-ard.biesheuvel@linaro.org>
References: <20190624073818.29296-1-ard.biesheuvel@linaro.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-crypto-owner@vger.kernel.org
Precedence: bulk

Series

crypto: aegis128 - add NEON intrinsics version for ARM/arm64 | expand

Commit Message

Ard Biesheuvel June 24, 2019, 7:38 a.m. UTC

The generic AES code provides four sets of lookup tables, where each
set consists of four tables containing the same 32-bit values, but
rotated by 0, 8, 16 and 24 bits, respectively. This makes sense for
CISC architectures such as x86 which support memory operands, but
for other architectures, the rotates are quite cheap, and using all
four tables needlessly thrashes the D-cache, and actually hurts rather
than helps performance.

Since x86 already has its own implementation of AEGIS based on AES-NI
instructions, let's tweak the generic implementation towards other
architectures, and avoid the prerotated tables, and perform the
rotations inline. On ARM Cortex-A53, this results in a ~8% speedup.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/aegis.h | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

Comments

Ondrej Mosnacek June 24, 2019, 8:13 a.m. UTC | #1

On Mon, Jun 24, 2019 at 9:38 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> The generic AES code provides four sets of lookup tables, where each
> set consists of four tables containing the same 32-bit values, but
> rotated by 0, 8, 16 and 24 bits, respectively. This makes sense for
> CISC architectures such as x86 which support memory operands, but
> for other architectures, the rotates are quite cheap, and using all
> four tables needlessly thrashes the D-cache, and actually hurts rather
> than helps performance.
>
> Since x86 already has its own implementation of AEGIS based on AES-NI
> instructions, let's tweak the generic implementation towards other
> architectures, and avoid the prerotated tables, and perform the
> rotations inline. On ARM Cortex-A53, this results in a ~8% speedup.
>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

I'm not an expert on low-level performance, but the rationale sounds reasonable.

Acked-by: Ondrej Mosnacek <omosnace@redhat.com>

> ---
>  crypto/aegis.h | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/crypto/aegis.h b/crypto/aegis.h
> index 41a3090cda8e..3308066ddde0 100644
> --- a/crypto/aegis.h
> +++ b/crypto/aegis.h
> @@ -10,6 +10,7 @@
>  #define _CRYPTO_AEGIS_H
>
>  #include <crypto/aes.h>
> +#include <linux/bitops.h>
>  #include <linux/types.h>
>
>  #define AEGIS_BLOCK_SIZE 16
> @@ -53,16 +54,13 @@ static void crypto_aegis_aesenc(union aegis_block *dst,
>                                 const union aegis_block *key)
>  {
>         const u8  *s  = src->bytes;
> -       const u32 *t0 = crypto_ft_tab[0];
> -       const u32 *t1 = crypto_ft_tab[1];
> -       const u32 *t2 = crypto_ft_tab[2];
> -       const u32 *t3 = crypto_ft_tab[3];
> +       const u32 *t = crypto_ft_tab[0];
>         u32 d0, d1, d2, d3;
>
> -       d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]];
> -       d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]];
> -       d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]];
> -       d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]];
> +       d0 = t[s[ 0]] ^ rol32(t[s[ 5]], 8) ^ rol32(t[s[10]], 16) ^ rol32(t[s[15]], 24);
> +       d1 = t[s[ 4]] ^ rol32(t[s[ 9]], 8) ^ rol32(t[s[14]], 16) ^ rol32(t[s[ 3]], 24);
> +       d2 = t[s[ 8]] ^ rol32(t[s[13]], 8) ^ rol32(t[s[ 2]], 16) ^ rol32(t[s[ 7]], 24);
> +       d3 = t[s[12]] ^ rol32(t[s[ 1]], 8) ^ rol32(t[s[ 6]], 16) ^ rol32(t[s[11]], 24);
>
>         dst->words32[0] = cpu_to_le32(d0) ^ key->words32[0];
>         dst->words32[1] = cpu_to_le32(d1) ^ key->words32[1];
> --
> 2.20.1
>

diff --git a/crypto/aegis.h b/crypto/aegis.h
index 41a3090cda8e..3308066ddde0 100644
--- a/crypto/aegis.h
+++ b/crypto/aegis.h
@@ -10,6 +10,7 @@ 
 #define _CRYPTO_AEGIS_H
 
 #include <crypto/aes.h>
+#include <linux/bitops.h>
 #include <linux/types.h>
 
 #define AEGIS_BLOCK_SIZE 16
@@ -53,16 +54,13 @@  static void crypto_aegis_aesenc(union aegis_block *dst,
 				const union aegis_block *key)
 {
 	const u8  *s  = src->bytes;
-	const u32 *t0 = crypto_ft_tab[0];
-	const u32 *t1 = crypto_ft_tab[1];
-	const u32 *t2 = crypto_ft_tab[2];
-	const u32 *t3 = crypto_ft_tab[3];
+	const u32 *t = crypto_ft_tab[0];
 	u32 d0, d1, d2, d3;
 
-	d0 = t0[s[ 0]] ^ t1[s[ 5]] ^ t2[s[10]] ^ t3[s[15]];
-	d1 = t0[s[ 4]] ^ t1[s[ 9]] ^ t2[s[14]] ^ t3[s[ 3]];
-	d2 = t0[s[ 8]] ^ t1[s[13]] ^ t2[s[ 2]] ^ t3[s[ 7]];
-	d3 = t0[s[12]] ^ t1[s[ 1]] ^ t2[s[ 6]] ^ t3[s[11]];
+	d0 = t[s[ 0]] ^ rol32(t[s[ 5]], 8) ^ rol32(t[s[10]], 16) ^ rol32(t[s[15]], 24);
+	d1 = t[s[ 4]] ^ rol32(t[s[ 9]], 8) ^ rol32(t[s[14]], 16) ^ rol32(t[s[ 3]], 24);
+	d2 = t[s[ 8]] ^ rol32(t[s[13]], 8) ^ rol32(t[s[ 2]], 16) ^ rol32(t[s[ 7]], 24);
+	d3 = t[s[12]] ^ rol32(t[s[ 1]], 8) ^ rol32(t[s[ 6]], 16) ^ rol32(t[s[11]], 24);
 
 	dst->words32[0] = cpu_to_le32(d0) ^ key->words32[0];
 	dst->words32[1] = cpu_to_le32(d1) ^ key->words32[1];

[3/6] crypto: aegis - avoid prerotated AES tables

Commit Message

Comments

Patch