Patchwork [v2,3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

login
register
mail settings
Submitter Eric Biggers
Date Feb. 13, 2018, 6:57 p.m.
Message ID <20180213185735.GA243817@google.com>
Download mbox | patch
Permalink /patch/10217317/
State Not Applicable
Delegated to: Herbert Xu
Headers show

Comments

Eric Biggers - Feb. 13, 2018, 6:57 p.m.
Hi Ard,

On Tue, Feb 13, 2018 at 11:34:36AM +0000, Ard Biesheuvel wrote:
> Hi Eric,
> 
> On 12 February 2018 at 23:52, Eric Biggers <ebiggers@google.com> wrote:
> > Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
> > 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
> > Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
> > encrypted/decrypted (doing one cipher round for all the blocks, then the
> > next round, etc.), then goes through XTS postprocessing.
> >
> > The performance depends on the processor but can be about 3 times faster
> > than the generic code.  For example, on an ARMv7 processor we observe
> > the following performance with Speck128/256-XTS:
> >
> >     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
> >     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
> >
> > In comparison to AES-256-XTS without the Cryptography Extensions:
> >
> >     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
> >     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
> >     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
> >
> > Speck64/128-XTS is even faster:
> >
> >     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
> >
> > Note that as with the generic code, only the Speck128 and Speck64
> > variants are supported.  Also, for now only the XTS mode of operation is
> > supported, to target the disk and file encryption use cases.  The NEON
> > code also only handles the portion of the data that is evenly divisible
> > into 128-byte chunks, with any remainder handled by a C fallback.  Of
> > course, other modes of operation could be added later if needed, and/or
> > the NEON code could be updated to handle other buffer sizes.
> >
> > The XTS specification is only defined for AES which has a 128-bit block
> > size, so for the GF(2^64) math needed for Speck64-XTS we use the
> > reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
> > paper.  Of course, when possible users should use Speck128-XTS, but even
> > that may be too slow on some processors; Speck64-XTS can be faster.
> >
> 
> I think this is excellent work. Speck seems an appropriate solution to
> this problem, and I'm glad we are not ending up with a stream cipher
> for block encryption.
> 
> Also, I think an arm64 port would be nice. I may take a stab at this
> if nobody else beats me to it.

We don't really want to encourage people to use Speck over AES with the
Cryptography Extensions, so that's why I didn't include an arm64 port.  That
being said, I suppose we can't stop people from adding an arm64 port if they
really do prefer Speck, or maybe for use on arm64 CPUs that don't have the
Cryptography Extensions (though I thought that almost all do).

> 
> I did run into an issue with this code though: On big-endian, I get
> 
> [    0.272381] alg: skcipher: Test 1 failed (invalid result) on
> encryption for xts-speck64-neon
> [    0.276151] 00000000: 84 af 54 07 19 d4 7c a6 9c 8a ac f6 c2 14 04 d8
> [    0.278541] 00000010: 7f 18 6c 43 56 ed 0b b3 92 21 a2 d9 17 59 e4 3b
> 
> so there may be a byte order corner case you missed in the rewrite (or
> the issue existed before, as I did not test your v1)
> 

To be honest I haven't tested either version on a big endian ARM CPU yet.  I
don't really know how to do that currently; maybe it's possible with QEMU.

But assuming I haven't missed anything, in the assembly code everything is
treated as byte arrays with the exception of the round keys which are 32-bit or
64-bit numbers in CPU endianness.  The byte arrays are loaded and stored with
vld1.8 and vst1.8 while the round keys are loaded with vld1.32 or vld1.64, so
the assembly code *should* work correctly on a big endian CPU.

However, looking over it now, I think there is a bug in the glue code for
Speck64-XTS when it handles buffers not evenly divisible into 128 bytes.
Namely, the tweak is treated as CPU endian when it should be little endian.
Could you try the following patch?
Ard Biesheuvel - Feb. 13, 2018, 7:04 p.m.
On 13 February 2018 at 18:57, Eric Biggers <ebiggers@google.com> wrote:
> Hi Ard,
>
> On Tue, Feb 13, 2018 at 11:34:36AM +0000, Ard Biesheuvel wrote:
>> Hi Eric,
>>
>> On 12 February 2018 at 23:52, Eric Biggers <ebiggers@google.com> wrote:
>> > Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>> > 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>> > Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>> > encrypted/decrypted (doing one cipher round for all the blocks, then the
>> > next round, etc.), then goes through XTS postprocessing.
>> >
>> > The performance depends on the processor but can be about 3 times faster
>> > than the generic code.  For example, on an ARMv7 processor we observe
>> > the following performance with Speck128/256-XTS:
>> >
>> >     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>> >     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>> >
>> > In comparison to AES-256-XTS without the Cryptography Extensions:
>> >
>> >     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>> >     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>> >     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>> >
>> > Speck64/128-XTS is even faster:
>> >
>> >     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>> >
>> > Note that as with the generic code, only the Speck128 and Speck64
>> > variants are supported.  Also, for now only the XTS mode of operation is
>> > supported, to target the disk and file encryption use cases.  The NEON
>> > code also only handles the portion of the data that is evenly divisible
>> > into 128-byte chunks, with any remainder handled by a C fallback.  Of
>> > course, other modes of operation could be added later if needed, and/or
>> > the NEON code could be updated to handle other buffer sizes.
>> >
>> > The XTS specification is only defined for AES which has a 128-bit block
>> > size, so for the GF(2^64) math needed for Speck64-XTS we use the
>> > reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>> > paper.  Of course, when possible users should use Speck128-XTS, but even
>> > that may be too slow on some processors; Speck64-XTS can be faster.
>> >
>>
>> I think this is excellent work. Speck seems an appropriate solution to
>> this problem, and I'm glad we are not ending up with a stream cipher
>> for block encryption.
>>
>> Also, I think an arm64 port would be nice. I may take a stab at this
>> if nobody else beats me to it.
>
> We don't really want to encourage people to use Speck over AES with the
> Cryptography Extensions, so that's why I didn't include an arm64 port.  That
> being said, I suppose we can't stop people from adding an arm64 port if they
> really do prefer Speck, or maybe for use on arm64 CPUs that don't have the
> Cryptography Extensions (though I thought that almost all do).
>

Many do, but not all of them. A notable exception is the Raspberry Pi 3.

>>
>> I did run into an issue with this code though: On big-endian, I get
>>
>> [ 0.272381] alg: skcipher: Test 1 failed (invalid result) on
>> encryption for xts-speck64-neon
>> [    0.276151] 00000000: 84 af 54 07 19 d4 7c a6 9c 8a ac f6 c2 14 04 d8
>> [    0.278541] 00000010: 7f 18 6c 43 56 ed 0b b3 92 21 a2 d9 17 59 e4 3b
>>
>> so there may be a byte order corner case you missed in the rewrite (or
>> the issue existed before, as I did not test your v1)
>>
>
> To be honest I haven't tested either version on a big endian ARM CPU yet.  I
> don't really know how to do that currently; maybe it's possible with QEMU.
>

I tested this on a big-endian 32-bit VM running under KVM on a 64-bit host.

> But assuming I haven't missed anything, in the assembly code everything is
> treated as byte arrays with the exception of the round keys which are 32-bit or
> 64-bit numbers in CPU endianness.  The byte arrays are loaded and stored with
> vld1.8 and vst1.8 while the round keys are loaded with vld1.32 or vld1.64, so
> the assembly code *should* work correctly on a big endian CPU.
>

Indeed.

> However, looking over it now, I think there is a bug in the glue code for
> Speck64-XTS when it handles buffers not evenly divisible into 128 bytes.
> Namely, the tweak is treated as CPU endian when it should be little endian.
> Could you try the following patch?
>
> diff --git a/arch/arm/crypto/speck-neon-glue.c b/arch/arm/crypto/speck-neon-glue.c
> index 3987dd6e063e..960cc634b36f 100644
> --- a/arch/arm/crypto/speck-neon-glue.c
> +++ b/arch/arm/crypto/speck-neon-glue.c
> @@ -157,7 +157,7 @@ __speck64_xts_crypt(struct skcipher_request *req, speck64_crypt_one_t crypt_one,
>         struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
>         const struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
>         struct skcipher_walk walk;
> -       u64 tweak;
> +       __le64 tweak;
>         int err;
>
>         err = skcipher_walk_virt(&walk, req, true);
> @@ -184,16 +184,16 @@ __speck64_xts_crypt(struct skcipher_request *req, speck64_crypt_one_t crypt_one,
>                 }
>
>                 /* Handle any remainder with generic code */
> -               while (nbytes >= sizeof(u64)) {
> -                       *(u64 *)dst = *(u64 *)src ^ tweak;
> +               while (nbytes >= sizeof(__le64)) {
> +                       *(__le64 *)dst = *(__le64 *)src ^ tweak;
>                         (*crypt_one)(&ctx->main_key, dst, dst);
> -                       *(u64 *)dst ^= tweak;
> -                       tweak = (tweak << 1) ^
> -                               ((tweak & (1ULL << 63)) ? 0x1B : 0);
> -
> -                       dst += sizeof(u64);
> -                       src += sizeof(u64);
> -                       nbytes -= sizeof(u64);
> +                       *(__le64 *)dst ^= tweak;
> +                       tweak = cpu_to_le64((le64_to_cpu(tweak) << 1) ^
> +                                           ((tweak & cpu_to_le64(1ULL << 63)) ?
> +                                            0x1B : 0));
> +                       dst += sizeof(__le64);
> +                       src += sizeof(__le64);
> +                       nbytes -= sizeof(__le64);
>                 }
>                 err = skcipher_walk_done(&walk, nbytes);
>         }

This fixes it.

Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

Patch

diff --git a/arch/arm/crypto/speck-neon-glue.c b/arch/arm/crypto/speck-neon-glue.c
index 3987dd6e063e..960cc634b36f 100644
--- a/arch/arm/crypto/speck-neon-glue.c
+++ b/arch/arm/crypto/speck-neon-glue.c
@@ -157,7 +157,7 @@  __speck64_xts_crypt(struct skcipher_request *req, speck64_crypt_one_t crypt_one,
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	const struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
 	struct skcipher_walk walk;
-	u64 tweak;
+	__le64 tweak;
 	int err;
 
 	err = skcipher_walk_virt(&walk, req, true);
@@ -184,16 +184,16 @@  __speck64_xts_crypt(struct skcipher_request *req, speck64_crypt_one_t crypt_one,
 		}
 
 		/* Handle any remainder with generic code */
-		while (nbytes >= sizeof(u64)) {
-			*(u64 *)dst = *(u64 *)src ^ tweak;
+		while (nbytes >= sizeof(__le64)) {
+			*(__le64 *)dst = *(__le64 *)src ^ tweak;
 			(*crypt_one)(&ctx->main_key, dst, dst);
-			*(u64 *)dst ^= tweak;
-			tweak = (tweak << 1) ^
-				((tweak & (1ULL << 63)) ? 0x1B : 0);
-
-			dst += sizeof(u64);
-			src += sizeof(u64);
-			nbytes -= sizeof(u64);
+			*(__le64 *)dst ^= tweak;
+			tweak = cpu_to_le64((le64_to_cpu(tweak) << 1) ^
+					    ((tweak & cpu_to_le64(1ULL << 63)) ?
+					     0x1B : 0));
+			dst += sizeof(__le64);
+			src += sizeof(__le64);
+			nbytes -= sizeof(__le64);
 		}
 		err = skcipher_walk_done(&walk, nbytes);
 	}