From patchwork Mon Sep 26 09:36:05 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988474
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 8321EC6FA83
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:38:15 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=UHjlyuA4z8PhbF2dD4p79Drkvrfdggi3zcXkGo3d7zw=; b=l33SkQfIUcVWdN
	9WzkZnRgX7FoSqoAIJa7scUL+rSYqGDmXsWxd2+wmXrb+1UBEAznnROL+tdGwBSR6Z6fFqHznXmmB
	6c272gWUE1frAOEffQkZZf8m6Z9Wn44ePkqbsmCJkZWeQN1hmkRBbl1W3aSgQfJW2wGNWNWrahXNY
	fLgSYE49B56HEnQKdS+hiRc+pbyKDDeJjotDnOUaXbmEGUCevac7mrvkxkTe/v7OmBSVhheuTNl/J
	uSLRon/qhLQXnT6o0DU1EcB4MGUvdF5x42maZY8eFwNBioJWJd9cHDsC5MAkVHmaOAxKO79G+IFey
	57MYB98I1OZhGTR3ojXw==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockY4-003gzR-JO; Mon, 26 Sep 2022 09:37:13 +0000
Received: from out30-42.freemail.mail.aliyun.com ([115.124.30.42])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXS-003gg7-Dc
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:36:36 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQkQP6._1664184986;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQkQP6._1664184986)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:27 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 01/16] crypto: arm64/sm3 - raise the priority of the CE
 implementation
Date: Mon, 26 Sep 2022 17:36:05 +0800
Message-Id: <20220926093620.99898-2-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023634_672625_6CF1F24D 
X-CRM114-Status: GOOD (  10.51  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Raise the priority of the sm3-ce algorithm from 200 to 400, this is
to make room for the implementation of sm3-neon.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm3-ce-glue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/crypto/sm3-ce-glue.c b/arch/arm64/crypto/sm3-ce-glue.c
index ee98954ae8ca..54bf6ebcfffb 100644
--- a/arch/arm64/crypto/sm3-ce-glue.c
+++ b/arch/arm64/crypto/sm3-ce-glue.c
@@ -84,7 +84,7 @@ static struct shash_alg sm3_alg = {
 	.base.cra_driver_name	= "sm3-ce",
 	.base.cra_blocksize	= SM3_BLOCK_SIZE,
 	.base.cra_module	= THIS_MODULE,
-	.base.cra_priority	= 200,
+	.base.cra_priority	= 400,
 };
 
 static int __init sm3_ce_mod_init(void)

From patchwork Mon Sep 26 09:36:06 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988477
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 55401C6FA83
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:39:41 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=CQLi6W/QvXX4t/nkIqPpiKXgEHi7J5ysHk5MNKSzxCI=; b=B8LubzVx5Q60vy
	sLtQfFCfUBCvZDjWLtevkYn8LZQusv3J/yxfjPpfh+cP8iF60usi7Qz3feG8am+NLQt39EY9aAGM8
	DShR8XlN6ostLeR4l79r6cRUNusD06Omzo6Qwo1yK2XlStV6T0qXfHafp56e6vVB8k61K/a25/XVC
	ebZbnukFYFECcmkobX/mJj/O7P31LRKBXP0smNmqnHxuDqItbc/MGYIi7lXm87jZdGYrX+Txn50S3
	g6xKdLXWrBtzKlwgd9/PJyKr8ghdcg3Ry1afcfTyoz6dy5YfsMWQ3KiCC3HsjrjQMNb2uxD/52F6W
	R1gaJbwh7BsSbKFNtR2Q==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockZI-003hcD-71; Mon, 26 Sep 2022 09:38:28 +0000
Received: from out30-133.freemail.mail.aliyun.com ([115.124.30.133])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXS-003gh1-F2
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:36:42 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQjskWT_1664184988;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQjskWT_1664184988)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:29 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 02/16] crypto: arm64/sm3 - add NEON assembly implementation
Date: Mon, 26 Sep 2022 17:36:06 +0800
Message-Id: <20220926093620.99898-3-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023635_057498_377C363D 
X-CRM114-Status: GOOD (  15.20  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

This patch adds the NEON acceleration implementation of the SM3 hash
algorithm. The main algorithm is based on SM3 NEON accelerated work of
the libgcrypt project.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 326 mode
of tcrypt, and compares the performance data of sm3-generic and sm3-ce.
The abscissas are blocks of different lengths. The data is tabulated and
the unit is Mb/s:

update-size    |      16      64     256    1024    2048    4096    8192
---------------+--------------------------------------------------------
sm3-generic    |  185.24  221.28  301.26  307.43  300.83  308.82  308.91
sm3-neon       |  171.81  220.20  322.94  339.28  334.09  343.61  343.87
sm3-ce         |  227.48  333.48  502.62  527.87  520.45  534.91  535.40

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig         |  11 +
 arch/arm64/crypto/Makefile        |   3 +
 arch/arm64/crypto/sm3-neon-core.S | 600 ++++++++++++++++++++++++++++++
 arch/arm64/crypto/sm3-neon-glue.c | 103 +++++
 4 files changed, 717 insertions(+)
 create mode 100644 arch/arm64/crypto/sm3-neon-core.S
 create mode 100644 arch/arm64/crypto/sm3-neon-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 8bd80508a710..4b121dc0cfba 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -96,6 +96,17 @@ config CRYPTO_SHA3_ARM64
 	  Architecture: arm64 using:
 	  - ARMv8.2 Crypto Extensions
 
+config CRYPTO_SM3_NEON
+	tristate "Hash functions: SM3 (NEON)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_HASH
+	select CRYPTO_SM3
+	help
+	  SM3 (ShangMi 3) secure hash function (OSCCA GM/T 0004-2012)
+
+	  Architecture: arm64 using:
+	  - NEON (Advanced SIMD) extensions
+
 config CRYPTO_SM3_ARM64_CE
 	tristate "Hash functions: SM3 (ARMv8.2 Crypto Extensions)"
 	depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 24bb0c4610de..087f1625e775 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -17,6 +17,9 @@ sha512-ce-y := sha512-ce-glue.o sha512-ce-core.o
 obj-$(CONFIG_CRYPTO_SHA3_ARM64) += sha3-ce.o
 sha3-ce-y := sha3-ce-glue.o sha3-ce-core.o
 
+obj-$(CONFIG_CRYPTO_SM3_NEON) += sm3-neon.o
+sm3-neon-y := sm3-neon-glue.o sm3-neon-core.o
+
 obj-$(CONFIG_CRYPTO_SM3_ARM64_CE) += sm3-ce.o
 sm3-ce-y := sm3-ce-glue.o sm3-ce-core.o
 
diff --git a/arch/arm64/crypto/sm3-neon-core.S b/arch/arm64/crypto/sm3-neon-core.S
new file mode 100644
index 000000000000..3e3b4e5c736f
--- /dev/null
+++ b/arch/arm64/crypto/sm3-neon-core.S
@@ -0,0 +1,600 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * sm3-neon-core.S - SM3 secure hash using NEON instructions
+ *
+ * Linux/arm64 port of the libgcrypt SM3 implementation for AArch64
+ *
+ * Copyright (C) 2021 Jussi Kivilinna <jussi.kivilinna@iki.fi>
+ * Copyright (c) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/* Context structure */
+
+#define state_h0 0
+#define state_h1 4
+#define state_h2 8
+#define state_h3 12
+#define state_h4 16
+#define state_h5 20
+#define state_h6 24
+#define state_h7 28
+
+/* Stack structure */
+
+#define STACK_W_SIZE        (32 * 2 * 3)
+
+#define STACK_W             (0)
+#define STACK_SIZE          (STACK_W + STACK_W_SIZE)
+
+/* Register macros */
+
+#define RSTATE x0
+#define RDATA  x1
+#define RNBLKS x2
+#define RKPTR  x28
+#define RFRAME x29
+
+#define ra w3
+#define rb w4
+#define rc w5
+#define rd w6
+#define re w7
+#define rf w8
+#define rg w9
+#define rh w10
+
+#define t0 w11
+#define t1 w12
+#define t2 w13
+#define t3 w14
+#define t4 w15
+#define t5 w16
+#define t6 w17
+
+#define k_even w19
+#define k_odd w20
+
+#define addr0 x21
+#define addr1 x22
+
+#define s0 w23
+#define s1 w24
+#define s2 w25
+#define s3 w26
+
+#define W0 v0
+#define W1 v1
+#define W2 v2
+#define W3 v3
+#define W4 v4
+#define W5 v5
+
+#define XTMP0 v6
+#define XTMP1 v7
+#define XTMP2 v16
+#define XTMP3 v17
+#define XTMP4 v18
+#define XTMP5 v19
+#define XTMP6 v20
+
+/* Helper macros. */
+
+#define _(...) /*_*/
+
+#define clear_vec(x) \
+	movi	x.8h, #0;
+
+#define rolw(o, a, n) \
+	ror	o, a, #(32 - n);
+
+/* Round function macros. */
+
+#define GG1_1(x, y, z, o, t) \
+	eor	o, x, y;
+#define GG1_2(x, y, z, o, t) \
+	eor	o, o, z;
+#define GG1_3(x, y, z, o, t)
+
+#define FF1_1(x, y, z, o, t) GG1_1(x, y, z, o, t)
+#define FF1_2(x, y, z, o, t)
+#define FF1_3(x, y, z, o, t) GG1_2(x, y, z, o, t)
+
+#define GG2_1(x, y, z, o, t) \
+	bic	o, z, x;
+#define GG2_2(x, y, z, o, t) \
+	and	t, y, x;
+#define GG2_3(x, y, z, o, t) \
+	eor	o, o, t;
+
+#define FF2_1(x, y, z, o, t) \
+	eor	o, x, y;
+#define FF2_2(x, y, z, o, t) \
+	and	t, x, y; \
+	and	o, o, z;
+#define FF2_3(x, y, z, o, t) \
+	eor	o, o, t;
+
+#define R(i, a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+	K_LOAD(round);                                                        \
+	ldr	t5, [sp, #(wtype##_W1_ADDR(round, widx))];                    \
+	rolw(t0, a, 12);                              /* rol(a, 12) => t0 */  \
+      IOP(1, iop_param);                                                      \
+	FF##i##_1(a, b, c, t1, t2);                                           \
+	ldr	t6, [sp, #(wtype##_W1W2_ADDR(round, widx))];                  \
+	add	k, k, e;                                                      \
+      IOP(2, iop_param);                                                      \
+	GG##i##_1(e, f, g, t3, t4);                                           \
+	FF##i##_2(a, b, c, t1, t2);                                           \
+      IOP(3, iop_param);                                                      \
+	add	k, k, t0;                                                     \
+	add	h, h, t5;                                                     \
+	add	d, d, t6;                     /* w1w2 + d => d */             \
+      IOP(4, iop_param);                                                      \
+	rolw(k, k, 7);                        /* rol (t0 + e + t), 7) => k */ \
+	GG##i##_2(e, f, g, t3, t4);                                           \
+	add	h, h, k;                      /* h + w1 + k => h */           \
+      IOP(5, iop_param);                                                      \
+	FF##i##_3(a, b, c, t1, t2);                                           \
+	eor	t0, t0, k;                    /* k ^ t0 => t0 */              \
+	GG##i##_3(e, f, g, t3, t4);                                           \
+	add	d, d, t1;                     /* FF(a,b,c) + d => d */        \
+      IOP(6, iop_param);                                                      \
+	add	t3, t3, h;                    /* GG(e,f,g) + h => t3 */       \
+	rolw(b, b, 9);                        /* rol(b, 9) => b */            \
+	eor	h, t3, t3, ror #(32-9);                                       \
+      IOP(7, iop_param);                                                      \
+	add	d, d, t0;                     /* t0 + d => d */               \
+	rolw(f, f, 19);                       /* rol(f, 19) => f */           \
+      IOP(8, iop_param);                                                      \
+	eor	h, h, t3, ror #(32-17);       /* P0(t3) => h */
+
+#define R1(a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+	R(1, ##a, ##b, ##c, ##d, ##e, ##f, ##g, ##h, ##k, K_LOAD, round, widx, wtype, IOP, iop_param)
+
+#define R2(a, b, c, d, e, f, g, h, k, K_LOAD, round, widx, wtype, IOP, iop_param) \
+	R(2, ##a, ##b, ##c, ##d, ##e, ##f, ##g, ##h, ##k, K_LOAD, round, widx, wtype, IOP, iop_param)
+
+#define KL(round) \
+	ldp	k_even, k_odd, [RKPTR, #(4*(round))];
+
+/* Input expansion macros. */
+
+/* Byte-swapped input address. */
+#define IW_W_ADDR(round, widx, offs) \
+	(STACK_W + ((round) / 4) * 64 + (offs) + ((widx) * 4))
+
+/* Expanded input address. */
+#define XW_W_ADDR(round, widx, offs) \
+	(STACK_W + ((((round) / 3) - 4) % 2) * 64 + (offs) + ((widx) * 4))
+
+/* Rounds 1-12, byte-swapped input block addresses. */
+#define IW_W1_ADDR(round, widx)   IW_W_ADDR(round, widx, 32)
+#define IW_W1W2_ADDR(round, widx) IW_W_ADDR(round, widx, 48)
+
+/* Rounds 1-12, expanded input block addresses. */
+#define XW_W1_ADDR(round, widx)   XW_W_ADDR(round, widx, 0)
+#define XW_W1W2_ADDR(round, widx) XW_W_ADDR(round, widx, 16)
+
+/* Input block loading.
+ * Interleaving within round function needed for in-order CPUs. */
+#define LOAD_W_VEC_1_1() \
+	add	addr0, sp, #IW_W1_ADDR(0, 0);
+#define LOAD_W_VEC_1_2() \
+	add	addr1, sp, #IW_W1_ADDR(4, 0);
+#define LOAD_W_VEC_1_3() \
+	ld1	{W0.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_4() \
+	ld1	{W1.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_5() \
+	ld1	{W2.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_6() \
+	ld1	{W3.16b}, [RDATA], #16;
+#define LOAD_W_VEC_1_7() \
+	rev32	XTMP0.16b, W0.16b;
+#define LOAD_W_VEC_1_8() \
+	rev32	XTMP1.16b, W1.16b;
+#define LOAD_W_VEC_2_1() \
+	rev32	XTMP2.16b, W2.16b;
+#define LOAD_W_VEC_2_2() \
+	rev32	XTMP3.16b, W3.16b;
+#define LOAD_W_VEC_2_3() \
+	eor	XTMP4.16b, XTMP1.16b, XTMP0.16b;
+#define LOAD_W_VEC_2_4() \
+	eor	XTMP5.16b, XTMP2.16b, XTMP1.16b;
+#define LOAD_W_VEC_2_5() \
+	st1	{XTMP0.16b}, [addr0], #16;
+#define LOAD_W_VEC_2_6() \
+	st1	{XTMP4.16b}, [addr0]; \
+	add	addr0, sp, #IW_W1_ADDR(8, 0);
+#define LOAD_W_VEC_2_7() \
+	eor	XTMP6.16b, XTMP3.16b, XTMP2.16b;
+#define LOAD_W_VEC_2_8() \
+	ext	W0.16b, XTMP0.16b, XTMP0.16b, #8;  /* W0: xx, w0, xx, xx */
+#define LOAD_W_VEC_3_1() \
+	mov	W2.16b, XTMP1.16b;                 /* W2: xx, w6, w5, w4 */
+#define LOAD_W_VEC_3_2() \
+	st1	{XTMP1.16b}, [addr1], #16;
+#define LOAD_W_VEC_3_3() \
+	st1	{XTMP5.16b}, [addr1]; \
+	ext	W1.16b, XTMP0.16b, XTMP0.16b, #4;  /* W1: xx, w3, w2, w1 */
+#define LOAD_W_VEC_3_4() \
+	ext	W3.16b, XTMP1.16b, XTMP2.16b, #12; /* W3: xx, w9, w8, w7 */
+#define LOAD_W_VEC_3_5() \
+	ext	W4.16b, XTMP2.16b, XTMP3.16b, #8;  /* W4: xx, w12, w11, w10 */
+#define LOAD_W_VEC_3_6() \
+	st1	{XTMP2.16b}, [addr0], #16;
+#define LOAD_W_VEC_3_7() \
+	st1	{XTMP6.16b}, [addr0];
+#define LOAD_W_VEC_3_8() \
+	ext	W5.16b, XTMP3.16b, XTMP3.16b, #4;  /* W5: xx, w15, w14, w13 */
+
+#define LOAD_W_VEC_1(iop_num, ...) \
+	LOAD_W_VEC_1_##iop_num()
+#define LOAD_W_VEC_2(iop_num, ...) \
+	LOAD_W_VEC_2_##iop_num()
+#define LOAD_W_VEC_3(iop_num, ...) \
+	LOAD_W_VEC_3_##iop_num()
+
+/* Message scheduling. Note: 3 words per vector register.
+ * Interleaving within round function needed for in-order CPUs. */
+#define SCHED_W_1_1(round, w0, w1, w2, w3, w4, w5) \
+	/* Load (w[i - 16]) => XTMP0 */            \
+	/* Load (w[i - 13]) => XTMP5 */            \
+	ext	XTMP0.16b, w0.16b, w0.16b, #12;    /* XTMP0: w0, xx, xx, xx */
+#define SCHED_W_1_2(round, w0, w1, w2, w3, w4, w5) \
+	ext	XTMP5.16b, w1.16b, w1.16b, #12;
+#define SCHED_W_1_3(round, w0, w1, w2, w3, w4, w5) \
+	ext	XTMP0.16b, XTMP0.16b, w1.16b, #12; /* XTMP0: xx, w2, w1, w0 */
+#define SCHED_W_1_4(round, w0, w1, w2, w3, w4, w5) \
+	ext	XTMP5.16b, XTMP5.16b, w2.16b, #12;
+#define SCHED_W_1_5(round, w0, w1, w2, w3, w4, w5) \
+	/* w[i - 9] == w3 */                       \
+	/* W3 ^ XTMP0 => XTMP0 */                  \
+	eor	XTMP0.16b, XTMP0.16b, w3.16b;
+#define SCHED_W_1_6(round, w0, w1, w2, w3, w4, w5) \
+	/* w[i - 3] == w5 */                       \
+	/* rol(XMM5, 15) ^ XTMP0 => XTMP0 */       \
+	/* rol(XTMP5, 7) => XTMP1 */               \
+	add	addr0, sp, #XW_W1_ADDR((round), 0); \
+	shl	XTMP2.4s, w5.4s, #15;
+#define SCHED_W_1_7(round, w0, w1, w2, w3, w4, w5) \
+	shl	XTMP1.4s, XTMP5.4s, #7;
+#define SCHED_W_1_8(round, w0, w1, w2, w3, w4, w5) \
+	sri	XTMP2.4s, w5.4s, #(32-15);
+#define SCHED_W_2_1(round, w0, w1, w2, w3, w4, w5) \
+	sri	XTMP1.4s, XTMP5.4s, #(32-7);
+#define SCHED_W_2_2(round, w0, w1, w2, w3, w4, w5) \
+	eor	XTMP0.16b, XTMP0.16b, XTMP2.16b;
+#define SCHED_W_2_3(round, w0, w1, w2, w3, w4, w5) \
+	/* w[i - 6] == W4 */                       \
+	/* W4 ^ XTMP1 => XTMP1 */                  \
+	eor	XTMP1.16b, XTMP1.16b, w4.16b;
+#define SCHED_W_2_4(round, w0, w1, w2, w3, w4, w5) \
+	/* P1(XTMP0) ^ XTMP1 => W0 */              \
+	shl	XTMP3.4s, XTMP0.4s, #15;
+#define SCHED_W_2_5(round, w0, w1, w2, w3, w4, w5) \
+	shl	XTMP4.4s, XTMP0.4s, #23;
+#define SCHED_W_2_6(round, w0, w1, w2, w3, w4, w5) \
+	eor	w0.16b, XTMP1.16b, XTMP0.16b;
+#define SCHED_W_2_7(round, w0, w1, w2, w3, w4, w5) \
+	sri	XTMP3.4s, XTMP0.4s, #(32-15);
+#define SCHED_W_2_8(round, w0, w1, w2, w3, w4, w5) \
+	sri	XTMP4.4s, XTMP0.4s, #(32-23);
+#define SCHED_W_3_1(round, w0, w1, w2, w3, w4, w5) \
+	eor	w0.16b, w0.16b, XTMP3.16b;
+#define SCHED_W_3_2(round, w0, w1, w2, w3, w4, w5) \
+	/* Load (w[i - 3]) => XTMP2 */             \
+	ext	XTMP2.16b, w4.16b, w4.16b, #12;
+#define SCHED_W_3_3(round, w0, w1, w2, w3, w4, w5) \
+	eor	w0.16b, w0.16b, XTMP4.16b;
+#define SCHED_W_3_4(round, w0, w1, w2, w3, w4, w5) \
+	ext	XTMP2.16b, XTMP2.16b, w5.16b, #12;
+#define SCHED_W_3_5(round, w0, w1, w2, w3, w4, w5) \
+	/* W1 ^ W2 => XTMP3 */                     \
+	eor	XTMP3.16b, XTMP2.16b, w0.16b;
+#define SCHED_W_3_6(round, w0, w1, w2, w3, w4, w5)
+#define SCHED_W_3_7(round, w0, w1, w2, w3, w4, w5) \
+	st1	{XTMP2.16b-XTMP3.16b}, [addr0];
+#define SCHED_W_3_8(round, w0, w1, w2, w3, w4, w5)
+
+#define SCHED_W_W0W1W2W3W4W5_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W0, W1, W2, W3, W4, W5)
+#define SCHED_W_W0W1W2W3W4W5_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W0, W1, W2, W3, W4, W5)
+#define SCHED_W_W0W1W2W3W4W5_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W0, W1, W2, W3, W4, W5)
+
+#define SCHED_W_W1W2W3W4W5W0_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W1, W2, W3, W4, W5, W0)
+#define SCHED_W_W1W2W3W4W5W0_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W1, W2, W3, W4, W5, W0)
+#define SCHED_W_W1W2W3W4W5W0_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W1, W2, W3, W4, W5, W0)
+
+#define SCHED_W_W2W3W4W5W0W1_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W2, W3, W4, W5, W0, W1)
+#define SCHED_W_W2W3W4W5W0W1_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W2, W3, W4, W5, W0, W1)
+#define SCHED_W_W2W3W4W5W0W1_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W2, W3, W4, W5, W0, W1)
+
+#define SCHED_W_W3W4W5W0W1W2_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W3, W4, W5, W0, W1, W2)
+#define SCHED_W_W3W4W5W0W1W2_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W3, W4, W5, W0, W1, W2)
+#define SCHED_W_W3W4W5W0W1W2_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W3, W4, W5, W0, W1, W2)
+
+#define SCHED_W_W4W5W0W1W2W3_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W4, W5, W0, W1, W2, W3)
+#define SCHED_W_W4W5W0W1W2W3_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W4, W5, W0, W1, W2, W3)
+#define SCHED_W_W4W5W0W1W2W3_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W4, W5, W0, W1, W2, W3)
+
+#define SCHED_W_W5W0W1W2W3W4_1(iop_num, round) \
+	SCHED_W_1_##iop_num(round, W5, W0, W1, W2, W3, W4)
+#define SCHED_W_W5W0W1W2W3W4_2(iop_num, round) \
+	SCHED_W_2_##iop_num(round, W5, W0, W1, W2, W3, W4)
+#define SCHED_W_W5W0W1W2W3W4_3(iop_num, round) \
+	SCHED_W_3_##iop_num(round, W5, W0, W1, W2, W3, W4)
+
+
+	/*
+	 * Transform blocks*64 bytes (blocks*16 32-bit words) at 'src'.
+	 *
+	 * void sm3_neon_transform(struct sm3_state *sst, u8 const *src,
+	 *                         int blocks)
+	 */
+	.text
+.align 3
+SYM_FUNC_START(sm3_neon_transform)
+	ldp		ra, rb, [RSTATE, #0]
+	ldp		rc, rd, [RSTATE, #8]
+	ldp		re, rf, [RSTATE, #16]
+	ldp		rg, rh, [RSTATE, #24]
+
+	stp		x28, x29, [sp, #-16]!
+	stp		x19, x20, [sp, #-16]!
+	stp		x21, x22, [sp, #-16]!
+	stp		x23, x24, [sp, #-16]!
+	stp		x25, x26, [sp, #-16]!
+	mov		RFRAME, sp
+
+	sub		addr0, sp, #STACK_SIZE
+	adr_l		RKPTR, .LKtable
+	and		sp, addr0, #(~63)
+
+	/* Preload first block. */
+	LOAD_W_VEC_1(1, 0)
+	LOAD_W_VEC_1(2, 0)
+	LOAD_W_VEC_1(3, 0)
+	LOAD_W_VEC_1(4, 0)
+	LOAD_W_VEC_1(5, 0)
+	LOAD_W_VEC_1(6, 0)
+	LOAD_W_VEC_1(7, 0)
+	LOAD_W_VEC_1(8, 0)
+	LOAD_W_VEC_2(1, 0)
+	LOAD_W_VEC_2(2, 0)
+	LOAD_W_VEC_2(3, 0)
+	LOAD_W_VEC_2(4, 0)
+	LOAD_W_VEC_2(5, 0)
+	LOAD_W_VEC_2(6, 0)
+	LOAD_W_VEC_2(7, 0)
+	LOAD_W_VEC_2(8, 0)
+	LOAD_W_VEC_3(1, 0)
+	LOAD_W_VEC_3(2, 0)
+	LOAD_W_VEC_3(3, 0)
+	LOAD_W_VEC_3(4, 0)
+	LOAD_W_VEC_3(5, 0)
+	LOAD_W_VEC_3(6, 0)
+	LOAD_W_VEC_3(7, 0)
+	LOAD_W_VEC_3(8, 0)
+
+.balign 16
+.Loop:
+	/* Transform 0-3 */
+	R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 0, 0, IW, _, 0)
+	R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  1, 1, IW, _, 0)
+	R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 2, 2, IW, _, 0)
+	R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  3, 3, IW, _, 0)
+
+	/* Transform 4-7 + Precalc 12-14 */
+	R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 4, 0, IW, _, 0)
+	R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  5, 1, IW, _, 0)
+	R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 6, 2, IW, SCHED_W_W0W1W2W3W4W5_1, 12)
+	R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  7, 3, IW, SCHED_W_W0W1W2W3W4W5_2, 12)
+
+	/* Transform 8-11 + Precalc 12-17 */
+	R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 8, 0, IW, SCHED_W_W0W1W2W3W4W5_3, 12)
+	R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  9, 1, IW, SCHED_W_W1W2W3W4W5W0_1, 15)
+	R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 10, 2, IW, SCHED_W_W1W2W3W4W5W0_2, 15)
+	R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  11, 3, IW, SCHED_W_W1W2W3W4W5W0_3, 15)
+
+	/* Transform 12-14 + Precalc 18-20 */
+	R1(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 12, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 18)
+	R1(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  13, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 18)
+	R1(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 14, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 18)
+
+	/* Transform 15-17 + Precalc 21-23 */
+	R1(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  15, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 21)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 16, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 21)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  17, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 21)
+
+	/* Transform 18-20 + Precalc 24-26 */
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 18, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 24)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  19, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 24)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 20, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 24)
+
+	/* Transform 21-23 + Precalc 27-29 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  21, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 27)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 22, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 27)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  23, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 27)
+
+	/* Transform 24-26 + Precalc 30-32 */
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 24, 0, XW, SCHED_W_W0W1W2W3W4W5_1, 30)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  25, 1, XW, SCHED_W_W0W1W2W3W4W5_2, 30)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 26, 2, XW, SCHED_W_W0W1W2W3W4W5_3, 30)
+
+	/* Transform 27-29 + Precalc 33-35 */
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  27, 0, XW, SCHED_W_W1W2W3W4W5W0_1, 33)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 28, 1, XW, SCHED_W_W1W2W3W4W5W0_2, 33)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  29, 2, XW, SCHED_W_W1W2W3W4W5W0_3, 33)
+
+	/* Transform 30-32 + Precalc 36-38 */
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 30, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 36)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  31, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 36)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 32, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 36)
+
+	/* Transform 33-35 + Precalc 39-41 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  33, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 39)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 34, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 39)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  35, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 39)
+
+	/* Transform 36-38 + Precalc 42-44 */
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 36, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 42)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  37, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 42)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 38, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 42)
+
+	/* Transform 39-41 + Precalc 45-47 */
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  39, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 45)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 40, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 45)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  41, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 45)
+
+	/* Transform 42-44 + Precalc 48-50 */
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 42, 0, XW, SCHED_W_W0W1W2W3W4W5_1, 48)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  43, 1, XW, SCHED_W_W0W1W2W3W4W5_2, 48)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 44, 2, XW, SCHED_W_W0W1W2W3W4W5_3, 48)
+
+	/* Transform 45-47 + Precalc 51-53 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  45, 0, XW, SCHED_W_W1W2W3W4W5W0_1, 51)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 46, 1, XW, SCHED_W_W1W2W3W4W5W0_2, 51)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  47, 2, XW, SCHED_W_W1W2W3W4W5W0_3, 51)
+
+	/* Transform 48-50 + Precalc 54-56 */
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 48, 0, XW, SCHED_W_W2W3W4W5W0W1_1, 54)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  49, 1, XW, SCHED_W_W2W3W4W5W0W1_2, 54)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 50, 2, XW, SCHED_W_W2W3W4W5W0W1_3, 54)
+
+	/* Transform 51-53 + Precalc 57-59 */
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  51, 0, XW, SCHED_W_W3W4W5W0W1W2_1, 57)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 52, 1, XW, SCHED_W_W3W4W5W0W1W2_2, 57)
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  53, 2, XW, SCHED_W_W3W4W5W0W1W2_3, 57)
+
+	/* Transform 54-56 + Precalc 60-62 */
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 54, 0, XW, SCHED_W_W4W5W0W1W2W3_1, 60)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  55, 1, XW, SCHED_W_W4W5W0W1W2W3_2, 60)
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 56, 2, XW, SCHED_W_W4W5W0W1W2W3_3, 60)
+
+	/* Transform 57-59 + Precalc 63 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  57, 0, XW, SCHED_W_W5W0W1W2W3W4_1, 63)
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 58, 1, XW, SCHED_W_W5W0W1W2W3W4_2, 63)
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  59, 2, XW, SCHED_W_W5W0W1W2W3W4_3, 63)
+
+	/* Transform 60 */
+	R2(ra, rb, rc, rd, re, rf, rg, rh, k_even, KL, 60, 0, XW, _, _)
+	subs		RNBLKS, RNBLKS, #1
+	b.eq		.Lend
+
+	/* Transform 61-63 + Preload next block */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  61, 1, XW, LOAD_W_VEC_1, _)
+	ldp		s0, s1, [RSTATE, #0]
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 62, 2, XW, LOAD_W_VEC_2, _)
+	ldp		s2, s3, [RSTATE, #8]
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  63, 0, XW, LOAD_W_VEC_3, _)
+
+	/* Update the chaining variables. */
+	eor		ra, ra, s0
+	eor		rb, rb, s1
+	ldp		s0, s1, [RSTATE, #16]
+	eor		rc, rc, s2
+	ldp		k_even, k_odd, [RSTATE, #24]
+	eor		rd, rd, s3
+	eor		re, re, s0
+	stp		ra, rb, [RSTATE, #0]
+	eor		rf, rf, s1
+	stp		rc, rd, [RSTATE, #8]
+	eor		rg, rg, k_even
+	stp		re, rf, [RSTATE, #16]
+	eor		rh, rh, k_odd
+	stp		rg, rh, [RSTATE, #24]
+	b		.Loop
+
+.Lend:
+	/* Transform 61-63 */
+	R2(rd, ra, rb, rc, rh, re, rf, rg, k_odd,  _,  61, 1, XW, _, _)
+	ldp		s0, s1, [RSTATE, #0]
+	R2(rc, rd, ra, rb, rg, rh, re, rf, k_even, KL, 62, 2, XW, _, _)
+	ldp		s2, s3, [RSTATE, #8]
+	R2(rb, rc, rd, ra, rf, rg, rh, re, k_odd,  _,  63, 0, XW, _, _)
+
+	/* Update the chaining variables. */
+	eor		ra, ra, s0
+	clear_vec(W0)
+	eor		rb, rb, s1
+	clear_vec(W1)
+	ldp		s0, s1, [RSTATE, #16]
+	clear_vec(W2)
+	eor		rc, rc, s2
+	clear_vec(W3)
+	ldp		k_even, k_odd, [RSTATE, #24]
+	clear_vec(W4)
+	eor		rd, rd, s3
+	clear_vec(W5)
+	eor		re, re, s0
+	clear_vec(XTMP0)
+	stp		ra, rb, [RSTATE, #0]
+	clear_vec(XTMP1)
+	eor		rf, rf, s1
+	clear_vec(XTMP2)
+	stp		rc, rd, [RSTATE, #8]
+	clear_vec(XTMP3)
+	eor		rg, rg, k_even
+	clear_vec(XTMP4)
+	stp		re, rf, [RSTATE, #16]
+	clear_vec(XTMP5)
+	eor		rh, rh, k_odd
+	clear_vec(XTMP6)
+	stp		rg, rh, [RSTATE, #24]
+
+	/* Clear message expansion area */
+	add		addr0, sp, #STACK_W
+	st1		{W0.16b-W3.16b}, [addr0], #64
+	st1		{W0.16b-W3.16b}, [addr0], #64
+	st1		{W0.16b-W3.16b}, [addr0]
+
+	mov		sp, RFRAME
+
+	ldp		x25, x26, [sp], #16
+	ldp		x23, x24, [sp], #16
+	ldp		x21, x22, [sp], #16
+	ldp		x19, x20, [sp], #16
+	ldp		x28, x29, [sp], #16
+
+	ret
+SYM_FUNC_END(sm3_neon_transform)
+
+
+	.section	".rodata", "a"
+
+	.align 4
+.LKtable:
+	.long 0x79cc4519, 0xf3988a32, 0xe7311465, 0xce6228cb
+	.long 0x9cc45197, 0x3988a32f, 0x7311465e, 0xe6228cbc
+	.long 0xcc451979, 0x988a32f3, 0x311465e7, 0x6228cbce
+	.long 0xc451979c, 0x88a32f39, 0x11465e73, 0x228cbce6
+	.long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c
+	.long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce
+	.long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec
+	.long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5
+	.long 0x7a879d8a, 0xf50f3b14, 0xea1e7629, 0xd43cec53
+	.long 0xa879d8a7, 0x50f3b14f, 0xa1e7629e, 0x43cec53d
+	.long 0x879d8a7a, 0x0f3b14f5, 0x1e7629ea, 0x3cec53d4
+	.long 0x79d8a7a8, 0xf3b14f50, 0xe7629ea1, 0xcec53d43
+	.long 0x9d8a7a87, 0x3b14f50f, 0x7629ea1e, 0xec53d43c
+	.long 0xd8a7a879, 0xb14f50f3, 0x629ea1e7, 0xc53d43ce
+	.long 0x8a7a879d, 0x14f50f3b, 0x29ea1e76, 0x53d43cec
+	.long 0xa7a879d8, 0x4f50f3b1, 0x9ea1e762, 0x3d43cec5
diff --git a/arch/arm64/crypto/sm3-neon-glue.c b/arch/arm64/crypto/sm3-neon-glue.c
new file mode 100644
index 000000000000..7182ee683f14
--- /dev/null
+++ b/arch/arm64/crypto/sm3-neon-glue.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * sm3-neon-glue.c - SM3 secure hash using NEON instructions
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <asm/unaligned.h>
+#include <crypto/internal/hash.h>
+#include <crypto/internal/simd.h>
+#include <crypto/sm3.h>
+#include <crypto/sm3_base.h>
+#include <linux/cpufeature.h>
+#include <linux/crypto.h>
+#include <linux/module.h>
+
+
+asmlinkage void sm3_neon_transform(struct sm3_state *sst, u8 const *src,
+				   int blocks);
+
+static int sm3_neon_update(struct shash_desc *desc, const u8 *data,
+			   unsigned int len)
+{
+	if (!crypto_simd_usable()) {
+		sm3_update(shash_desc_ctx(desc), data, len);
+		return 0;
+	}
+
+	kernel_neon_begin();
+	sm3_base_do_update(desc, data, len, sm3_neon_transform);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int sm3_neon_final(struct shash_desc *desc, u8 *out)
+{
+	if (!crypto_simd_usable()) {
+		sm3_final(shash_desc_ctx(desc), out);
+		return 0;
+	}
+
+	kernel_neon_begin();
+	sm3_base_do_finalize(desc, sm3_neon_transform);
+	kernel_neon_end();
+
+	return sm3_base_finish(desc, out);
+}
+
+static int sm3_neon_finup(struct shash_desc *desc, const u8 *data,
+			  unsigned int len, u8 *out)
+{
+	if (!crypto_simd_usable()) {
+		struct sm3_state *sctx = shash_desc_ctx(desc);
+
+		if (len)
+			sm3_update(sctx, data, len);
+		sm3_final(sctx, out);
+		return 0;
+	}
+
+	kernel_neon_begin();
+	if (len)
+		sm3_base_do_update(desc, data, len, sm3_neon_transform);
+	sm3_base_do_finalize(desc, sm3_neon_transform);
+	kernel_neon_end();
+
+	return sm3_base_finish(desc, out);
+}
+
+static struct shash_alg sm3_alg = {
+	.digestsize		= SM3_DIGEST_SIZE,
+	.init			= sm3_base_init,
+	.update			= sm3_neon_update,
+	.final			= sm3_neon_final,
+	.finup			= sm3_neon_finup,
+	.descsize		= sizeof(struct sm3_state),
+	.base.cra_name		= "sm3",
+	.base.cra_driver_name	= "sm3-neon",
+	.base.cra_blocksize	= SM3_BLOCK_SIZE,
+	.base.cra_module	= THIS_MODULE,
+	.base.cra_priority	= 200,
+};
+
+static int __init sm3_neon_init(void)
+{
+	return crypto_register_shash(&sm3_alg);
+}
+
+static void __exit sm3_neon_fini(void)
+{
+	crypto_unregister_shash(&sm3_alg);
+}
+
+module_init(sm3_neon_init);
+module_exit(sm3_neon_fini);
+
+MODULE_DESCRIPTION("SM3 secure hash using NEON instructions");
+MODULE_AUTHOR("Jussi Kivilinna <jussi.kivilinna@iki.fi>");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");

From patchwork Mon Sep 26 09:36:07 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988476
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id D97C4C07E9D
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:39:21 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=qBU1RRgbVhteWana7alu9hPQbW6ULBNtsQadlQqZUC4=; b=GmcFmX0rsFV8Pq
	39Lwb1sH6vdpT9KCV74bpQGYvWW11c9dTnIo/0cCfkABjarL8g9+ZWdkBfk6+h1ODKmHjaoC+yvkS
	JYEpdGKww/HJmin4FiFTCQZJbWMA9WmO0ZGocBoOY+tSQAF0k+8dx+BY34wvXrqGPbUQuAZ6QNgVV
	VCvabt+PtJ7RzIRpacex07NVh/9XS/ZwWDn4anPLn0bzRHuJD4Xd5aG5qR0D9UK/Z+I4X16uSsRnQ
	EtgMKPLSJVEmnJ269XX2q7Y15UBetYLO+0GC/ijGi7BUJZ+/mIsXiK2Y8TJRlInMesFXq0SbpoNma
	03CD7xv8Cz36hQx56SBQ==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockYz-003hSd-BI; Mon, 26 Sep 2022 09:38:09 +0000
Received: from out30-44.freemail.mail.aliyun.com ([115.124.30.44])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXT-003ghg-Du
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:36:41 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQkJzD3_1664184990;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQkJzD3_1664184990)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:31 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 03/16] crypto: arm64/sm4 - refactor and simplify NEON
 implementation
Date: Mon, 26 Sep 2022 17:36:07 +0800
Message-Id: <20220926093620.99898-4-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023635_898025_0692C072 
X-CRM114-Status: GOOD (  11.60  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

This patch does not add new features. The main work is to refactor and
simplify the implementation of SM4 NEON, which is reflected in the
following aspects:

The accelerated implementation supports the arbitrary number of blocks,
not just multiples of 8, which simplifies the implementation and brings
some optimization acceleration for data that is not aligned by 8 blocks.

When loading the input data, use the ld4 instruction to replace the
original ld1 instruction as much as possible, which will save the cost
of matrix transposition of the input data.

Use 8-block parallelism whenever possible to speed up matrix transpose
and rotation operations, instead of up to 4-block parallelism.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-neon-core.S | 630 +++++++++++++++++++-----------
 arch/arm64/crypto/sm4-neon-glue.c | 172 +++-----
 2 files changed, 456 insertions(+), 346 deletions(-)

diff --git a/arch/arm64/crypto/sm4-neon-core.S b/arch/arm64/crypto/sm4-neon-core.S
index 3d5256b354d2..f295b4b7d70a 100644
--- a/arch/arm64/crypto/sm4-neon-core.S
+++ b/arch/arm64/crypto/sm4-neon-core.S
@@ -18,6 +18,11 @@
 #define RTMP2	v10
 #define RTMP3	v11
 
+#define RTMP4	v12
+#define RTMP5	v13
+#define RTMP6	v14
+#define RTMP7	v15
+
 #define RX0	v12
 #define RX1	v13
 #define RKEY	v14
@@ -25,7 +30,7 @@
 
 /* Helper macros. */
 
-#define PREPARE                                                 \
+#define SM4_PREPARE()                                           \
 	adr_l		x5, crypto_sm4_sbox;                    \
 	ld1		{v16.16b-v19.16b}, [x5], #64;           \
 	ld1		{v20.16b-v23.16b}, [x5], #64;           \
@@ -42,7 +47,25 @@
 	zip1		s2.2d, RTMP2.2d, RTMP3.2d;              \
 	zip2		s3.2d, RTMP2.2d, RTMP3.2d;
 
-#define rotate_clockwise_90(s0, s1, s2, s3)                     \
+#define transpose_4x4_2x(s0, s1, s2, s3, s4, s5, s6, s7)        \
+	zip1		RTMP0.4s, s0.4s, s1.4s;                 \
+	zip1		RTMP1.4s, s2.4s, s3.4s;                 \
+	zip2		RTMP2.4s, s0.4s, s1.4s;                 \
+	zip2		RTMP3.4s, s2.4s, s3.4s;                 \
+	zip1		RTMP4.4s, s4.4s, s5.4s;                 \
+	zip1		RTMP5.4s, s6.4s, s7.4s;                 \
+	zip2		RTMP6.4s, s4.4s, s5.4s;                 \
+	zip2		RTMP7.4s, s6.4s, s7.4s;                 \
+	zip1		s0.2d, RTMP0.2d, RTMP1.2d;              \
+	zip2		s1.2d, RTMP0.2d, RTMP1.2d;              \
+	zip1		s2.2d, RTMP2.2d, RTMP3.2d;              \
+	zip2		s3.2d, RTMP2.2d, RTMP3.2d;              \
+	zip1		s4.2d, RTMP4.2d, RTMP5.2d;              \
+	zip2		s5.2d, RTMP4.2d, RTMP5.2d;              \
+	zip1		s6.2d, RTMP6.2d, RTMP7.2d;              \
+	zip2		s7.2d, RTMP6.2d, RTMP7.2d;
+
+#define rotate_clockwise_4x4(s0, s1, s2, s3)                    \
 	zip1		RTMP0.4s, s1.4s, s0.4s;                 \
 	zip2		RTMP1.4s, s1.4s, s0.4s;                 \
 	zip1		RTMP2.4s, s3.4s, s2.4s;                 \
@@ -52,6 +75,24 @@
 	zip1		s2.2d, RTMP3.2d, RTMP1.2d;              \
 	zip2		s3.2d, RTMP3.2d, RTMP1.2d;
 
+#define rotate_clockwise_4x4_2x(s0, s1, s2, s3, s4, s5, s6, s7) \
+	zip1		RTMP0.4s, s1.4s, s0.4s;                 \
+	zip1		RTMP2.4s, s3.4s, s2.4s;                 \
+	zip2		RTMP1.4s, s1.4s, s0.4s;                 \
+	zip2		RTMP3.4s, s3.4s, s2.4s;                 \
+	zip1		RTMP4.4s, s5.4s, s4.4s;                 \
+	zip1		RTMP6.4s, s7.4s, s6.4s;                 \
+	zip2		RTMP5.4s, s5.4s, s4.4s;                 \
+	zip2		RTMP7.4s, s7.4s, s6.4s;                 \
+	zip1		s0.2d, RTMP2.2d, RTMP0.2d;              \
+	zip2		s1.2d, RTMP2.2d, RTMP0.2d;              \
+	zip1		s2.2d, RTMP3.2d, RTMP1.2d;              \
+	zip2		s3.2d, RTMP3.2d, RTMP1.2d;              \
+	zip1		s4.2d, RTMP6.2d, RTMP4.2d;              \
+	zip2		s5.2d, RTMP6.2d, RTMP4.2d;              \
+	zip1		s6.2d, RTMP7.2d, RTMP5.2d;              \
+	zip2		s7.2d, RTMP7.2d, RTMP5.2d;
+
 #define ROUND4(round, s0, s1, s2, s3)                           \
 	dup		RX0.4s, RKEY.s[round];                  \
 	/* rk ^ s1 ^ s2 ^ s3 */                                 \
@@ -87,14 +128,7 @@
 	/* s0 ^= RTMP3 */                                       \
 	eor		s0.16b, s0.16b, RTMP3.16b;
 
-#define SM4_CRYPT_BLK4(b0, b1, b2, b3)                          \
-	rev32		b0.16b, b0.16b;                         \
-	rev32		b1.16b, b1.16b;                         \
-	rev32		b2.16b, b2.16b;                         \
-	rev32		b3.16b, b3.16b;                         \
-                                                                \
-	transpose_4x4(b0, b1, b2, b3);                          \
-                                                                \
+#define SM4_CRYPT_BLK4_BE(b0, b1, b2, b3)                       \
 	mov		x6, 8;                                  \
 4:                                                              \
 	ld1		{RKEY.4s}, [x0], #16;                   \
@@ -107,15 +141,23 @@
                                                                 \
 	bne		4b;                                     \
                                                                 \
-	rotate_clockwise_90(b0, b1, b2, b3);                    \
 	rev32		b0.16b, b0.16b;                         \
 	rev32		b1.16b, b1.16b;                         \
 	rev32		b2.16b, b2.16b;                         \
 	rev32		b3.16b, b3.16b;                         \
                                                                 \
+	rotate_clockwise_4x4(b0, b1, b2, b3);                   \
+                                                                \
 	/* repoint to rkey */                                   \
 	sub		x0, x0, #128;
 
+#define SM4_CRYPT_BLK4(b0, b1, b2, b3)                          \
+	rev32		b0.16b, b0.16b;                         \
+	rev32		b1.16b, b1.16b;                         \
+	rev32		b2.16b, b2.16b;                         \
+	rev32		b3.16b, b3.16b;                         \
+	SM4_CRYPT_BLK4_BE(b0, b1, b2, b3);
+
 #define ROUND8(round, s0, s1, s2, s3, t0, t1, t2, t3)           \
 	/* rk ^ s1 ^ s2 ^ s3 */                                 \
 	dup		RX0.4s, RKEY.s[round];                  \
@@ -175,7 +217,7 @@
 	eor		s0.16b, s0.16b, RTMP0.16b;              \
 	eor		t0.16b, t0.16b, RTMP1.16b;
 
-#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)          \
+#define SM4_CRYPT_BLK8_norotate(b0, b1, b2, b3, b4, b5, b6, b7) \
 	rev32		b0.16b, b0.16b;                         \
 	rev32		b1.16b, b1.16b;                         \
 	rev32		b2.16b, b2.16b;                         \
@@ -185,9 +227,6 @@
 	rev32		b6.16b, b6.16b;                         \
 	rev32		b7.16b, b7.16b;                         \
                                                                 \
-	transpose_4x4(b0, b1, b2, b3);                          \
-	transpose_4x4(b4, b5, b6, b7);                          \
-                                                                \
 	mov		x6, 8;                                  \
 8:                                                              \
 	ld1		{RKEY.4s}, [x0], #16;                   \
@@ -200,8 +239,6 @@
                                                                 \
 	bne		8b;                                     \
                                                                 \
-	rotate_clockwise_90(b0, b1, b2, b3);                    \
-	rotate_clockwise_90(b4, b5, b6, b7);                    \
 	rev32		b0.16b, b0.16b;                         \
 	rev32		b1.16b, b1.16b;                         \
 	rev32		b2.16b, b2.16b;                         \
@@ -214,274 +251,429 @@
 	/* repoint to rkey */                                   \
 	sub		x0, x0, #128;
 
+#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)			\
+	SM4_CRYPT_BLK8_norotate(b0, b1, b2, b3, b4, b5, b6, b7);	\
+	rotate_clockwise_4x4_2x(b0, b1, b2, b3, b4, b5, b6, b7);	\
 
-.align 3
-SYM_FUNC_START_LOCAL(__sm4_neon_crypt_blk1_4)
-	/* input:
-	 *   x0: round key array, CTX
-	 *   x1: dst
-	 *   x2: src
-	 *   w3: num blocks (1..4)
-	 */
-	PREPARE;
-
-	ld1		{v0.16b}, [x2], #16;
-	mov		v1.16b, v0.16b;
-	mov		v2.16b, v0.16b;
-	mov		v3.16b, v0.16b;
-	cmp		w3, #2;
-	blt		.Lblk4_load_input_done;
-	ld1		{v1.16b}, [x2], #16;
-	beq		.Lblk4_load_input_done;
-	ld1		{v2.16b}, [x2], #16;
-	cmp		w3, #3;
-	beq		.Lblk4_load_input_done;
-	ld1		{v3.16b}, [x2];
-
-.Lblk4_load_input_done:
-	SM4_CRYPT_BLK4(v0, v1, v2, v3);
-
-	st1		{v0.16b}, [x1], #16;
-	cmp		w3, #2;
-	blt		.Lblk4_store_output_done;
-	st1		{v1.16b}, [x1], #16;
-	beq		.Lblk4_store_output_done;
-	st1		{v2.16b}, [x1], #16;
-	cmp		w3, #3;
-	beq		.Lblk4_store_output_done;
-	st1		{v3.16b}, [x1];
-
-.Lblk4_store_output_done:
-	ret;
-SYM_FUNC_END(__sm4_neon_crypt_blk1_4)
 
 .align 3
-SYM_FUNC_START(sm4_neon_crypt_blk1_8)
+SYM_FUNC_START(sm4_neon_crypt)
 	/* input:
 	 *   x0: round key array, CTX
 	 *   x1: dst
 	 *   x2: src
-	 *   w3: num blocks (1..8)
+	 *   w3: nblocks
 	 */
-	cmp		w3, #5;
-	blt		__sm4_neon_crypt_blk1_4;
-
-	PREPARE;
-
-	ld1		{v0.16b-v3.16b}, [x2], #64;
-	ld1		{v4.16b}, [x2], #16;
-	mov		v5.16b, v4.16b;
-	mov		v6.16b, v4.16b;
-	mov		v7.16b, v4.16b;
-	beq		.Lblk8_load_input_done;
-	ld1		{v5.16b}, [x2], #16;
-	cmp		w3, #7;
-	blt		.Lblk8_load_input_done;
-	ld1		{v6.16b}, [x2], #16;
-	beq		.Lblk8_load_input_done;
-	ld1		{v7.16b}, [x2];
-
-.Lblk8_load_input_done:
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
-	cmp		w3, #6;
-	st1		{v0.16b-v3.16b}, [x1], #64;
-	st1		{v4.16b}, [x1], #16;
-	blt		.Lblk8_store_output_done;
-	st1		{v5.16b}, [x1], #16;
-	beq		.Lblk8_store_output_done;
-	st1		{v6.16b}, [x1], #16;
-	cmp		w3, #7;
-	beq		.Lblk8_store_output_done;
-	st1		{v7.16b}, [x1];
-
-.Lblk8_store_output_done:
-	ret;
-SYM_FUNC_END(sm4_neon_crypt_blk1_8)
+	SM4_PREPARE()
 
-.align 3
-SYM_FUNC_START(sm4_neon_crypt_blk8)
-	/* input:
-	 *   x0: round key array, CTX
-	 *   x1: dst
-	 *   x2: src
-	 *   w3: nblocks (multiples of 8)
-	 */
-	PREPARE;
+.Lcrypt_loop_8x:
+	sub		w3, w3, #8
+	tbnz		w3, #31, .Lcrypt_4x
+
+	ld4		{v0.4s-v3.4s}, [x2], #64
+	ld4		{v4.4s-v7.4s}, [x2], #64
 
-.Lcrypt_loop_blk:
-	subs		w3, w3, #8;
-	bmi		.Lcrypt_end;
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
 
-	ld1		{v0.16b-v3.16b}, [x2], #64;
-	ld1		{v4.16b-v7.16b}, [x2], #64;
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	cbz		w3, .Lcrypt_end
+	b		.Lcrypt_loop_8x
 
-	st1		{v0.16b-v3.16b}, [x1], #64;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+.Lcrypt_4x:
+	add		w3, w3, #8
+	cmp		w3, #4
+	blt		.Lcrypt_tail
 
-	b		.Lcrypt_loop_blk;
+	sub		w3, w3, #4
+
+	ld4		{v0.4s-v3.4s}, [x2], #64
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	cbz		w3, .Lcrypt_end
+
+.Lcrypt_tail:
+	cmp		w3, #2
+	ld1		{v0.16b}, [x2], #16
+	blt		.Lcrypt_tail_load_done
+	ld1		{v1.16b}, [x2], #16
+	beq		.Lcrypt_tail_load_done
+	ld1		{v2.16b}, [x2], #16
+
+.Lcrypt_tail_load_done:
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	cmp		w3, #2
+	st1		{v0.16b}, [x1], #16
+	blt		.Lcrypt_end
+	st1		{v1.16b}, [x1], #16
+	beq		.Lcrypt_end
+	st1		{v2.16b}, [x1], #16
 
 .Lcrypt_end:
-	ret;
-SYM_FUNC_END(sm4_neon_crypt_blk8)
+	ret
+SYM_FUNC_END(sm4_neon_crypt)
 
 .align 3
-SYM_FUNC_START(sm4_neon_cbc_dec_blk8)
+SYM_FUNC_START(sm4_neon_cbc_dec)
 	/* input:
 	 *   x0: round key array, CTX
 	 *   x1: dst
 	 *   x2: src
 	 *   x3: iv (big endian, 128 bit)
-	 *   w4: nblocks (multiples of 8)
+	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE()
+
+	ld1		{RIV.16b}, [x3]
+
+.Lcbc_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lcbc_dec_4x
+
+	ld4		{v0.4s-v3.4s}, [x2], #64
+	ld4		{v4.4s-v7.4s}, [x2]
+
+	SM4_CRYPT_BLK8_norotate(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	/* Avoid overwriting the RIV register */
+	rotate_clockwise_4x4(v0, v1, v2, v3)
+	rotate_clockwise_4x4(v4, v5, v6, v7)
+
+	sub		x2, x2, #64
+
+	eor		v0.16b, v0.16b, RIV.16b
 
-	ld1		{RIV.16b}, [x3];
+	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64
+	ld1		{RTMP4.16b-RTMP7.16b}, [x2], #64
 
-.Lcbc_loop_blk:
-	subs		w4, w4, #8;
-	bmi		.Lcbc_end;
+	eor		v1.16b, v1.16b, RTMP0.16b
+	eor		v2.16b, v2.16b, RTMP1.16b
+	eor		v3.16b, v3.16b, RTMP2.16b
+	eor		v4.16b, v4.16b, RTMP3.16b
+	eor		v5.16b, v5.16b, RTMP4.16b
+	eor		v6.16b, v6.16b, RTMP5.16b
+	eor		v7.16b, v7.16b, RTMP6.16b
 
-	ld1		{v0.16b-v3.16b}, [x2], #64;
-	ld1		{v4.16b-v7.16b}, [x2];
+	mov		RIV.16b, RTMP7.16b
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
 
-	sub		x2, x2, #64;
-	eor		v0.16b, v0.16b, RIV.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v1.16b, v1.16b, RTMP0.16b;
-	eor		v2.16b, v2.16b, RTMP1.16b;
-	eor		v3.16b, v3.16b, RTMP2.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	cbz		w4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_8x
 
-	eor		v4.16b, v4.16b, RTMP3.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v5.16b, v5.16b, RTMP0.16b;
-	eor		v6.16b, v6.16b, RTMP1.16b;
-	eor		v7.16b, v7.16b, RTMP2.16b;
+.Lcbc_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lcbc_dec_tail
 
-	mov		RIV.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+	sub		w4, w4, #4
 
-	b		.Lcbc_loop_blk;
+	ld1		{v0.16b-v3.16b}, [x2], #64
 
-.Lcbc_end:
+	rev32		v4.16b, v0.16b
+	rev32		v5.16b, v1.16b
+	rev32		v6.16b, v2.16b
+	rev32		v7.16b, v3.16b
+
+	transpose_4x4(v4, v5, v6, v7)
+
+	SM4_CRYPT_BLK4_BE(v4, v5, v6, v7)
+
+	eor		v4.16b, v4.16b, RIV.16b
+	eor		v5.16b, v5.16b, v0.16b
+	eor		v6.16b, v6.16b, v1.16b
+	eor		v7.16b, v7.16b, v2.16b
+
+	mov		RIV.16b, v3.16b
+
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	cbz		w4, .Lcbc_dec_end
+
+.Lcbc_dec_tail:
+	cmp		w4, #2
+	ld1		{v0.16b}, [x2], #16
+	blt		.Lcbc_dec_tail_load_done
+	ld1		{v1.16b}, [x2], #16
+	beq		.Lcbc_dec_tail_load_done
+	ld1		{v2.16b}, [x2], #16
+
+.Lcbc_dec_tail_load_done:
+	rev32		v4.16b, v0.16b
+	rev32		v5.16b, v1.16b
+	rev32		v6.16b, v2.16b
+
+	transpose_4x4(v4, v5, v6, v7)
+
+	SM4_CRYPT_BLK4_BE(v4, v5, v6, v7)
+
+	cmp		w4, #2
+	eor		v4.16b, v4.16b, RIV.16b
+	mov		RIV.16b, v0.16b
+	st1		{v4.16b}, [x1], #16
+	blt		.Lcbc_dec_end
+
+	eor		v5.16b, v5.16b, v0.16b
+	mov		RIV.16b, v1.16b
+	st1		{v5.16b}, [x1], #16
+	beq		.Lcbc_dec_end
+
+	eor		v6.16b, v6.16b, v1.16b
+	mov		RIV.16b, v2.16b
+	st1		{v6.16b}, [x1], #16
+
+.Lcbc_dec_end:
 	/* store new IV */
-	st1		{RIV.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
-SYM_FUNC_END(sm4_neon_cbc_dec_blk8)
+	ret
+SYM_FUNC_END(sm4_neon_cbc_dec)
 
 .align 3
-SYM_FUNC_START(sm4_neon_cfb_dec_blk8)
+SYM_FUNC_START(sm4_neon_cfb_dec)
 	/* input:
 	 *   x0: round key array, CTX
 	 *   x1: dst
 	 *   x2: src
 	 *   x3: iv (big endian, 128 bit)
-	 *   w4: nblocks (multiples of 8)
+	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE()
+
+	ld1		{v0.16b}, [x3]
+
+.Lcfb_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lcfb_dec_4x
+
+	ld1		{v1.16b-v3.16b}, [x2], #48
+	ld4		{v4.4s-v7.4s}, [x2]
+
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	sub		x2, x2, #48
+	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64
+	ld1		{RTMP4.16b-RTMP7.16b}, [x2], #64
+
+	eor		v0.16b, v0.16b, RTMP0.16b
+	eor		v1.16b, v1.16b, RTMP1.16b
+	eor		v2.16b, v2.16b, RTMP2.16b
+	eor		v3.16b, v3.16b, RTMP3.16b
+	eor		v4.16b, v4.16b, RTMP4.16b
+	eor		v5.16b, v5.16b, RTMP5.16b
+	eor		v6.16b, v6.16b, RTMP6.16b
+	eor		v7.16b, v7.16b, RTMP7.16b
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	mov		v0.16b, RTMP7.16b
+
+	cbz		w4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_8x
+
+.Lcfb_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lcfb_dec_tail
+
+	sub		w4, w4, #4
+
+	ld1		{v4.16b-v7.16b}, [x2], #64
+
+	rev32		v0.16b, v0.16b		/* v0 is IV register */
+	rev32		v1.16b, v4.16b
+	rev32		v2.16b, v5.16b
+	rev32		v3.16b, v6.16b
+
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK4_BE(v0, v1, v2, v3)
 
-	ld1		{v0.16b}, [x3];
+	eor		v0.16b, v0.16b, v4.16b
+	eor		v1.16b, v1.16b, v5.16b
+	eor		v2.16b, v2.16b, v6.16b
+	eor		v3.16b, v3.16b, v7.16b
 
-.Lcfb_loop_blk:
-	subs		w4, w4, #8;
-	bmi		.Lcfb_end;
+	st1		{v0.16b-v3.16b}, [x1], #64
 
-	ld1		{v1.16b, v2.16b, v3.16b}, [x2], #48;
-	ld1		{v4.16b-v7.16b}, [x2];
+	mov		v0.16b, v7.16b
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	cbz		w4, .Lcfb_dec_end
 
-	sub		x2, x2, #48;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+.Lcfb_dec_tail:
+	cmp		w4, #2
+	ld1		{v4.16b}, [x2], #16
+	blt		.Lcfb_dec_tail_load_done
+	ld1		{v5.16b}, [x2], #16
+	beq		.Lcfb_dec_tail_load_done
+	ld1		{v6.16b}, [x2], #16
 
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v4.16b, v4.16b, RTMP0.16b;
-	eor		v5.16b, v5.16b, RTMP1.16b;
-	eor		v6.16b, v6.16b, RTMP2.16b;
-	eor		v7.16b, v7.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+.Lcfb_dec_tail_load_done:
+	rev32		v0.16b, v0.16b		/* v0 is IV register */
+	rev32		v1.16b, v4.16b
+	rev32		v2.16b, v5.16b
 
-	mov		v0.16b, RTMP3.16b;
+	transpose_4x4(v0, v1, v2, v3)
 
-	b		.Lcfb_loop_blk;
+	SM4_CRYPT_BLK4_BE(v0, v1, v2, v3)
 
-.Lcfb_end:
+	cmp		w4, #2
+	eor		v0.16b, v0.16b, v4.16b
+	st1		{v0.16b}, [x1], #16
+	mov		v0.16b, v4.16b
+	blt		.Lcfb_dec_end
+
+	eor		v1.16b, v1.16b, v5.16b
+	st1		{v1.16b}, [x1], #16
+	mov		v0.16b, v5.16b
+	beq		.Lcfb_dec_end
+
+	eor		v2.16b, v2.16b, v6.16b
+	st1		{v2.16b}, [x1], #16
+	mov		v0.16b, v6.16b
+
+.Lcfb_dec_end:
 	/* store new IV */
-	st1		{v0.16b}, [x3];
+	st1		{v0.16b}, [x3]
 
-	ret;
-SYM_FUNC_END(sm4_neon_cfb_dec_blk8)
+	ret
+SYM_FUNC_END(sm4_neon_cfb_dec)
 
 .align 3
-SYM_FUNC_START(sm4_neon_ctr_enc_blk8)
+SYM_FUNC_START(sm4_neon_ctr_crypt)
 	/* input:
 	 *   x0: round key array, CTX
 	 *   x1: dst
 	 *   x2: src
 	 *   x3: ctr (big endian, 128 bit)
-	 *   w4: nblocks (multiples of 8)
+	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE()
 
-	ldp		x7, x8, [x3];
-	rev		x7, x7;
-	rev		x8, x8;
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
 
-.Lctr_loop_blk:
-	subs		w4, w4, #8;
-	bmi		.Lctr_end;
+.Lctr_crypt_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lctr_crypt_4x
 
-#define inc_le128(vctr)                     \
-	mov		vctr.d[1], x8;      \
-	mov		vctr.d[0], x7;      \
-	adds		x8, x8, #1;         \
-	adc		x7, x7, xzr;        \
-	rev64		vctr.16b, vctr.16b;
+#define inc_le128(vctr)                             \
+		mov		vctr.d[1], x8;      \
+		mov		vctr.d[0], x7;      \
+		adds		x8, x8, #1;         \
+		rev64		vctr.16b, vctr.16b; \
+		adc		x7, x7, xzr;
 
 	/* construct CTRs */
-	inc_le128(v0);			/* +0 */
-	inc_le128(v1);			/* +1 */
-	inc_le128(v2);			/* +2 */
-	inc_le128(v3);			/* +3 */
-	inc_le128(v4);			/* +4 */
-	inc_le128(v5);			/* +5 */
-	inc_le128(v6);			/* +6 */
-	inc_le128(v7);			/* +7 */
-
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
-
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v4.16b, v4.16b, RTMP0.16b;
-	eor		v5.16b, v5.16b, RTMP1.16b;
-	eor		v6.16b, v6.16b, RTMP2.16b;
-	eor		v7.16b, v7.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
-
-	b		.Lctr_loop_blk;
-
-.Lctr_end:
+	inc_le128(v0)			/* +0 */
+	inc_le128(v1)			/* +1 */
+	inc_le128(v2)			/* +2 */
+	inc_le128(v3)			/* +3 */
+	inc_le128(v4)			/* +4 */
+	inc_le128(v5)			/* +5 */
+	inc_le128(v6)			/* +6 */
+	inc_le128(v7)			/* +7 */
+
+	transpose_4x4_2x(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64
+	ld1		{RTMP4.16b-RTMP7.16b}, [x2], #64
+
+	eor		v0.16b, v0.16b, RTMP0.16b
+	eor		v1.16b, v1.16b, RTMP1.16b
+	eor		v2.16b, v2.16b, RTMP2.16b
+	eor		v3.16b, v3.16b, RTMP3.16b
+	eor		v4.16b, v4.16b, RTMP4.16b
+	eor		v5.16b, v5.16b, RTMP5.16b
+	eor		v6.16b, v6.16b, RTMP6.16b
+	eor		v7.16b, v7.16b, RTMP7.16b
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	cbz		w4, .Lctr_crypt_end
+	b		.Lctr_crypt_loop_8x
+
+.Lctr_crypt_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lctr_crypt_tail
+
+	sub		w4, w4, #4
+
+	/* construct CTRs */
+	inc_le128(v0)			/* +0 */
+	inc_le128(v1)			/* +1 */
+	inc_le128(v2)			/* +2 */
+	inc_le128(v3)			/* +3 */
+
+	ld1		{v4.16b-v7.16b}, [x2], #64
+
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	eor		v0.16b, v0.16b, v4.16b
+	eor		v1.16b, v1.16b, v5.16b
+	eor		v2.16b, v2.16b, v6.16b
+	eor		v3.16b, v3.16b, v7.16b
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	cbz		w4, .Lctr_crypt_end
+
+.Lctr_crypt_tail:
+	/* inc_le128 will change the sign bit */
+	ld1		{v4.16b}, [x2], #16
+	inc_le128(v0)
+	cmp		w4, #2
+	blt		.Lctr_crypt_tail_load_done
+
+	ld1		{v5.16b}, [x2], #16
+	inc_le128(v1)
+	cmp		w4, #2
+	beq		.Lctr_crypt_tail_load_done
+
+	ld1		{v6.16b}, [x2], #16
+	inc_le128(v2)
+
+.Lctr_crypt_tail_load_done:
+	transpose_4x4(v0, v1, v2, v3)
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	cmp		w4, #2
+
+	eor		v0.16b, v0.16b, v4.16b
+	st1		{v0.16b}, [x1], #16
+	blt		.Lctr_crypt_end
+
+	eor		v1.16b, v1.16b, v5.16b
+	st1		{v1.16b}, [x1], #16
+	beq		.Lctr_crypt_end
+
+	eor		v2.16b, v2.16b, v6.16b
+	st1		{v2.16b}, [x1], #16
+
+.Lctr_crypt_end:
 	/* store new CTR */
-	rev		x7, x7;
-	rev		x8, x8;
-	stp		x7, x8, [x3];
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
 
-	ret;
-SYM_FUNC_END(sm4_neon_ctr_enc_blk8)
+	ret
+SYM_FUNC_END(sm4_neon_ctr_crypt)
diff --git a/arch/arm64/crypto/sm4-neon-glue.c b/arch/arm64/crypto/sm4-neon-glue.c
index 03a6a6866a31..7b19accf5c03 100644
--- a/arch/arm64/crypto/sm4-neon-glue.c
+++ b/arch/arm64/crypto/sm4-neon-glue.c
@@ -18,19 +18,14 @@
 #include <crypto/internal/skcipher.h>
 #include <crypto/sm4.h>
 
-#define BYTES2BLKS(nbytes)	((nbytes) >> 4)
-#define BYTES2BLK8(nbytes)	(((nbytes) >> 4) & ~(8 - 1))
-
-asmlinkage void sm4_neon_crypt_blk1_8(const u32 *rkey, u8 *dst, const u8 *src,
-				      unsigned int nblks);
-asmlinkage void sm4_neon_crypt_blk8(const u32 *rkey, u8 *dst, const u8 *src,
-				    unsigned int nblks);
-asmlinkage void sm4_neon_cbc_dec_blk8(const u32 *rkey, u8 *dst, const u8 *src,
-				      u8 *iv, unsigned int nblks);
-asmlinkage void sm4_neon_cfb_dec_blk8(const u32 *rkey, u8 *dst, const u8 *src,
-				      u8 *iv, unsigned int nblks);
-asmlinkage void sm4_neon_ctr_enc_blk8(const u32 *rkey, u8 *dst, const u8 *src,
-				      u8 *iv, unsigned int nblks);
+asmlinkage void sm4_neon_crypt(const u32 *rkey, u8 *dst, const u8 *src,
+			       unsigned int nblocks);
+asmlinkage void sm4_neon_cbc_dec(const u32 *rkey_dec, u8 *dst, const u8 *src,
+				 u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_neon_cfb_dec(const u32 *rkey_enc, u8 *dst, const u8 *src,
+				 u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_neon_ctr_crypt(const u32 *rkey_enc, u8 *dst, const u8 *src,
+				   u8 *iv, unsigned int nblocks);
 
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
@@ -51,27 +46,18 @@ static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLK8(nbytes);
-		if (nblks) {
-			sm4_neon_crypt_blk8(rkey, dst, src, nblks);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			sm4_neon_crypt(rkey, dst, src, nblocks);
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			sm4_neon_crypt_blk1_8(rkey, dst, src, nblks);
-			nbytes -= nblks * SM4_BLOCK_SIZE;
+			kernel_neon_end();
 		}
 
-		kernel_neon_end();
-
-		err = skcipher_walk_done(&walk, nbytes);
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
 	}
 
 	return err;
@@ -138,48 +124,19 @@ static int sm4_cbc_decrypt(struct skcipher_request *req)
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLK8(nbytes);
-		if (nblks) {
-			sm4_neon_cbc_dec_blk8(ctx->rkey_dec, dst, src,
-					walk.iv, nblks);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			sm4_neon_cbc_dec(ctx->rkey_dec, dst, src,
+					 walk.iv, nblocks);
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			u8 keystream[SM4_BLOCK_SIZE * 8];
-			u8 iv[SM4_BLOCK_SIZE];
-			int i;
-
-			sm4_neon_crypt_blk1_8(ctx->rkey_dec, keystream,
-					src, nblks);
-
-			src += ((int)nblks - 2) * SM4_BLOCK_SIZE;
-			dst += (nblks - 1) * SM4_BLOCK_SIZE;
-			memcpy(iv, src + SM4_BLOCK_SIZE, SM4_BLOCK_SIZE);
-
-			for (i = nblks - 1; i > 0; i--) {
-				crypto_xor_cpy(dst, src,
-					&keystream[i * SM4_BLOCK_SIZE],
-					SM4_BLOCK_SIZE);
-				src -= SM4_BLOCK_SIZE;
-				dst -= SM4_BLOCK_SIZE;
-			}
-			crypto_xor_cpy(dst, walk.iv,
-					keystream, SM4_BLOCK_SIZE);
-			memcpy(walk.iv, iv, SM4_BLOCK_SIZE);
-			nbytes -= nblks * SM4_BLOCK_SIZE;
+			kernel_neon_end();
 		}
 
-		kernel_neon_end();
-
-		err = skcipher_walk_done(&walk, nbytes);
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
 	}
 
 	return err;
@@ -238,41 +195,21 @@ static int sm4_cfb_decrypt(struct skcipher_request *req)
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLK8(nbytes);
-		if (nblks) {
-			sm4_neon_cfb_dec_blk8(ctx->rkey_enc, dst, src,
-					walk.iv, nblks);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			sm4_neon_cfb_dec(ctx->rkey_enc, dst, src,
+					 walk.iv, nblocks);
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			u8 keystream[SM4_BLOCK_SIZE * 8];
-
-			memcpy(keystream, walk.iv, SM4_BLOCK_SIZE);
-			if (nblks > 1)
-				memcpy(&keystream[SM4_BLOCK_SIZE], src,
-					(nblks - 1) * SM4_BLOCK_SIZE);
-			memcpy(walk.iv, src + (nblks - 1) * SM4_BLOCK_SIZE,
-				SM4_BLOCK_SIZE);
-
-			sm4_neon_crypt_blk1_8(ctx->rkey_enc, keystream,
-					keystream, nblks);
-
-			crypto_xor_cpy(dst, src, keystream,
-					nblks * SM4_BLOCK_SIZE);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			kernel_neon_end();
 
-		kernel_neon_end();
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
 
 		/* tail */
 		if (walk.nbytes == walk.total && nbytes > 0) {
@@ -302,40 +239,21 @@ static int sm4_ctr_crypt(struct skcipher_request *req)
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLK8(nbytes);
-		if (nblks) {
-			sm4_neon_ctr_enc_blk8(ctx->rkey_enc, dst, src,
-					walk.iv, nblks);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			sm4_neon_ctr_crypt(ctx->rkey_enc, dst, src,
+					   walk.iv, nblocks);
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			u8 keystream[SM4_BLOCK_SIZE * 8];
-			int i;
-
-			for (i = 0; i < nblks; i++) {
-				memcpy(&keystream[i * SM4_BLOCK_SIZE],
-					walk.iv, SM4_BLOCK_SIZE);
-				crypto_inc(walk.iv, SM4_BLOCK_SIZE);
-			}
-			sm4_neon_crypt_blk1_8(ctx->rkey_enc, keystream,
-					keystream, nblks);
-
-			crypto_xor_cpy(dst, src, keystream,
-					nblks * SM4_BLOCK_SIZE);
-			dst += nblks * SM4_BLOCK_SIZE;
-			src += nblks * SM4_BLOCK_SIZE;
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			kernel_neon_end();
 
-		kernel_neon_end();
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
 
 		/* tail */
 		if (walk.nbytes == walk.total && nbytes > 0) {

From patchwork Mon Sep 26 09:36:08 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988489
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 3B8ADC32771
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:42:27 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=08XawmwFC4oFprYMUhdvmultFS+/p/cfH/sOb1wx0FA=; b=TWnVc4bOrSHPGz
	NCCgmmH7ETBGICuwkuyvWmKuxk9kysKd6K+mZ1XJ9cKqRFx+n87GA0R6OOWxtnhg+SqsiFvA++ofS
	CqAdraJuznC91m3DUsYW4pQbKsChKJ9SA8O7H2BOvJUd1Jp1e+6P4868OGmx5QASNqV+2YV1Zi2kq
	OFxxDTJQbCMZezP/Gh4YjsRkwmYL7XHboLeET6zt3+Dt51iynMWA7x7mb1af5aWorR1jyW1DBp39Q
	fGDYcTcGTCeAOb78a91Vti1Myaw7Q2lCvaxD2KzPZBYzmquml8ERr2Y9J9arwBkj8i4rH7EK8KA5d
	6yWvnWdugTSml1kFBbWQ==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockbe-003irb-AU; Mon, 26 Sep 2022 09:40:56 +0000
Received: from out30-45.freemail.mail.aliyun.com ([115.124.30.45])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXW-003gkB-Ky
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:36:54 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R211e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046051;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQjskXh_1664184991;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQjskXh_1664184991)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:32 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 04/16] crypto: testmgr - add SM4 cts-cbc/essiv/xts/xcbc test
 vectors
Date: Mon, 26 Sep 2022 17:36:08 +0800
Message-Id: <20220926093620.99898-5-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023639_314182_B2A19F93 
X-CRM114-Status: UNSURE (   9.26  )
X-CRM114-Notice: Please train this message.
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

This patch newly adds the test vectors of CTS-CBC/ESSIV/XTS/XCBC modes
of the SM4 algorithm, and also added some test vectors for SM4 GCM/CCM.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 crypto/testmgr.c |   25 +
 crypto/testmgr.h | 1161 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 1186 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index e4bb03b8b924..cce101c7e8f9 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -4712,6 +4712,12 @@ static const struct alg_test_desc alg_test_descs[] = {
 		.alg = "cts(cbc(paes))",
 		.test = alg_test_null,
 		.fips_allowed = 1,
+	}, {
+		.alg = "cts(cbc(sm4))",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = __VECS(sm4_cts_tv_template)
+		}
 	}, {
 		.alg = "curve25519",
 		.test = alg_test_kpp,
@@ -5059,6 +5065,12 @@ static const struct alg_test_desc alg_test_descs[] = {
 			.cipher = __VECS(essiv_aes_cbc_tv_template)
 		}
 	}, {
+		.alg = "essiv(cbc(sm4),sm3)",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = __VECS(essiv_sm4_cbc_tv_template)
+		}
+	}, {
 #if IS_ENABLED(CONFIG_CRYPTO_DH_RFC7919_GROUPS)
 		.alg = "ffdhe2048(dh)",
 		.test = alg_test_kpp,
@@ -5586,6 +5598,12 @@ static const struct alg_test_desc alg_test_descs[] = {
 		.suite = {
 			.hash = __VECS(aes_xcbc128_tv_template)
 		}
+	}, {
+		.alg = "xcbc(sm4)",
+		.test = alg_test_hash,
+		.suite = {
+			.hash = __VECS(sm4_xcbc128_tv_template)
+		}
 	}, {
 		.alg = "xchacha12",
 		.test = alg_test_skcipher,
@@ -5640,6 +5658,13 @@ static const struct alg_test_desc alg_test_descs[] = {
 		.suite = {
 			.cipher = __VECS(serpent_xts_tv_template)
 		}
+	}, {
+		.alg = "xts(sm4)",
+		.generic_driver = "xts(ecb(sm4-generic))",
+		.test = alg_test_skcipher,
+		.suite = {
+			.cipher = __VECS(sm4_xts_tv_template)
+		}
 	}, {
 		.alg = "xts(twofish)",
 		.generic_driver = "xts(ecb(twofish-generic))",
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index d6088e26f326..ced48e4dad0c 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -14882,6 +14882,537 @@ static const struct cipher_testvec sm4_cfb_tv_template[] = {
 	}
 };
 
+static const struct cipher_testvec sm4_cts_tv_template[] = {
+	/* Generated from AES-CTS test vectors */
+	{
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20",
+		.len	= 17,
+		.ctext	= "\x05\xfe\x23\xee\x17\xa2\x89\x98"
+			  "\xbc\x97\x0a\x0b\x54\x67\xca\xd7"
+			  "\xd6",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20",
+		.len	= 31,
+		.ctext	= "\x15\x46\xe4\x95\xa4\xec\xf0\xb8"
+			  "\x49\xd6\x6a\x9d\x89\xc7\xfd\x70"
+			  "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20\x43",
+		.len	= 32,
+		.ctext	= "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+			  "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3"
+			  "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20\x43"
+			  "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+			  "\x70\x6c\x65\x61\x73\x65\x2c",
+		.len	= 47,
+		.ctext	= "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+			  "\xd3\xe1\xdc\xeb\xfa\x04\x11\x99"
+			  "\xde\xcf\x6f\x4d\x7b\x09\x92\x7f"
+			  "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+			  "\x01\x6a\xbf\xd4\x3f\x79\x02",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20\x43"
+			  "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+			  "\x70\x6c\x65\x61\x73\x65\x2c\x20",
+		.len	= 48,
+		.ctext	= "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+			  "\x9a\xbd\x7b\xfe\x82\xab\xcc\x7f"
+			  "\xbd\x99\x21\x0c\x5e\x4d\xed\x20"
+			  "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+			  "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3",
+	}, {
+		.klen	= 16,
+		.key    = "\x63\x68\x69\x63\x6b\x65\x6e\x20"
+			  "\x74\x65\x72\x69\x79\x61\x6b\x69",
+		.ptext	= "\x49\x20\x77\x6f\x75\x6c\x64\x20"
+			  "\x6c\x69\x6b\x65\x20\x74\x68\x65"
+			  "\x20\x47\x65\x6e\x65\x72\x61\x6c"
+			  "\x20\x47\x61\x75\x27\x73\x20\x43"
+			  "\x68\x69\x63\x6b\x65\x6e\x2c\x20"
+			  "\x70\x6c\x65\x61\x73\x65\x2c\x20"
+			  "\x61\x6e\x64\x20\x77\x6f\x6e\x74"
+			  "\x6f\x6e\x20\x73\x6f\x75\x70\x2e",
+		.len	= 64,
+		.ctext	= "\xd6\x71\xc8\xc0\x4d\x52\x7c\x66"
+			  "\x93\xf7\x70\xbb\xa8\x3f\xa3\xcf"
+			  "\x89\xc7\x99\x3f\x87\x69\x5c\xd3"
+			  "\x01\x6a\xbf\xd4\x3f\x79\x02\xa3"
+			  "\x58\x19\xa4\x8f\xa9\x68\x5e\x6b"
+			  "\x2c\x0f\x81\x60\x15\x98\x27\x4f"
+			  "\x9a\xbd\x7b\xfe\x82\xab\xcc\x7f"
+			  "\xbd\x99\x21\x0c\x5e\x4d\xed\x20",
+	}
+};
+
+static const struct cipher_testvec essiv_sm4_cbc_tv_template[] = {
+	/* Generated from AES-ESSIV-CBC test vectors */
+	{
+		.key    = "\x06\xa9\x21\x40\x36\xb8\xa1\x5b"
+			  "\x51\x2e\x03\xd5\x34\x12\x00\x06",
+		.klen   = 16,
+		.iv	= "\x3d\xaf\xba\x42\x9d\x9e\xb4\x30"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "Single block msg",
+		.ctext	= "\x83\xa0\x79\x71\x18\xed\xb2\x0f"
+			  "\xa8\x71\x94\x22\x8e\x1f\xc1\xbb",
+		.len	= 16,
+	}, {
+		.key    = "\xc2\x86\x69\x6d\x88\x7c\x9a\xa0"
+			  "\x61\x1b\xbb\x3e\x20\x25\xa4\x5a",
+		.klen   = 16,
+		.iv     = "\x56\x2e\x17\x99\x6d\x09\x3d\x28"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+		.ctext	= "\x48\x38\xba\xa0\x09\xa2\xe1\x61"
+			  "\x94\xe5\xd2\x63\xe5\x04\x6c\x62"
+			  "\x93\x21\x95\xfb\x8c\xf4\x25\x19"
+			  "\xe0\x0f\x9c\xfa\x51\xfe\xe7\x32",
+		.len	= 32,
+	}, {
+		.key	= "\x1f\x35\x2c\x07\x3b\x61\x08\xd7"
+			  "\x2d\x98\x10\xa3\x09\x14\xdf\xf4",
+		.klen	= 16,
+		.iv	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
+			  "\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
+			  "\xae\x2d\x8a\x57\x1e\x03\xac\x9c"
+			  "\x9e\xb7\x6f\xac\x45\xaf\x8e\x51"
+			  "\x30\xc8\x1c\x46\xa3\x5c\xe4\x11"
+			  "\xe5\xfb\xc1\x19\x1a\x0a\x52\xef"
+			  "\xf6\x9f\x24\x45\xdf\x4f\x9b\x17"
+			  "\xad\x2b\x41\x7b\xe6\x6c\x37\x10",
+		.ctext	= "\xa5\x1d\x64\x91\x28\x1f\xbe\x9e"
+			  "\x15\x39\x5f\xe4\xe1\x5a\x8c\x38"
+			  "\x80\x7f\xc7\x7d\x00\x4c\x4b\xff"
+			  "\x75\x3a\x03\xfe\x41\x75\x26\x9e"
+			  "\x3f\xf1\x36\xaf\x7b\x37\x73\x1a"
+			  "\xaf\x9b\x91\xec\x1e\xf0\x05\x9d"
+			  "\x87\xda\x7b\xa3\xaa\xe6\x5b\x98"
+			  "\x41\x73\xd5\x3c\x8c\x8b\xb5\x88",
+		.len	= 64,
+	}, {
+		.key	= "\xBE\xE1\x04\x27\xE1\x04\x27\x4A"
+			  "\x6D\x90\x4A\x6D\x90\xB3\xD6\xF9",
+		.klen	= 16,
+		.iv	= "\xE7\x82\x1D\xB8\x53\x11\xAC\x47"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x50\xB9\x22\xAE\x17\x80\x0C\x75"
+			  "\xDE\x47\xD3\x3C\xA5\x0E\x9A\x03"
+			  "\x6C\xF8\x61\xCA\x33\xBF\x28\x91"
+			  "\x1D\x86\xEF\x58\xE4\x4D\xB6\x1F"
+			  "\xAB\x14\x7D\x09\x72\xDB\x44\xD0"
+			  "\x39\xA2\x0B\x97\x00\x69\xF5\x5E"
+			  "\xC7\x30\xBC\x25\x8E\x1A\x83\xEC"
+			  "\x55\xE1\x4A\xB3\x1C\xA8\x11\x7A"
+			  "\x06\x6F\xD8\x41\xCD\x36\x9F\x08"
+			  "\x94\xFD\x66\xF2\x5B\xC4\x2D\xB9"
+			  "\x22\x8B\x17\x80\xE9\x52\xDE\x47"
+			  "\xB0\x19\xA5\x0E\x77\x03\x6C\xD5"
+			  "\x3E\xCA\x33\x9C\x05\x91\xFA\x63"
+			  "\xEF\x58\xC1\x2A\xB6\x1F\x88\x14"
+			  "\x7D\xE6\x4F\xDB\x44\xAD\x16\xA2"
+			  "\x0B\x74\x00\x69\xD2\x3B\xC7\x30"
+			  "\x99\x02\x8E\xF7\x60\xEC\x55\xBE"
+			  "\x27\xB3\x1C\x85\x11\x7A\xE3\x4C"
+			  "\xD8\x41\xAA\x13\x9F\x08\x71\xFD"
+			  "\x66\xCF\x38\xC4\x2D\x96\x22\x8B"
+			  "\xF4\x5D\xE9\x52\xBB\x24\xB0\x19"
+			  "\x82\x0E\x77\xE0\x49\xD5\x3E\xA7"
+			  "\x10\x9C\x05\x6E\xFA\x63\xCC\x35"
+			  "\xC1\x2A\x93\x1F\x88\xF1\x5A\xE6"
+			  "\x4F\xB8\x21\xAD\x16\x7F\x0B\x74"
+			  "\xDD\x46\xD2\x3B\xA4\x0D\x99\x02"
+			  "\x6B\xF7\x60\xC9\x32\xBE\x27\x90"
+			  "\x1C\x85\xEE\x57\xE3\x4C\xB5\x1E"
+			  "\xAA\x13\x7C\x08\x71\xDA\x43\xCF"
+			  "\x38\xA1\x0A\x96\xFF\x68\xF4\x5D"
+			  "\xC6\x2F\xBB\x24\x8D\x19\x82\xEB"
+			  "\x54\xE0\x49\xB2\x1B\xA7\x10\x79"
+			  "\x05\x6E\xD7\x40\xCC\x35\x9E\x07"
+			  "\x93\xFC\x65\xF1\x5A\xC3\x2C\xB8"
+			  "\x21\x8A\x16\x7F\xE8\x51\xDD\x46"
+			  "\xAF\x18\xA4\x0D\x76\x02\x6B\xD4"
+			  "\x3D\xC9\x32\x9B\x04\x90\xF9\x62"
+			  "\xEE\x57\xC0\x29\xB5\x1E\x87\x13"
+			  "\x7C\xE5\x4E\xDA\x43\xAC\x15\xA1"
+			  "\x0A\x73\xFF\x68\xD1\x3A\xC6\x2F"
+			  "\x98\x01\x8D\xF6\x5F\xEB\x54\xBD"
+			  "\x26\xB2\x1B\x84\x10\x79\xE2\x4B"
+			  "\xD7\x40\xA9\x12\x9E\x07\x70\xFC"
+			  "\x65\xCE\x37\xC3\x2C\x95\x21\x8A"
+			  "\xF3\x5C\xE8\x51\xBA\x23\xAF\x18"
+			  "\x81\x0D\x76\xDF\x48\xD4\x3D\xA6"
+			  "\x0F\x9B\x04\x6D\xF9\x62\xCB\x34"
+			  "\xC0\x29\x92\x1E\x87\xF0\x59\xE5"
+			  "\x4E\xB7\x20\xAC\x15\x7E\x0A\x73"
+			  "\xDC\x45\xD1\x3A\xA3\x0C\x98\x01"
+			  "\x6A\xF6\x5F\xC8\x31\xBD\x26\x8F"
+			  "\x1B\x84\xED\x56\xE2\x4B\xB4\x1D"
+			  "\xA9\x12\x7B\x07\x70\xD9\x42\xCE"
+			  "\x37\xA0\x09\x95\xFE\x67\xF3\x5C"
+			  "\xC5\x2E\xBA\x23\x8C\x18\x81\xEA"
+			  "\x53\xDF\x48\xB1\x1A\xA6\x0F\x78"
+			  "\x04\x6D\xD6\x3F\xCB\x34\x9D\x06"
+			  "\x92\xFB\x64\xF0\x59\xC2\x2B\xB7"
+			  "\x20\x89\x15\x7E\xE7\x50\xDC\x45"
+			  "\xAE\x17\xA3\x0C\x75\x01\x6A\xD3"
+			  "\x3C\xC8\x31\x9A\x03\x8F\xF8\x61"
+			  "\xED\x56\xBF\x28\xB4\x1D\x86\x12",
+		.ctext	= "\xad\x68\x40\x68\xb2\xf9\x77\x55"
+			  "\xd5\x1c\x17\x46\xc1\xfa\x05\xdd"
+			  "\x94\x5c\xb7\x99\x82\xba\x05\x48"
+			  "\xac\x5d\x14\x30\x2e\xc8\x0e\x2f"
+			  "\x5a\xd7\x39\x43\x95\x4d\x93\xff"
+			  "\x6b\xe3\xb7\x71\xc1\x39\x43\x8d"
+			  "\x10\xd7\xd9\xa8\xe7\x65\xb7\x0a"
+			  "\x27\x98\x5b\x90\xc3\x80\x1f\xd9"
+			  "\x65\x82\x88\x0a\xc3\x16\x3f\xae"
+			  "\x1f\xad\x88\xe9\xfb\x9e\xd4\xc8"
+			  "\x81\x36\x50\x37\x1f\x11\x83\xe2"
+			  "\xc5\x1a\x48\xdb\xc3\x18\x07\x5d"
+			  "\xee\x4b\xea\x40\xd3\xd9\x8c\x59"
+			  "\x29\xe1\x0b\x79\x3b\x28\xac\x75"
+			  "\xda\x82\x99\x86\xd4\xbe\xd8\x81"
+			  "\xe0\xc4\x58\x78\xe4\x33\xc1\xf1"
+			  "\xbe\x96\xd3\x4c\x42\x6b\xaf\x24"
+			  "\x69\xb4\x25\x88\x37\x9e\xb2\xfb"
+			  "\x5c\x93\x22\x89\x2f\x81\x85\x06"
+			  "\x12\x74\x3b\x6c\x99\x81\xfb\xbe"
+			  "\x0f\xc4\xa5\xb6\xf8\x79\x5f\x72"
+			  "\xf8\x46\x94\x3f\x1f\x9f\x15\xa2"
+			  "\xc8\xc0\xbf\xeb\xa3\x9e\x59\xe1"
+			  "\xbd\x1a\xe1\xe3\x6b\x33\x96\x54"
+			  "\x1b\xc4\x25\x74\x06\xcf\x8a\x75"
+			  "\x6c\xfc\x76\x7f\x9e\x7b\x00\xce"
+			  "\xa8\x1e\x6a\x0f\x5a\xa6\xcb\x77"
+			  "\x5f\x90\x39\xcb\xfe\x0e\x16\x53"
+			  "\x8e\x21\x0f\x7e\x51\xcc\x92\xb8"
+			  "\x4f\x65\x76\x20\x3d\x56\xb4\xcc"
+			  "\x8b\x8e\x8e\x68\xc3\x82\x53\x5c"
+			  "\x1c\x82\x13\x32\x3b\x97\xff\x48"
+			  "\x98\xda\x4a\x7c\xc8\x21\x83\xfd"
+			  "\xe2\xf1\x30\xe1\x11\xe9\xe8\x97"
+			  "\x97\x24\x06\x73\xf2\x52\xbb\xab"
+			  "\x9d\x5f\x0b\xa8\x2f\xab\x0b\x7d"
+			  "\xe8\x20\x7b\x67\x2e\x93\xb5\x11"
+			  "\x6c\x16\xea\xdd\x1a\x9d\xf2\xdc"
+			  "\x79\x57\xc4\x04\xcb\x7f\x36\xa0"
+			  "\x2e\xa7\x89\xab\xaa\x56\x59\x9e"
+			  "\xec\x38\xea\x1a\xe9\xa7\x58\x58"
+			  "\xb5\xb7\x8f\x8c\x5c\xd6\x86\x67"
+			  "\x65\x0f\x93\x47\xf7\x3e\x19\x19"
+			  "\x9b\x22\xd1\xc6\xc2\xba\x32\x5c"
+			  "\x2c\x7a\xa2\xbb\xa5\x22\xde\xe5"
+			  "\x1e\x78\x2c\xd3\x40\x6d\xfa\x79"
+			  "\x4c\x9e\x1c\x36\x34\xaf\x95\x2e"
+			  "\x68\x2e\x69\x7d\xe4\x7d\x0c\x74"
+			  "\xaf\x73\x5b\x48\x62\x90\x5e\x19"
+			  "\x0f\x12\xb3\xdb\x77\xbb\xe2\xac"
+			  "\xaf\xfe\xd9\xa1\x80\x09\xc6\xd4"
+			  "\xf4\x21\x3f\xa4\x0f\x16\x7b\x36"
+			  "\x29\x6d\x10\xa2\xba\xaf\xf5\xa3"
+			  "\x51\xca\x0a\x25\x74\x9a\xb7\x02"
+			  "\xb8\xf8\x6b\xda\xb8\x1c\x9f\x62"
+			  "\xf5\x61\x62\x9f\x4b\x71\x24\x45"
+			  "\xfb\x0f\xdf\xa8\x47\x6f\x2f\x05"
+			  "\x2f\xf4\xfd\xb8\xd1\x8c\x29\x9d"
+			  "\x9d\xe8\x6f\x10\x89\xef\x08\x59"
+			  "\xa0\x24\x1f\xdb\xea\xbc\x97\x44"
+			  "\x23\x74\xbf\xaa\x87\x10\x5c\x58"
+			  "\x2a\xe6\xe2\x19\xc5\x7e\x21\xe2",
+		.len	= 496,
+	},
+};
+
+static const struct cipher_testvec sm4_xts_tv_template[] = {
+	/* Generated from AES-XTS test vectors */
+	{
+		.key	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ctext	= "\xd9\xb4\x21\xf7\x31\xc8\x94\xfd"
+			  "\xc3\x5b\x77\x29\x1f\xe4\xe3\xb0"
+			  "\x2a\x1f\xb7\x66\x98\xd5\x9f\x0e"
+			  "\x51\x37\x6c\x4a\xda\x5b\xc7\x5d",
+		.len	= 32,
+	}, {
+		.key	= "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x11\x11\x11\x11\x11\x11\x11\x11"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ctext	= "\xa7\x4d\x72\x6c\x11\x19\x6a\x32"
+			  "\xbe\x04\xe0\x01\xff\x29\xd0\xc7"
+			  "\x93\x2f\x9f\x3e\xc2\x9b\xfc\xb6"
+			  "\x4d\xd1\x7f\x63\xcb\xd3\xea\x31",
+		.len	= 32,
+	}, {
+		.key	= "\xff\xfe\xfd\xfc\xfb\xfa\xf9\xf8"
+			  "\xf7\xf6\xf5\xf4\xf3\xf2\xf1\xf0"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22"
+			  "\x22\x22\x22\x22\x22\x22\x22\x22",
+		.klen	= 32,
+		.iv	= "\x33\x33\x33\x33\x33\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44"
+			  "\x44\x44\x44\x44\x44\x44\x44\x44",
+		.ctext	= "\x7f\x76\x08\x8e\xff\xad\xf7\x0c"
+			  "\x02\xea\x9f\x95\xda\x06\x28\xd3"
+			  "\x51\xbf\xcb\x9e\xac\x05\x63\xbc"
+			  "\xf1\x7b\x71\x0d\xab\x0a\x98\x26",
+		.len	= 32,
+	}, {
+		.key	= "\x27\x18\x28\x18\x28\x45\x90\x45"
+			  "\x23\x53\x60\x28\x74\x71\x35\x26"
+			  "\x31\x41\x59\x26\x53\x58\x97\x93"
+			  "\x23\x84\x62\x64\x33\x83\x27\x95",
+		.klen	= 32,
+		.iv	= "\x00\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
+			  "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf"
+			  "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf"
+			  "\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7"
+			  "\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf"
+			  "\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7"
+			  "\xe8\xe9\xea\xeb\xec\xed\xee\xef"
+			  "\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7"
+			  "\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff",
+		.ctext	= "\x54\xdd\x65\xb6\x32\x6f\xae\xa8"
+			  "\xfa\xd1\xa8\x3c\x63\x61\x4a\xf3"
+			  "\x9f\x72\x1d\x8d\xfe\x17\x7a\x30"
+			  "\xb6\x6a\xbf\x6a\x44\x99\x80\xe1"
+			  "\xcd\xbe\x06\xaf\xb7\x33\x36\xf3"
+			  "\x7a\x4d\x39\xde\x96\x4a\x30\xd7"
+			  "\xd0\x4a\x37\x99\x16\x9c\x60\x25"
+			  "\x8f\x6b\x74\x8a\x61\x86\x1a\xa5"
+			  "\xec\x92\xa2\xc1\x5b\x2b\x7c\x61"
+			  "\x5a\x42\xab\xa4\x99\xbb\xd6\xb7"
+			  "\x1d\xb9\xc7\x89\xb2\x18\x20\x89"
+			  "\xa2\x5d\xd3\xdf\x80\x0e\xd1\x86"
+			  "\x4d\x19\xf7\xed\x45\xfd\x17\xa9"
+			  "\x48\x0b\x0f\xb8\x2d\x9b\x7f\xc3"
+			  "\xed\x57\xe9\xa1\x14\x0e\xaa\x77"
+			  "\x8d\xd2\xdd\x67\x9e\x3e\xdc\x3d"
+			  "\xc4\xd5\x5c\x95\x0e\xbc\x53\x1d"
+			  "\x95\x92\xf7\xc4\x63\x82\x56\xd5"
+			  "\x65\x18\x29\x2a\x20\xaf\x98\xfd"
+			  "\xd3\xa6\x36\x00\x35\x0a\x70\xab"
+			  "\x5a\x40\xf4\xc2\x85\x03\x7c\xa0"
+			  "\x1f\x25\x1f\x19\xec\xae\x03\x29"
+			  "\xff\x77\xad\x88\xcd\x5a\x4c\xde"
+			  "\xa2\xae\xab\xc2\x21\x48\xff\xbd"
+			  "\x23\x9b\xd1\x05\x15\xbd\xe1\x13"
+			  "\x1d\xec\x84\x04\xe4\x43\xdc\x76"
+			  "\x31\x40\xd5\xf2\x2b\xf3\x3e\x0c"
+			  "\x68\x72\xd6\xb8\x1d\x63\x0f\x6f"
+			  "\x00\xcd\xd0\x58\xfe\x80\xf9\xcb"
+			  "\xfb\x77\x70\x7f\x93\xce\xe2\xca"
+			  "\x92\xb9\x15\xb8\x30\x40\x27\xc1"
+			  "\x90\xa8\x4e\x2d\x65\xe0\x18\xcc"
+			  "\x6a\x38\x7d\x37\x66\xac\xdb\x28"
+			  "\x25\x32\x84\xe8\xdb\x9a\xcf\x8f"
+			  "\x52\x28\x0d\xdc\x6d\x00\x33\xd2"
+			  "\xcc\xaa\xa4\xf9\xae\xff\x12\x36"
+			  "\x69\xbc\x02\x4f\xd6\x76\x8e\xdf"
+			  "\x8b\xc1\xf8\xd6\x22\xc1\x9c\x60"
+			  "\x9e\xf9\x7f\x60\x91\x90\xcd\x11"
+			  "\x02\x41\xe7\xfb\x08\x4e\xd8\x94"
+			  "\x2d\xa1\xf9\xb9\xcf\x1b\x51\x4b"
+			  "\x61\xa3\x88\xb3\x0e\xa6\x1a\x4a"
+			  "\x74\x5b\x38\x1e\xe7\xad\x6c\x4d"
+			  "\xb1\x27\x54\x53\xb8\x41\x3f\x98"
+			  "\xdf\x6e\x4a\x40\x98\x6e\xe4\xb5"
+			  "\x9a\xf5\xdf\xae\xcd\x30\x12\x65"
+			  "\x17\x90\x67\xa0\x0d\x7c\xa3\x5a"
+			  "\xb9\x5a\xbd\x61\x7a\xde\xa2\x8e"
+			  "\xc1\xc2\x6a\x97\xde\x28\xb8\xbf"
+			  "\xe3\x01\x20\xd6\xae\xfb\xd2\x58"
+			  "\xc5\x9e\x42\xd1\x61\xe8\x06\x5a"
+			  "\x78\x10\x6b\xdc\xa5\xcd\x90\xfb"
+			  "\x3a\xac\x4e\x93\x86\x6c\x8a\x7f"
+			  "\x96\x76\x86\x0a\x79\x14\x5b\xd9"
+			  "\x2e\x02\xe8\x19\xa9\x0b\xe0\xb9"
+			  "\x7c\xc5\x22\xb3\x21\x06\x85\x6f"
+			  "\xdf\x0e\x54\xd8\x8e\x46\x24\x15"
+			  "\x5a\x2f\x1c\x14\xea\xea\xa1\x63"
+			  "\xf8\x58\xe9\x9a\x80\x6e\x79\x1a"
+			  "\xcd\x82\xf1\xb0\xe2\x9f\x00\x28"
+			  "\xa4\xc3\x8e\x97\x6f\x57\x1a\x93"
+			  "\xf4\xfd\x57\xd7\x87\xc2\x4d\xb0"
+			  "\xe0\x1c\xa3\x04\xe5\xa5\xc4\xdd"
+			  "\x50\xcf\x8b\xdb\xf4\x91\xe5\x7c",
+		.len	= 512,
+	}, {
+		.key	= "\x62\x49\x77\x57\x24\x70\x93\x69"
+			  "\x99\x59\x57\x49\x66\x96\x76\x27"
+			  "\x02\x88\x41\x97\x16\x93\x99\x37"
+			  "\x51\x05\x82\x09\x74\x94\x45\x92",
+		.klen	= 32,
+		.iv	= "\xff\x00\x00\x00\x00\x00\x00\x00"
+			  "\x00\x00\x00\x00\x00\x00\x00\x00",
+		.ptext	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+			  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+			  "\x20\x21\x22\x23\x24\x25\x26\x27"
+			  "\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f"
+			  "\x30\x31\x32\x33\x34\x35\x36\x37"
+			  "\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f"
+			  "\x40\x41\x42\x43\x44\x45\x46\x47"
+			  "\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f"
+			  "\x50\x51\x52\x53\x54\x55\x56\x57"
+			  "\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f"
+			  "\x60\x61\x62\x63\x64\x65\x66\x67"
+			  "\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f"
+			  "\x70\x71\x72\x73\x74\x75\x76\x77"
+			  "\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f"
+			  "\x80\x81\x82\x83\x84\x85\x86\x87"
+			  "\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f"
+			  "\x90\x91\x92\x93\x94\x95\x96\x97"
+			  "\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7"
+			  "\xa8\xa9\xaa\xab\xac\xad\xae\xaf"
+			  "\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7"
+			  "\xf8\xf9\xfa\xfb\xfc",
+		.ctext	= "\xa2\x9f\x9e\x4e\x71\xdb\x28\x3c"
+			  "\x80\x0e\xf6\xb7\x8e\x57\x1c\xba"
+			  "\x90\xda\x3b\x6c\x22\x00\x68\x30"
+			  "\x1d\x63\x0d\x9e\x6a\xad\x37\x55"
+			  "\xbc\x77\x1e\xc9\xad\x83\x30\xd5"
+			  "\x27\xb2\x66\x77\x18\x3c\xa6\x39"
+			  "\x9c\x0a\xaa\x1f\x02\xe1\xd5\x65"
+			  "\x9b\x8d\xc5\x97\x3d\xc5\x04\x53"
+			  "\x78\x00\xe3\xb0\x1a\x43\x4e\xb7"
+			  "\xc4\x9f\x38\xc5\x7b\xa4\x70\x64"
+			  "\x78\xe6\x32\xd9\x65\x44\xc5\x64"
+			  "\xb8\x42\x35\x99\xff\x66\x75\xb0"
+			  "\x22\xd3\x9b\x6e\x8d\xcf\x6a\x24"
+			  "\xfd\x92\xb7\x1b\x04\x28\x2a\x61"
+			  "\xdc\x96\x2a\x20\x7a\x2c\xf1\xf9"
+			  "\x12\x15\xf0\x4d\xcf\x2b\xde\x33"
+			  "\x41\xbc\xe7\x85\x87\x22\xb7\x16"
+			  "\x02\x1c\xd8\xa2\x0f\x1f\xa3\xe9"
+			  "\xd8\x45\x48\xe7\xbe\x08\x4e\x4e"
+			  "\x23\x79\x84\xdb\x40\x76\xf5\x13"
+			  "\x78\x92\x4a\x2f\xf9\x1b\xf2\x80"
+			  "\x25\x74\x51\x45\x9a\x77\x78\x97"
+			  "\xd3\xe0\xc7\xc4\x35\x67\x2a\xe6"
+			  "\xb3\x0d\x62\x9f\x8b",
+		.len	= 189,
+	},
+};
+
 static const struct aead_testvec sm4_gcm_tv_template[] = {
 	{ /* From https://datatracker.ietf.org/doc/html/rfc8998#appendix-A.1 */
 		.key	= "\x01\x23\x45\x67\x89\xAB\xCD\xEF"
@@ -14913,6 +15444,298 @@ static const struct aead_testvec sm4_gcm_tv_template[] = {
 			  "\x83\xDE\x35\x41\xE4\xC2\xB5\x81"
 			  "\x77\xE0\x65\xA9\xBF\x7B\x62\xEC",
 		.clen	= 80,
+	}, { /* Generated from AES-GCM test vectors */
+		.key    = zeroed_string,
+		.klen	= 16,
+		.ctext	= "\x23\x2f\x0c\xfe\x30\x8b\x49\xea"
+			  "\x6f\xc8\x82\x29\xb5\xdc\x85\x8d",
+		.clen	= 16,
+	}, {
+		.key    = zeroed_string,
+		.klen	= 16,
+		.ptext	= zeroed_string,
+		.plen	= 16,
+		.ctext	= "\x7d\xe2\xaa\x7f\x11\x10\x18\x82"
+			  "\x18\x06\x3b\xe1\xbf\xeb\x6d\x89"
+			  "\xb8\x51\xb5\xf3\x94\x93\x75\x2b"
+			  "\xe5\x08\xf1\xbb\x44\x82\xc5\x57",
+		.clen	= 32,
+	}, {
+		.key	= "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+			  "\x6d\x6a\x8f\x94\x67\x30\x83\x08",
+		.klen	= 16,
+		.iv	= "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+			  "\xde\xca\xf8\x88",
+		.ptext	= "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+			  "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+			  "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+			  "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+			  "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+			  "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+			  "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+			  "\xba\x63\x7b\x39\x1a\xaf\xd2\x55",
+		.plen	= 64,
+		.ctext	= "\xe4\x11\x0f\xf1\xc1\x41\x97\xe6"
+			  "\x76\x21\x6a\x33\x83\x10\x41\xeb"
+			  "\x09\x58\x00\x11\x7b\xdc\x3f\x75"
+			  "\x1a\x49\x6e\xfc\xf2\xbb\xdf\xdb"
+			  "\x3a\x2e\x13\xfd\xc5\xc1\x9d\x07"
+			  "\x1a\xe5\x48\x3f\xed\xde\x98\x5d"
+			  "\x3f\x2d\x5b\x4e\xee\x0b\xb6\xdf"
+			  "\xe3\x63\x36\x83\x23\xf7\x5b\x80"
+			  "\x7d\xfe\x77\xef\x71\xb1\x5e\xc9"
+			  "\x52\x6b\x09\xab\x84\x28\x4b\x8a",
+		.clen	= 80,
+	}, {
+		.key	= "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+			  "\x6d\x6a\x8f\x94\x67\x30\x83\x08",
+		.klen	= 16,
+		.iv	= "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+			  "\xde\xca\xf8\x88",
+		.ptext	= "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+			  "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+			  "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+			  "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+			  "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+			  "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+			  "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+			  "\xba\x63\x7b\x39",
+		.plen	= 60,
+		.assoc	= "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+			  "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+			  "\xab\xad\xda\xd2",
+		.alen	= 20,
+		.ctext	= "\xe4\x11\x0f\xf1\xc1\x41\x97\xe6"
+			  "\x76\x21\x6a\x33\x83\x10\x41\xeb"
+			  "\x09\x58\x00\x11\x7b\xdc\x3f\x75"
+			  "\x1a\x49\x6e\xfc\xf2\xbb\xdf\xdb"
+			  "\x3a\x2e\x13\xfd\xc5\xc1\x9d\x07"
+			  "\x1a\xe5\x48\x3f\xed\xde\x98\x5d"
+			  "\x3f\x2d\x5b\x4e\xee\x0b\xb6\xdf"
+			  "\xe3\x63\x36\x83"
+			  "\x89\xf6\xba\x35\xb8\x18\xd3\xcc"
+			  "\x38\x6c\x05\xb3\x8a\xcb\xc9\xde",
+		.clen	= 76,
+	}, {
+		.key	= "\xfe\xff\xe9\x92\x86\x65\x73\x1c"
+			  "\xfe\xff\xe9\x92\x86\x65\x73\x1c",
+		.klen	= 16,
+		.iv	= "\xca\xfe\xba\xbe\xfa\xce\xdb\xad"
+			  "\xde\xca\xf8\x88",
+		.ptext	= "\xd9\x31\x32\x25\xf8\x84\x06\xe5"
+			  "\xa5\x59\x09\xc5\xaf\xf5\x26\x9a"
+			  "\x86\xa7\xa9\x53\x15\x34\xf7\xda"
+			  "\x2e\x4c\x30\x3d\x8a\x31\x8a\x72"
+			  "\x1c\x3c\x0c\x95\x95\x68\x09\x53"
+			  "\x2f\xcf\x0e\x24\x49\xa6\xb5\x25"
+			  "\xb1\x6a\xed\xf5\xaa\x0d\xe6\x57"
+			  "\xba\x63\x7b\x39",
+		.plen	= 60,
+		.assoc	= "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+			  "\xfe\xed\xfa\xce\xde\xad\xbe\xef"
+			  "\xab\xad\xda\xd2",
+		.alen	= 20,
+		.ctext	= "\xc1\x11\x44\x51\xd9\x25\x87\x5b"
+			  "\x0f\xd9\x06\xf3\x33\x44\xbb\x87"
+			  "\x8b\xa3\x77\xd2\x0c\x60\xfa\xcc"
+			  "\x85\x50\x6f\x96\x0c\x54\x54\xc1"
+			  "\x58\x04\x88\x6e\xf4\x26\x35\x7e"
+			  "\x94\x80\x48\x6c\xf2\xf4\x88\x1f"
+			  "\x19\x63\xea\xae\xba\x81\x1a\x5d"
+			  "\x0e\x6f\x59\x08"
+			  "\x33\xac\x5b\xa8\x19\x60\xdb\x1d"
+			  "\xdd\x2e\x22\x2e\xe0\x87\x51\x5d",
+		.clen	= 76,
+	}, {
+		.key	= "\x8b\x32\xcf\xe7\x44\xed\x13\x59"
+			  "\x04\x38\x77\xb0\xb9\xad\xb4\x38",
+		.klen	= 16,
+		.iv	= "\x00\xff\xff\xff\xff\x00\x00\xff"
+			  "\xff\xff\x00\xff",
+		.ptext	= "\x42\xc1\xcc\x08\x48\x6f\x41\x3f"
+			  "\x2f\x11\x66\x8b\x2a\x16\xf0\xe0"
+			  "\x58\x83\xf0\xc3\x70\x14\xc0\x5b"
+			  "\x3f\xec\x1d\x25\x3c\x51\xd2\x03"
+			  "\xcf\x59\x74\x1f\xb2\x85\xb4\x07"
+			  "\xc6\x6a\x63\x39\x8a\x5b\xde\xcb"
+			  "\xaf\x08\x44\xbd\x6f\x91\x15\xe1"
+			  "\xf5\x7a\x6e\x18\xbd\xdd\x61\x50"
+			  "\x59\xa9\x97\xab\xbb\x0e\x74\x5c"
+			  "\x00\xa4\x43\x54\x04\x54\x9b\x3b"
+			  "\x77\xec\xfd\x5c\xa6\xe8\x7b\x08"
+			  "\xae\xe6\x10\x3f\x32\x65\xd1\xfc"
+			  "\xa4\x1d\x2c\x31\xfb\x33\x7a\xb3"
+			  "\x35\x23\xf4\x20\x41\xd4\xad\x82"
+			  "\x8b\xa4\xad\x96\x1c\x20\x53\xbe"
+			  "\x0e\xa6\xf4\xdc\x78\x49\x3e\x72"
+			  "\xb1\xa9\xb5\x83\xcb\x08\x54\xb7"
+			  "\xad\x49\x3a\xae\x98\xce\xa6\x66"
+			  "\x10\x30\x90\x8c\x55\x83\xd7\x7c"
+			  "\x8b\xe6\x53\xde\xd2\x6e\x18\x21"
+			  "\x01\x52\xd1\x9f\x9d\xbb\x9c\x73"
+			  "\x57\xcc\x89\x09\x75\x9b\x78\x70"
+			  "\xed\x26\x97\x4d\xb4\xe4\x0c\xa5"
+			  "\xfa\x70\x04\x70\xc6\x96\x1c\x7d"
+			  "\x54\x41\x77\xa8\xe3\xb0\x7e\x96"
+			  "\x82\xd9\xec\xa2\x87\x68\x55\xf9"
+			  "\x8f\x9e\x73\x43\x47\x6a\x08\x36"
+			  "\x93\x67\xa8\x2d\xde\xac\x41\xa9"
+			  "\x5c\x4d\x73\x97\x0f\x70\x68\xfa"
+			  "\x56\x4d\x00\xc2\x3b\x1f\xc8\xb9"
+			  "\x78\x1f\x51\x07\xe3\x9a\x13\x4e"
+			  "\xed\x2b\x2e\xa3\xf7\x44\xb2\xe7"
+			  "\xab\x19\x37\xd9\xba\x76\x5e\xd2"
+			  "\xf2\x53\x15\x17\x4c\x6b\x16\x9f"
+			  "\x02\x66\x49\xca\x7c\x91\x05\xf2"
+			  "\x45\x36\x1e\xf5\x77\xad\x1f\x46"
+			  "\xa8\x13\xfb\x63\xb6\x08\x99\x63"
+			  "\x82\xa2\xed\xb3\xac\xdf\x43\x19"
+			  "\x45\xea\x78\x73\xd9\xb7\x39\x11"
+			  "\xa3\x13\x7c\xf8\x3f\xf7\xad\x81"
+			  "\x48\x2f\xa9\x5c\x5f\xa0\xf0\x79"
+			  "\xa4\x47\x7d\x80\x20\x26\xfd\x63"
+			  "\x0a\xc7\x7e\x6d\x75\x47\xff\x76"
+			  "\x66\x2e\x8a\x6c\x81\x35\xaf\x0b"
+			  "\x2e\x6a\x49\x60\xc1\x10\xe1\xe1"
+			  "\x54\x03\xa4\x09\x0c\x37\x7a\x15"
+			  "\x23\x27\x5b\x8b\x4b\xa5\x64\x97"
+			  "\xae\x4a\x50\x73\x1f\x66\x1c\x5c"
+			  "\x03\x25\x3c\x8d\x48\x58\x71\x34"
+			  "\x0e\xec\x4e\x55\x1a\x03\x6a\xe5"
+			  "\xb6\x19\x2b\x84\x2a\x20\xd1\xea"
+			  "\x80\x6f\x96\x0e\x05\x62\xc7\x78"
+			  "\x87\x79\x60\x38\x46\xb4\x25\x57"
+			  "\x6e\x16\x63\xf8\xad\x6e\xd7\x42"
+			  "\x69\xe1\x88\xef\x6e\xd5\xb4\x9a"
+			  "\x3c\x78\x6c\x3b\xe5\xa0\x1d\x22"
+			  "\x86\x5c\x74\x3a\xeb\x24\x26\xc7"
+			  "\x09\xfc\x91\x96\x47\x87\x4f\x1a"
+			  "\xd6\x6b\x2c\x18\x47\xc0\xb8\x24"
+			  "\xa8\x5a\x4a\x9e\xcb\x03\xe7\x2a"
+			  "\x09\xe6\x4d\x9c\x6d\x86\x60\xf5"
+			  "\x2f\x48\x69\x37\x9f\xf2\xd2\xcb"
+			  "\x0e\x5a\xdd\x6e\x8a\xfb\x6a\xfe"
+			  "\x0b\x63\xde\x87\x42\x79\x8a\x68"
+			  "\x51\x28\x9b\x7a\xeb\xaf\xb8\x2f"
+			  "\x9d\xd1\xc7\x45\x90\x08\xc9\x83"
+			  "\xe9\x83\x84\xcb\x28\x69\x09\x69"
+			  "\xce\x99\x46\x00\x54\xcb\xd8\x38"
+			  "\xf9\x53\x4a\xbf\x31\xce\x57\x15"
+			  "\x33\xfa\x96\x04\x33\x42\xe3\xc0"
+			  "\xb7\x54\x4a\x65\x7a\x7c\x02\xe6"
+			  "\x19\x95\xd0\x0e\x82\x07\x63\xf9"
+			  "\xe1\x2b\x2a\xfc\x55\x92\x52\xc9"
+			  "\xb5\x9f\x23\x28\x60\xe7\x20\x51"
+			  "\x10\xd3\xed\x6d\x9b\xab\xb8\xe2"
+			  "\x5d\x9a\x34\xb3\xbe\x9c\x64\xcb"
+			  "\x78\xc6\x91\x22\x40\x91\x80\xbe"
+			  "\xd7\x78\x5c\x0e\x0a\xdc\x08\xe9"
+			  "\x67\x10\xa4\x83\x98\x79\x23\xe7"
+			  "\x92\xda\xa9\x22\x16\xb1\xe7\x78"
+			  "\xa3\x1c\x6c\x8f\x35\x7c\x4d\x37"
+			  "\x2f\x6e\x0b\x50\x5c\x34\xb9\xf9"
+			  "\xe6\x3d\x91\x0d\x32\x95\xaa\x3d"
+			  "\x48\x11\x06\xbb\x2d\xf2\x63\x88"
+			  "\x3f\x73\x09\xe2\x45\x56\x31\x51"
+			  "\xfa\x5e\x4e\x62\xf7\x90\xf9\xa9"
+			  "\x7d\x7b\x1b\xb1\xc8\x26\x6e\x66"
+			  "\xf6\x90\x9a\x7f\xf2\x57\xcc\x23"
+			  "\x59\xfa\xfa\xaa\x44\x04\x01\xa7"
+			  "\xa4\x78\xdb\x74\x3d\x8b\xb5",
+		.plen	= 719,
+		.ctext	= "\xdc\xb1\x0f\x2a\xe8\x2d\x1c\x57"
+			  "\xc4\x82\xfa\xd6\x87\xe6\x2f\x50"
+			  "\xbd\x9e\x0a\x42\x31\xf2\xc7\xbb"
+			  "\x21\x63\xa7\x05\x43\x33\xef\x33"
+			  "\x5c\xd3\x47\x55\xce\x5c\xe4\xd4"
+			  "\xe5\x07\x62\x22\xac\x01\xa8\x35"
+			  "\x9c\x59\x34\x30\x8e\xff\x9f\xb4"
+			  "\xd2\x4e\x74\x90\x64\xf2\x78\x5e"
+			  "\x63\xb7\xc5\x08\x1b\x37\xa5\x9e"
+			  "\xc0\xde\xff\xa9\x7f\x0b\xd3\x02"
+			  "\x83\x6e\x33\xfa\x43\x11\xd3\xda"
+			  "\x02\xcf\xcd\x4a\xc0\x78\x1f\x39"
+			  "\x62\xcb\xa3\x95\x7e\x13\x92\x28"
+			  "\xb2\xc4\x7a\xba\xd1\xc6\xf6\x1f"
+			  "\xda\x0b\xf1\xd1\x99\x54\xd8\x3b"
+			  "\x16\xf8\xe6\x97\x1e\xa7\xcf\x49"
+			  "\x69\x84\x01\x4c\xdc\x7a\x34\xff"
+			  "\x01\x08\xa3\x0b\x39\xac\x21\x37"
+			  "\xd8\xb4\x04\x19\x8b\x7a\x7d\x17"
+			  "\x44\xd1\x18\xaf\x1f\xa9\x29\xfe"
+			  "\xfa\x77\xe0\x40\x42\x0c\x79\xb7"
+			  "\xc3\x15\x1b\xd9\x0c\x82\xfc\x16"
+			  "\x70\xd6\x2a\xe9\x94\x72\xc5\xa5"
+			  "\x8a\x58\xbc\xfa\xe0\x88\x39\x4a"
+			  "\x80\xe8\xec\xaf\x60\xac\xe7\xf8"
+			  "\x9c\xf0\xfc\x61\x39\x07\x98\x6b"
+			  "\x88\xe3\x98\x22\x28\x18\x4a\x2d"
+			  "\x25\xef\x10\xe3\x83\x66\x3f\xfd"
+			  "\xc7\x0b\xa3\xfd\x97\xa9\xf4\xbd"
+			  "\xd8\x2a\xee\x4a\x50\xad\xcc\xb5"
+			  "\xc7\xab\xb8\x79\x9c\xd1\xf1\x27"
+			  "\x08\xf5\xf5\xe8\x1b\x66\xce\x41"
+			  "\x56\x60\x94\x86\xf0\x78\xc2\xfa"
+			  "\x5b\x63\x40\xb1\xd1\x1a\x38\x69"
+			  "\x0b\x8c\xb2\xf5\xa2\xbe\x90\x9d"
+			  "\x46\x23\x79\x8b\x3b\x4a\xf4\xbb"
+			  "\x55\xf7\x58\x9d\xaf\x59\xff\x74"
+			  "\xf3\xb9\xc4\x26\xb1\xf8\xe1\x28"
+			  "\x8b\x5e\x8f\x6d\x64\xe7\xe8\x63"
+			  "\xd2\x9e\xcb\xee\xae\x19\x04\x1d"
+			  "\x05\xf0\x9d\x99\x7b\x33\x33\xae"
+			  "\x6e\xe5\x09\xdd\x67\x51\xc4\xc8"
+			  "\x6a\xc7\x36\x35\xc9\x93\x76\xa1"
+			  "\xa8\x1c\xfa\x75\x92\x34\x0e\x7d"
+			  "\x3d\x1d\xef\x00\xfd\xa5\x25\x12"
+			  "\x7c\x91\x21\x41\xcc\x50\x47\xa9"
+			  "\x22\x50\x24\x96\x34\x79\x3d\xe8"
+			  "\x3f\xa0\x56\xaf\x98\x53\x55\xc3"
+			  "\x46\x1b\x17\x54\xb8\xb0\xb7\xe0"
+			  "\xe0\xab\x47\x6f\x06\xda\xcc\x75"
+			  "\xa7\x96\xb7\x92\xf3\xa0\x5f\xe6"
+			  "\xba\x97\xe3\x2f\x97\x05\xb2\x99"
+			  "\xa0\x09\x10\x98\x9c\xd3\x2e\xd1"
+			  "\x7e\x2a\x30\x54\x3c\xb9\x33\xe3"
+			  "\xf2\xaf\xd3\xa5\xee\xd0\x0b\x8a"
+			  "\x19\x54\x0f\x02\x51\x1f\x91\xdf"
+			  "\x71\x9c\xad\x77\x35\x28\x55\x6d"
+			  "\xcd\x7a\xd9\xa3\x41\x98\x6b\x37"
+			  "\x19\x0f\xbe\xae\x69\xb2\x25\x01"
+			  "\xee\x0e\x51\x4b\x53\xea\x0f\x5f"
+			  "\x85\x74\x79\x36\x32\x0a\x2a\x40"
+			  "\xad\x6b\x78\x41\x54\x99\xe9\xc1"
+			  "\x2b\x6c\x9b\x42\x21\xef\xe2\x50"
+			  "\x56\x8d\x78\xdf\x58\xbe\x0a\x0f"
+			  "\xfc\xfc\x0d\x2e\xd0\xcb\xa6\x0a"
+			  "\xa8\xd9\x1e\xa9\xd4\x7c\x99\x88"
+			  "\xcf\x11\xad\x1c\xd3\x04\x63\x55"
+			  "\xef\x85\x0b\x69\xa1\x40\xf1\x75"
+			  "\x24\xf4\xe5\x2c\xd4\x7a\x24\x50"
+			  "\x8f\xa2\x71\xc9\x92\x20\xcd\xcf"
+			  "\xda\x40\xbe\xf6\xfe\x1a\xca\xc7"
+			  "\x4a\x80\x45\x55\xcb\xdd\xb7\x01"
+			  "\xb0\x8d\xcb\xd2\xae\xbd\xa4\xd0"
+			  "\x5c\x10\x05\x66\x7b\xd4\xff\xd9"
+			  "\xc4\x23\x9d\x8d\x6b\x24\xf8\x3f"
+			  "\x73\x4d\x5c\x2b\x33\x4c\x5e\x63"
+			  "\x74\x6d\x03\xa1\x7a\x35\x65\x17"
+			  "\x38\x7f\x3b\xc1\x69\xcf\x61\x34"
+			  "\x30\x21\xaf\x97\x47\x12\x3f\xa1"
+			  "\xa7\x50\xc5\x87\xfb\x3f\x70\x32"
+			  "\x86\x17\x5f\x25\xe4\x74\xc6\xd0"
+			  "\x9b\x39\xe6\xe1\x5a\xec\x8f\x40"
+			  "\xce\xcc\x37\x3b\xd8\x72\x1c\x31"
+			  "\x75\xa4\xa6\x89\x8c\xdd\xd6\xd2"
+			  "\x32\x3d\xe8\xc3\x54\xab\x1f\x35"
+			  "\x52\xb4\x94\x81\xb0\x37\x3a\x03"
+			  "\xbb\xb1\x99\x30\xa5\xf8\x21\xcd"
+			  "\x93\x5d\xa7\x13\xed\xc7\x49\x09"
+			  "\x70\xda\x08\x39\xaa\x15\x9e\x45"
+			  "\x35\x2b\x0f\x5c\x8c\x8b\xc9"
+			  "\xa8\xb8\x9f\xfd\x37\x36\x31\x7e"
+			  "\x34\x4f\xc1\xc0\xca\x8a\x22\xfd",
+		.clen	= 735,
 	}
 };
 
@@ -14947,6 +15770,282 @@ static const struct aead_testvec sm4_ccm_tv_template[] = {
 			  "\x16\x84\x2D\x4F\xA1\x86\xF5\x6A"
 			  "\xB3\x32\x56\x97\x1F\xA1\x10\xF4",
 		.clen	= 80,
+	}, { /* Generated from AES-CCM test vectors */
+		.key	= "\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7"
+			  "\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf",
+		.klen	= 16,
+		.iv	= "\x01\x00\x00\x00\x03\x02\x01\x00"
+			  "\xa0\xa1\xa2\xa3\xa4\xa5\x00\x00",
+		.assoc	= "\x00\x01\x02\x03\x04\x05\x06\x07",
+		.alen	= 8,
+		.ptext	= "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+			  "\x10\x11\x12\x13\x14\x15\x16\x17"
+			  "\x18\x19\x1a\x1b\x1c\x1d\x1e",
+		.plen	= 23,
+		.ctext	= "\x7b\xff\x4a\x15\xf5\x73\xce\x82"
+			  "\x6e\xc2\x31\x1d\xe2\x53\x02\xac"
+			  "\xa4\x48\xf9\xe4\xf5\x1f\x81\x70"
+			  "\x18\xbc\xb6\x84\x01\xb8\xae",
+		.clen	= 31,
+	}, {
+		.key	= "\xf4\x6b\xc2\x75\x62\xfe\xb4\xe1"
+			  "\x53\x14\x73\x66\x8d\x88\xf6\x80",
+		.klen	= 16,
+		.iv	= "\x03\xa0\x20\x35\x26\xf2\x21\x8d"
+			  "\x50\x20\xda\xe2\x00\x00\x00\x00",
+		.assoc	= "\x5b\x9e\x13\x67\x02\x5e\xef\xc1"
+			  "\x6c\xf9\xd7\x1e\x52\x8f\x7a\x47"
+			  "\xe9\xd4\xcf\x20\x14\x6e\xf0\x2d"
+			  "\xd8\x9e\x2b\x56\x10\x23\x56\xe7",
+		.alen	= 32,
+		.ctext	= "\x23\x58\xce\xdc\x40\xb1\xcd\x92"
+			  "\x47\x96\x59\xfc\x8a\x26\x4f\xcf",
+		.clen	= 16,
+	}, {
+		.key	= "\xab\x2f\x8a\x74\xb7\x1c\xd2\xb1"
+			  "\xff\x80\x2e\x48\x7d\x82\xf8\xb9",
+		.klen	= 16,
+		.iv	= "\x03\xaf\x94\x87\x78\x35\x82\x81"
+			  "\x7f\x88\x94\x68\x00\x00\x00\x00",
+		.alen	= 0,
+		.ptext	= "\x00",
+		.plen	= 0,
+		.ctext	= "\x72\x7e\xf5\xd6\x39\x7a\x2b\x43",
+		.clen	= 8,
+	}, {
+		.key	= "\x39\xbb\xa7\xbe\x59\x97\x9e\x73"
+			  "\xa4\x48\x93\x39\x26\x71\x4a\xc6",
+		.klen	= 16,
+		.iv	= "\x03\xee\x49\x83\xe9\xa9\xff\xe9"
+			  "\x57\xba\xfd\x9e\x00\x00\x00\x00",
+		.assoc	= "\x44\xa6\x2c\x05\xe9\xe1\x43\xb1"
+			  "\x58\x7c\xf2\x5c\x6d\x39\x0a\x64"
+			  "\xa4\xf0\x13\x05\xd1\x77\x99\x67"
+			  "\x11\xc4\xc6\xdb\x00\x56\x36\x61",
+		.alen	= 32,
+		.ptext	= "\x00",
+		.plen	= 0,
+		.ctext	= "\xb0\x9d\xc6\xfb\x7d\xb5\xa1\x0e",
+		.clen	= 8,
+	}, {
+		.key	= "\x58\x5d\xa0\x96\x65\x1a\x04\xd7"
+			  "\x0d\x1a\x53\x3b\xb5\xe3\xf8\x8b",
+		.klen	= 16,
+		.iv	= "\x03\xcf\x76\x3f\xd9\x95\x75\x8f"
+			  "\x44\x89\x40\x7b\x00\x00\x00\x00",
+		.assoc	= "\x8f\x86\x6c\x4d\x1d\xc5\x39\x88"
+			  "\xc8\xf3\x5c\x52\x10\x63\x6f\x2b"
+			  "\x8a\x2a\xc5\x6f\x30\x23\x58\x7b"
+			  "\xfb\x36\x03\x11\xb4\xd9\xf2\xfe",
+		.alen	= 32,
+		.ptext	= "\xc2\x54\xc8\xde\x78\x87\x77\x40"
+			  "\x49\x71\xe4\xb7\xe7\xcb\x76\x61"
+			  "\x0a\x41\xb9\xe9\xc0\x76\x54\xab"
+			  "\x04\x49\x3b\x19\x93\x57\x25\x5d",
+		.plen	= 32,
+		.ctext	= "\xc9\xae\xef\x1d\xf3\x2c\xd3\x38"
+			  "\xc9\x7f\x7e\x28\xe8\xaa\xb3\x60"
+			  "\x49\xdc\x66\xca\x7b\x3d\xe0\x3c"
+			  "\xcb\x45\x9c\x1b\xb2\xbe\x07\x90"
+			  "\x87\xa6\x6b\x89\x0d\x0f\x90\xaa"
+			  "\x7d\xf6\x5a\x9a\x68\x2b\x81\x92",
+		.clen	= 48,
+	}, {
+		.key	= "\x8b\x32\xcf\xe7\x44\xed\x13\x59"
+			  "\x04\x38\x77\xb0\xb9\xad\xb4\x38",
+		.klen	= 16,
+		.iv	= "\x02\xff\xff\xff\xff\x00\x00\xff"
+			  "\xff\xff\x00\xff\xff\x00\x00\x00",
+		.assoc	= "\x8f\x86\x6c\x4d\x1d\xc5\x39\x88"
+			  "\xc8\xf3\x5c\x52\x10\x63\x6f\x2b"
+			  "\x8a\x2a\xc5\x6f\x30\x23\x58\x7b"
+			  "\xfb\x36\x03\x11\xb4\xd9\xf2\xfe"
+			  "\xc8\xf3\x5c\x52\x10\x63",
+		.alen	= 38,
+		.ptext	= "\x42\xc1\xcc\x08\x48\x6f\x41\x3f"
+			  "\x2f\x11\x66\x8b\x2a\x16\xf0\xe0"
+			  "\x58\x83\xf0\xc3\x70\x14\xc0\x5b"
+			  "\x3f\xec\x1d\x25\x3c\x51\xd2\x03"
+			  "\xcf\x59\x74\x1f\xb2\x85\xb4\x07"
+			  "\xc6\x6a\x63\x39\x8a\x5b\xde\xcb"
+			  "\xaf\x08\x44\xbd\x6f\x91\x15\xe1"
+			  "\xf5\x7a\x6e\x18\xbd\xdd\x61\x50"
+			  "\x59\xa9\x97\xab\xbb\x0e\x74\x5c"
+			  "\x00\xa4\x43\x54\x04\x54\x9b\x3b"
+			  "\x77\xec\xfd\x5c\xa6\xe8\x7b\x08"
+			  "\xae\xe6\x10\x3f\x32\x65\xd1\xfc"
+			  "\xa4\x1d\x2c\x31\xfb\x33\x7a\xb3"
+			  "\x35\x23\xf4\x20\x41\xd4\xad\x82"
+			  "\x8b\xa4\xad\x96\x1c\x20\x53\xbe"
+			  "\x0e\xa6\xf4\xdc\x78\x49\x3e\x72"
+			  "\xb1\xa9\xb5\x83\xcb\x08\x54\xb7"
+			  "\xad\x49\x3a\xae\x98\xce\xa6\x66"
+			  "\x10\x30\x90\x8c\x55\x83\xd7\x7c"
+			  "\x8b\xe6\x53\xde\xd2\x6e\x18\x21"
+			  "\x01\x52\xd1\x9f\x9d\xbb\x9c\x73"
+			  "\x57\xcc\x89\x09\x75\x9b\x78\x70"
+			  "\xed\x26\x97\x4d\xb4\xe4\x0c\xa5"
+			  "\xfa\x70\x04\x70\xc6\x96\x1c\x7d"
+			  "\x54\x41\x77\xa8\xe3\xb0\x7e\x96"
+			  "\x82\xd9\xec\xa2\x87\x68\x55\xf9"
+			  "\x8f\x9e\x73\x43\x47\x6a\x08\x36"
+			  "\x93\x67\xa8\x2d\xde\xac\x41\xa9"
+			  "\x5c\x4d\x73\x97\x0f\x70\x68\xfa"
+			  "\x56\x4d\x00\xc2\x3b\x1f\xc8\xb9"
+			  "\x78\x1f\x51\x07\xe3\x9a\x13\x4e"
+			  "\xed\x2b\x2e\xa3\xf7\x44\xb2\xe7"
+			  "\xab\x19\x37\xd9\xba\x76\x5e\xd2"
+			  "\xf2\x53\x15\x17\x4c\x6b\x16\x9f"
+			  "\x02\x66\x49\xca\x7c\x91\x05\xf2"
+			  "\x45\x36\x1e\xf5\x77\xad\x1f\x46"
+			  "\xa8\x13\xfb\x63\xb6\x08\x99\x63"
+			  "\x82\xa2\xed\xb3\xac\xdf\x43\x19"
+			  "\x45\xea\x78\x73\xd9\xb7\x39\x11"
+			  "\xa3\x13\x7c\xf8\x3f\xf7\xad\x81"
+			  "\x48\x2f\xa9\x5c\x5f\xa0\xf0\x79"
+			  "\xa4\x47\x7d\x80\x20\x26\xfd\x63"
+			  "\x0a\xc7\x7e\x6d\x75\x47\xff\x76"
+			  "\x66\x2e\x8a\x6c\x81\x35\xaf\x0b"
+			  "\x2e\x6a\x49\x60\xc1\x10\xe1\xe1"
+			  "\x54\x03\xa4\x09\x0c\x37\x7a\x15"
+			  "\x23\x27\x5b\x8b\x4b\xa5\x64\x97"
+			  "\xae\x4a\x50\x73\x1f\x66\x1c\x5c"
+			  "\x03\x25\x3c\x8d\x48\x58\x71\x34"
+			  "\x0e\xec\x4e\x55\x1a\x03\x6a\xe5"
+			  "\xb6\x19\x2b\x84\x2a\x20\xd1\xea"
+			  "\x80\x6f\x96\x0e\x05\x62\xc7\x78"
+			  "\x87\x79\x60\x38\x46\xb4\x25\x57"
+			  "\x6e\x16\x63\xf8\xad\x6e\xd7\x42"
+			  "\x69\xe1\x88\xef\x6e\xd5\xb4\x9a"
+			  "\x3c\x78\x6c\x3b\xe5\xa0\x1d\x22"
+			  "\x86\x5c\x74\x3a\xeb\x24\x26\xc7"
+			  "\x09\xfc\x91\x96\x47\x87\x4f\x1a"
+			  "\xd6\x6b\x2c\x18\x47\xc0\xb8\x24"
+			  "\xa8\x5a\x4a\x9e\xcb\x03\xe7\x2a"
+			  "\x09\xe6\x4d\x9c\x6d\x86\x60\xf5"
+			  "\x2f\x48\x69\x37\x9f\xf2\xd2\xcb"
+			  "\x0e\x5a\xdd\x6e\x8a\xfb\x6a\xfe"
+			  "\x0b\x63\xde\x87\x42\x79\x8a\x68"
+			  "\x51\x28\x9b\x7a\xeb\xaf\xb8\x2f"
+			  "\x9d\xd1\xc7\x45\x90\x08\xc9\x83"
+			  "\xe9\x83\x84\xcb\x28\x69\x09\x69"
+			  "\xce\x99\x46\x00\x54\xcb\xd8\x38"
+			  "\xf9\x53\x4a\xbf\x31\xce\x57\x15"
+			  "\x33\xfa\x96\x04\x33\x42\xe3\xc0"
+			  "\xb7\x54\x4a\x65\x7a\x7c\x02\xe6"
+			  "\x19\x95\xd0\x0e\x82\x07\x63\xf9"
+			  "\xe1\x2b\x2a\xfc\x55\x92\x52\xc9"
+			  "\xb5\x9f\x23\x28\x60\xe7\x20\x51"
+			  "\x10\xd3\xed\x6d\x9b\xab\xb8\xe2"
+			  "\x5d\x9a\x34\xb3\xbe\x9c\x64\xcb"
+			  "\x78\xc6\x91\x22\x40\x91\x80\xbe"
+			  "\xd7\x78\x5c\x0e\x0a\xdc\x08\xe9"
+			  "\x67\x10\xa4\x83\x98\x79\x23\xe7"
+			  "\x92\xda\xa9\x22\x16\xb1\xe7\x78"
+			  "\xa3\x1c\x6c\x8f\x35\x7c\x4d\x37"
+			  "\x2f\x6e\x0b\x50\x5c\x34\xb9\xf9"
+			  "\xe6\x3d\x91\x0d\x32\x95\xaa\x3d"
+			  "\x48\x11\x06\xbb\x2d\xf2\x63\x88"
+			  "\x3f\x73\x09\xe2\x45\x56\x31\x51"
+			  "\xfa\x5e\x4e\x62\xf7\x90\xf9\xa9"
+			  "\x7d\x7b\x1b\xb1\xc8\x26\x6e\x66"
+			  "\xf6\x90\x9a\x7f\xf2\x57\xcc\x23"
+			  "\x59\xfa\xfa\xaa\x44\x04\x01\xa7"
+			  "\xa4\x78\xdb\x74\x3d\x8b\xb5",
+		.plen	= 719,
+		.ctext	= "\xc5\x50\x85\x02\x72\xa8\xb3\x62"
+			  "\xf9\xcd\x77\x7b\x43\xa5\x04\x70"
+			  "\x68\x40\x57\x21\x1c\xfe\xef\x05"
+			  "\x4d\xb8\x44\xba\x59\xea\x62\x32"
+			  "\xcb\x6b\x6a\x39\x9b\xf3\xe5\xa4"
+			  "\x36\x38\xde\x7d\xcf\xb6\xcd\xe3"
+			  "\x89\xbf\x37\xc9\x96\x3c\x70\x10"
+			  "\x92\x47\xcc\xac\x6f\xf8\x55\x9a"
+			  "\x26\x43\x34\xb4\x92\x7d\x68\xfc"
+			  "\x60\x37\x74\x2a\x55\xba\xc7\xd7"
+			  "\x98\x69\xb7\xcf\x42\xfd\xb2\x10"
+			  "\xa0\x59\xe1\x2c\x73\x66\x12\x97"
+			  "\x85\x8b\x28\xcc\x29\x02\x15\x89"
+			  "\x23\xd3\x32\x92\x87\x57\x09\x13"
+			  "\x04\x7e\x8b\x6c\x3a\xc1\x4e\x6c"
+			  "\xe1\x9f\xc8\xcc\x47\x9c\xd8\x10"
+			  "\xf4\xb7\x5c\x30\x7a\x8b\x0f\x01"
+			  "\x52\x38\x02\x92\x99\xac\x03\x90"
+			  "\x18\x32\x2d\x21\x6a\x0a\x2a\xe7"
+			  "\xc2\xcc\x15\x84\x4e\x2b\x0b\x3a"
+			  "\x4c\xdc\xb0\x6b\x10\xd1\x27\x10"
+			  "\xf0\x4a\x5c\x43\xa0\x34\x34\x59"
+			  "\x47\x43\x48\xcb\x69\xa7\xff\x52"
+			  "\xb8\xca\x23\x09\x07\xd7\xc5\xe4"
+			  "\x2a\x4f\x99\xd5\x83\x36\x2a\x2d"
+			  "\x59\xd0\xca\xb0\xfa\x40\x8c\xab"
+			  "\xdf\x69\x08\xd9\x79\x1d\xde\xa8"
+			  "\x0b\x34\x74\x4d\xf5\xa0\x4c\x81"
+			  "\x7f\x93\x06\x40\x24\xfe\x7d\xcd"
+			  "\xe4\xfe\xf8\xf8\x30\xce\xd0\x5d"
+			  "\x70\xfd\x0d\x5a\x78\x85\x74\x2d"
+			  "\xe4\xb5\x40\x18\x99\x11\xe4\x6a"
+			  "\xdf\xfa\x4f\x25\x2c\xde\x15\xb7"
+			  "\x12\xd8\xc6\x90\x0d\x0f\xc9\xfb"
+			  "\x21\xf1\xed\xfe\x98\xe1\x03\xe2"
+			  "\x5c\xef\xb6\xc7\x87\x77\x0e\xcd"
+			  "\xff\x78\x94\xc9\xbe\xd3\x47\xf7"
+			  "\x8d\x37\x48\x01\x42\xe2\x17\x96"
+			  "\xfc\xc0\xcb\x7b\x7b\x57\xaf\x3b"
+			  "\xc9\xd0\x94\xce\x5e\x1b\xa9\x47"
+			  "\x02\x4d\x74\xcc\x45\x1d\xd3\x2d"
+			  "\x5f\x4f\x7f\xf2\x4b\xf9\x59\xee"
+			  "\x9e\x9e\xb9\x95\x29\x19\xd1\x5f"
+			  "\x72\xab\x8d\xf1\x28\xd1\x1c\xae"
+			  "\xc2\xba\xf7\x22\x84\x2c\x83\x51"
+			  "\x03\xad\xa3\xef\x81\xa7\xdc\xf1"
+			  "\x44\x51\x50\x96\x70\xd1\xe5\x47"
+			  "\x57\xf9\x30\x90\xe4\xbf\xfc\x75"
+			  "\x14\xaa\x4d\xb7\xb1\xe7\x79\x33"
+			  "\x43\xc2\x5c\xc1\xbc\x09\x92\x0f"
+			  "\xa7\xaf\x68\x51\x51\xec\x0b\xc3"
+			  "\x3d\x2b\x94\x30\x45\x29\x1b\x9e"
+			  "\x70\x56\xf8\xd6\x67\x2d\x39\x3b"
+			  "\x3c\xd2\xd0\xd3\xdc\x7d\x84\xe9"
+			  "\x06\x31\x98\xa6\x5c\xbf\x10\x58"
+			  "\xce\xbb\xa7\xe1\x65\x7e\x51\x87"
+			  "\x70\x46\xb4\x7f\xf9\xec\x92\x1c"
+			  "\x9b\x24\x49\xc1\x04\xbe\x1c\x5f"
+			  "\xcc\xb3\x33\x8c\xad\xe7\xdc\x32"
+			  "\x54\xa2\x0d\x83\x0f\x3c\x12\x5d"
+			  "\x71\xe3\x9c\xae\x71\xa3\x2a\x10"
+			  "\xc5\x91\xb4\x73\x96\x60\xdb\x5d"
+			  "\x1f\xd5\x9a\xd2\x69\xc3\xd7\x4b"
+			  "\xa2\x66\x81\x96\x4a\xaa\x02\xd6"
+			  "\xd5\x44\x9b\x42\x3a\x15\x5f\xe7"
+			  "\x4d\x7c\xf6\x71\x4a\xea\xe8\x43"
+			  "\xd7\x68\xe4\xbc\x05\x87\x49\x05"
+			  "\x3b\x47\xb2\x6d\x5f\xd1\x11\xa6"
+			  "\x58\xd4\xa2\x45\xec\xb5\x54\x55"
+			  "\xd3\xd6\xd2\x6a\x8b\x21\x9e\x2c"
+			  "\xf1\x27\x4b\x5b\xe3\xff\xe0\xfd"
+			  "\x4b\xf1\xe7\xe2\x84\xf2\x17\x37"
+			  "\x11\x68\xc4\x92\x4b\x6b\xef\x8e"
+			  "\x75\xf5\xc2\x7d\x5c\xe9\x7c\xfc"
+			  "\x2b\x00\x33\x0e\x7d\x69\xd8\xd4"
+			  "\x9b\xa8\x38\x54\x7e\x6d\x23\x51"
+			  "\x2c\xd6\xc4\x58\x23\x1c\x22\x2a"
+			  "\x59\xc5\x9b\xec\x9d\xbf\x03\x0f"
+			  "\xb3\xdd\xba\x02\x22\xa0\x34\x37"
+			  "\x19\x56\xc2\x5b\x32\x1d\x1e\x66"
+			  "\x68\xf4\x47\x05\x04\x18\xa7\x28"
+			  "\x80\xf2\xc7\x99\xed\x1e\x72\x48"
+			  "\x8f\x97\x5d\xb3\x74\x42\xfd\x0c"
+			  "\x0f\x5f\x29\x0c\xf1\x35\x22\x90"
+			  "\xd6\x7c\xb8\xa3\x2a\x89\x38\x71"
+			  "\xe9\x7a\x55\x3c\x3b\xf2\x6e\x1a"
+			  "\x22\x8f\x07\x81\xc1\xe1\xf1\x76"
+			  "\x2a\x75\xab\x86\xc4\xcc\x52\x59"
+			  "\x83\x19\x5e\xb3\x53\xe2\x81\xdf"
+			  "\xe6\x15\xb3\xba\x0c\x0e\xba"
+			  "\xa9\x2c\xed\x51\xd5\x06\xc8\xc6"
+			  "\x4b\x9f\x5d\x1b\x61\x31\xad\xf4",
+		.clen	= 735,
 	}
 };
 
@@ -15030,6 +16129,68 @@ static const struct hash_testvec sm4_cmac128_tv_template[] = {
 	}
 };
 
+static const struct hash_testvec sm4_xcbc128_tv_template[] = {
+	{ /* Generated from AES-XCBC128 test vectors */
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= zeroed_string,
+		.digest 	= "\xa9\x9a\x5c\x44\xe2\x34\xee\x2c"
+				  "\x9b\xe4\x9d\xca\x64\xb0\xa5\xc4",
+		.psize		= 0,
+		.ksize		= 16,
+	}, {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02",
+		.digest		= "\x17\x27\x62\xf3\x8b\x88\x1d\xc0"
+				  "\x97\x35\x9c\x3e\x9f\x27\xb7\x83",
+		.psize		= 3,
+		.ksize		= 16,
+	} , {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.digest 	= "\xda\x45\xd1\xac\xec\x4d\xab\x46"
+				  "\xdd\x59\xe0\x44\xff\x59\xd5\xfc",
+		.psize		= 16,
+		.ksize		= 16,
+	}, {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+				  "\x10\x11\x12\x13",
+		.digest 	= "\xbe\x24\x5d\x81\x8c\x8a\x10\xa4"
+				  "\x8e\xc2\x16\xfa\xa4\x83\xc9\x2a",
+		.psize		= 20,
+		.ksize		= 16,
+	}, {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+				  "\x10\x11\x12\x13\x14\x15\x16\x17"
+				  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f",
+		.digest 	= "\x91\x82\x31\x56\xd5\x77\xa4\xc5"
+				  "\x88\x2d\xce\x3a\x87\x5e\xbd\xba",
+		.psize		= 32,
+		.ksize		= 16,
+	}, {
+		.key		= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+		.plaintext 	= "\x00\x01\x02\x03\x04\x05\x06\x07"
+				  "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f"
+				  "\x10\x11\x12\x13\x14\x15\x16\x17"
+				  "\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f"
+				  "\x20\x21",
+		.digest 	= "\x2a\xae\xa5\x24\x0c\x12\x9f\x5f"
+				  "\x55\xfb\xae\x35\x13\x0d\x22\x2d",
+		.psize		= 34,
+		.ksize		= 16,
+	}
+};
+
 /* Cast6 test vectors from RFC 2612 */
 static const struct cipher_testvec cast6_tv_template[] = {
 	{

From patchwork Mon Sep 26 09:36:09 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988478
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 84426C32771
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:40:12 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=ZtDqgeQbfXJR6RR6sC30l51UECPzI4iwXwerglb59/g=; b=hcJl3PdRMGHuRo
	5+7s94nBTTlD0CQVRY4oviMGcLASN35zhdkJa3Wsv7FkRnjdlfs3HZE8I2MnYEJBmq63NA/VrOq51
	TYHuN0V3XYjMQlEBU/H/gksdO96wxAI4CFgcD4aGtXC4qhptNLu0Rogkv5czSryk2Yg+Sc0gCZDsb
	Co1xPD6Mcbn2j09327qn7M/E/9V3NxnB+l26+XHWqS0YEIazISWHF7dtCPLSS13eZz8B5P0sDm4hL
	Q34sUuMY9fV6aptZkCy1sp7cloaXZFRkuhhAj+HhzJOPbK+eO+lBQJmD5/tTQqHY5w/yZFyKSU3Ps
	6at/+JN0Mx23FP4PezJw==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockZv-003hvO-86; Mon, 26 Sep 2022 09:39:07 +0000
Received: from out30-56.freemail.mail.aliyun.com ([115.124.30.56])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXW-003gkC-Bc
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:36:43 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R561e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046051;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQkQP7a_1664184993;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQkQP7a_1664184993)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:34 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 05/16] crypto: tcrypt - add SM4 cts-cbc/essiv/xts/xcbc test
Date: Mon, 26 Sep 2022 17:36:09 +0800
Message-Id: <20220926093620.99898-6-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023638_598651_0055E729 
X-CRM114-Status: UNSURE (   7.59  )
X-CRM114-Notice: Please train this message.
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Added CTS-CBC/ESSIV/XTS/XCBC tests for SM4 algorithms, as well as
corresponding speed tests, this is to test performance-optimized
implementations of these modes.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 crypto/tcrypt.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index a82679b576bb..b870b2fe716d 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1711,6 +1711,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 		ret += tcrypt_test("gcm(aria)");
 		break;
 
+	case 59:
+		ret += tcrypt_test("cts(cbc(sm4))");
+		break;
+
 	case 100:
 		ret += tcrypt_test("hmac(md5)");
 		break;
@@ -1811,6 +1815,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 		ret += tcrypt_test("cmac(sm4)");
 		break;
 
+	case 160:
+		ret += tcrypt_test("xcbc(sm4)");
+		break;
+
 	case 181:
 		ret += tcrypt_test("authenc(hmac(sha1),cbc(des))");
 		break;
@@ -1846,6 +1854,7 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 		ret += tcrypt_test("cbc(sm4)");
 		ret += tcrypt_test("cfb(sm4)");
 		ret += tcrypt_test("ctr(sm4)");
+		ret += tcrypt_test("xts(sm4)");
 		break;
 	case 192:
 		ret += tcrypt_test("ecb(aria)");
@@ -2109,6 +2118,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 				speed_template_16);
 		test_cipher_speed("cbc(sm4)", DECRYPT, sec, NULL, 0,
 				speed_template_16);
+		test_cipher_speed("cts(cbc(sm4))", ENCRYPT, sec, NULL, 0,
+				speed_template_16);
+		test_cipher_speed("cts(cbc(sm4))", DECRYPT, sec, NULL, 0,
+				speed_template_16);
 		test_cipher_speed("cfb(sm4)", ENCRYPT, sec, NULL, 0,
 				speed_template_16);
 		test_cipher_speed("cfb(sm4)", DECRYPT, sec, NULL, 0,
@@ -2117,6 +2130,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 				speed_template_16);
 		test_cipher_speed("ctr(sm4)", DECRYPT, sec, NULL, 0,
 				speed_template_16);
+		test_cipher_speed("xts(sm4)", ENCRYPT, sec, NULL, 0,
+				speed_template_32);
+		test_cipher_speed("xts(sm4)", DECRYPT, sec, NULL, 0,
+				speed_template_32);
 		break;
 
 	case 219:
@@ -2212,6 +2229,13 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 				   speed_template_16, num_mb);
 		break;
 
+	case 230:
+		test_acipher_speed("essiv(cbc(sm4),sm3)", ENCRYPT, sec,
+				   NULL, 0, speed_template_16);
+		test_acipher_speed("essiv(cbc(sm4),sm3)", DECRYPT, sec,
+				   NULL, 0, speed_template_16);
+		break;
+
 	case 300:
 		if (alg) {
 			test_hash_speed(alg, sec, generic_hash_speed_template);
@@ -2630,6 +2654,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m, u32 num_mb)
 				speed_template_16);
 		test_acipher_speed("ctr(sm4)", DECRYPT, sec, NULL, 0,
 				speed_template_16);
+		test_acipher_speed("xts(sm4)", ENCRYPT, sec, NULL, 0,
+				speed_template_32);
+		test_acipher_speed("xts(sm4)", DECRYPT, sec, NULL, 0,
+				speed_template_32);
 		break;
 
 	case 519:

From patchwork Mon Sep 26 09:36:10 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988481
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id BDC11C07E9D
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:41:28 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=mDZUZrQu0QqRJlYh//cHi7+GlUoVS9rPq/NQGS6oejk=; b=pJrq3CjH2d19DP
	esigTyz02UUpAxlf0nROSKU5PyFi3g4IbEKiFOWP6gNi871L+eDwGtD3B53wnkpDzlEDMhRTfnsoy
	x2UUJ6WvrYQy/JOD0SJ4PrO8wFrkwpRHpNMpgF/wxxtfTe2yhMbFwK08cDhKclWtHHIu+X+ZVjt/8
	panhMujoNSd8Ux23NllL3GY/tCDW09dWhPdp4pZP7Mh5xJB/sVhqKk7Zvtd+3brbFTkWnan4+sMXb
	yZ9gnCFUqnnxAf2eZZ/UKE0SzNBb93hYJppyeChFKPqF0aYFWO1bAMgdP59BV7VJd1yPwrK4MkD59
	oXKdv/C3HgO0TmdI5O+Q==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockav-003iTa-E3; Mon, 26 Sep 2022 09:40:09 +0000
Received: from out30-45.freemail.mail.aliyun.com ([115.124.30.45])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXY-003gmJ-SD
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:36:52 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R141e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQkJzEH_1664184995;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQkJzEH_1664184995)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:36 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 06/16] crypto: arm64/sm4 - refactor and simplify CE
 implementation
Date: Mon, 26 Sep 2022 17:36:10 +0800
Message-Id: <20220926093620.99898-7-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023641_345444_9FE996C9 
X-CRM114-Status: GOOD (  13.08  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

This patch does not add new features, but only refactors and simplifies the
implementation of the Crypto Extension acceleration of the SM4 algorithm:

Extract the macro optimized by SM4 Crypto Extension for reuse in the
subsequent optimization of CCM/GCM modes.

Encryption in CBC and CFB modes processes four blocks at a time instead of
one, allowing the ld1 instruction to load 64 bytes of data at a time, which
will reduces unnecessary memory accesses.

CBC/CFB/CTR makes full use of free registers to reduce redundant memory
accesses, and rearranges some instructions to improve out-of-order execution
capabilities.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-asm.h  | 209 +++++++++++
 arch/arm64/crypto/sm4-ce-core.S | 646 ++++++++++++++------------------
 arch/arm64/crypto/sm4-ce-glue.c |  64 ++--
 3 files changed, 519 insertions(+), 400 deletions(-)
 create mode 100644 arch/arm64/crypto/sm4-ce-asm.h

diff --git a/arch/arm64/crypto/sm4-ce-asm.h b/arch/arm64/crypto/sm4-ce-asm.h
new file mode 100644
index 000000000000..7ea98e42e779
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-asm.h
@@ -0,0 +1,209 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 helper macros for Crypto Extensions
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#define SM4_PREPARE(ptr)					\
+	ld1		{v24.16b-v27.16b}, [ptr], #64;		\
+	ld1		{v28.16b-v31.16b}, [ptr];
+
+#define SM4_CRYPT_BLK_BE(b0)					\
+	sm4e		b0.4s, v24.4s;				\
+	sm4e		b0.4s, v25.4s;				\
+	sm4e		b0.4s, v26.4s;				\
+	sm4e		b0.4s, v27.4s;				\
+	sm4e		b0.4s, v28.4s;				\
+	sm4e		b0.4s, v29.4s;				\
+	sm4e		b0.4s, v30.4s;				\
+	sm4e		b0.4s, v31.4s;				\
+	rev64		b0.4s, b0.4s;				\
+	ext		b0.16b, b0.16b, b0.16b, #8;		\
+	rev32		b0.16b, b0.16b;
+
+#define SM4_CRYPT_BLK(b0)					\
+	rev32		b0.16b, b0.16b;				\
+	SM4_CRYPT_BLK_BE(b0);
+
+#define SM4_CRYPT_BLK2_BE(b0, b1)				\
+	sm4e		b0.4s, v24.4s;				\
+	sm4e		b1.4s, v24.4s;				\
+	sm4e		b0.4s, v25.4s;				\
+	sm4e		b1.4s, v25.4s;				\
+	sm4e		b0.4s, v26.4s;				\
+	sm4e		b1.4s, v26.4s;				\
+	sm4e		b0.4s, v27.4s;				\
+	sm4e		b1.4s, v27.4s;				\
+	sm4e		b0.4s, v28.4s;				\
+	sm4e		b1.4s, v28.4s;				\
+	sm4e		b0.4s, v29.4s;				\
+	sm4e		b1.4s, v29.4s;				\
+	sm4e		b0.4s, v30.4s;				\
+	sm4e		b1.4s, v30.4s;				\
+	sm4e		b0.4s, v31.4s;				\
+	sm4e		b1.4s, v31.4s;				\
+	rev64		b0.4s, b0.4s;				\
+	rev64		b1.4s, b1.4s;				\
+	ext		b0.16b, b0.16b, b0.16b, #8;		\
+	ext		b1.16b, b1.16b, b1.16b, #8;		\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+
+#define SM4_CRYPT_BLK2(b0, b1)					\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	SM4_CRYPT_BLK2_BE(b0, b1);
+
+#define SM4_CRYPT_BLK4_BE(b0, b1, b2, b3)			\
+	sm4e		b0.4s, v24.4s;				\
+	sm4e		b1.4s, v24.4s;				\
+	sm4e		b2.4s, v24.4s;				\
+	sm4e		b3.4s, v24.4s;				\
+	sm4e		b0.4s, v25.4s;				\
+	sm4e		b1.4s, v25.4s;				\
+	sm4e		b2.4s, v25.4s;				\
+	sm4e		b3.4s, v25.4s;				\
+	sm4e		b0.4s, v26.4s;				\
+	sm4e		b1.4s, v26.4s;				\
+	sm4e		b2.4s, v26.4s;				\
+	sm4e		b3.4s, v26.4s;				\
+	sm4e		b0.4s, v27.4s;				\
+	sm4e		b1.4s, v27.4s;				\
+	sm4e		b2.4s, v27.4s;				\
+	sm4e		b3.4s, v27.4s;				\
+	sm4e		b0.4s, v28.4s;				\
+	sm4e		b1.4s, v28.4s;				\
+	sm4e		b2.4s, v28.4s;				\
+	sm4e		b3.4s, v28.4s;				\
+	sm4e		b0.4s, v29.4s;				\
+	sm4e		b1.4s, v29.4s;				\
+	sm4e		b2.4s, v29.4s;				\
+	sm4e		b3.4s, v29.4s;				\
+	sm4e		b0.4s, v30.4s;				\
+	sm4e		b1.4s, v30.4s;				\
+	sm4e		b2.4s, v30.4s;				\
+	sm4e		b3.4s, v30.4s;				\
+	sm4e		b0.4s, v31.4s;				\
+	sm4e		b1.4s, v31.4s;				\
+	sm4e		b2.4s, v31.4s;				\
+	sm4e		b3.4s, v31.4s;				\
+	rev64		b0.4s, b0.4s;				\
+	rev64		b1.4s, b1.4s;				\
+	rev64		b2.4s, b2.4s;				\
+	rev64		b3.4s, b3.4s;				\
+	ext		b0.16b, b0.16b, b0.16b, #8;		\
+	ext		b1.16b, b1.16b, b1.16b, #8;		\
+	ext		b2.16b, b2.16b, b2.16b, #8;		\
+	ext		b3.16b, b3.16b, b3.16b, #8;		\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	rev32		b2.16b, b2.16b;				\
+	rev32		b3.16b, b3.16b;
+
+#define SM4_CRYPT_BLK4(b0, b1, b2, b3)				\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	rev32		b2.16b, b2.16b;				\
+	rev32		b3.16b, b3.16b;				\
+	SM4_CRYPT_BLK4_BE(b0, b1, b2, b3);
+
+#define SM4_CRYPT_BLK8_BE(b0, b1, b2, b3, b4, b5, b6, b7)	\
+	sm4e		b0.4s, v24.4s;				\
+	sm4e		b1.4s, v24.4s;				\
+	sm4e		b2.4s, v24.4s;				\
+	sm4e		b3.4s, v24.4s;				\
+	sm4e		b4.4s, v24.4s;				\
+	sm4e		b5.4s, v24.4s;				\
+	sm4e		b6.4s, v24.4s;				\
+	sm4e		b7.4s, v24.4s;				\
+	sm4e		b0.4s, v25.4s;				\
+	sm4e		b1.4s, v25.4s;				\
+	sm4e		b2.4s, v25.4s;				\
+	sm4e		b3.4s, v25.4s;				\
+	sm4e		b4.4s, v25.4s;				\
+	sm4e		b5.4s, v25.4s;				\
+	sm4e		b6.4s, v25.4s;				\
+	sm4e		b7.4s, v25.4s;				\
+	sm4e		b0.4s, v26.4s;				\
+	sm4e		b1.4s, v26.4s;				\
+	sm4e		b2.4s, v26.4s;				\
+	sm4e		b3.4s, v26.4s;				\
+	sm4e		b4.4s, v26.4s;				\
+	sm4e		b5.4s, v26.4s;				\
+	sm4e		b6.4s, v26.4s;				\
+	sm4e		b7.4s, v26.4s;				\
+	sm4e		b0.4s, v27.4s;				\
+	sm4e		b1.4s, v27.4s;				\
+	sm4e		b2.4s, v27.4s;				\
+	sm4e		b3.4s, v27.4s;				\
+	sm4e		b4.4s, v27.4s;				\
+	sm4e		b5.4s, v27.4s;				\
+	sm4e		b6.4s, v27.4s;				\
+	sm4e		b7.4s, v27.4s;				\
+	sm4e		b0.4s, v28.4s;				\
+	sm4e		b1.4s, v28.4s;				\
+	sm4e		b2.4s, v28.4s;				\
+	sm4e		b3.4s, v28.4s;				\
+	sm4e		b4.4s, v28.4s;				\
+	sm4e		b5.4s, v28.4s;				\
+	sm4e		b6.4s, v28.4s;				\
+	sm4e		b7.4s, v28.4s;				\
+	sm4e		b0.4s, v29.4s;				\
+	sm4e		b1.4s, v29.4s;				\
+	sm4e		b2.4s, v29.4s;				\
+	sm4e		b3.4s, v29.4s;				\
+	sm4e		b4.4s, v29.4s;				\
+	sm4e		b5.4s, v29.4s;				\
+	sm4e		b6.4s, v29.4s;				\
+	sm4e		b7.4s, v29.4s;				\
+	sm4e		b0.4s, v30.4s;				\
+	sm4e		b1.4s, v30.4s;				\
+	sm4e		b2.4s, v30.4s;				\
+	sm4e		b3.4s, v30.4s;				\
+	sm4e		b4.4s, v30.4s;				\
+	sm4e		b5.4s, v30.4s;				\
+	sm4e		b6.4s, v30.4s;				\
+	sm4e		b7.4s, v30.4s;				\
+	sm4e		b0.4s, v31.4s;				\
+	sm4e		b1.4s, v31.4s;				\
+	sm4e		b2.4s, v31.4s;				\
+	sm4e		b3.4s, v31.4s;				\
+	sm4e		b4.4s, v31.4s;				\
+	sm4e		b5.4s, v31.4s;				\
+	sm4e		b6.4s, v31.4s;				\
+	sm4e		b7.4s, v31.4s;				\
+	rev64		b0.4s, b0.4s;				\
+	rev64		b1.4s, b1.4s;				\
+	rev64		b2.4s, b2.4s;				\
+	rev64		b3.4s, b3.4s;				\
+	rev64		b4.4s, b4.4s;				\
+	rev64		b5.4s, b5.4s;				\
+	rev64		b6.4s, b6.4s;				\
+	rev64		b7.4s, b7.4s;				\
+	ext		b0.16b, b0.16b, b0.16b, #8;		\
+	ext		b1.16b, b1.16b, b1.16b, #8;		\
+	ext		b2.16b, b2.16b, b2.16b, #8;		\
+	ext		b3.16b, b3.16b, b3.16b, #8;		\
+	ext		b4.16b, b4.16b, b4.16b, #8;		\
+	ext		b5.16b, b5.16b, b5.16b, #8;		\
+	ext		b6.16b, b6.16b, b6.16b, #8;		\
+	ext		b7.16b, b7.16b, b7.16b, #8;		\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	rev32		b2.16b, b2.16b;				\
+	rev32		b3.16b, b3.16b;				\
+	rev32		b4.16b, b4.16b;				\
+	rev32		b5.16b, b5.16b;				\
+	rev32		b6.16b, b6.16b;				\
+	rev32		b7.16b, b7.16b;
+
+#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)		\
+	rev32		b0.16b, b0.16b;				\
+	rev32		b1.16b, b1.16b;				\
+	rev32		b2.16b, b2.16b;				\
+	rev32		b3.16b, b3.16b;				\
+	rev32		b4.16b, b4.16b;				\
+	rev32		b5.16b, b5.16b;				\
+	rev32		b6.16b, b6.16b;				\
+	rev32		b7.16b, b7.16b;				\
+	SM4_CRYPT_BLK8_BE(b0, b1, b2, b3, b4, b5, b6, b7);
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 934e0f093279..41fc745a8528 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -10,10 +10,12 @@
 
 #include <linux/linkage.h>
 #include <asm/assembler.h>
+#include "sm4-ce-asm.h"
 
 .arch	armv8-a+crypto
 
-.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 16, 20, 24, 25, 26, 27, 28, 29, 30, 31
+.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
+		20, 24, 25, 26, 27, 28, 29, 30, 31
 	.set .Lv\b\().4s, \b
 .endr
 
@@ -34,174 +36,6 @@
 
 #define RIV	v20
 
-/* Helper macros. */
-
-#define PREPARE                                       \
-	ld1		{v24.16b-v27.16b}, [x0], #64; \
-	ld1		{v28.16b-v31.16b}, [x0];
-
-#define SM4_CRYPT_BLK(b0)                           \
-	rev32		b0.16b, b0.16b;             \
-	sm4e		b0.4s, v24.4s;              \
-	sm4e		b0.4s, v25.4s;              \
-	sm4e		b0.4s, v26.4s;              \
-	sm4e		b0.4s, v27.4s;              \
-	sm4e		b0.4s, v28.4s;              \
-	sm4e		b0.4s, v29.4s;              \
-	sm4e		b0.4s, v30.4s;              \
-	sm4e		b0.4s, v31.4s;              \
-	rev64		b0.4s, b0.4s;               \
-	ext		b0.16b, b0.16b, b0.16b, #8; \
-	rev32		b0.16b, b0.16b;
-
-#define SM4_CRYPT_BLK4(b0, b1, b2, b3)              \
-	rev32		b0.16b, b0.16b;             \
-	rev32		b1.16b, b1.16b;             \
-	rev32		b2.16b, b2.16b;             \
-	rev32		b3.16b, b3.16b;             \
-	sm4e		b0.4s, v24.4s;              \
-	sm4e		b1.4s, v24.4s;              \
-	sm4e		b2.4s, v24.4s;              \
-	sm4e		b3.4s, v24.4s;              \
-	sm4e		b0.4s, v25.4s;              \
-	sm4e		b1.4s, v25.4s;              \
-	sm4e		b2.4s, v25.4s;              \
-	sm4e		b3.4s, v25.4s;              \
-	sm4e		b0.4s, v26.4s;              \
-	sm4e		b1.4s, v26.4s;              \
-	sm4e		b2.4s, v26.4s;              \
-	sm4e		b3.4s, v26.4s;              \
-	sm4e		b0.4s, v27.4s;              \
-	sm4e		b1.4s, v27.4s;              \
-	sm4e		b2.4s, v27.4s;              \
-	sm4e		b3.4s, v27.4s;              \
-	sm4e		b0.4s, v28.4s;              \
-	sm4e		b1.4s, v28.4s;              \
-	sm4e		b2.4s, v28.4s;              \
-	sm4e		b3.4s, v28.4s;              \
-	sm4e		b0.4s, v29.4s;              \
-	sm4e		b1.4s, v29.4s;              \
-	sm4e		b2.4s, v29.4s;              \
-	sm4e		b3.4s, v29.4s;              \
-	sm4e		b0.4s, v30.4s;              \
-	sm4e		b1.4s, v30.4s;              \
-	sm4e		b2.4s, v30.4s;              \
-	sm4e		b3.4s, v30.4s;              \
-	sm4e		b0.4s, v31.4s;              \
-	sm4e		b1.4s, v31.4s;              \
-	sm4e		b2.4s, v31.4s;              \
-	sm4e		b3.4s, v31.4s;              \
-	rev64		b0.4s, b0.4s;               \
-	rev64		b1.4s, b1.4s;               \
-	rev64		b2.4s, b2.4s;               \
-	rev64		b3.4s, b3.4s;               \
-	ext		b0.16b, b0.16b, b0.16b, #8; \
-	ext		b1.16b, b1.16b, b1.16b, #8; \
-	ext		b2.16b, b2.16b, b2.16b, #8; \
-	ext		b3.16b, b3.16b, b3.16b, #8; \
-	rev32		b0.16b, b0.16b;             \
-	rev32		b1.16b, b1.16b;             \
-	rev32		b2.16b, b2.16b;             \
-	rev32		b3.16b, b3.16b;
-
-#define SM4_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7) \
-	rev32		b0.16b, b0.16b;             \
-	rev32		b1.16b, b1.16b;             \
-	rev32		b2.16b, b2.16b;             \
-	rev32		b3.16b, b3.16b;             \
-	rev32		b4.16b, b4.16b;             \
-	rev32		b5.16b, b5.16b;             \
-	rev32		b6.16b, b6.16b;             \
-	rev32		b7.16b, b7.16b;             \
-	sm4e		b0.4s, v24.4s;              \
-	sm4e		b1.4s, v24.4s;              \
-	sm4e		b2.4s, v24.4s;              \
-	sm4e		b3.4s, v24.4s;              \
-	sm4e		b4.4s, v24.4s;              \
-	sm4e		b5.4s, v24.4s;              \
-	sm4e		b6.4s, v24.4s;              \
-	sm4e		b7.4s, v24.4s;              \
-	sm4e		b0.4s, v25.4s;              \
-	sm4e		b1.4s, v25.4s;              \
-	sm4e		b2.4s, v25.4s;              \
-	sm4e		b3.4s, v25.4s;              \
-	sm4e		b4.4s, v25.4s;              \
-	sm4e		b5.4s, v25.4s;              \
-	sm4e		b6.4s, v25.4s;              \
-	sm4e		b7.4s, v25.4s;              \
-	sm4e		b0.4s, v26.4s;              \
-	sm4e		b1.4s, v26.4s;              \
-	sm4e		b2.4s, v26.4s;              \
-	sm4e		b3.4s, v26.4s;              \
-	sm4e		b4.4s, v26.4s;              \
-	sm4e		b5.4s, v26.4s;              \
-	sm4e		b6.4s, v26.4s;              \
-	sm4e		b7.4s, v26.4s;              \
-	sm4e		b0.4s, v27.4s;              \
-	sm4e		b1.4s, v27.4s;              \
-	sm4e		b2.4s, v27.4s;              \
-	sm4e		b3.4s, v27.4s;              \
-	sm4e		b4.4s, v27.4s;              \
-	sm4e		b5.4s, v27.4s;              \
-	sm4e		b6.4s, v27.4s;              \
-	sm4e		b7.4s, v27.4s;              \
-	sm4e		b0.4s, v28.4s;              \
-	sm4e		b1.4s, v28.4s;              \
-	sm4e		b2.4s, v28.4s;              \
-	sm4e		b3.4s, v28.4s;              \
-	sm4e		b4.4s, v28.4s;              \
-	sm4e		b5.4s, v28.4s;              \
-	sm4e		b6.4s, v28.4s;              \
-	sm4e		b7.4s, v28.4s;              \
-	sm4e		b0.4s, v29.4s;              \
-	sm4e		b1.4s, v29.4s;              \
-	sm4e		b2.4s, v29.4s;              \
-	sm4e		b3.4s, v29.4s;              \
-	sm4e		b4.4s, v29.4s;              \
-	sm4e		b5.4s, v29.4s;              \
-	sm4e		b6.4s, v29.4s;              \
-	sm4e		b7.4s, v29.4s;              \
-	sm4e		b0.4s, v30.4s;              \
-	sm4e		b1.4s, v30.4s;              \
-	sm4e		b2.4s, v30.4s;              \
-	sm4e		b3.4s, v30.4s;              \
-	sm4e		b4.4s, v30.4s;              \
-	sm4e		b5.4s, v30.4s;              \
-	sm4e		b6.4s, v30.4s;              \
-	sm4e		b7.4s, v30.4s;              \
-	sm4e		b0.4s, v31.4s;              \
-	sm4e		b1.4s, v31.4s;              \
-	sm4e		b2.4s, v31.4s;              \
-	sm4e		b3.4s, v31.4s;              \
-	sm4e		b4.4s, v31.4s;              \
-	sm4e		b5.4s, v31.4s;              \
-	sm4e		b6.4s, v31.4s;              \
-	sm4e		b7.4s, v31.4s;              \
-	rev64		b0.4s, b0.4s;               \
-	rev64		b1.4s, b1.4s;               \
-	rev64		b2.4s, b2.4s;               \
-	rev64		b3.4s, b3.4s;               \
-	rev64		b4.4s, b4.4s;               \
-	rev64		b5.4s, b5.4s;               \
-	rev64		b6.4s, b6.4s;               \
-	rev64		b7.4s, b7.4s;               \
-	ext		b0.16b, b0.16b, b0.16b, #8; \
-	ext		b1.16b, b1.16b, b1.16b, #8; \
-	ext		b2.16b, b2.16b, b2.16b, #8; \
-	ext		b3.16b, b3.16b, b3.16b, #8; \
-	ext		b4.16b, b4.16b, b4.16b, #8; \
-	ext		b5.16b, b5.16b, b5.16b, #8; \
-	ext		b6.16b, b6.16b, b6.16b, #8; \
-	ext		b7.16b, b7.16b, b7.16b, #8; \
-	rev32		b0.16b, b0.16b;             \
-	rev32		b1.16b, b1.16b;             \
-	rev32		b2.16b, b2.16b;             \
-	rev32		b3.16b, b3.16b;             \
-	rev32		b4.16b, b4.16b;             \
-	rev32		b5.16b, b5.16b;             \
-	rev32		b6.16b, b6.16b;             \
-	rev32		b7.16b, b7.16b;
-
 
 .align 3
 SYM_FUNC_START(sm4_ce_expand_key)
@@ -268,7 +102,7 @@ SYM_FUNC_START(sm4_ce_crypt_block)
 	 *   x1: dst
 	 *   x2: src
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
 	ld1		{v0.16b}, [x2];
 	SM4_CRYPT_BLK(v0);
@@ -285,7 +119,7 @@ SYM_FUNC_START(sm4_ce_crypt)
 	 *   x2: src
 	 *   w3: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
 .Lcrypt_loop_blk:
 	sub		w3, w3, #8;
@@ -337,26 +171,50 @@ SYM_FUNC_START(sm4_ce_cbc_enc)
 	 *   x3: iv (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
+
+	ld1		{RIV.16b}, [x3]
+
+.Lcbc_enc_loop_4x:
+	cmp		w4, #4
+	blt		.Lcbc_enc_loop_1x
+
+	sub		w4, w4, #4
 
-	ld1		{RIV.16b}, [x3];
+	ld1		{v0.16b-v3.16b}, [x2], #64
 
-.Lcbc_enc_loop:
-	sub		w4, w4, #1;
+	eor		v0.16b, v0.16b, RIV.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v1.16b, v1.16b, v0.16b
+	SM4_CRYPT_BLK(v1)
+	eor		v2.16b, v2.16b, v1.16b
+	SM4_CRYPT_BLK(v2)
+	eor		v3.16b, v3.16b, v2.16b
+	SM4_CRYPT_BLK(v3)
 
-	ld1		{RTMP0.16b}, [x2], #16;
-	eor		RIV.16b, RIV.16b, RTMP0.16b;
+	st1		{v0.16b-v3.16b}, [x1], #64
+	mov		RIV.16b, v3.16b
 
-	SM4_CRYPT_BLK(RIV);
+	cbz		w4, .Lcbc_enc_end
+	b		.Lcbc_enc_loop_4x
 
-	st1		{RIV.16b}, [x1], #16;
+.Lcbc_enc_loop_1x:
+	sub		w4, w4, #1
 
-	cbnz		w4, .Lcbc_enc_loop;
+	ld1		{v0.16b}, [x2], #16
 
+	eor		RIV.16b, RIV.16b, v0.16b
+	SM4_CRYPT_BLK(RIV)
+
+	st1		{RIV.16b}, [x1], #16
+
+	cbnz		w4, .Lcbc_enc_loop_1x
+
+.Lcbc_enc_end:
 	/* store new IV */
-	st1		{RIV.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_cbc_enc)
 
 .align 3
@@ -368,79 +226,93 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
 	 *   x3: iv (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
-	ld1		{RIV.16b}, [x3];
+	ld1		{RIV.16b}, [x3]
 
-.Lcbc_loop_blk:
-	sub		w4, w4, #8;
-	tbnz		w4, #31, .Lcbc_tail8;
+.Lcbc_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lcbc_dec_4x
 
-	ld1		{v0.16b-v3.16b}, [x2], #64;
-	ld1		{v4.16b-v7.16b}, [x2];
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	ld1		{v4.16b-v7.16b}, [x2], #64
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	rev32		v8.16b, v0.16b
+	rev32		v9.16b, v1.16b
+	rev32		v10.16b, v2.16b
+	rev32		v11.16b, v3.16b
+	rev32		v12.16b, v4.16b
+	rev32		v13.16b, v5.16b
+	rev32		v14.16b, v6.16b
+	rev32		v15.16b, v7.16b
 
-	sub		x2, x2, #64;
-	eor		v0.16b, v0.16b, RIV.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v1.16b, v1.16b, RTMP0.16b;
-	eor		v2.16b, v2.16b, RTMP1.16b;
-	eor		v3.16b, v3.16b, RTMP2.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	SM4_CRYPT_BLK8_BE(v8, v9, v10, v11, v12, v13, v14, v15)
 
-	eor		v4.16b, v4.16b, RTMP3.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v5.16b, v5.16b, RTMP0.16b;
-	eor		v6.16b, v6.16b, RTMP1.16b;
-	eor		v7.16b, v7.16b, RTMP2.16b;
+	eor		v8.16b, v8.16b, RIV.16b
+	eor		v9.16b, v9.16b, v0.16b
+	eor		v10.16b, v10.16b, v1.16b
+	eor		v11.16b, v11.16b, v2.16b
+	eor		v12.16b, v12.16b, v3.16b
+	eor		v13.16b, v13.16b, v4.16b
+	eor		v14.16b, v14.16b, v5.16b
+	eor		v15.16b, v15.16b, v6.16b
 
-	mov		RIV.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+	st1		{v8.16b-v11.16b}, [x1], #64
+	st1		{v12.16b-v15.16b}, [x1], #64
 
-	cbz		w4, .Lcbc_end;
-	b		.Lcbc_loop_blk;
+	mov		RIV.16b, v7.16b
 
-.Lcbc_tail8:
-	add		w4, w4, #8;
-	cmp		w4, #4;
-	blt		.Lcbc_tail4;
+	cbz		w4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_8x
 
-	sub		w4, w4, #4;
+.Lcbc_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lcbc_dec_loop_1x
 
-	ld1		{v0.16b-v3.16b}, [x2];
+	sub		w4, w4, #4
 
-	SM4_CRYPT_BLK4(v0, v1, v2, v3);
+	ld1		{v0.16b-v3.16b}, [x2], #64
 
-	eor		v0.16b, v0.16b, RIV.16b;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v1.16b, v1.16b, RTMP0.16b;
-	eor		v2.16b, v2.16b, RTMP1.16b;
-	eor		v3.16b, v3.16b, RTMP2.16b;
+	rev32		v8.16b, v0.16b
+	rev32		v9.16b, v1.16b
+	rev32		v10.16b, v2.16b
+	rev32		v11.16b, v3.16b
 
-	mov		RIV.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	SM4_CRYPT_BLK4_BE(v8, v9, v10, v11)
 
-	cbz		w4, .Lcbc_end;
+	eor		v8.16b, v8.16b, RIV.16b
+	eor		v9.16b, v9.16b, v0.16b
+	eor		v10.16b, v10.16b, v1.16b
+	eor		v11.16b, v11.16b, v2.16b
 
-.Lcbc_tail4:
-	sub		w4, w4, #1;
+	st1		{v8.16b-v11.16b}, [x1], #64
 
-	ld1		{v0.16b}, [x2];
+	mov		RIV.16b, v3.16b
 
-	SM4_CRYPT_BLK(v0);
+	cbz		w4, .Lcbc_dec_end
 
-	eor		v0.16b, v0.16b, RIV.16b;
-	ld1		{RIV.16b}, [x2], #16;
-	st1		{v0.16b}, [x1], #16;
+.Lcbc_dec_loop_1x:
+	sub		w4, w4, #1
+
+	ld1		{v0.16b}, [x2], #16
+
+	rev32		v8.16b, v0.16b
+
+	SM4_CRYPT_BLK_BE(v8)
 
-	cbnz		w4, .Lcbc_tail4;
+	eor		v8.16b, v8.16b, RIV.16b
+	st1		{v8.16b}, [x1], #16
 
-.Lcbc_end:
+	mov		RIV.16b, v0.16b
+
+	cbnz		w4, .Lcbc_dec_loop_1x
+
+.Lcbc_dec_end:
 	/* store new IV */
-	st1		{RIV.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_cbc_dec)
 
 .align 3
@@ -452,25 +324,57 @@ SYM_FUNC_START(sm4_ce_cfb_enc)
 	 *   x3: iv (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
+
+	ld1		{RIV.16b}, [x3]
+
+.Lcfb_enc_loop_4x:
+	cmp		w4, #4
+	blt		.Lcfb_enc_loop_1x
+
+	sub		w4, w4, #4
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	rev32		v8.16b, RIV.16b
+	SM4_CRYPT_BLK_BE(v8)
+	eor		v0.16b, v0.16b, v8.16b
+
+	rev32		v8.16b, v0.16b
+	SM4_CRYPT_BLK_BE(v8)
+	eor		v1.16b, v1.16b, v8.16b
+
+	rev32		v8.16b, v1.16b
+	SM4_CRYPT_BLK_BE(v8)
+	eor		v2.16b, v2.16b, v8.16b
+
+	rev32		v8.16b, v2.16b
+	SM4_CRYPT_BLK_BE(v8)
+	eor		v3.16b, v3.16b, v8.16b
 
-	ld1		{RIV.16b}, [x3];
+	st1		{v0.16b-v3.16b}, [x1], #64
+	mov		RIV.16b, v3.16b
 
-.Lcfb_enc_loop:
-	sub		w4, w4, #1;
+	cbz		w4, .Lcfb_enc_end
+	b		.Lcfb_enc_loop_4x
 
-	SM4_CRYPT_BLK(RIV);
+.Lcfb_enc_loop_1x:
+	sub		w4, w4, #1
 
-	ld1		{RTMP0.16b}, [x2], #16;
-	eor		RIV.16b, RIV.16b, RTMP0.16b;
-	st1		{RIV.16b}, [x1], #16;
+	ld1		{v0.16b}, [x2], #16
 
-	cbnz		w4, .Lcfb_enc_loop;
+	SM4_CRYPT_BLK(RIV)
+	eor		RIV.16b, RIV.16b, v0.16b
 
+	st1		{RIV.16b}, [x1], #16
+
+	cbnz		w4, .Lcfb_enc_loop_1x
+
+.Lcfb_enc_end:
 	/* store new IV */
-	st1		{RIV.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_cfb_enc)
 
 .align 3
@@ -482,79 +386,91 @@ SYM_FUNC_START(sm4_ce_cfb_dec)
 	 *   x3: iv (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
-	ld1		{v0.16b}, [x3];
+	ld1		{RIV.16b}, [x3]
 
-.Lcfb_loop_blk:
-	sub		w4, w4, #8;
-	tbnz		w4, #31, .Lcfb_tail8;
+.Lcfb_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lcfb_dec_4x
 
-	ld1		{v1.16b, v2.16b, v3.16b}, [x2], #48;
-	ld1		{v4.16b-v7.16b}, [x2];
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	ld1		{v4.16b-v7.16b}, [x2], #64
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
+	rev32		v8.16b, RIV.16b
+	rev32		v9.16b, v0.16b
+	rev32		v10.16b, v1.16b
+	rev32		v11.16b, v2.16b
+	rev32		v12.16b, v3.16b
+	rev32		v13.16b, v4.16b
+	rev32		v14.16b, v5.16b
+	rev32		v15.16b, v6.16b
 
-	sub		x2, x2, #48;
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	SM4_CRYPT_BLK8_BE(v8, v9, v10, v11, v12, v13, v14, v15)
 
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v4.16b, v4.16b, RTMP0.16b;
-	eor		v5.16b, v5.16b, RTMP1.16b;
-	eor		v6.16b, v6.16b, RTMP2.16b;
-	eor		v7.16b, v7.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+	mov		RIV.16b, v7.16b
 
-	mov		v0.16b, RTMP3.16b;
+	eor		v0.16b, v0.16b, v8.16b
+	eor		v1.16b, v1.16b, v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
 
-	cbz		w4, .Lcfb_end;
-	b		.Lcfb_loop_blk;
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
 
-.Lcfb_tail8:
-	add		w4, w4, #8;
-	cmp		w4, #4;
-	blt		.Lcfb_tail4;
+	cbz		w4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_8x
 
-	sub		w4, w4, #4;
+.Lcfb_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lcfb_dec_loop_1x
 
-	ld1		{v1.16b, v2.16b, v3.16b}, [x2];
+	sub		w4, w4, #4
 
-	SM4_CRYPT_BLK4(v0, v1, v2, v3);
+	ld1		{v0.16b-v3.16b}, [x2], #64
 
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	rev32		v8.16b, RIV.16b
+	rev32		v9.16b, v0.16b
+	rev32		v10.16b, v1.16b
+	rev32		v11.16b, v2.16b
 
-	mov		v0.16b, RTMP3.16b;
+	SM4_CRYPT_BLK4_BE(v8, v9, v10, v11)
 
-	cbz		w4, .Lcfb_end;
+	mov		RIV.16b, v3.16b
 
-.Lcfb_tail4:
-	sub		w4, w4, #1;
+	eor		v0.16b, v0.16b, v8.16b
+	eor		v1.16b, v1.16b, v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
 
-	SM4_CRYPT_BLK(v0);
+	st1		{v0.16b-v3.16b}, [x1], #64
 
-	ld1		{RTMP0.16b}, [x2], #16;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	st1		{v0.16b}, [x1], #16;
+	cbz		w4, .Lcfb_dec_end
+
+.Lcfb_dec_loop_1x:
+	sub		w4, w4, #1
+
+	ld1		{v0.16b}, [x2], #16
 
-	mov		v0.16b, RTMP0.16b;
+	SM4_CRYPT_BLK(RIV)
 
-	cbnz		w4, .Lcfb_tail4;
+	eor		RIV.16b, RIV.16b, v0.16b
+	st1		{RIV.16b}, [x1], #16
 
-.Lcfb_end:
+	mov		RIV.16b, v0.16b
+
+	cbnz		w4, .Lcfb_dec_loop_1x
+
+.Lcfb_dec_end:
 	/* store new IV */
-	st1		{v0.16b}, [x3];
+	st1		{RIV.16b}, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_cfb_dec)
 
 .align 3
@@ -566,95 +482,99 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
 	 *   x3: ctr (big endian, 128 bit)
 	 *   w4: nblocks
 	 */
-	PREPARE;
+	SM4_PREPARE(x0)
 
-	ldp		x7, x8, [x3];
-	rev		x7, x7;
-	rev		x8, x8;
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
 
-.Lctr_loop_blk:
-	sub		w4, w4, #8;
-	tbnz		w4, #31, .Lctr_tail8;
+.Lctr_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lctr_4x
 
-#define inc_le128(vctr)                     \
-	mov		vctr.d[1], x8;      \
-	mov		vctr.d[0], x7;      \
-	adds		x8, x8, #1;         \
-	adc		x7, x7, xzr;        \
-	rev64		vctr.16b, vctr.16b;
+#define inc_le128(vctr)					\
+		mov		vctr.d[1], x8;		\
+		mov		vctr.d[0], x7;		\
+		adds		x8, x8, #1;		\
+		rev64		vctr.16b, vctr.16b;	\
+		adc		x7, x7, xzr;
 
 	/* construct CTRs */
-	inc_le128(v0);			/* +0 */
-	inc_le128(v1);			/* +1 */
-	inc_le128(v2);			/* +2 */
-	inc_le128(v3);			/* +3 */
-	inc_le128(v4);			/* +4 */
-	inc_le128(v5);			/* +5 */
-	inc_le128(v6);			/* +6 */
-	inc_le128(v7);			/* +7 */
+	inc_le128(v0)			/* +0 */
+	inc_le128(v1)			/* +1 */
+	inc_le128(v2)			/* +2 */
+	inc_le128(v3)			/* +3 */
+	inc_le128(v4)			/* +4 */
+	inc_le128(v5)			/* +5 */
+	inc_le128(v6)			/* +6 */
+	inc_le128(v7)			/* +7 */
+
+	ld1		{v8.16b-v11.16b}, [x2], #64
+	ld1		{v12.16b-v15.16b}, [x2], #64
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	eor		v0.16b, v0.16b, v8.16b
+	eor		v1.16b, v1.16b, v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	cbz		w4, .Lctr_end
+	b		.Lctr_loop_8x
+
+.Lctr_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lctr_loop_1x
+
+	sub		w4, w4, #4
 
-	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7);
-
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
+	/* construct CTRs */
+	inc_le128(v0)			/* +0 */
+	inc_le128(v1)			/* +1 */
+	inc_le128(v2)			/* +2 */
+	inc_le128(v3)			/* +3 */
 
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v4.16b, v4.16b, RTMP0.16b;
-	eor		v5.16b, v5.16b, RTMP1.16b;
-	eor		v6.16b, v6.16b, RTMP2.16b;
-	eor		v7.16b, v7.16b, RTMP3.16b;
-	st1		{v4.16b-v7.16b}, [x1], #64;
+	ld1		{v8.16b-v11.16b}, [x2], #64
 
-	cbz		w4, .Lctr_end;
-	b		.Lctr_loop_blk;
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
 
-.Lctr_tail8:
-	add		w4, w4, #8;
-	cmp		w4, #4;
-	blt		.Lctr_tail4;
+	eor		v0.16b, v0.16b, v8.16b
+	eor		v1.16b, v1.16b, v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
 
-	sub		w4, w4, #4;
+	st1		{v0.16b-v3.16b}, [x1], #64
 
-	/* construct CTRs */
-	inc_le128(v0);			/* +0 */
-	inc_le128(v1);			/* +1 */
-	inc_le128(v2);			/* +2 */
-	inc_le128(v3);			/* +3 */
+	cbz		w4, .Lctr_end
 
-	SM4_CRYPT_BLK4(v0, v1, v2, v3);
-
-	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	eor		v1.16b, v1.16b, RTMP1.16b;
-	eor		v2.16b, v2.16b, RTMP2.16b;
-	eor		v3.16b, v3.16b, RTMP3.16b;
-	st1		{v0.16b-v3.16b}, [x1], #64;
-
-	cbz		w4, .Lctr_end;
-
-.Lctr_tail4:
-	sub		w4, w4, #1;
+.Lctr_loop_1x:
+	sub		w4, w4, #1
 
 	/* construct CTRs */
-	inc_le128(v0);
+	inc_le128(v0)
 
-	SM4_CRYPT_BLK(v0);
+	ld1		{v8.16b}, [x2], #16
 
-	ld1		{RTMP0.16b}, [x2], #16;
-	eor		v0.16b, v0.16b, RTMP0.16b;
-	st1		{v0.16b}, [x1], #16;
+	SM4_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x1], #16
 
-	cbnz		w4, .Lctr_tail4;
+	cbnz		w4, .Lctr_loop_1x
 
 .Lctr_end:
 	/* store new CTR */
-	rev		x7, x7;
-	rev		x8, x8;
-	stp		x7, x8, [x3];
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
 
-	ret;
+	ret
 SYM_FUNC_END(sm4_ce_ctr_enc)
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 496d55c0d01a..e56e81b1f35f 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -26,9 +26,9 @@ asmlinkage void sm4_ce_crypt_block(const u32 *rkey, u8 *dst, const u8 *src);
 asmlinkage void sm4_ce_crypt(const u32 *rkey, u8 *dst, const u8 *src,
 			     unsigned int nblks);
 asmlinkage void sm4_ce_cbc_enc(const u32 *rkey, u8 *dst, const u8 *src,
-			       u8 *iv, unsigned int nblks);
+			       u8 *iv, unsigned int nblocks);
 asmlinkage void sm4_ce_cbc_dec(const u32 *rkey, u8 *dst, const u8 *src,
-			       u8 *iv, unsigned int nblks);
+			       u8 *iv, unsigned int nblocks);
 asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -94,66 +94,56 @@ static int sm4_ecb_decrypt(struct skcipher_request *req)
 	return sm4_ecb_do_crypt(req, ctx->rkey_dec);
 }
 
-static int sm4_cbc_encrypt(struct skcipher_request *req)
+static int sm4_cbc_crypt(struct skcipher_request *req,
+			 struct sm4_ctx *ctx, bool encrypt)
 {
-	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
-	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
 	struct skcipher_walk walk;
 	unsigned int nbytes;
 	int err;
 
 	err = skcipher_walk_virt(&walk, req, false);
+	if (err)
+		return err;
 
 	while ((nbytes = walk.nbytes) > 0) {
 		const u8 *src = walk.src.virt.addr;
 		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
+		unsigned int nblocks;
 
-		kernel_neon_begin();
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
 
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			sm4_ce_cbc_enc(ctx->rkey_enc, dst, src, walk.iv, nblks);
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
+			if (encrypt)
+				sm4_ce_cbc_enc(ctx->rkey_enc, dst, src,
+					       walk.iv, nblocks);
+			else
+				sm4_ce_cbc_dec(ctx->rkey_dec, dst, src,
+					       walk.iv, nblocks);
 
-		kernel_neon_end();
+			kernel_neon_end();
+		}
 
-		err = skcipher_walk_done(&walk, nbytes);
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
 	}
 
 	return err;
 }
 
-static int sm4_cbc_decrypt(struct skcipher_request *req)
+static int sm4_cbc_encrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
 	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
-	struct skcipher_walk walk;
-	unsigned int nbytes;
-	int err;
-
-	err = skcipher_walk_virt(&walk, req, false);
 
-	while ((nbytes = walk.nbytes) > 0) {
-		const u8 *src = walk.src.virt.addr;
-		u8 *dst = walk.dst.virt.addr;
-		unsigned int nblks;
-
-		kernel_neon_begin();
-
-		nblks = BYTES2BLKS(nbytes);
-		if (nblks) {
-			sm4_ce_cbc_dec(ctx->rkey_dec, dst, src, walk.iv, nblks);
-			nbytes -= nblks * SM4_BLOCK_SIZE;
-		}
-
-		kernel_neon_end();
+	return sm4_cbc_crypt(req, ctx, true);
+}
 
-		err = skcipher_walk_done(&walk, nbytes);
-	}
+static int sm4_cbc_decrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
 
-	return err;
+	return sm4_cbc_crypt(req, ctx, false);
 }
 
 static int sm4_cfb_encrypt(struct skcipher_request *req)

From patchwork Mon Sep 26 09:36:11 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988479
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 2E12DC32771
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:40:39 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=l1N/kebWyWulHD8EWnSfkvD5Jx8E7XwYdrBJwwJC59g=; b=dbxetRjhC5oz0K
	yWhsjc3+DBSVB1xyeF3O4I+qmygeX9WhZ29peZh+zNvYLJpbZJVI1SMgWRnDiTUznWFCcPXG10Rlj
	qVIOcHHGtQCjw9tFzy37LHBbn8glKwBdv7a2qOmRp0A8ieUQS75I2mm8m3v9iaihB2LQwBLhak/To
	ZT9QJCWygQhrTVAtvVSrxVckxz/Jgkq4uZ+M3ppymvi41SCKvC/cVQL78IRgps1T7NNPNXXPjZb2Z
	u2Qj0kqONDrtvD4nUmwVS0mfw/xwsJAo3Cm2Z7leKEhDoFSqj2Jq67VEPZf8E/IkCdX7h6fbAbeH5
	1bDxBV06CfJicwHremCQ==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockaD-003i5w-Pn; Mon, 26 Sep 2022 09:39:26 +0000
Received: from out30-42.freemail.mail.aliyun.com ([115.124.30.42])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXb-003gnd-54
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:36:48 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R651e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQjskZc_1664184997;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQjskZc_1664184997)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:38 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 07/16] crypto: arm64/sm4 - simplify sm4_ce_expand_key() of CE
 implementation
Date: Mon, 26 Sep 2022 17:36:11 +0800
Message-Id: <20220926093620.99898-8-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023643_449333_ED9D9A73 
X-CRM114-Status: UNSURE (   8.33  )
X-CRM114-Notice: Please train this message.
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Use a 128-bit swap mask and tbl instruction to simplify the implementation
for generating SM4 rkey_dec.

Also fixed the issue of not being wrapped by kernel_neon_begin/end() when
using the sm4_ce_expand_key() function.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-core.S | 46 ++++++++++++++++-----------------
 arch/arm64/crypto/sm4-ce-glue.c |  2 ++
 2 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 41fc745a8528..9e4b4f01cdf3 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -65,32 +65,23 @@ SYM_FUNC_START(sm4_ce_expand_key)
 	sm4ekey		v6.4s, v5.4s, v30.4s;
 	sm4ekey		v7.4s, v6.4s, v31.4s;
 
+	adr_l		x5, .Lbswap128_mask
+	ld1		{v24.16b}, [x5]
+
 	st1		{v0.16b-v3.16b}, [x1], #64;
 	st1		{v4.16b-v7.16b}, [x1];
-	rev64		v7.4s, v7.4s;
-	rev64		v6.4s, v6.4s;
-	rev64		v5.4s, v5.4s;
-	rev64		v4.4s, v4.4s;
-	rev64		v3.4s, v3.4s;
-	rev64		v2.4s, v2.4s;
-	rev64		v1.4s, v1.4s;
-	rev64		v0.4s, v0.4s;
-	ext		v7.16b, v7.16b, v7.16b, #8;
-	ext		v6.16b, v6.16b, v6.16b, #8;
-	ext		v5.16b, v5.16b, v5.16b, #8;
-	ext		v4.16b, v4.16b, v4.16b, #8;
-	ext		v3.16b, v3.16b, v3.16b, #8;
-	ext		v2.16b, v2.16b, v2.16b, #8;
-	ext		v1.16b, v1.16b, v1.16b, #8;
-	ext		v0.16b, v0.16b, v0.16b, #8;
-	st1		{v7.16b}, [x2], #16;
-	st1		{v6.16b}, [x2], #16;
-	st1		{v5.16b}, [x2], #16;
-	st1		{v4.16b}, [x2], #16;
-	st1		{v3.16b}, [x2], #16;
-	st1		{v2.16b}, [x2], #16;
-	st1		{v1.16b}, [x2], #16;
-	st1		{v0.16b}, [x2];
+
+	tbl		v16.16b, {v7.16b}, v24.16b
+	tbl		v17.16b, {v6.16b}, v24.16b
+	tbl		v18.16b, {v5.16b}, v24.16b
+	tbl		v19.16b, {v4.16b}, v24.16b
+	tbl		v20.16b, {v3.16b}, v24.16b
+	tbl		v21.16b, {v2.16b}, v24.16b
+	tbl		v22.16b, {v1.16b}, v24.16b
+	tbl		v23.16b, {v0.16b}, v24.16b
+
+	st1		{v16.16b-v19.16b}, [x2], #64
+	st1		{v20.16b-v23.16b}, [x2]
 
 	ret;
 SYM_FUNC_END(sm4_ce_expand_key)
@@ -578,3 +569,10 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
 
 	ret
 SYM_FUNC_END(sm4_ce_ctr_enc)
+
+
+	.section	".rodata", "a"
+	.align 4
+.Lbswap128_mask:
+	.byte		0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
+	.byte		0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index e56e81b1f35f..ff2d8442d473 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -44,8 +44,10 @@ static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 	if (key_len != SM4_KEY_SIZE)
 		return -EINVAL;
 
+	kernel_neon_begin();
 	sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
 			  crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
 	return 0;
 }
 

From patchwork Mon Sep 26 09:36:12 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988480
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 20EBEC6FA86
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:40:49 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=LZLz0B+LG5zeY+P1Nta2FXWvPmFo5LbrFBo+rMN+Mkg=; b=Bdo6leuaq4mle0
	Gdw4oDbV10GO9zhRCoIGBsVBvHNKgA6en++EsNqieaNIUlTSvU/zKXQuVu3V9VmaUNc713KmLz58t
	L/TtYwraRR6DECvCYmBuRBHDCUqVMJMS65Mob5jKa3QEfKthkF60ImuDgOiJBbEK8N6R/k4r4Hu3G
	bWlc5EcD8J9GUB1mDWl9iX2gp1ZqdkWfFpsG4OO/fLfP9na2ueJiY8L5vzWBdWIX4bD8GkltdNKMi
	/Pn7x4pDhleuk/yxmNxKXikGN4f4dFGsyMedTXO5vaaZLv952iJuD8dL4RqjO181AI8rUtWc47KBq
	OLBmrbYyMCbURjgZJe3g==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockaZ-003iHm-54; Mon, 26 Sep 2022 09:39:47 +0000
Received: from out30-45.freemail.mail.aliyun.com ([115.124.30.45])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXb-003gnt-VG
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:36:50 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R121e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046049;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQkJzFp_1664184999;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQkJzFp_1664184999)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:40 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 08/16] crypto: arm64/sm4 - export reusable CE acceleration
 functions
Date: Mon, 26 Sep 2022 17:36:12 +0800
Message-Id: <20220926093620.99898-9-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023644_226202_16CA6D0A 
X-CRM114-Status: GOOD (  11.93  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

In the accelerated implementation of the SM4 algorithm using the Crypto
Extension instructions, there are some functions that can be reused in
the upcoming accelerated implementation of the GCM/CCM mode, and the
CBC/CFB encryption is reused in the optimized implementation of SVESM4.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-glue.c |  5 +++++
 arch/arm64/crypto/sm4-ce.h      | 16 ++++++++++++++++
 2 files changed, 21 insertions(+)
 create mode 100644 arch/arm64/crypto/sm4-ce.h

diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index ff2d8442d473..63abcadc684b 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -36,6 +36,11 @@ asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
 asmlinkage void sm4_ce_ctr_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 
+EXPORT_SYMBOL(sm4_ce_expand_key);
+EXPORT_SYMBOL(sm4_ce_crypt_block);
+EXPORT_SYMBOL(sm4_ce_cbc_enc);
+EXPORT_SYMBOL(sm4_ce_cfb_enc);
+
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
 {
diff --git a/arch/arm64/crypto/sm4-ce.h b/arch/arm64/crypto/sm4-ce.h
new file mode 100644
index 000000000000..109c21b37590
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 common functions for Crypto Extensions
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+void sm4_ce_expand_key(const u8 *key, u32 *rkey_enc, u32 *rkey_dec,
+		       const u32 *fk, const u32 *ck);
+
+void sm4_ce_crypt_block(const u32 *rkey, u8 *dst, const u8 *src);
+
+void sm4_ce_cbc_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+		    u8 *iv, unsigned int nblocks);
+
+void sm4_ce_cfb_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+		    u8 *iv, unsigned int nblocks);

From patchwork Mon Sep 26 09:36:13 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988490
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id DB49EC32771
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:42:33 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=ZPNh8T7oT8md8p62DRLPWJ48zqBazdmknsktrRly6Ks=; b=bv/jZweuA+HGw4
	xQIegScxiGTJHr9Im4eLsJkdy/6qaYSrblPOtLDz9hD2zf1VD7/L7pw17dnb9s3LfDWmfAZRxSAcy
	MwR1nJ555XrRPIRBUJiuhR3yjo9BNxHDNA8E+Z5MCP9jiWt8ebDMJifK9xBuZGQyCuQWOjOTJmUEi
	Lc9sPo7dWueZvMtE+mBN1Se+fxLESQFxkJ0Bzkf2MhDtUddFLyKUIOV/zztY1rkzk2ZrjcY/zVFGV
	4yww4g/t8ySPc35DfwwAyvcQEzxX2cmWRWl/9yzw7k8jT0/N4f1oUwpbf60X5H5BSZqn/vyS+uXpj
	qhgd6K0ZwiANLvd+zZGw==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockcA-003jDQ-Vo; Mon, 26 Sep 2022 09:41:27 +0000
Received: from out30-54.freemail.mail.aliyun.com ([115.124.30.54])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXf-003gpJ-62
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:36:55 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R911e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046060;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQkJzGX_1664185001;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQkJzGX_1664185001)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:42 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 09/16] crypto: arm64/sm4 - add CE implementation for CTS-CBC
 mode
Date: Mon, 26 Sep 2022 17:36:13 +0800
Message-Id: <20220926093620.99898-10-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023647_448142_856456F4 
X-CRM114-Status: GOOD (  16.95  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

This patch is a CE-optimized assembly implementation for CTS-CBC mode.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is cts(cbc-sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:

Before:

cts(cbc-sm4-ce) |      16       64      128      256     1024     1420     4096
----------------+--------------------------------------------------------------
    CTS-CBC enc |  286.09   297.17   457.97   627.75   868.58   900.80   957.69
    CTS-CBC dec |  286.67   285.63   538.35   947.08  2241.03  2577.32  3391.14

After:

cts-cbc-sm4-ce  |      16       64      128      256     1024     1420     4096
----------------+--------------------------------------------------------------
    CTS-CBC enc |  288.19   428.80   593.57   741.04   911.73   931.80   950.00
    CTS-CBC dec |  292.22   468.99   838.23  1380.76  2741.17  3036.42  3409.62

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-core.S | 102 ++++++++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-ce-glue.c |  94 +++++++++++++++++++++++++++++
 2 files changed, 196 insertions(+)

diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 9e4b4f01cdf3..414d29f8110b 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -306,6 +306,100 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
 	ret
 SYM_FUNC_END(sm4_ce_cbc_dec)
 
+.align 3
+SYM_FUNC_START(sm4_ce_cbc_cts_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nbytes
+	 */
+	SM4_PREPARE(x0)
+
+	sub		w5, w4, #16
+	uxtw		x5, w5
+
+	ld1		{RIV.16b}, [x3]
+
+	ld1		{v0.16b}, [x2]
+	eor		RIV.16b, RIV.16b, v0.16b
+	SM4_CRYPT_BLK(RIV)
+
+	/* load permute table */
+	adr_l		x6, .Lcts_permute_table
+	add		x7, x6, #32
+	add		x6, x6, x5
+	sub		x7, x7, x5
+	ld1		{v3.16b}, [x6]
+	ld1		{v4.16b}, [x7]
+
+	/* overlapping loads */
+	add		x2, x2, x5
+	ld1		{v1.16b}, [x2]
+
+	/* create Cn from En-1 */
+	tbl		v0.16b, {RIV.16b}, v3.16b
+	/* padding Pn with zeros */
+	tbl		v1.16b, {v1.16b}, v4.16b
+
+	eor		v1.16b, v1.16b, RIV.16b
+	SM4_CRYPT_BLK(v1)
+
+	/* overlapping stores */
+	add		x5, x1, x5
+	st1		{v0.16b}, [x5]
+	st1		{v1.16b}, [x1]
+
+	ret
+SYM_FUNC_END(sm4_ce_cbc_cts_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_cbc_cts_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nbytes
+	 */
+	SM4_PREPARE(x0)
+
+	sub		w5, w4, #16
+	uxtw		x5, w5
+
+	ld1		{RIV.16b}, [x3]
+
+	/* load permute table */
+	adr_l		x6, .Lcts_permute_table
+	add		x7, x6, #32
+	add		x6, x6, x5
+	sub		x7, x7, x5
+	ld1		{v3.16b}, [x6]
+	ld1		{v4.16b}, [x7]
+
+	/* overlapping loads */
+	ld1		{v0.16b}, [x2], x5
+	ld1		{v1.16b}, [x2]
+
+	SM4_CRYPT_BLK(v0)
+	/* select the first Ln bytes of Xn to create Pn */
+	tbl		v2.16b, {v0.16b}, v3.16b
+	eor		v2.16b, v2.16b, v1.16b
+
+	/* overwrite the first Ln bytes with Cn to create En-1 */
+	tbx		v0.16b, {v1.16b}, v4.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, RIV.16b
+
+	/* overlapping stores */
+	add		x5, x1, x5
+	st1		{v2.16b}, [x5]
+	st1		{v0.16b}, [x1]
+
+	ret
+SYM_FUNC_END(sm4_ce_cbc_cts_dec)
+
 .align 3
 SYM_FUNC_START(sm4_ce_cfb_enc)
 	/* input:
@@ -576,3 +670,11 @@ SYM_FUNC_END(sm4_ce_ctr_enc)
 .Lbswap128_mask:
 	.byte		0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
 	.byte		0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
+
+.Lcts_permute_table:
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		 0x0,  0x1,  0x2,  0x3,  0x4,  0x5,  0x6,  0x7
+	.byte		 0x8,  0x9,  0xa,  0xb,  0xc,  0xd,  0xe,  0xf
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 63abcadc684b..4d4072c7bfa2 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -16,6 +16,7 @@
 #include <asm/simd.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
+#include <crypto/scatterwalk.h>
 #include <crypto/sm4.h>
 
 #define BYTES2BLKS(nbytes)	((nbytes) >> 4)
@@ -29,6 +30,10 @@ asmlinkage void sm4_ce_cbc_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblocks);
 asmlinkage void sm4_ce_cbc_dec(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblocks);
+asmlinkage void sm4_ce_cbc_cts_enc(const u32 *rkey, u8 *dst, const u8 *src,
+				   u8 *iv, unsigned int nbytes);
+asmlinkage void sm4_ce_cbc_cts_dec(const u32 *rkey, u8 *dst, const u8 *src,
+				   u8 *iv, unsigned int nbytes);
 asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -153,6 +158,78 @@ static int sm4_cbc_decrypt(struct skcipher_request *req)
 	return sm4_cbc_crypt(req, ctx, false);
 }
 
+static int sm4_cbc_cts_crypt(struct skcipher_request *req, bool encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct scatterlist *src = req->src;
+	struct scatterlist *dst = req->dst;
+	struct scatterlist sg_src[2], sg_dst[2];
+	struct skcipher_request subreq;
+	struct skcipher_walk walk;
+	int cbc_blocks;
+	int err;
+
+	if (req->cryptlen < SM4_BLOCK_SIZE)
+		return -EINVAL;
+
+	if (req->cryptlen == SM4_BLOCK_SIZE)
+		return sm4_cbc_crypt(req, ctx, encrypt);
+
+	skcipher_request_set_tfm(&subreq, tfm);
+	skcipher_request_set_callback(&subreq, skcipher_request_flags(req),
+				      NULL, NULL);
+
+	/* handle the CBC cryption part */
+	cbc_blocks = DIV_ROUND_UP(req->cryptlen, SM4_BLOCK_SIZE) - 2;
+	if (cbc_blocks) {
+		skcipher_request_set_crypt(&subreq, src, dst,
+					   cbc_blocks * SM4_BLOCK_SIZE,
+					   req->iv);
+
+		err = sm4_cbc_crypt(&subreq, ctx, encrypt);
+		if (err)
+			return err;
+
+		dst = src = scatterwalk_ffwd(sg_src, src, subreq.cryptlen);
+		if (req->dst != req->src)
+			dst = scatterwalk_ffwd(sg_dst, req->dst,
+					       subreq.cryptlen);
+	}
+
+	/* handle ciphertext stealing */
+	skcipher_request_set_crypt(&subreq, src, dst,
+				   req->cryptlen - cbc_blocks * SM4_BLOCK_SIZE,
+				   req->iv);
+
+	err = skcipher_walk_virt(&walk, &subreq, false);
+	if (err)
+		return err;
+
+	kernel_neon_begin();
+
+	if (encrypt)
+		sm4_ce_cbc_cts_enc(ctx->rkey_enc, walk.dst.virt.addr,
+				   walk.src.virt.addr, walk.iv, walk.nbytes);
+	else
+		sm4_ce_cbc_cts_dec(ctx->rkey_dec, walk.dst.virt.addr,
+				   walk.src.virt.addr, walk.iv, walk.nbytes);
+
+	kernel_neon_end();
+
+	return skcipher_walk_done(&walk, 0);
+}
+
+static int sm4_cbc_cts_encrypt(struct skcipher_request *req)
+{
+	return sm4_cbc_cts_crypt(req, true);
+}
+
+static int sm4_cbc_cts_decrypt(struct skcipher_request *req)
+{
+	return sm4_cbc_cts_crypt(req, false);
+}
+
 static int sm4_cfb_encrypt(struct skcipher_request *req)
 {
 	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
@@ -342,6 +419,22 @@ static struct skcipher_alg sm4_algs[] = {
 		.setkey		= sm4_setkey,
 		.encrypt	= sm4_ctr_crypt,
 		.decrypt	= sm4_ctr_crypt,
+	}, {
+		.base = {
+			.cra_name		= "cts(cbc(sm4))",
+			.cra_driver_name	= "cts-cbc-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.walksize	= SM4_BLOCK_SIZE * 2,
+		.setkey		= sm4_setkey,
+		.encrypt	= sm4_cbc_cts_encrypt,
+		.decrypt	= sm4_cbc_cts_decrypt,
 	}
 };
 
@@ -365,5 +458,6 @@ MODULE_ALIAS_CRYPTO("ecb(sm4)");
 MODULE_ALIAS_CRYPTO("cbc(sm4)");
 MODULE_ALIAS_CRYPTO("cfb(sm4)");
 MODULE_ALIAS_CRYPTO("ctr(sm4)");
+MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
 MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
 MODULE_LICENSE("GPL v2");

From patchwork Mon Sep 26 09:36:14 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988493
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 2CB62C32771
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:45:09 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=KTvXbuWLLJNlIgPSZk8lPoeZlSP0+JxA07+WeQvLIRY=; b=K5QIo+ivkyZT05
	ZqEXvosJJdurNcaIsSywIC9buPZxGYIODBVBJgXJe5I3wTureCLTh4leEcSB4D8WjOA9ieuRTnQHm
	oVDVQvGOq8ih6WTpfllRdPDp9X6GLFnhONvBc3eBRrk28zfkZ3iXkV9UqDOv2Ld/PZmT/yKqmJ3fX
	YxJHjq3tPt5mcKALyRRHd2tqpFHcxtvERqqFp7fVU7pHtiAse0VF5s8vZFmrXxa1vZK/MvgMS1XzU
	QQcj2brDh5b9A5Ug6LHXEScokMHxuLgJGm55eFsrs590R9MG1HtYTPhn4TCy9cZmzD0fy9HVqC83B
	N5123o3gkefjAaIwEfjQ==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockeH-003kDy-2c; Mon, 26 Sep 2022 09:43:38 +0000
Received: from out30-131.freemail.mail.aliyun.com ([115.124.30.131])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXh-003gq9-S2
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:37:01 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R891e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045176;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQjskbL_1664185003;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQjskbL_1664185003)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:44 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 10/16] crypto: arm64/sm4 - add CE implementation for XTS mode
Date: Mon, 26 Sep 2022 17:36:14 +0800
Message-Id: <20220926093620.99898-11-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023650_327938_1C5FC49F 
X-CRM114-Status: GOOD (  19.72  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

This patch is a CE-optimized assembly implementation for XTS mode.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is xts(ecb-sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:

Before:

xts(ecb-sm4-ce) |      16       64      128      256     1024     1420     4096
----------------+--------------------------------------------------------------
        XTS enc |  117.17   430.56   732.92  1134.98  2007.03  2136.23  2347.20
        XTS dec |  116.89   429.02   733.40  1132.96  2006.13  2130.50  2347.92

After:

xts-sm4-ce      |      16       64      128      256     1024     1420     4096
----------------+--------------------------------------------------------------
        XTS enc |  224.68   798.91  1248.08  1714.60  2413.73  2467.84  2612.62
        XTS dec |  229.85   791.34  1237.79  1720.00  2413.30  2473.84  2611.95

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig       |   4 +-
 arch/arm64/crypto/sm4-ce-core.S | 343 ++++++++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-ce-glue.c | 159 ++++++++++++++-
 3 files changed, 504 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 4b121dc0cfba..8939f5ae9214 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -231,7 +231,7 @@ config CRYPTO_SM4_ARM64_CE
 	  - NEON (Advanced SIMD) extensions
 
 config CRYPTO_SM4_ARM64_CE_BLK
-	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv8 Crypto Extensions)"
+	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR/XTS (ARMv8 Crypto Extensions)"
 	depends on KERNEL_MODE_NEON
 	select CRYPTO_SKCIPHER
 	select CRYPTO_SM4
@@ -242,6 +242,8 @@ config CRYPTO_SM4_ARM64_CE_BLK
 	  - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
 	  - CFB (Cipher Feedback) mode (NIST SP800-38A)
 	  - CTR (Counter) mode (NIST SP800-38A)
+	  - XTS (XOR Encrypt XOR with ciphertext stealing) mode (NIST SP800-38E
+	    and IEEE 1619)
 
 	  Architecture: arm64 using:
 	  - ARMv8 Crypto Extensions
diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 414d29f8110b..ddd15ec09d38 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -35,6 +35,7 @@
 #define RTMP3	v19
 
 #define RIV	v20
+#define RMASK	v21
 
 
 .align 3
@@ -665,6 +666,348 @@ SYM_FUNC_START(sm4_ce_ctr_enc)
 SYM_FUNC_END(sm4_ce_ctr_enc)
 
 
+#define tweak_next(vt, vin, RTMP)					\
+		sshr		RTMP.2d, vin.2d, #63;			\
+		and		RTMP.16b, RTMP.16b, RMASK.16b;		\
+		add		vt.2d, vin.2d, vin.2d;			\
+		ext		RTMP.16b, RTMP.16b, RTMP.16b, #8;	\
+		eor		vt.16b, vt.16b, RTMP.16b;
+
+.align 3
+SYM_FUNC_START(sm4_ce_xts_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: tweak (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: round key array for IV
+	 */
+	ld1		{v8.16b}, [x3]
+
+	cbz		x5, .Lxts_enc_nofirst
+
+	SM4_PREPARE(x5)
+
+	/* Generate first tweak */
+	SM4_CRYPT_BLK(v8)
+
+.Lxts_enc_nofirst:
+	SM4_PREPARE(x0)
+
+	ands		w5, w4, #15
+	lsr		w4, w4, #4
+	sub		w6, w4, #1
+	csel		w4, w4, w6, eq
+	uxtw		x5, w5
+
+	movi		RMASK.2s, #0x1
+	movi		RTMP0.2s, #0x87
+	uzp1		RMASK.4s, RMASK.4s, RTMP0.4s
+
+	cbz		w4, .Lxts_enc_cts
+
+.Lxts_enc_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lxts_enc_4x
+
+	tweak_next( v9,  v8, RTMP0)
+	tweak_next(v10,  v9, RTMP1)
+	tweak_next(v11, v10, RTMP2)
+	tweak_next(v12, v11, RTMP3)
+	tweak_next(v13, v12, RTMP0)
+	tweak_next(v14, v13, RTMP1)
+	tweak_next(v15, v14, RTMP2)
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	ld1		{v4.16b-v7.16b}, [x2], #64
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	tweak_next(v8, v15, RTMP3)
+
+	cbz		w4, .Lxts_enc_cts
+	b		.Lxts_enc_loop_8x
+
+.Lxts_enc_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lxts_enc_loop_1x
+
+	sub		w4, w4, #4
+
+	tweak_next( v9,  v8, RTMP0)
+	tweak_next(v10,  v9, RTMP1)
+	tweak_next(v11, v10, RTMP2)
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	tweak_next(v8, v11, RTMP3)
+
+	cbz		w4, .Lxts_enc_cts
+
+.Lxts_enc_loop_1x:
+	sub		w4, w4, #1
+
+	ld1		{v0.16b}, [x2], #16
+	eor		v0.16b, v0.16b, v8.16b
+
+	SM4_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x1], #16
+
+	tweak_next(v8, v8, RTMP0)
+
+	cbnz		w4, .Lxts_enc_loop_1x
+
+.Lxts_enc_cts:
+	cbz		x5, .Lxts_enc_end
+
+	/* cipher text stealing */
+
+	tweak_next(v9, v8, RTMP0)
+	ld1		{v0.16b}, [x2]
+	eor		v0.16b, v0.16b, v8.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v8.16b
+
+	/* load permute table */
+	adr_l		x6, .Lcts_permute_table
+	add		x7, x6, #32
+	add		x6, x6, x5
+	sub		x7, x7, x5
+	ld1		{v3.16b}, [x6]
+	ld1		{v4.16b}, [x7]
+
+	/* overlapping loads */
+	add		x2, x2, x5
+	ld1		{v1.16b}, [x2]
+
+	/* create Cn from En-1 */
+	tbl		v2.16b, {v0.16b}, v3.16b
+	/* padding Pn with En-1 at the end */
+	tbx		v0.16b, {v1.16b}, v4.16b
+
+	eor		v0.16b, v0.16b, v9.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v9.16b
+
+
+	/* overlapping stores */
+	add		x5, x1, x5
+	st1		{v2.16b}, [x5]
+	st1		{v0.16b}, [x1]
+
+	b		.Lxts_enc_ret
+
+.Lxts_enc_end:
+	/* store new tweak */
+	st1		{v8.16b}, [x3]
+
+.Lxts_enc_ret:
+	ret
+SYM_FUNC_END(sm4_ce_xts_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_xts_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: tweak (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: round key array for IV
+	 */
+	ld1		{v8.16b}, [x3]
+
+	cbz		x5, .Lxts_dec_nofirst
+
+	SM4_PREPARE(x5)
+
+	/* Generate first tweak */
+	SM4_CRYPT_BLK(v8)
+
+.Lxts_dec_nofirst:
+	SM4_PREPARE(x0)
+
+	ands		w5, w4, #15
+	lsr		w4, w4, #4
+	sub		w6, w4, #1
+	csel		w4, w4, w6, eq
+	uxtw		x5, w5
+
+	movi		RMASK.2s, #0x1
+	movi		RTMP0.2s, #0x87
+	uzp1		RMASK.4s, RMASK.4s, RTMP0.4s
+
+	cbz		w4, .Lxts_dec_cts
+
+.Lxts_dec_loop_8x:
+	sub		w4, w4, #8
+	tbnz		w4, #31, .Lxts_dec_4x
+
+	tweak_next( v9,  v8, RTMP0)
+	tweak_next(v10,  v9, RTMP1)
+	tweak_next(v11, v10, RTMP2)
+	tweak_next(v12, v11, RTMP3)
+	tweak_next(v13, v12, RTMP0)
+	tweak_next(v14, v13, RTMP1)
+	tweak_next(v15, v14, RTMP2)
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	ld1		{v4.16b-v7.16b}, [x2], #64
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+
+	SM4_CRYPT_BLK8(v0, v1, v2, v3, v4, v5, v6, v7)
+
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	eor		v4.16b, v4.16b, v12.16b
+	eor		v5.16b, v5.16b, v13.16b
+	eor		v6.16b, v6.16b, v14.16b
+	eor		v7.16b, v7.16b, v15.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+	st1		{v4.16b-v7.16b}, [x1], #64
+
+	tweak_next(v8, v15, RTMP3)
+
+	cbz		w4, .Lxts_dec_cts
+	b		.Lxts_dec_loop_8x
+
+.Lxts_dec_4x:
+	add		w4, w4, #8
+	cmp		w4, #4
+	blt		.Lxts_dec_loop_1x
+
+	sub		w4, w4, #4
+
+	tweak_next( v9,  v8, RTMP0)
+	tweak_next(v10,  v9, RTMP1)
+	tweak_next(v11, v10, RTMP2)
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	eor		v0.16b, v0.16b,  v8.16b
+	eor		v1.16b, v1.16b,  v9.16b
+	eor		v2.16b, v2.16b, v10.16b
+	eor		v3.16b, v3.16b, v11.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	tweak_next(v8, v11, RTMP3)
+
+	cbz		w4, .Lxts_dec_cts
+
+.Lxts_dec_loop_1x:
+	sub		w4, w4, #1
+
+	ld1		{v0.16b}, [x2], #16
+	eor		v0.16b, v0.16b, v8.16b
+
+	SM4_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x1], #16
+
+	tweak_next(v8, v8, RTMP0)
+
+	cbnz		w4, .Lxts_dec_loop_1x
+
+.Lxts_dec_cts:
+	cbz		x5, .Lxts_dec_end
+
+	/* cipher text stealing */
+
+	tweak_next(v9, v8, RTMP0)
+	ld1		{v0.16b}, [x2]
+	eor		v0.16b, v0.16b, v9.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v9.16b
+
+	/* load permute table */
+	adr_l		x6, .Lcts_permute_table
+	add		x7, x6, #32
+	add		x6, x6, x5
+	sub		x7, x7, x5
+	ld1		{v3.16b}, [x6]
+	ld1		{v4.16b}, [x7]
+
+	/* overlapping loads */
+	add		x2, x2, x5
+	ld1		{v1.16b}, [x2]
+
+	/* create Cn from En-1 */
+	tbl		v2.16b, {v0.16b}, v3.16b
+	/* padding Pn with En-1 at the end */
+	tbx		v0.16b, {v1.16b}, v4.16b
+
+	eor		v0.16b, v0.16b, v8.16b
+	SM4_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v8.16b
+
+
+	/* overlapping stores */
+	add		x5, x1, x5
+	st1		{v2.16b}, [x5]
+	st1		{v0.16b}, [x1]
+
+	b		.Lxts_dec_ret
+
+.Lxts_dec_end:
+	/* store new tweak */
+	st1		{v8.16b}, [x3]
+
+.Lxts_dec_ret:
+	ret
+SYM_FUNC_END(sm4_ce_xts_dec)
+
+
 	.section	".rodata", "a"
 	.align 4
 .Lbswap128_mask:
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 4d4072c7bfa2..8222766f712a 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -17,6 +17,7 @@
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
 #include <crypto/scatterwalk.h>
+#include <crypto/xts.h>
 #include <crypto/sm4.h>
 
 #define BYTES2BLKS(nbytes)	((nbytes) >> 4)
@@ -40,12 +41,23 @@ asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 asmlinkage void sm4_ce_ctr_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
+asmlinkage void sm4_ce_xts_enc(const u32 *rkey1, u8 *dst, const u8 *src,
+			       u8 *tweak, unsigned int nbytes,
+			       const u32 *rkey2_enc);
+asmlinkage void sm4_ce_xts_dec(const u32 *rkey1, u8 *dst, const u8 *src,
+			       u8 *tweak, unsigned int nbytes,
+			       const u32 *rkey2_enc);
 
 EXPORT_SYMBOL(sm4_ce_expand_key);
 EXPORT_SYMBOL(sm4_ce_crypt_block);
 EXPORT_SYMBOL(sm4_ce_cbc_enc);
 EXPORT_SYMBOL(sm4_ce_cfb_enc);
 
+struct sm4_xts_ctx {
+	struct sm4_ctx key1;
+	struct sm4_ctx key2;
+};
+
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
 {
@@ -61,6 +73,29 @@ static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 	return 0;
 }
 
+static int sm4_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
+			  unsigned int key_len)
+{
+	struct sm4_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
+	int ret;
+
+	if (key_len != SM4_KEY_SIZE * 2)
+		return -EINVAL;
+
+	ret = xts_verify_key(tfm, key, key_len);
+	if (ret)
+		return ret;
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->key1.rkey_enc,
+			  ctx->key1.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+	sm4_ce_expand_key(&key[SM4_KEY_SIZE], ctx->key2.rkey_enc,
+			  ctx->key2.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
 static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
 {
 	struct skcipher_walk walk;
@@ -357,6 +392,111 @@ static int sm4_ctr_crypt(struct skcipher_request *req)
 	return err;
 }
 
+static int sm4_xts_crypt(struct skcipher_request *req, bool encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
+	int tail = req->cryptlen % SM4_BLOCK_SIZE;
+	const u32 *rkey2_enc = ctx->key2.rkey_enc;
+	struct scatterlist sg_src[2], sg_dst[2];
+	struct skcipher_request subreq;
+	struct scatterlist *src, *dst;
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	if (req->cryptlen < SM4_BLOCK_SIZE)
+		return -EINVAL;
+
+	err = skcipher_walk_virt(&walk, req, false);
+	if (err)
+		return err;
+
+	if (unlikely(tail > 0 && walk.nbytes < walk.total)) {
+		int nblocks = DIV_ROUND_UP(req->cryptlen, SM4_BLOCK_SIZE) - 2;
+
+		skcipher_walk_abort(&walk);
+
+		skcipher_request_set_tfm(&subreq, tfm);
+		skcipher_request_set_callback(&subreq,
+					      skcipher_request_flags(req),
+					      NULL, NULL);
+		skcipher_request_set_crypt(&subreq, req->src, req->dst,
+					   nblocks * SM4_BLOCK_SIZE, req->iv);
+
+		err = skcipher_walk_virt(&walk, &subreq, false);
+		if (err)
+			return err;
+	} else {
+		tail = 0;
+	}
+
+	while ((nbytes = walk.nbytes) >= SM4_BLOCK_SIZE) {
+		if (nbytes < walk.total)
+			nbytes &= ~(SM4_BLOCK_SIZE - 1);
+
+		kernel_neon_begin();
+
+		if (encrypt)
+			sm4_ce_xts_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+				       walk.src.virt.addr, walk.iv, nbytes,
+				       rkey2_enc);
+		else
+			sm4_ce_xts_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+				       walk.src.virt.addr, walk.iv, nbytes,
+				       rkey2_enc);
+
+		kernel_neon_end();
+
+		rkey2_enc = NULL;
+
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+		if (err)
+			return err;
+	}
+
+	if (likely(tail == 0))
+		return 0;
+
+	/* handle ciphertext stealing */
+
+	dst = src = scatterwalk_ffwd(sg_src, req->src, subreq.cryptlen);
+	if (req->dst != req->src)
+		dst = scatterwalk_ffwd(sg_dst, req->dst, subreq.cryptlen);
+
+	skcipher_request_set_crypt(&subreq, src, dst, SM4_BLOCK_SIZE + tail,
+				   req->iv);
+
+	err = skcipher_walk_virt(&walk, &subreq, false);
+	if (err)
+		return err;
+
+	kernel_neon_begin();
+
+	if (encrypt)
+		sm4_ce_xts_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+			       walk.src.virt.addr, walk.iv, walk.nbytes,
+			       rkey2_enc);
+	else
+		sm4_ce_xts_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+			       walk.src.virt.addr, walk.iv, walk.nbytes,
+			       rkey2_enc);
+
+	kernel_neon_end();
+
+	return skcipher_walk_done(&walk, 0);
+}
+
+static int sm4_xts_encrypt(struct skcipher_request *req)
+{
+	return sm4_xts_crypt(req, true);
+}
+
+static int sm4_xts_decrypt(struct skcipher_request *req)
+{
+	return sm4_xts_crypt(req, false);
+}
+
 static struct skcipher_alg sm4_algs[] = {
 	{
 		.base = {
@@ -435,6 +575,22 @@ static struct skcipher_alg sm4_algs[] = {
 		.setkey		= sm4_setkey,
 		.encrypt	= sm4_cbc_cts_encrypt,
 		.decrypt	= sm4_cbc_cts_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "xts(sm4)",
+			.cra_driver_name	= "xts-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_xts_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE * 2,
+		.max_keysize	= SM4_KEY_SIZE * 2,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.walksize	= SM4_BLOCK_SIZE * 2,
+		.setkey		= sm4_xts_setkey,
+		.encrypt	= sm4_xts_encrypt,
+		.decrypt	= sm4_xts_decrypt,
 	}
 };
 
@@ -451,7 +607,7 @@ static void __exit sm4_exit(void)
 module_cpu_feature_match(SM4, sm4_init);
 module_exit(sm4_exit);
 
-MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv8 Crypto Extensions");
+MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR/XTS using ARMv8 Crypto Extensions");
 MODULE_ALIAS_CRYPTO("sm4-ce");
 MODULE_ALIAS_CRYPTO("sm4");
 MODULE_ALIAS_CRYPTO("ecb(sm4)");
@@ -459,5 +615,6 @@ MODULE_ALIAS_CRYPTO("cbc(sm4)");
 MODULE_ALIAS_CRYPTO("cfb(sm4)");
 MODULE_ALIAS_CRYPTO("ctr(sm4)");
 MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
+MODULE_ALIAS_CRYPTO("xts(sm4)");
 MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
 MODULE_LICENSE("GPL v2");

From patchwork Mon Sep 26 09:36:15 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988491
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 945A0C07E9D
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:43:26 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=s/Y/MGWpDr8ISFu3YwKJKFD7qN4XJs8agpoPUdPG8CI=; b=vZ20mPq7vbCNLY
	dKlf5rTGw4Ytuo2ka0L9jDQMjI7Qy7NOWLIVKsujETbiQHZai42hxZVP/vOoiS+Iyw40DY9pztFir
	h167v1yzVeA7UJMq/wHP3aAQ/VOOWX+YKtMHjILZfcbV5NJrunT+k4z9Hb8OFZdXu1V6/nwtD8U61
	83NVd6Lp0TBIudiV/667Rxx3GXVt/c2hSSop+SKKzduWCHs843tIf/Ig+iw5XFHlJfKrIXJaDqXYr
	ucqWvqLRHvpaPZwqLkxWpprgvUbSJlITLLmB0bKQQ0GVIFEP5ndQT0BxSOa9QwNUm0t2htxrpeCuP
	xEKBMr3de3jGbQarQNsQ==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockcv-003jZP-1K; Mon, 26 Sep 2022 09:42:13 +0000
Received: from out30-44.freemail.mail.aliyun.com ([115.124.30.44])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXh-003gqO-2B
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:36:56 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R141e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046049;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQkJzHP_1664185004;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQkJzHP_1664185004)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:45 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 11/16] crypto: essiv - allow digestsize to be greater than
 keysize
Date: Mon, 26 Sep 2022 17:36:15 +0800
Message-Id: <20220926093620.99898-12-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023649_341963_73FCEB0F 
X-CRM114-Status: GOOD (  11.17  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

In essiv mode, the digest of the hash algorithm is used as the key to
encrypt the IV. The current implementation requires that the digest size
of the hash algorithm is equal to the key size, which will exclude
algorithms that do not meet this situation, such as essiv(cbc(sm4),sm3),
the hash result of sm3 is fixed 256 bits, and the key size of sm4
symmetric algorithm is fixed 128 bits, which makes it impossible to use
essiv mode.

This patch allows algorithms whose digest size is greater than key size
to use esssiv mode by truncating the digest.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 crypto/essiv.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/crypto/essiv.c b/crypto/essiv.c
index e33369df9034..6ee5a61bcae4 100644
--- a/crypto/essiv.c
+++ b/crypto/essiv.c
@@ -68,6 +68,7 @@ static int essiv_skcipher_setkey(struct crypto_skcipher *tfm,
 {
 	struct essiv_tfm_ctx *tctx = crypto_skcipher_ctx(tfm);
 	u8 salt[HASH_MAX_DIGESTSIZE];
+	unsigned int saltlen;
 	int err;
 
 	crypto_skcipher_clear_flags(tctx->u.skcipher, CRYPTO_TFM_REQ_MASK);
@@ -86,8 +87,11 @@ static int essiv_skcipher_setkey(struct crypto_skcipher *tfm,
 	crypto_cipher_set_flags(tctx->essiv_cipher,
 				crypto_skcipher_get_flags(tfm) &
 				CRYPTO_TFM_REQ_MASK);
-	return crypto_cipher_setkey(tctx->essiv_cipher, salt,
-				    crypto_shash_digestsize(tctx->hash));
+
+	saltlen = min(crypto_shash_digestsize(tctx->hash),
+		      crypto_skcipher_max_keysize(tctx->u.skcipher));
+
+	return crypto_cipher_setkey(tctx->essiv_cipher, salt, saltlen);
 }
 
 static int essiv_aead_setkey(struct crypto_aead *tfm, const u8 *key,
@@ -418,8 +422,7 @@ static bool essiv_supported_algorithms(const char *essiv_cipher_name,
 	if (IS_ERR(alg))
 		return false;
 
-	if (hash_alg->digestsize < alg->cra_cipher.cia_min_keysize ||
-	    hash_alg->digestsize > alg->cra_cipher.cia_max_keysize)
+	if (hash_alg->digestsize < alg->cra_cipher.cia_min_keysize)
 		goto out;
 
 	if (ivsize != alg->cra_blocksize)

From patchwork Mon Sep 26 09:36:16 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988492
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 6BF7EC07E9D
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:44:15 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=53gH8OvJpycDEhk+E4oFgeeVUtFxRkGul5l76PUnNGM=; b=lW5a2tjlU0yXLr
	AJ6QkBXYhPGV11bybkn1iPHtMUrejv4HRW4469GTcP1BiGM/bypP6w4CaO0DDCTuZGwOFzY04doHM
	CwkRP5kqTgLewlj2AXeiSHKnEPAJ7wsOt4oRhimCG0m9pcFDL5DY3uhQMZv2L4tZRfCJd2TgXsyqg
	cd9DKh22oUyDhtvcymWw/EagJJ4G0G2blddWlqSHJYPPgm+wuFgIO2kUVfumgMoKu7JuHczuN67eF
	YlkilqKQbzQanMpu6Sz4OvaU0RVE/svYOR1TKZmeqpnbCPkRuOJI96Y4mNJWom5pUrGhm1y9AHV3w
	9WWsjHDe4QTNOuBkxe2A==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockdd-003jwM-Tn; Mon, 26 Sep 2022 09:42:58 +0000
Received: from out30-42.freemail.mail.aliyun.com ([115.124.30.42])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXi-003grG-TZ
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:37:00 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046050;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQkJzHu_1664185006;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQkJzHu_1664185006)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:47 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 12/16] crypto: arm64/sm4 - add CE implementation for ESSIV
 mode
Date: Mon, 26 Sep 2022 17:36:16 +0800
Message-Id: <20220926093620.99898-13-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023651_191053_6EA80812 
X-CRM114-Status: GOOD (  14.11  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

This patch is a CE-optimized assembly implementation for ESSIV mode.
The assembly part is realized by reusing the CBC mode.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-core.S |  42 +++++++++++
 arch/arm64/crypto/sm4-ce-glue.c | 128 ++++++++++++++++++++++++++++++++
 2 files changed, 170 insertions(+)

diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index ddd15ec09d38..6b923c3209a0 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -154,6 +154,26 @@ SYM_FUNC_START(sm4_ce_crypt)
 	ret;
 SYM_FUNC_END(sm4_ce_crypt)
 
+.align 3
+SYM_FUNC_START(sm4_ce_essiv_cbc_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 *   x5: round key array for IV
+	 */
+	ld1		{RIV.16b}, [x3]
+
+	SM4_PREPARE(x5)
+
+	SM4_CRYPT_BLK(RIV)
+
+	SM4_PREPARE(x0)
+
+	b		.Lcbc_enc_loop_4x
+
 .align 3
 SYM_FUNC_START(sm4_ce_cbc_enc)
 	/* input:
@@ -208,6 +228,27 @@ SYM_FUNC_START(sm4_ce_cbc_enc)
 
 	ret
 SYM_FUNC_END(sm4_ce_cbc_enc)
+SYM_FUNC_END(sm4_ce_essiv_cbc_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_essiv_cbc_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 *   x5: round key array for IV
+	 */
+	ld1		{RIV.16b}, [x3]
+
+	SM4_PREPARE(x5)
+
+	SM4_CRYPT_BLK(RIV)
+
+	SM4_PREPARE(x0)
+
+	b		.Lcbc_dec_loop_8x
 
 .align 3
 SYM_FUNC_START(sm4_ce_cbc_dec)
@@ -306,6 +347,7 @@ SYM_FUNC_START(sm4_ce_cbc_dec)
 
 	ret
 SYM_FUNC_END(sm4_ce_cbc_dec)
+SYM_FUNC_END(sm4_ce_essiv_cbc_dec)
 
 .align 3
 SYM_FUNC_START(sm4_ce_cbc_cts_enc)
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 8222766f712a..6267ec1cfac0 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -19,6 +19,8 @@
 #include <crypto/scatterwalk.h>
 #include <crypto/xts.h>
 #include <crypto/sm4.h>
+#include <crypto/sm3.h>
+#include <crypto/hash.h>
 
 #define BYTES2BLKS(nbytes)	((nbytes) >> 4)
 
@@ -35,6 +37,12 @@ asmlinkage void sm4_ce_cbc_cts_enc(const u32 *rkey, u8 *dst, const u8 *src,
 				   u8 *iv, unsigned int nbytes);
 asmlinkage void sm4_ce_cbc_cts_dec(const u32 *rkey, u8 *dst, const u8 *src,
 				   u8 *iv, unsigned int nbytes);
+asmlinkage void sm4_ce_essiv_cbc_enc(const u32 *rkey1, u8 *dst, const u8 *src,
+				     u8 *iv, unsigned int nblocks,
+				     const u32 *rkey2_enc);
+asmlinkage void sm4_ce_essiv_cbc_dec(const u32 *rkey1, u8 *dst, const u8 *src,
+				     u8 *iv, unsigned int nblocks,
+				     const u32 *rkey2_enc);
 asmlinkage void sm4_ce_cfb_enc(const u32 *rkey, u8 *dst, const u8 *src,
 			       u8 *iv, unsigned int nblks);
 asmlinkage void sm4_ce_cfb_dec(const u32 *rkey, u8 *dst, const u8 *src,
@@ -58,6 +66,12 @@ struct sm4_xts_ctx {
 	struct sm4_ctx key2;
 };
 
+struct sm4_essiv_cbc_ctx {
+	struct sm4_ctx key1;
+	struct sm4_ctx key2;
+	struct crypto_shash *hash;
+};
+
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
 {
@@ -96,6 +110,27 @@ static int sm4_xts_setkey(struct crypto_skcipher *tfm, const u8 *key,
 	return 0;
 }
 
+static int sm4_essiv_cbc_setkey(struct crypto_skcipher *tfm, const u8 *key,
+				unsigned int key_len)
+{
+	struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+	u8 __aligned(8) digest[SM3_DIGEST_SIZE];
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	crypto_shash_tfm_digest(ctx->hash, key, key_len, digest);
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->key1.rkey_enc,
+			  ctx->key1.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+	sm4_ce_expand_key(digest, ctx->key2.rkey_enc,
+			  ctx->key2.rkey_dec, crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
 static int sm4_ecb_do_crypt(struct skcipher_request *req, const u32 *rkey)
 {
 	struct skcipher_walk walk;
@@ -497,6 +532,81 @@ static int sm4_xts_decrypt(struct skcipher_request *req)
 	return sm4_xts_crypt(req, false);
 }
 
+static int sm4_essiv_cbc_init_tfm(struct crypto_skcipher *tfm)
+{
+	struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	ctx->hash = crypto_alloc_shash("sm3", 0, 0);
+
+	return PTR_ERR_OR_ZERO(ctx->hash);
+}
+
+static void sm4_essiv_cbc_exit_tfm(struct crypto_skcipher *tfm)
+{
+	struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	crypto_free_shash(ctx->hash);
+}
+
+static int sm4_essiv_cbc_crypt(struct skcipher_request *req, bool encrypt)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_essiv_cbc_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	unsigned int nblocks;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	if ((nblocks = walk.nbytes / SM4_BLOCK_SIZE) > 0) {
+		kernel_neon_begin();
+
+		if (encrypt)
+			sm4_ce_essiv_cbc_enc(ctx->key1.rkey_enc,
+					     walk.dst.virt.addr,
+					     walk.src.virt.addr, walk.iv,
+					     nblocks, ctx->key2.rkey_enc);
+		else
+			sm4_ce_essiv_cbc_dec(ctx->key1.rkey_dec,
+					     walk.dst.virt.addr,
+					     walk.src.virt.addr, walk.iv,
+					     nblocks, ctx->key2.rkey_enc);
+
+		kernel_neon_end();
+
+		err = skcipher_walk_done(&walk, walk.nbytes % SM4_BLOCK_SIZE);
+		if (err)
+			return err;
+	}
+
+	while ((nblocks = walk.nbytes / SM4_BLOCK_SIZE) > 0) {
+		kernel_neon_begin();
+
+		if (encrypt)
+			sm4_ce_cbc_enc(ctx->key1.rkey_enc, walk.dst.virt.addr,
+				       walk.src.virt.addr, walk.iv, nblocks);
+		else
+			sm4_ce_cbc_dec(ctx->key1.rkey_dec, walk.dst.virt.addr,
+				       walk.src.virt.addr, walk.iv, nblocks);
+
+		kernel_neon_end();
+
+		err = skcipher_walk_done(&walk, walk.nbytes % SM4_BLOCK_SIZE);
+	}
+
+	return err;
+}
+
+static int sm4_essiv_cbc_encrypt(struct skcipher_request *req)
+{
+	return sm4_essiv_cbc_crypt(req, true);
+}
+
+static int sm4_essiv_cbc_decrypt(struct skcipher_request *req)
+{
+	return sm4_essiv_cbc_crypt(req, false);
+}
+
 static struct skcipher_alg sm4_algs[] = {
 	{
 		.base = {
@@ -591,6 +701,23 @@ static struct skcipher_alg sm4_algs[] = {
 		.setkey		= sm4_xts_setkey,
 		.encrypt	= sm4_xts_encrypt,
 		.decrypt	= sm4_xts_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "essiv(cbc(sm4),sm3)",
+			.cra_driver_name	= "essiv-cbc-sm4-sm3-ce",
+			.cra_priority		= 400 + 1,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_essiv_cbc_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.setkey		= sm4_essiv_cbc_setkey,
+		.encrypt	= sm4_essiv_cbc_encrypt,
+		.decrypt	= sm4_essiv_cbc_decrypt,
+		.init		= sm4_essiv_cbc_init_tfm,
+		.exit		= sm4_essiv_cbc_exit_tfm,
 	}
 };
 
@@ -616,5 +743,6 @@ MODULE_ALIAS_CRYPTO("cfb(sm4)");
 MODULE_ALIAS_CRYPTO("ctr(sm4)");
 MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
 MODULE_ALIAS_CRYPTO("xts(sm4)");
+MODULE_ALIAS_CRYPTO("essiv(cbc(sm4),sm3)");
 MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
 MODULE_LICENSE("GPL v2");

From patchwork Mon Sep 26 09:36:17 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988494
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id E5ED4C07E9D
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:45:55 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=RRsYBP/FgOn7CNFIlLr0W80MTtmvrxzhC3WS3nA0GMQ=; b=q2IJ/OPNdi4qZQ
	JT1HumElAZT91IPAYFqOZ8ZoyDv/mzlMhS75MZdz1M2TE5UdGHAfryTdfuwLkZweBrmQhX+RoBSU8
	0DSn/4zJo7+mulb1xMFupZ/mkifiOsFaXQnAEX07N+0uTzJsipxIVTU4pJVJPiAvkuMtadPVZJh9x
	TrJyimqVo9cepVPGoEQi5OtfpZBb/6A3o8Wz/R1x+zM/HcrA9XyTtbpuqdMc+8paZRuLN9zC+PfaV
	QOhkxmVM1+dQEqwlQYFJV0EiHgt55erDQGwA0vnSChXEMZqzxUJLPAUfz4NTFzap3dNhAr7FJ/YWf
	JSsQnZpTIW4QWG9Yf6Pg==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockf9-003keR-SO; Mon, 26 Sep 2022 09:44:32 +0000
Received: from out30-42.freemail.mail.aliyun.com ([115.124.30.42])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXm-003gtE-4G
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:37:04 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R471e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQkQPCm_1664185008;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQkQPCm_1664185008)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:49 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 13/16] crypto: arm64/sm4 - add CE implementation for
 cmac/xcbc/cbcmac
Date: Mon, 26 Sep 2022 17:36:17 +0800
Message-Id: <20220926093620.99898-14-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023654_404304_E0821EF2 
X-CRM114-Status: GOOD (  21.41  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

This patch is a CE-optimized assembly implementation for cmac/xcbc/cbcmac.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 300 mode of
tcrypt, and compared the performance before and after this patch (the driver
used before this patch is XXXmac(sm4-ce)). The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:

Before:

update-size    |      16      64     256    1024    2048    4096    8192
---------------+--------------------------------------------------------
cmac(sm4-ce)   |  293.33  403.69  503.76  527.78  531.10  535.46  535.81
xcbc(sm4-ce)   |  292.83  402.50  504.02  529.08  529.87  536.55  538.24
cbcmac(sm4-ce) |  318.42  415.79  497.12  515.05  523.15  521.19  523.01

After:

update-size    |      16      64     256    1024    2048    4096    8192
---------------+--------------------------------------------------------
cmac-sm4-ce    |  371.99  675.28  903.56  971.65  980.57  990.40  991.04
xcbc-sm4-ce    |  372.11  674.55  903.47  971.61  980.96  990.42  991.10
cbcmac-sm4-ce  |  371.63  675.33  903.23  972.07  981.42  990.93  991.45

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/sm4-ce-core.S |  70 +++++++++
 arch/arm64/crypto/sm4-ce-glue.c | 267 +++++++++++++++++++++++++++++++-
 2 files changed, 336 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/crypto/sm4-ce-core.S b/arch/arm64/crypto/sm4-ce-core.S
index 6b923c3209a0..69fe3b90b7ad 100644
--- a/arch/arm64/crypto/sm4-ce-core.S
+++ b/arch/arm64/crypto/sm4-ce-core.S
@@ -35,6 +35,7 @@
 #define RTMP3	v19
 
 #define RIV	v20
+#define RMAC	v20
 #define RMASK	v21
 
 
@@ -1049,6 +1050,75 @@ SYM_FUNC_START(sm4_ce_xts_dec)
 	ret
 SYM_FUNC_END(sm4_ce_xts_dec)
 
+.align 3
+SYM_FUNC_START(sm4_ce_mac_update)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: digest
+	 *   x2: src
+	 *   w3: nblocks
+	 *   w4: enc_before
+	 *   w5: enc_after
+	 */
+	SM4_PREPARE(x0)
+
+	ld1		{RMAC.16b}, [x1]
+
+	cbz		w4, .Lmac_update
+
+	SM4_CRYPT_BLK(RMAC)
+
+.Lmac_update:
+	cbz		w3, .Lmac_ret
+
+	sub		w6, w3, #1
+	cmp		w5, wzr
+	csel		w3, w3, w6, ne
+
+	cbz		w3, .Lmac_end
+
+.Lmac_loop_4x:
+	cmp		w3, #4
+	blt		.Lmac_loop_1x
+
+	sub		w3, w3, #4
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v1.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v2.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v3.16b
+	SM4_CRYPT_BLK(RMAC)
+
+	cbz		w3, .Lmac_end
+	b		.Lmac_loop_4x
+
+.Lmac_loop_1x:
+	sub		w3, w3, #1
+
+	ld1		{v0.16b}, [x2], #16
+
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	SM4_CRYPT_BLK(RMAC)
+
+	cbnz		w3, .Lmac_loop_1x
+
+
+.Lmac_end:
+	cbnz		w5, .Lmac_ret
+
+	ld1		{v0.16b}, [x2], #16
+	eor		RMAC.16b, RMAC.16b, v0.16b
+
+.Lmac_ret:
+	st1		{RMAC.16b}, [x1]
+	ret
+SYM_FUNC_END(sm4_ce_mac_update)
+
 
 	.section	".rodata", "a"
 	.align 4
diff --git a/arch/arm64/crypto/sm4-ce-glue.c b/arch/arm64/crypto/sm4-ce-glue.c
index 6267ec1cfac0..c2d10b8e92b2 100644
--- a/arch/arm64/crypto/sm4-ce-glue.c
+++ b/arch/arm64/crypto/sm4-ce-glue.c
@@ -14,8 +14,10 @@
 #include <linux/cpufeature.h>
 #include <asm/neon.h>
 #include <asm/simd.h>
+#include <crypto/b128ops.h>
 #include <crypto/internal/simd.h>
 #include <crypto/internal/skcipher.h>
+#include <crypto/internal/hash.h>
 #include <crypto/scatterwalk.h>
 #include <crypto/xts.h>
 #include <crypto/sm4.h>
@@ -55,6 +57,9 @@ asmlinkage void sm4_ce_xts_enc(const u32 *rkey1, u8 *dst, const u8 *src,
 asmlinkage void sm4_ce_xts_dec(const u32 *rkey1, u8 *dst, const u8 *src,
 			       u8 *tweak, unsigned int nbytes,
 			       const u32 *rkey2_enc);
+asmlinkage void sm4_ce_mac_update(const u32 *rkey_enc, u8 *digest,
+				  const u8 *src, unsigned int nblocks,
+				  bool enc_before, bool enc_after);
 
 EXPORT_SYMBOL(sm4_ce_expand_key);
 EXPORT_SYMBOL(sm4_ce_crypt_block);
@@ -72,6 +77,16 @@ struct sm4_essiv_cbc_ctx {
 	struct crypto_shash *hash;
 };
 
+struct sm4_mac_tfm_ctx {
+	struct sm4_ctx key;
+	u8 __aligned(8) consts[];
+};
+
+struct sm4_mac_desc_ctx {
+	unsigned int len;
+	u8 digest[SM4_BLOCK_SIZE];
+};
+
 static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
 		      unsigned int key_len)
 {
@@ -721,13 +736,260 @@ static struct skcipher_alg sm4_algs[] = {
 	}
 };
 
+static int sm4_cbcmac_setkey(struct crypto_shash *tfm, const u8 *key,
+			     unsigned int key_len)
+{
+	struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int sm4_cmac_setkey(struct crypto_shash *tfm, const u8 *key,
+			   unsigned int key_len)
+{
+	struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+	be128 *consts = (be128 *)ctx->consts;
+	u64 a, b;
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	memset(consts, 0, SM4_BLOCK_SIZE);
+
+	kernel_neon_begin();
+
+	sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+
+	/* encrypt the zero block */
+	sm4_ce_crypt_block(ctx->key.rkey_enc, (u8 *)consts, (const u8 *)consts);
+
+	kernel_neon_end();
+
+	/* gf(2^128) multiply zero-ciphertext with u and u^2 */
+	a = be64_to_cpu(consts[0].a);
+	b = be64_to_cpu(consts[0].b);
+	consts[0].a = cpu_to_be64((a << 1) | (b >> 63));
+	consts[0].b = cpu_to_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0));
+
+	a = be64_to_cpu(consts[0].a);
+	b = be64_to_cpu(consts[0].b);
+	consts[1].a = cpu_to_be64((a << 1) | (b >> 63));
+	consts[1].b = cpu_to_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0));
+
+	return 0;
+}
+
+static int sm4_xcbc_setkey(struct crypto_shash *tfm, const u8 *key,
+			   unsigned int key_len)
+{
+	struct sm4_mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
+	u8 __aligned(8) key2[SM4_BLOCK_SIZE];
+	static u8 const ks[3][SM4_BLOCK_SIZE] = {
+		{ [0 ... SM4_BLOCK_SIZE - 1] = 0x1},
+		{ [0 ... SM4_BLOCK_SIZE - 1] = 0x2},
+		{ [0 ... SM4_BLOCK_SIZE - 1] = 0x3},
+	};
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+
+	sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+
+	sm4_ce_crypt_block(ctx->key.rkey_enc, key2, ks[0]);
+	sm4_ce_crypt(ctx->key.rkey_enc, ctx->consts, ks[1], 2);
+
+	sm4_ce_expand_key(key2, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int sm4_mac_init(struct shash_desc *desc)
+{
+	struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+
+	memset(ctx->digest, 0, SM4_BLOCK_SIZE);
+	ctx->len = 0;
+
+	return 0;
+}
+
+static int sm4_mac_update(struct shash_desc *desc, const u8 *p,
+			  unsigned int len)
+{
+	struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+	struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+	unsigned int l, nblocks;
+
+	if (len == 0)
+		return 0;
+
+	if (ctx->len || ctx->len + len < SM4_BLOCK_SIZE) {
+		l = min(len, SM4_BLOCK_SIZE - ctx->len);
+
+		crypto_xor(ctx->digest + ctx->len, p, l);
+		ctx->len += l;
+		len -= l;
+		p += l;
+	}
+
+	if (len && (ctx->len % SM4_BLOCK_SIZE) == 0) {
+		kernel_neon_begin();
+
+		if (len < SM4_BLOCK_SIZE && ctx->len == SM4_BLOCK_SIZE) {
+			sm4_ce_crypt_block(tctx->key.rkey_enc,
+					   ctx->digest, ctx->digest);
+			ctx->len = 0;
+		} else {
+			nblocks = len / SM4_BLOCK_SIZE;
+			len %= SM4_BLOCK_SIZE;
+
+			sm4_ce_mac_update(tctx->key.rkey_enc, ctx->digest, p,
+					  nblocks, (ctx->len == SM4_BLOCK_SIZE),
+					  (len != 0));
+
+			p += nblocks * SM4_BLOCK_SIZE;
+
+			if (len == 0)
+				ctx->len = SM4_BLOCK_SIZE;
+		}
+
+		kernel_neon_end();
+
+		if (len) {
+			crypto_xor(ctx->digest, p, len);
+			ctx->len = len;
+		}
+	}
+
+	return 0;
+}
+
+static int sm4_cmac_final(struct shash_desc *desc, u8 *out)
+{
+	struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+	struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+	const u8 *consts = tctx->consts;
+
+	if (ctx->len != SM4_BLOCK_SIZE) {
+		ctx->digest[ctx->len] ^= 0x80;
+		consts += SM4_BLOCK_SIZE;
+	}
+
+	kernel_neon_begin();
+	sm4_ce_mac_update(tctx->key.rkey_enc, ctx->digest, consts, 1,
+			  false, true);
+	kernel_neon_end();
+
+	memcpy(out, ctx->digest, SM4_BLOCK_SIZE);
+
+	return 0;
+}
+
+static int sm4_cbcmac_final(struct shash_desc *desc, u8 *out)
+{
+	struct sm4_mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
+	struct sm4_mac_desc_ctx *ctx = shash_desc_ctx(desc);
+
+	if (ctx->len) {
+		kernel_neon_begin();
+		sm4_ce_crypt_block(tctx->key.rkey_enc, ctx->digest,
+				   ctx->digest);
+		kernel_neon_end();
+	}
+
+	memcpy(out, ctx->digest, SM4_BLOCK_SIZE);
+
+	return 0;
+}
+
+static struct shash_alg sm4_mac_algs[] = {
+	{
+		.base = {
+			.cra_name		= "cmac(sm4)",
+			.cra_driver_name	= "cmac-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_mac_tfm_ctx)
+							+ SM4_BLOCK_SIZE * 2,
+			.cra_module		= THIS_MODULE,
+		},
+		.digestsize	= SM4_BLOCK_SIZE,
+		.init		= sm4_mac_init,
+		.update		= sm4_mac_update,
+		.final		= sm4_cmac_final,
+		.setkey		= sm4_cmac_setkey,
+		.descsize	= sizeof(struct sm4_mac_desc_ctx),
+	}, {
+		.base = {
+			.cra_name		= "xcbc(sm4)",
+			.cra_driver_name	= "xcbc-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_mac_tfm_ctx)
+							+ SM4_BLOCK_SIZE * 2,
+			.cra_module		= THIS_MODULE,
+		},
+		.digestsize	= SM4_BLOCK_SIZE,
+		.init		= sm4_mac_init,
+		.update		= sm4_mac_update,
+		.final		= sm4_cmac_final,
+		.setkey		= sm4_xcbc_setkey,
+		.descsize	= sizeof(struct sm4_mac_desc_ctx),
+	}, {
+		.base = {
+			.cra_name		= "cbcmac(sm4)",
+			.cra_driver_name	= "cbcmac-sm4-ce",
+			.cra_priority		= 400,
+			.cra_blocksize		= 1,
+			.cra_ctxsize		= sizeof(struct sm4_mac_tfm_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.digestsize	= SM4_BLOCK_SIZE,
+		.init		= sm4_mac_init,
+		.update		= sm4_mac_update,
+		.final		= sm4_cbcmac_final,
+		.setkey		= sm4_cbcmac_setkey,
+		.descsize	= sizeof(struct sm4_mac_desc_ctx),
+	}
+};
+
 static int __init sm4_init(void)
 {
-	return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+	int err;
+
+	err = crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+	if (err)
+		return err;
+
+	err = crypto_register_shashes(sm4_mac_algs, ARRAY_SIZE(sm4_mac_algs));
+	if (err)
+		goto out_err;
+
+	return 0;
+
+out_err:
+	crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+	return err;
 }
 
 static void __exit sm4_exit(void)
 {
+	crypto_unregister_shashes(sm4_mac_algs, ARRAY_SIZE(sm4_mac_algs));
 	crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
 }
 
@@ -744,5 +1006,8 @@ MODULE_ALIAS_CRYPTO("ctr(sm4)");
 MODULE_ALIAS_CRYPTO("cts(cbc(sm4))");
 MODULE_ALIAS_CRYPTO("xts(sm4)");
 MODULE_ALIAS_CRYPTO("essiv(cbc(sm4),sm3)");
+MODULE_ALIAS_CRYPTO("cmac(sm4)");
+MODULE_ALIAS_CRYPTO("xcbc(sm4)");
+MODULE_ALIAS_CRYPTO("cbcmac(sm4)");
 MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
 MODULE_LICENSE("GPL v2");

From patchwork Mon Sep 26 09:36:18 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988501
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 64FE8C32771
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:46:56 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=ziOLSooJcHhtiA1/asRikh2Col/fL1iSakAJDK7986Y=; b=2FZJyVDq5mRBKD
	upbhWftzUGpy34XEyH1C3Lh+SAS9xyvjoR/QkEr/w44EsJ2II9zgvSC+u9XBBUbXNILwWYLIyDhMs
	iFr3SUJSdAybDdsjkDXICzkS6gbo+xGNt5Q1LvyU4OvHDYdwBWto9rWxUzDnNDdrEO7r7LFEjGAVB
	2wAowcWavmOY3ur/zIHvtDh38idNVn1ps9tZ0vaGVEpX8697etcNmdeeoKuV/KKMilL0jiB89Ty4C
	ytplkUyc1HcLTXXRSISzYZkvz7TahUHzyIMcbWNP9ssq33MtX95YWKxbbzZFn4lCXA0NX85iEAwgZ
	N1n/oE0vII6m5d2pK2GA==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockg8-003l81-IW; Mon, 26 Sep 2022 09:45:32 +0000
Received: from out30-54.freemail.mail.aliyun.com ([115.124.30.54])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXm-003gtS-S4
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:37:08 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045176;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQkJzJF_1664185010;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQkJzJF_1664185010)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:52 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 14/16] crypto: arm64/sm4 - add CE implementation for CCM mode
Date: Mon, 26 Sep 2022 17:36:18 +0800
Message-Id: <20220926093620.99898-15-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023655_298669_30CE53A6 
X-CRM114-Status: GOOD (  21.78  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

This patch is a CE-optimized assembly implementation for CCM mode.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 223 and 225
modes of tcrypt, and compared the performance before and after this patch (the
driver used before this patch is ccm_base(ctr-sm4-ce,cbcmac-sm4-ce)).
The abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:

Before (rfc4309(ccm_base(ctr-sm4-ce,cbcmac-sm4-ce))):

ccm(sm4)     |     16      64     256     512    1024    1420    4096    8192
-------------+---------------------------------------------------------------
  CCM enc    |  35.07  125.40  336.47  468.17  581.97  619.18  712.56  736.01
  CCM dec    |  34.87  124.40  335.08  466.75  581.04  618.81  712.25  735.89
  CCM mb enc |  34.71  123.96  333.92  465.39  579.91  617.49  711.45  734.92
  CCM mb dec |  34.42  122.80  331.02  462.81  578.28  616.42  709.88  734.19

After (rfc4309(ccm-sm4-ce)):

ccm-sm4-ce   |     16      64     256     512    1024    1420    4096    8192
-------------+---------------------------------------------------------------
  CCM enc    |  77.12  249.82  569.94  725.17  839.27  867.71  952.87  969.89
  CCM dec    |  75.90  247.26  566.29  722.12  836.90  865.95  951.74  968.57
  CCM mb enc |  75.98  245.25  562.91  718.99  834.76  864.70  950.17  967.90
  CCM mb dec |  75.06  243.78  560.58  717.13  833.68  862.70  949.35  967.11

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig           |  16 ++
 arch/arm64/crypto/Makefile          |   3 +
 arch/arm64/crypto/sm4-ce-ccm-core.S | 328 ++++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-ce-ccm-glue.c | 303 +++++++++++++++++++++++++
 4 files changed, 650 insertions(+)
 create mode 100644 arch/arm64/crypto/sm4-ce-ccm-core.S
 create mode 100644 arch/arm64/crypto/sm4-ce-ccm-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 8939f5ae9214..2611036a3e3f 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -281,6 +281,22 @@ config CRYPTO_AES_ARM64_CE_CCM
 	  - ARMv8 Crypto Extensions
 	  - NEON (Advanced SIMD) extensions
 
+config CRYPTO_SM4_ARM64_CE_CCM
+	tristate "AEAD cipher: SM4 in CCM mode (ARMv8 Crypto Extensions)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_ALGAPI
+	select CRYPTO_AEAD
+	select CRYPTO_SM4
+	select CRYPTO_SM4_ARM64_CE_BLK
+	help
+	  AEAD cipher: SM4 cipher algorithms (OSCCA GB/T 32907-2016) with
+	  CCM (Counter with Cipher Block Chaining-Message Authentication Code)
+	  authenticated encryption mode (NIST SP800-38C)
+
+	  Architecture: arm64 using:
+	  - ARMv8 Crypto Extensions
+	  - NEON (Advanced SIMD) extensions
+
 config CRYPTO_CRCT10DIF_ARM64_CE
 	tristate "CRCT10DIF (PMULL)"
 	depends on KERNEL_MODE_NEON && CRC_T10DIF
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 087f1625e775..843ea5266965 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -29,6 +29,9 @@ sm4-ce-cipher-y := sm4-ce-cipher-glue.o sm4-ce-cipher-core.o
 obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_BLK) += sm4-ce.o
 sm4-ce-y := sm4-ce-glue.o sm4-ce-core.o
 
+obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_CCM) += sm4-ce-ccm.o
+sm4-ce-ccm-y := sm4-ce-ccm-glue.o sm4-ce-ccm-core.o
+
 obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
 sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
 
diff --git a/arch/arm64/crypto/sm4-ce-ccm-core.S b/arch/arm64/crypto/sm4-ce-ccm-core.S
new file mode 100644
index 000000000000..028207c4afd0
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-ccm-core.S
@@ -0,0 +1,328 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-CCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include "sm4-ce-asm.h"
+
+.arch	armv8-a+crypto
+
+.irp b, 0, 1, 8, 9, 10, 11, 12, 13, 14, 15, 16, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lv\b\().4s, \b
+.endr
+
+.macro sm4e, vd, vn
+	.inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+/* Register macros */
+
+#define RMAC	v16
+
+/* Helper macros. */
+
+#define inc_le128(vctr)					\
+		mov		vctr.d[1], x8;		\
+		mov		vctr.d[0], x7;		\
+		adds		x8, x8, #1;		\
+		rev64		vctr.16b, vctr.16b;	\
+		adc		x7, x7, xzr;
+
+
+.align 3
+SYM_FUNC_START(sm4_ce_cbcmac_update)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: mac
+	 *   x2: src
+	 *   w3: nblocks
+	 */
+	SM4_PREPARE(x0)
+
+	ld1		{RMAC.16b}, [x1]
+
+.Lcbcmac_loop_4x:
+	cmp		w3, #4
+	blt		.Lcbcmac_loop_1x
+
+	sub		w3, w3, #4
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v1.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v2.16b
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v3.16b
+
+	cbz		w3, .Lcbcmac_end
+	b		.Lcbcmac_loop_4x
+
+.Lcbcmac_loop_1x:
+	sub		w3, w3, #1
+
+	ld1		{v0.16b}, [x2], #16
+
+	SM4_CRYPT_BLK(RMAC)
+	eor		RMAC.16b, RMAC.16b, v0.16b
+
+	cbnz		w3, .Lcbcmac_loop_1x
+
+.Lcbcmac_end:
+	st1		{RMAC.16b}, [x1]
+	ret
+SYM_FUNC_END(sm4_ce_cbcmac_update)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_final)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: ctr0 (big endian, 128 bit)
+	 *   x2: mac
+	 */
+	SM4_PREPARE(x0)
+
+	ld1		{RMAC.16b}, [x2]
+	ld1		{v0.16b}, [x1]
+
+	SM4_CRYPT_BLK2(RMAC, v0)
+
+	/* en-/decrypt the mac with ctr0 */
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	st1		{RMAC.16b}, [x2]
+
+	ret
+SYM_FUNC_END(sm4_ce_ccm_final)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: mac
+	 */
+	SM4_PREPARE(x0)
+
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
+
+	ld1		{RMAC.16b}, [x5]
+
+.Lccm_enc_loop_4x:
+	cmp		w4, #(4 * 16)
+	blt		.Lccm_enc_loop_1x
+
+	sub		w4, w4, #(4 * 16)
+
+	/* construct CTRs */
+	inc_le128(v8)			/* +0 */
+	inc_le128(v9)			/* +1 */
+	inc_le128(v10)			/* +2 */
+	inc_le128(v11)			/* +3 */
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	SM4_CRYPT_BLK2(v8, RMAC)
+	eor		v8.16b, v8.16b, v0.16b
+	eor		RMAC.16b, RMAC.16b, v0.16b
+	SM4_CRYPT_BLK2(v9, RMAC)
+	eor		v9.16b, v9.16b, v1.16b
+	eor		RMAC.16b, RMAC.16b, v1.16b
+	SM4_CRYPT_BLK2(v10, RMAC)
+	eor		v10.16b, v10.16b, v2.16b
+	eor		RMAC.16b, RMAC.16b, v2.16b
+	SM4_CRYPT_BLK2(v11, RMAC)
+	eor		v11.16b, v11.16b, v3.16b
+	eor		RMAC.16b, RMAC.16b, v3.16b
+
+	st1		{v8.16b-v11.16b}, [x1], #64
+
+	cbz		w4, .Lccm_enc_end
+	b		.Lccm_enc_loop_4x
+
+.Lccm_enc_loop_1x:
+	cmp		w4, #16
+	blt		.Lccm_enc_tail
+
+	sub		w4, w4, #16
+
+	/* construct CTRs */
+	inc_le128(v8)
+
+	ld1		{v0.16b}, [x2], #16
+
+	SM4_CRYPT_BLK2(v8, RMAC)
+	eor		v8.16b, v8.16b, v0.16b
+	eor		RMAC.16b, RMAC.16b, v0.16b
+
+	st1		{v8.16b}, [x1], #16
+
+	cbz		w4, .Lccm_enc_end
+	b		.Lccm_enc_loop_1x
+
+.Lccm_enc_tail:
+	/* construct CTRs */
+	inc_le128(v8)
+
+	SM4_CRYPT_BLK2(RMAC, v8)
+
+	/* store new MAC */
+	st1		{RMAC.16b}, [x5]
+
+.Lccm_enc_tail_loop:
+	ldrb		w0, [x2], #1		/* get 1 byte from input */
+	umov		w9, v8.b[0]		/* get top crypted CTR byte */
+	umov		w6, RMAC.b[0]		/* get top MAC byte */
+
+	eor		w9, w9, w0		/* w9 = CTR ^ input */
+	eor		w6, w6, w0		/* w6 = MAC ^ input */
+
+	strb		w9, [x1], #1		/* store out byte */
+	strb		w6, [x5], #1		/* store MAC byte */
+
+	subs		w4, w4, #1
+	beq		.Lccm_enc_ret
+
+	/* shift out one byte */
+	ext		RMAC.16b, RMAC.16b, RMAC.16b, #1
+	ext		v8.16b, v8.16b, v8.16b, #1
+
+	b		.Lccm_enc_tail_loop
+
+.Lccm_enc_end:
+	/* store new MAC */
+	st1		{RMAC.16b}, [x5]
+
+	/* store new CTR */
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
+
+.Lccm_enc_ret:
+	ret
+SYM_FUNC_END(sm4_ce_ccm_enc)
+
+.align 3
+SYM_FUNC_START(sm4_ce_ccm_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: mac
+	 */
+	SM4_PREPARE(x0)
+
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
+
+	ld1		{RMAC.16b}, [x5]
+
+.Lccm_dec_loop_4x:
+	cmp		w4, #(4 * 16)
+	blt		.Lccm_dec_loop_1x
+
+	sub		w4, w4, #(4 * 16)
+
+	/* construct CTRs */
+	inc_le128(v8)			/* +0 */
+	inc_le128(v9)			/* +1 */
+	inc_le128(v10)			/* +2 */
+	inc_le128(v11)			/* +3 */
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	SM4_CRYPT_BLK2(v8, RMAC)
+	eor		v8.16b, v8.16b, v0.16b
+	eor		RMAC.16b, RMAC.16b, v8.16b
+	SM4_CRYPT_BLK2(v9, RMAC)
+	eor		v9.16b, v9.16b, v1.16b
+	eor		RMAC.16b, RMAC.16b, v9.16b
+	SM4_CRYPT_BLK2(v10, RMAC)
+	eor		v10.16b, v10.16b, v2.16b
+	eor		RMAC.16b, RMAC.16b, v10.16b
+	SM4_CRYPT_BLK2(v11, RMAC)
+	eor		v11.16b, v11.16b, v3.16b
+	eor		RMAC.16b, RMAC.16b, v11.16b
+
+	st1		{v8.16b-v11.16b}, [x1], #64
+
+	cbz		w4, .Lccm_dec_end
+	b		.Lccm_dec_loop_4x
+
+.Lccm_dec_loop_1x:
+	cmp		w4, #16
+	blt		.Lccm_dec_tail
+
+	sub		w4, w4, #16
+
+	/* construct CTRs */
+	inc_le128(v8)
+
+	ld1		{v0.16b}, [x2], #16
+
+	SM4_CRYPT_BLK2(v8, RMAC)
+	eor		v8.16b, v8.16b, v0.16b
+	eor		RMAC.16b, RMAC.16b, v8.16b
+
+	st1		{v8.16b}, [x1], #16
+
+	cbz		w4, .Lccm_dec_end
+	b		.Lccm_dec_loop_1x
+
+.Lccm_dec_tail:
+	/* construct CTRs */
+	inc_le128(v8)
+
+	SM4_CRYPT_BLK2(RMAC, v8)
+
+	/* store new MAC */
+	st1		{RMAC.16b}, [x5]
+
+.Lccm_dec_tail_loop:
+	ldrb		w0, [x2], #1		/* get 1 byte from input */
+	umov		w9, v8.b[0]		/* get top crypted CTR byte */
+	umov		w6, RMAC.b[0]		/* get top MAC byte */
+
+	eor		w9, w9, w0		/* w9 = CTR ^ input */
+	eor		w6, w6, w9		/* w6 = MAC ^ output */
+
+	strb		w9, [x1], #1		/* store out byte */
+	strb		w6, [x5], #1		/* store MAC byte */
+
+	subs		w4, w4, #1
+	beq		.Lccm_dec_ret
+
+	/* shift out one byte */
+	ext		RMAC.16b, RMAC.16b, RMAC.16b, #1
+	ext		v8.16b, v8.16b, v8.16b, #1
+
+	b		.Lccm_dec_tail_loop
+
+.Lccm_dec_end:
+	/* store new MAC */
+	st1		{RMAC.16b}, [x5]
+
+	/* store new CTR */
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
+
+.Lccm_dec_ret:
+	ret
+SYM_FUNC_END(sm4_ce_ccm_dec)
diff --git a/arch/arm64/crypto/sm4-ce-ccm-glue.c b/arch/arm64/crypto/sm4-ce-ccm-glue.c
new file mode 100644
index 000000000000..f2cec7b52efc
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-ccm-glue.c
@@ -0,0 +1,303 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-CCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <crypto/scatterwalk.h>
+#include <crypto/internal/aead.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_ce_cbcmac_update(const u32 *rkey_enc, u8 *mac,
+				     const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_ce_ccm_enc(const u32 *rkey_enc, u8 *dst, const u8 *src,
+			       u8 *iv, unsigned int nbytes, u8 *mac);
+asmlinkage void sm4_ce_ccm_dec(const u32 *rkey_enc, u8 *dst, const u8 *src,
+			       u8 *iv, unsigned int nbytes, u8 *mac);
+asmlinkage void sm4_ce_ccm_final(const u32 *rkey_enc, u8 *iv, u8 *mac);
+
+
+static int ccm_setkey(struct crypto_aead *tfm, const u8 *key,
+		      unsigned int key_len)
+{
+	struct sm4_ctx *ctx = crypto_aead_ctx(tfm);
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int ccm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
+{
+	if ((authsize & 1) || authsize < 4)
+		return -EINVAL;
+	return 0;
+}
+
+static int ccm_format_input(u8 info[], struct aead_request *req,
+			    unsigned int msglen)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	unsigned int l = req->iv[0] + 1;
+	unsigned int m;
+	__be32 len;
+
+	/* verify that CCM dimension 'L': 2 <= L <= 8 */
+	if (l < 2 || l > 8)
+		return -EINVAL;
+	if (l < 4 && msglen >> (8 * l))
+		return -EOVERFLOW;
+
+	memset(&req->iv[SM4_BLOCK_SIZE - l], 0, l);
+
+	memcpy(info, req->iv, SM4_BLOCK_SIZE);
+
+	m = crypto_aead_authsize(aead);
+
+	/* format flags field per RFC 3610/NIST 800-38C */
+	*info |= ((m - 2) / 2) << 3;
+	if (req->assoclen)
+		*info |= (1 << 6);
+
+	/*
+	 * format message length field,
+	 * Linux uses a u32 type to represent msglen
+	 */
+	if (l >= 4)
+		l = 4;
+
+	len = cpu_to_be32(msglen);
+	memcpy(&info[SM4_BLOCK_SIZE - l], (u8 *)&len + 4 - l, l);
+
+	return 0;
+}
+
+static void ccm_calculate_auth_mac(struct aead_request *req, u8 mac[])
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+	struct __packed { __be16 l; __be32 h; } aadlen;
+	u32 assoclen = req->assoclen;
+	struct scatter_walk walk;
+	unsigned int len;
+
+	if (assoclen < 0xff00) {
+		aadlen.l = cpu_to_be16(assoclen);
+		len = 2;
+	} else {
+		aadlen.l = cpu_to_be16(0xfffe);
+		put_unaligned_be32(assoclen, &aadlen.h);
+		len = 6;
+	}
+
+	sm4_ce_crypt_block(ctx->rkey_enc, mac, mac);
+	crypto_xor(mac, (const u8 *)&aadlen, len);
+
+	scatterwalk_start(&walk, req->src);
+
+	do {
+		u32 n = scatterwalk_clamp(&walk, assoclen);
+		u8 *p, *ptr;
+
+		if (!n) {
+			scatterwalk_start(&walk, sg_next(walk.sg));
+			n = scatterwalk_clamp(&walk, assoclen);
+		}
+
+		p = ptr = scatterwalk_map(&walk);
+		assoclen -= n;
+		scatterwalk_advance(&walk, n);
+
+		while (n > 0) {
+			unsigned int l, nblocks;
+
+			if (len == SM4_BLOCK_SIZE) {
+				if (n < SM4_BLOCK_SIZE) {
+					sm4_ce_crypt_block(ctx->rkey_enc,
+							   mac, mac);
+
+					len = 0;
+				} else {
+					nblocks = n / SM4_BLOCK_SIZE;
+					sm4_ce_cbcmac_update(ctx->rkey_enc,
+							     mac, ptr, nblocks);
+
+					ptr += nblocks * SM4_BLOCK_SIZE;
+					n %= SM4_BLOCK_SIZE;
+
+					continue;
+				}
+			}
+
+			l = min(n, SM4_BLOCK_SIZE - len);
+			if (l) {
+				crypto_xor(mac + len, ptr, l);
+				len += l;
+				ptr += l;
+				n -= l;
+			}
+		}
+
+		scatterwalk_unmap(p);
+		scatterwalk_done(&walk, 0, assoclen);
+	} while (assoclen);
+}
+
+static int ccm_crypt(struct aead_request *req, struct skcipher_walk *walk,
+		     u32 *rkey_enc, u8 mac[],
+		     void (*sm4_ce_ccm_crypt)(const u32 *rkey_enc, u8 *dst,
+					const u8 *src, u8 *iv,
+					unsigned int nbytes, u8 *mac))
+{
+	u8 __aligned(8) ctr0[SM4_BLOCK_SIZE];
+	int err;
+
+	/* preserve the initial ctr0 for the TAG */
+	memcpy(ctr0, walk->iv, SM4_BLOCK_SIZE);
+	crypto_inc(walk->iv, SM4_BLOCK_SIZE);
+
+	kernel_neon_begin();
+
+	if (req->assoclen)
+		ccm_calculate_auth_mac(req, mac);
+
+	do {
+		unsigned int tail = walk->nbytes % SM4_BLOCK_SIZE;
+		const u8 *src = walk->src.virt.addr;
+		u8 *dst = walk->dst.virt.addr;
+
+		if (walk->nbytes == walk->total)
+			tail = 0;
+
+		if (walk->nbytes - tail)
+			sm4_ce_ccm_crypt(rkey_enc, dst, src, walk->iv,
+					 walk->nbytes - tail, mac);
+
+		if (walk->nbytes == walk->total)
+			sm4_ce_ccm_final(rkey_enc, ctr0, mac);
+
+		kernel_neon_end();
+
+		if (walk->nbytes) {
+			err = skcipher_walk_done(walk, tail);
+			if (err)
+				return err;
+			if (walk->nbytes)
+				kernel_neon_begin();
+		}
+	} while (walk->nbytes > 0);
+
+	return 0;
+}
+
+static int ccm_encrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) mac[SM4_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	int err;
+
+	err = ccm_format_input(mac, req, req->cryptlen);
+	if (err)
+		return err;
+
+	err = skcipher_walk_aead_encrypt(&walk, req, false);
+	if (err)
+		return err;
+
+	err = ccm_crypt(req, &walk, ctx->rkey_enc, mac, sm4_ce_ccm_enc);
+	if (err)
+		return err;
+
+	/* copy authtag to end of dst */
+	scatterwalk_map_and_copy(mac, req->dst, req->assoclen + req->cryptlen,
+				 crypto_aead_authsize(aead), 1);
+
+	return 0;
+}
+
+static int ccm_decrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	unsigned int authsize = crypto_aead_authsize(aead);
+	struct sm4_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) mac[SM4_BLOCK_SIZE];
+	u8 authtag[SM4_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	int err;
+
+	err = ccm_format_input(mac, req, req->cryptlen - authsize);
+	if (err)
+		return err;
+
+	err = skcipher_walk_aead_decrypt(&walk, req, false);
+	if (err)
+		return err;
+
+	err = ccm_crypt(req, &walk, ctx->rkey_enc, mac, sm4_ce_ccm_dec);
+	if (err)
+		return err;
+
+	/* compare calculated auth tag with the stored one */
+	scatterwalk_map_and_copy(authtag, req->src,
+				 req->assoclen + req->cryptlen - authsize,
+				 authsize, 0);
+
+	if (crypto_memneq(authtag, mac, authsize))
+		return -EBADMSG;
+
+	return 0;
+}
+
+static struct aead_alg sm4_ccm_alg = {
+	.base = {
+		.cra_name		= "ccm(sm4)",
+		.cra_driver_name	= "ccm-sm4-ce",
+		.cra_priority		= 400,
+		.cra_blocksize		= 1,
+		.cra_ctxsize		= sizeof(struct sm4_ctx),
+		.cra_module		= THIS_MODULE,
+	},
+	.ivsize		= SM4_BLOCK_SIZE,
+	.chunksize	= SM4_BLOCK_SIZE,
+	.maxauthsize	= SM4_BLOCK_SIZE,
+	.setkey		= ccm_setkey,
+	.setauthsize	= ccm_setauthsize,
+	.encrypt	= ccm_encrypt,
+	.decrypt	= ccm_decrypt,
+};
+
+static int __init sm4_ce_ccm_init(void)
+{
+	return crypto_register_aead(&sm4_ccm_alg);
+}
+
+static void __exit sm4_ce_ccm_exit(void)
+{
+	crypto_unregister_aead(&sm4_ccm_alg);
+}
+
+module_cpu_feature_match(SM4, sm4_ce_ccm_init);
+module_exit(sm4_ce_ccm_exit);
+
+MODULE_DESCRIPTION("Synchronous SM4 in CCM mode using ARMv8 Crypto Extensions");
+MODULE_ALIAS_CRYPTO("ccm(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");

From patchwork Mon Sep 26 09:36:19 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988502
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 33C46C32771
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:47:56 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=rpUkT5m1mbBKyNQhmFTRL+xo4OjF0PWqzYff3pt3Qdg=; b=khs+VTRAON6/hJ
	vfVE2Wyu3EZ8BzykZcVIQXH9z5XBzqJQcPvvbIy3QZPjxexqr4O49tpbWeruJVrN/ZusG1S/2CeCa
	3bLvEu6EefRddwUrumBfofv8GSqfbaWgbbI0h/Ldd2hcxQNN2diueNrkVaHtdJWSA5zbn8JOjDonf
	Uuj21g6D6Awc5pgfE47mLv/+yFxkaLF+r6watCvG4K5S7M6xJaBNI49r8sG4Dz8RxbVyqKEUFaGWb
	D19p4zXeZ/MgDjBaJpBXW8llBFSv2Pa5jzY8MXW15MB6ukljoJda9hmAdcdxeK7pudAA49bdsHUCZ
	+HR8UceoKC2A+PNI+a3g==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockh2-003lXW-3o; Mon, 26 Sep 2022 09:46:29 +0000
Received: from out30-56.freemail.mail.aliyun.com ([115.124.30.56])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXp-003gv2-5I
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:37:13 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQkJzJn_1664185012;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQkJzJn_1664185012)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:53 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 15/16] crypto: arm64/sm4 - add CE implementation for GCM mode
Date: Mon, 26 Sep 2022 17:36:19 +0800
Message-Id: <20220926093620.99898-16-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023657_667506_324C449C 
X-CRM114-Status: GOOD (  16.22  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

This patch is a CE-optimized assembly implementation for GCM mode.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 224 and 224
modes of tcrypt, and compared the performance before and after this patch (the
driver used before this patch is gcm_base(ctr-sm4-ce,ghash-generic)).
The abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:

Before (gcm_base(ctr-sm4-ce,ghash-generic)):

gcm(sm4)     |     16      64      256      512     1024     1420     4096     8192
-------------+---------------------------------------------------------------------
  GCM enc    |  25.24   64.65   104.66   116.69   123.81   125.12   129.67   130.62
  GCM dec    |  25.40   64.80   104.74   116.70   123.81   125.21   129.68   130.59
  GCM mb enc |  24.95   64.06   104.20   116.38   123.55   124.97   129.63   130.61
  GCM mb dec |  24.92   64.00   104.13   116.34   123.55   124.98   129.56   130.48

After:

gcm-sm4-ce   |     16      64      256      512     1024     1420     4096     8192
-------------+---------------------------------------------------------------------
  GCM enc    | 108.62  397.18   971.60  1283.92  1522.77  1513.39  1777.00  1806.96
  GCM dec    | 116.36  398.14  1004.27  1319.11  1624.21  1635.43  1932.54  1974.20
  GCM mb enc | 107.13  391.79   962.05  1274.94  1514.76  1508.57  1769.07  1801.58
  GCM mb dec | 113.40  389.36   988.51  1307.68  1619.10  1631.55  1931.70  1970.86

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig           |  16 +
 arch/arm64/crypto/Makefile          |   3 +
 arch/arm64/crypto/sm4-ce-gcm-core.S | 741 ++++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-ce-gcm-glue.c | 286 +++++++++++
 4 files changed, 1046 insertions(+)
 create mode 100644 arch/arm64/crypto/sm4-ce-gcm-core.S
 create mode 100644 arch/arm64/crypto/sm4-ce-gcm-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 2611036a3e3f..6793d5bc3ee5 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -297,6 +297,22 @@ config CRYPTO_SM4_ARM64_CE_CCM
 	  - ARMv8 Crypto Extensions
 	  - NEON (Advanced SIMD) extensions
 
+config CRYPTO_SM4_ARM64_CE_GCM
+	tristate "AEAD cipher: SM4 in GCM mode (ARMv8 Crypto Extensions)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_ALGAPI
+	select CRYPTO_AEAD
+	select CRYPTO_SM4
+	select CRYPTO_SM4_ARM64_CE_BLK
+	help
+	  AEAD cipher: SM4 cipher algorithms (OSCCA GB/T 32907-2016) with
+	  GCM (Galois/Counter Mode) authenticated encryption mode (NIST SP800-38D)
+
+	  Architecture: arm64 using:
+	  - ARMv8 Crypto Extensions
+	  - PMULL (Polynomial Multiply Long) instructions
+	  - NEON (Advanced SIMD) extensions
+
 config CRYPTO_CRCT10DIF_ARM64_CE
 	tristate "CRCT10DIF (PMULL)"
 	depends on KERNEL_MODE_NEON && CRC_T10DIF
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 843ea5266965..4818e204c2ac 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -32,6 +32,9 @@ sm4-ce-y := sm4-ce-glue.o sm4-ce-core.o
 obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_CCM) += sm4-ce-ccm.o
 sm4-ce-ccm-y := sm4-ce-ccm-glue.o sm4-ce-ccm-core.o
 
+obj-$(CONFIG_CRYPTO_SM4_ARM64_CE_GCM) += sm4-ce-gcm.o
+sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
+
 obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
 sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
 
diff --git a/arch/arm64/crypto/sm4-ce-gcm-core.S b/arch/arm64/crypto/sm4-ce-gcm-core.S
new file mode 100644
index 000000000000..7aa3ec18a289
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-gcm-core.S
@@ -0,0 +1,741 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-GCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2016 Jussi Kivilinna <jussi.kivilinna@iki.fi>
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+#include "sm4-ce-asm.h"
+
+.arch	armv8-a+crypto
+
+.irp b, 0, 1, 2, 3, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lv\b\().4s, \b
+.endr
+
+.macro sm4e, vd, vn
+	.inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+/* Register macros */
+
+/* Used for both encryption and decryption */
+#define	RHASH	v21
+#define	RRCONST	v22
+#define RZERO	v23
+
+/* Helper macros. */
+
+/*
+ * input: m0, m1
+ * output: r0:r1 (low 128-bits in r0, high in r1)
+ */
+#define PMUL_128x128(r0, r1, m0, m1, T0, T1)			\
+		ext		T0.16b, m1.16b, m1.16b, #8;	\
+		pmull		r0.1q, m0.1d, m1.1d;		\
+		pmull		T1.1q, m0.1d, T0.1d;		\
+		pmull2		T0.1q, m0.2d, T0.2d;		\
+		pmull2		r1.1q, m0.2d, m1.2d;		\
+		eor		T0.16b, T0.16b, T1.16b;		\
+		ext		T1.16b, RZERO.16b, T0.16b, #8;	\
+		ext		T0.16b, T0.16b, RZERO.16b, #8;	\
+		eor		r0.16b, r0.16b, T1.16b;		\
+		eor		r1.16b, r1.16b, T0.16b;
+
+#define PMUL_128x128_4x(r0, r1, m0, m1, T0, T1,			\
+			r2, r3, m2, m3, T2, T3,			\
+			r4, r5, m4, m5, T4, T5,			\
+			r6, r7, m6, m7, T6, T7)			\
+		ext		T0.16b, m1.16b, m1.16b, #8;	\
+		ext		T2.16b, m3.16b, m3.16b, #8;	\
+		ext		T4.16b, m5.16b, m5.16b, #8;	\
+		ext		T6.16b, m7.16b, m7.16b, #8;	\
+		pmull		r0.1q, m0.1d, m1.1d;		\
+		pmull		r2.1q, m2.1d, m3.1d;		\
+		pmull		r4.1q, m4.1d, m5.1d;		\
+		pmull		r6.1q, m6.1d, m7.1d;		\
+		pmull		T1.1q, m0.1d, T0.1d;		\
+		pmull		T3.1q, m2.1d, T2.1d;		\
+		pmull		T5.1q, m4.1d, T4.1d;		\
+		pmull		T7.1q, m6.1d, T6.1d;		\
+		pmull2		T0.1q, m0.2d, T0.2d;		\
+		pmull2		T2.1q, m2.2d, T2.2d;		\
+		pmull2		T4.1q, m4.2d, T4.2d;		\
+		pmull2		T6.1q, m6.2d, T6.2d;		\
+		pmull2		r1.1q, m0.2d, m1.2d;		\
+		pmull2		r3.1q, m2.2d, m3.2d;		\
+		pmull2		r5.1q, m4.2d, m5.2d;		\
+		pmull2		r7.1q, m6.2d, m7.2d;		\
+		eor		T0.16b, T0.16b, T1.16b;		\
+		eor		T2.16b, T2.16b, T3.16b;		\
+		eor		T4.16b, T4.16b, T5.16b;		\
+		eor		T6.16b, T6.16b, T7.16b;		\
+		ext		T1.16b, RZERO.16b, T0.16b, #8;	\
+		ext		T3.16b, RZERO.16b, T2.16b, #8;	\
+		ext		T5.16b, RZERO.16b, T4.16b, #8;	\
+		ext		T7.16b, RZERO.16b, T6.16b, #8;	\
+		ext		T0.16b, T0.16b, RZERO.16b, #8;	\
+		ext		T2.16b, T2.16b, RZERO.16b, #8;	\
+		ext		T4.16b, T4.16b, RZERO.16b, #8;	\
+		ext		T6.16b, T6.16b, RZERO.16b, #8;	\
+		eor		r0.16b, r0.16b, T1.16b;		\
+		eor		r2.16b, r2.16b, T3.16b; 	\
+		eor		r4.16b, r4.16b, T5.16b; 	\
+		eor		r6.16b, r6.16b, T7.16b; 	\
+		eor		r1.16b, r1.16b, T0.16b; 	\
+		eor		r3.16b, r3.16b, T2.16b; 	\
+		eor		r5.16b, r5.16b, T4.16b; 	\
+		eor		r7.16b, r7.16b, T6.16b;
+
+/*
+ * input: r0:r1 (low 128-bits in r0, high in r1)
+ * output: a
+ */
+#define REDUCTION(a, r0, r1, rconst, T0, T1)			\
+		pmull2		T0.1q, r1.2d, rconst.2d;	\
+		ext		T1.16b, T0.16b, RZERO.16b, #8;	\
+		ext		T0.16b, RZERO.16b, T0.16b, #8;	\
+		eor		r1.16b, r1.16b, T1.16b;		\
+		eor		r0.16b, r0.16b, T0.16b;		\
+		pmull		T0.1q, r1.1d, rconst.1d;	\
+		eor		a.16b, r0.16b, T0.16b;
+
+#define SM4_CRYPT_PMUL_128x128_BLK(b0, r0, r1, m0, m1, T0, T1)	\
+	rev32			b0.16b, b0.16b;			\
+		ext		T0.16b, m1.16b, m1.16b, #8;	\
+	sm4e			b0.4s, v24.4s;			\
+		pmull		r0.1q, m0.1d, m1.1d;		\
+	sm4e			b0.4s, v25.4s;			\
+		pmull		T1.1q, m0.1d, T0.1d;		\
+	sm4e			b0.4s, v26.4s;			\
+		pmull2		T0.1q, m0.2d, T0.2d;		\
+	sm4e			b0.4s, v27.4s;			\
+		pmull2		r1.1q, m0.2d, m1.2d;		\
+	sm4e			b0.4s, v28.4s;			\
+		eor		T0.16b, T0.16b, T1.16b;		\
+	sm4e			b0.4s, v29.4s;			\
+		ext		T1.16b, RZERO.16b, T0.16b, #8;	\
+	sm4e			b0.4s, v30.4s;			\
+		ext		T0.16b, T0.16b, RZERO.16b, #8;	\
+	sm4e			b0.4s, v31.4s;			\
+		eor		r0.16b, r0.16b, T1.16b;		\
+	rev64			b0.4s, b0.4s;			\
+		eor		r1.16b, r1.16b, T0.16b;		\
+	ext			b0.16b, b0.16b, b0.16b, #8;	\
+	rev32			b0.16b, b0.16b;
+
+#define SM4_CRYPT_PMUL_128x128_BLK3(b0, b1, b2,			\
+				    r0, r1, m0, m1, T0, T1,	\
+				    r2, r3, m2, m3, T2, T3,	\
+				    r4, r5, m4, m5, T4, T5)	\
+	rev32			b0.16b, b0.16b;			\
+	rev32			b1.16b, b1.16b;			\
+	rev32			b2.16b, b2.16b;			\
+		ext		T0.16b, m1.16b, m1.16b, #8;	\
+		ext		T2.16b, m3.16b, m3.16b, #8;	\
+		ext		T4.16b, m5.16b, m5.16b, #8;	\
+	sm4e			b0.4s, v24.4s;			\
+	sm4e			b1.4s, v24.4s;			\
+	sm4e			b2.4s, v24.4s;			\
+		pmull		r0.1q, m0.1d, m1.1d;		\
+		pmull		r2.1q, m2.1d, m3.1d;		\
+		pmull		r4.1q, m4.1d, m5.1d;		\
+	sm4e			b0.4s, v25.4s;			\
+	sm4e			b1.4s, v25.4s;			\
+	sm4e			b2.4s, v25.4s;			\
+		pmull		T1.1q, m0.1d, T0.1d;		\
+		pmull		T3.1q, m2.1d, T2.1d;		\
+		pmull		T5.1q, m4.1d, T4.1d;		\
+	sm4e			b0.4s, v26.4s;			\
+	sm4e			b1.4s, v26.4s;			\
+	sm4e			b2.4s, v26.4s;			\
+		pmull2		T0.1q, m0.2d, T0.2d;		\
+		pmull2		T2.1q, m2.2d, T2.2d;		\
+		pmull2		T4.1q, m4.2d, T4.2d;		\
+	sm4e			b0.4s, v27.4s;			\
+	sm4e			b1.4s, v27.4s;			\
+	sm4e			b2.4s, v27.4s;			\
+		pmull2		r1.1q, m0.2d, m1.2d;		\
+		pmull2		r3.1q, m2.2d, m3.2d;		\
+		pmull2		r5.1q, m4.2d, m5.2d;		\
+	sm4e			b0.4s, v28.4s;			\
+	sm4e			b1.4s, v28.4s;			\
+	sm4e			b2.4s, v28.4s;			\
+		eor		T0.16b, T0.16b, T1.16b;		\
+		eor		T2.16b, T2.16b, T3.16b;		\
+		eor		T4.16b, T4.16b, T5.16b;		\
+	sm4e			b0.4s, v29.4s;			\
+	sm4e			b1.4s, v29.4s;			\
+	sm4e			b2.4s, v29.4s;			\
+		ext		T1.16b, RZERO.16b, T0.16b, #8;	\
+		ext		T3.16b, RZERO.16b, T2.16b, #8;	\
+		ext		T5.16b, RZERO.16b, T4.16b, #8;	\
+	sm4e			b0.4s, v30.4s;			\
+	sm4e			b1.4s, v30.4s;			\
+	sm4e			b2.4s, v30.4s;			\
+		ext		T0.16b, T0.16b, RZERO.16b, #8;	\
+		ext		T2.16b, T2.16b, RZERO.16b, #8;	\
+		ext		T4.16b, T4.16b, RZERO.16b, #8;	\
+	sm4e			b0.4s, v31.4s;			\
+	sm4e			b1.4s, v31.4s;			\
+	sm4e			b2.4s, v31.4s;			\
+		eor		r0.16b, r0.16b, T1.16b;		\
+		eor		r2.16b, r2.16b, T3.16b;		\
+		eor		r4.16b, r4.16b, T5.16b;		\
+	rev64			b0.4s, b0.4s;			\
+	rev64			b1.4s, b1.4s;			\
+	rev64			b2.4s, b2.4s;			\
+		eor		r1.16b, r1.16b, T0.16b;		\
+		eor		r3.16b, r3.16b, T2.16b;		\
+		eor		r5.16b, r5.16b, T4.16b;		\
+	ext			b0.16b, b0.16b, b0.16b, #8;	\
+	ext			b1.16b, b1.16b, b1.16b, #8;	\
+	ext			b2.16b, b2.16b, b2.16b, #8;	\
+		eor		r0.16b, r0.16b, r2.16b;		\
+		eor		r1.16b, r1.16b, r3.16b;		\
+	rev32			b0.16b, b0.16b;			\
+	rev32			b1.16b, b1.16b;			\
+	rev32			b2.16b, b2.16b;			\
+		eor		r0.16b, r0.16b, r4.16b;		\
+		eor		r1.16b, r1.16b, r5.16b;
+
+#define inc32_le128(vctr)					\
+		mov		vctr.d[1], x9;			\
+		add		w6, w9, #1;			\
+		mov		vctr.d[0], x8;			\
+		bfi		x9, x6, #0, #32;		\
+		rev64		vctr.16b, vctr.16b;
+
+#define GTAG_HASH_LENGTHS(vctr0, vlen)					\
+		ld1		{vlen.16b}, [x7];			\
+		/* construct CTR0 */					\
+		/* the lower 32-bits of initial IV is always be32(1) */	\
+		mov		x6, #0x1;				\
+		bfi		x9, x6, #0, #32;			\
+		mov		vctr0.d[0], x8;				\
+		mov		vctr0.d[1], x9;				\
+		rbit		vlen.16b, vlen.16b;			\
+		rev64		vctr0.16b, vctr0.16b;			\
+		/* authtag = GCTR(CTR0, GHASH) */			\
+		eor		RHASH.16b, RHASH.16b, vlen.16b;		\
+		SM4_CRYPT_PMUL_128x128_BLK(vctr0, RR0, RR1, RHASH, RH1,	\
+					   RTMP0, RTMP1);		\
+		REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3);	\
+		rbit		RHASH.16b, RHASH.16b;			\
+		eor		RHASH.16b, RHASH.16b, vctr0.16b;
+
+
+/* Register macros for encrypt and ghash */
+
+/* can be the same as input v0-v3 */
+#define	RR1	v0
+#define	RR3	v1
+#define	RR5	v2
+#define	RR7	v3
+
+#define	RR0	v4
+#define	RR2	v5
+#define	RR4	v6
+#define	RR6	v7
+
+#define RTMP0	v8
+#define RTMP1	v9
+#define RTMP2	v10
+#define RTMP3	v11
+#define RTMP4	v12
+#define RTMP5	v13
+#define RTMP6	v14
+#define RTMP7	v15
+
+#define	RH1	v16
+#define	RH2	v17
+#define	RH3	v18
+#define	RH4	v19
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_ghash_setup)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: ghash table
+	 */
+	SM4_PREPARE(x0)
+
+	adr_l		x2, .Lghash_rconst
+	ld1r		{RRCONST.2d}, [x2]
+
+	eor		RZERO.16b, RZERO.16b, RZERO.16b
+
+	/* H = E(K, 0^128) */
+	rev32		v0.16b, RZERO.16b
+	SM4_CRYPT_BLK_BE(v0)
+
+	/* H ^ 1 */
+	rbit		RH1.16b, v0.16b
+
+	/* H ^ 2 */
+	PMUL_128x128(RR0, RR1, RH1, RH1, RTMP0, RTMP1)
+	REDUCTION(RH2, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	/* H ^ 3 */
+	PMUL_128x128(RR0, RR1, RH2, RH1, RTMP0, RTMP1)
+	REDUCTION(RH3, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	/* H ^ 4 */
+	PMUL_128x128(RR0, RR1, RH2, RH2, RTMP0, RTMP1)
+	REDUCTION(RH4, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	st1		{RH1.16b-RH4.16b}, [x1]
+
+	ret
+SYM_FUNC_END(sm4_ce_pmull_ghash_setup)
+
+.align 3
+SYM_FUNC_START(pmull_ghash_update)
+	/* input:
+	 *   x0: ghash table
+	 *   x1: ghash result
+	 *   x2: src
+	 *   w3: nblocks
+	 */
+	ld1		{RH1.16b-RH4.16b}, [x0]
+
+	ld1		{RHASH.16b}, [x1]
+	rbit		RHASH.16b, RHASH.16b
+
+	adr_l		x4, .Lghash_rconst
+	ld1r		{RRCONST.2d}, [x4]
+
+	eor		RZERO.16b, RZERO.16b, RZERO.16b
+
+.Lghash_loop_4x:
+	cmp		w3, #4
+	blt		.Lghash_loop_1x
+
+	sub		w3, w3, #4
+
+	ld1		{v0.16b-v3.16b}, [x2], #64
+
+	rbit		v0.16b, v0.16b
+	rbit		v1.16b, v1.16b
+	rbit		v2.16b, v2.16b
+	rbit		v3.16b, v3.16b
+
+	/*
+	 * (in0 ^ HASH) * H^4 => rr0:rr1
+	 * (in1)        * H^3 => rr2:rr3
+	 * (in2)        * H^2 => rr4:rr5
+	 * (in3)        * H^1 => rr6:rr7
+	 */
+	eor		RHASH.16b, RHASH.16b, v0.16b
+
+	PMUL_128x128_4x(RR0, RR1, RHASH, RH4, RTMP0, RTMP1,
+			RR2, RR3, v1, RH3, RTMP2, RTMP3,
+			RR4, RR5, v2, RH2, RTMP4, RTMP5,
+			RR6, RR7, v3, RH1, RTMP6, RTMP7)
+
+	eor		RR0.16b, RR0.16b, RR2.16b
+	eor		RR1.16b, RR1.16b, RR3.16b
+	eor		RR0.16b, RR0.16b, RR4.16b
+	eor		RR1.16b, RR1.16b, RR5.16b
+	eor		RR0.16b, RR0.16b, RR6.16b
+	eor		RR1.16b, RR1.16b, RR7.16b
+
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+	cbz		w3, .Lghash_end
+	b		.Lghash_loop_4x
+
+.Lghash_loop_1x:
+	sub		w3, w3, #1
+
+	ld1		{v0.16b}, [x2], #16
+	rbit		v0.16b, v0.16b
+	eor		RHASH.16b, RHASH.16b, v0.16b
+
+	PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	cbnz		w3, .Lghash_loop_1x
+
+.Lghash_end:
+	rbit		RHASH.16b, RHASH.16b
+	st1		{RHASH.2d}, [x1]
+
+	ret
+SYM_FUNC_END(pmull_ghash_update)
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_gcm_enc)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: ghash result
+	 *   x6: ghash table
+	 *   x7: lengths (only for last block)
+	 */
+	SM4_PREPARE(x0)
+
+	ldp		x8, x9, [x3]
+	rev		x8, x8
+	rev		x9, x9
+
+	ld1		{RH1.16b-RH4.16b}, [x6]
+
+	ld1		{RHASH.16b}, [x5]
+	rbit		RHASH.16b, RHASH.16b
+
+	adr_l		x6, .Lghash_rconst
+	ld1r		{RRCONST.2d}, [x6]
+
+	eor		RZERO.16b, RZERO.16b, RZERO.16b
+
+	cbz		w4, .Lgcm_enc_hash_len
+
+.Lgcm_enc_loop_4x:
+	cmp		w4, #(4 * 16)
+	blt		.Lgcm_enc_loop_1x
+
+	sub		w4, w4, #(4 * 16)
+
+	/* construct CTRs */
+	inc32_le128(v0)			/* +0 */
+	inc32_le128(v1)			/* +1 */
+	inc32_le128(v2)			/* +2 */
+	inc32_le128(v3)			/* +3 */
+
+	ld1		{RTMP0.16b-RTMP3.16b}, [x2], #64
+
+	SM4_CRYPT_BLK4(v0, v1, v2, v3)
+
+	eor		v0.16b, v0.16b, RTMP0.16b
+	eor		v1.16b, v1.16b, RTMP1.16b
+	eor		v2.16b, v2.16b, RTMP2.16b
+	eor		v3.16b, v3.16b, RTMP3.16b
+	st1		{v0.16b-v3.16b}, [x1], #64
+
+	/* ghash update */
+
+	rbit		v0.16b, v0.16b
+	rbit		v1.16b, v1.16b
+	rbit		v2.16b, v2.16b
+	rbit		v3.16b, v3.16b
+
+	/*
+	 * (in0 ^ HASH) * H^4 => rr0:rr1
+	 * (in1)        * H^3 => rr2:rr3
+	 * (in2)        * H^2 => rr4:rr5
+	 * (in3)        * H^1 => rr6:rr7
+	 */
+	eor		RHASH.16b, RHASH.16b, v0.16b
+
+	PMUL_128x128_4x(RR0, RR1, RHASH, RH4, RTMP0, RTMP1,
+			RR2, RR3, v1, RH3, RTMP2, RTMP3,
+			RR4, RR5, v2, RH2, RTMP4, RTMP5,
+			RR6, RR7, v3, RH1, RTMP6, RTMP7)
+
+	eor		RR0.16b, RR0.16b, RR2.16b
+	eor		RR1.16b, RR1.16b, RR3.16b
+	eor		RR0.16b, RR0.16b, RR4.16b
+	eor		RR1.16b, RR1.16b, RR5.16b
+	eor		RR0.16b, RR0.16b, RR6.16b
+	eor		RR1.16b, RR1.16b, RR7.16b
+
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+	cbz		w4, .Lgcm_enc_hash_len
+	b		.Lgcm_enc_loop_4x
+
+.Lgcm_enc_loop_1x:
+	cmp		w4, #16
+	blt		.Lgcm_enc_tail
+
+	sub		w4, w4, #16
+
+	/* construct CTRs */
+	inc32_le128(v0)
+
+	ld1		{RTMP0.16b}, [x2], #16
+
+	SM4_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, RTMP0.16b
+	st1		{v0.16b}, [x1], #16
+
+	/* ghash update */
+	rbit		v0.16b, v0.16b
+	eor		RHASH.16b, RHASH.16b, v0.16b
+	PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	cbz		w4, .Lgcm_enc_hash_len
+	b		.Lgcm_enc_loop_1x
+
+.Lgcm_enc_tail:
+	/* construct CTRs */
+	inc32_le128(v0)
+	SM4_CRYPT_BLK(v0)
+
+	/* load permute table */
+	adr_l		x0, .Lcts_permute_table
+	add		x0, x0, #32
+	sub		x0, x0, w4, uxtw
+	ld1		{v3.16b}, [x0]
+
+.Lgcm_enc_tail_loop:
+	/* do encrypt */
+	ldrb		w0, [x2], #1	/* get 1 byte from input */
+	umov		w6, v0.b[0]	/* get top crypted byte */
+	eor		w6, w6, w0	/* w6 = CTR ^ input */
+	strb		w6, [x1], #1	/* store out byte */
+
+	/* shift right out one byte */
+	ext		v0.16b, v0.16b, v0.16b, #1
+	/* the last ciphertext is placed in high bytes */
+	ins		v0.b[15], w6
+
+	subs		w4, w4, #1
+	bne		.Lgcm_enc_tail_loop
+
+	/* padding last block with zeros */
+	tbl		v0.16b, {v0.16b}, v3.16b
+
+	/* ghash update */
+	rbit		v0.16b, v0.16b
+	eor		RHASH.16b, RHASH.16b, v0.16b
+	PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+.Lgcm_enc_hash_len:
+	cbz		x7, .Lgcm_enc_end
+
+	GTAG_HASH_LENGTHS(v1, v3)
+
+	b		.Lgcm_enc_ret
+
+.Lgcm_enc_end:
+	/* store new CTR */
+	rev		x8, x8
+	rev		x9, x9
+	stp		x8, x9, [x3]
+
+	rbit		RHASH.16b, RHASH.16b
+
+.Lgcm_enc_ret:
+	/* store new MAC */
+	st1		{RHASH.2d}, [x5]
+
+	ret
+SYM_FUNC_END(sm4_ce_pmull_gcm_enc)
+
+#undef	RR1
+#undef	RR3
+#undef	RR5
+#undef	RR7
+#undef	RR0
+#undef	RR2
+#undef	RR4
+#undef	RR6
+#undef RTMP0
+#undef RTMP1
+#undef RTMP2
+#undef RTMP3
+#undef RTMP4
+#undef RTMP5
+#undef RTMP6
+#undef RTMP7
+#undef	RH1
+#undef	RH2
+#undef	RH3
+#undef	RH4
+
+
+/* Register macros for decrypt */
+
+/* v0-v2 for building CTRs, v3-v5 for saving inputs */
+
+#define	RR1	v6
+#define	RR3	v7
+#define	RR5	v8
+
+#define	RR0	v9
+#define	RR2	v10
+#define	RR4	v11
+
+#define RTMP0	v12
+#define RTMP1	v13
+#define RTMP2	v14
+#define RTMP3	v15
+#define RTMP4	v16
+#define RTMP5	v17
+
+#define	RH1	v18
+#define	RH2	v19
+#define	RH3	v20
+
+.align 3
+SYM_FUNC_START(sm4_ce_pmull_gcm_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nbytes
+	 *   x5: ghash result
+	 *   x6: ghash table
+	 *   x7: lengths (only for last block)
+	 */
+	SM4_PREPARE(x0)
+
+	ldp		x8, x9, [x3]
+	rev		x8, x8
+	rev		x9, x9
+
+	ld1		{RH1.16b-RH3.16b}, [x6]
+
+	ld1		{RHASH.16b}, [x5]
+	rbit		RHASH.16b, RHASH.16b
+
+	adr_l		x6, .Lghash_rconst
+	ld1r		{RRCONST.2d}, [x6]
+
+	eor		RZERO.16b, RZERO.16b, RZERO.16b
+
+	cbz		w4, .Lgcm_dec_hash_len
+
+.Lgcm_dec_loop_3x:
+	cmp		w4, #(3 * 16)
+	blt		.Lgcm_dec_loop_1x
+
+	sub		w4, w4, #(3 * 16)
+
+	ld1		{v3.16b-v5.16b}, [x2], #(3 * 16)
+
+	/* construct CTRs */
+	inc32_le128(v0)			/* +0 */
+	rbit		v6.16b, v3.16b
+	inc32_le128(v1)			/* +1 */
+	rbit		v7.16b, v4.16b
+	inc32_le128(v2)			/* +2 */
+	rbit		v8.16b, v5.16b
+
+	eor		RHASH.16b, RHASH.16b, v6.16b
+
+	/* decrypt & ghash update */
+	SM4_CRYPT_PMUL_128x128_BLK3(v0, v1, v2,
+				    RR0, RR1, RHASH, RH3, RTMP0, RTMP1,
+				    RR2, RR3, v7, RH2, RTMP2, RTMP3,
+				    RR4, RR5, v8, RH1, RTMP4, RTMP5)
+
+	eor		v0.16b, v0.16b, v3.16b
+	eor		v1.16b, v1.16b, v4.16b
+	eor		v2.16b, v2.16b, v5.16b
+
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP0, RTMP1)
+
+	st1		{v0.16b-v2.16b}, [x1], #(3 * 16)
+
+	cbz		w4, .Lgcm_dec_hash_len
+	b		.Lgcm_dec_loop_3x
+
+.Lgcm_dec_loop_1x:
+	cmp		w4, #16
+	blt		.Lgcm_dec_tail
+
+	sub		w4, w4, #16
+
+	ld1		{v3.16b}, [x2], #16
+
+	/* construct CTRs */
+	inc32_le128(v0)
+	rbit		v6.16b, v3.16b
+
+	eor		RHASH.16b, RHASH.16b, v6.16b
+
+	SM4_CRYPT_PMUL_128x128_BLK(v0, RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+
+	eor		v0.16b, v0.16b, v3.16b
+
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+	st1		{v0.16b}, [x1], #16
+
+	cbz		w4, .Lgcm_dec_hash_len
+	b		.Lgcm_dec_loop_1x
+
+.Lgcm_dec_tail:
+	/* construct CTRs */
+	inc32_le128(v0)
+	SM4_CRYPT_BLK(v0)
+
+	/* load permute table */
+	adr_l		x0, .Lcts_permute_table
+	add		x0, x0, #32
+	sub		x0, x0, w4, uxtw
+	ld1		{v3.16b}, [x0]
+
+.Lgcm_dec_tail_loop:
+	/* do decrypt */
+	ldrb		w0, [x2], #1	/* get 1 byte from input */
+	umov		w6, v0.b[0]	/* get top crypted byte */
+	eor		w6, w6, w0	/* w6 = CTR ^ input */
+	strb		w6, [x1], #1	/* store out byte */
+
+	/* shift right out one byte */
+	ext		v0.16b, v0.16b, v0.16b, #1
+	/* the last ciphertext is placed in high bytes */
+	ins		v0.b[15], w0
+
+	subs		w4, w4, #1
+	bne		.Lgcm_dec_tail_loop
+
+	/* padding last block with zeros */
+	tbl		v0.16b, {v0.16b}, v3.16b
+
+	/* ghash update */
+	rbit		v0.16b, v0.16b
+	eor		RHASH.16b, RHASH.16b, v0.16b
+	PMUL_128x128(RR0, RR1, RHASH, RH1, RTMP0, RTMP1)
+	REDUCTION(RHASH, RR0, RR1, RRCONST, RTMP2, RTMP3)
+
+.Lgcm_dec_hash_len:
+	cbz		x7, .Lgcm_dec_end
+
+	GTAG_HASH_LENGTHS(v1, v3)
+
+	b		.Lgcm_dec_ret
+
+.Lgcm_dec_end:
+	/* store new CTR */
+	rev		x8, x8
+	rev		x9, x9
+	stp		x8, x9, [x3]
+
+	rbit		RHASH.16b, RHASH.16b
+
+.Lgcm_dec_ret:
+	/* store new MAC */
+	st1		{RHASH.2d}, [x5]
+
+	ret
+SYM_FUNC_END(sm4_ce_pmull_gcm_dec)
+
+	.section	".rodata", "a"
+	.align 4
+.Lcts_permute_table:
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		 0x0,  0x1,  0x2,  0x3,  0x4,  0x5,  0x6,  0x7
+	.byte		 0x8,  0x9,  0xa,  0xb,  0xc,  0xd,  0xe,  0xf
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+
+.Lghash_rconst:
+	.quad		0x87
diff --git a/arch/arm64/crypto/sm4-ce-gcm-glue.c b/arch/arm64/crypto/sm4-ce-gcm-glue.c
new file mode 100644
index 000000000000..e90ea0f17beb
--- /dev/null
+++ b/arch/arm64/crypto/sm4-ce-gcm-glue.c
@@ -0,0 +1,286 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4-GCM AEAD Algorithm using ARMv8 Crypto Extensions
+ * as specified in rfc8998
+ * https://datatracker.ietf.org/doc/html/rfc8998
+ *
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <crypto/b128ops.h>
+#include <crypto/scatterwalk.h>
+#include <crypto/internal/aead.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_ce_pmull_ghash_setup(const u32 *rkey_enc, u8 *ghash_table);
+asmlinkage void pmull_ghash_update(const u8 *ghash_table, u8 *ghash,
+				   const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_ce_pmull_gcm_enc(const u32 *rkey_enc, u8 *dst,
+				     const u8 *src, u8 *iv,
+				     unsigned int nbytes, u8 *ghash,
+				     const u8 *ghash_table, const u8 *lengths);
+asmlinkage void sm4_ce_pmull_gcm_dec(const u32 *rkey_enc, u8 *dst,
+				     const u8 *src, u8 *iv,
+				     unsigned int nbytes, u8 *ghash,
+				     const u8 *ghash_table, const u8 *lengths);
+
+#define GHASH_BLOCK_SIZE	16
+#define GCM_IV_SIZE		12
+
+struct sm4_gcm_ctx {
+	struct sm4_ctx key;
+	u8 ghash_table[16 * 4];
+};
+
+
+static int gcm_setkey(struct crypto_aead *tfm, const u8 *key,
+		      unsigned int key_len)
+{
+	struct sm4_gcm_ctx *ctx = crypto_aead_ctx(tfm);
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+
+	sm4_ce_expand_key(key, ctx->key.rkey_enc, ctx->key.rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+	sm4_ce_pmull_ghash_setup(ctx->key.rkey_enc, ctx->ghash_table);
+
+	kernel_neon_end();
+	return 0;
+}
+
+static int gcm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
+{
+	switch (authsize) {
+	case 4:
+	case 8:
+	case 12 ... 16:
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
+static void gcm_calculate_auth_mac(struct aead_request *req, u8 ghash[])
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) buffer[GHASH_BLOCK_SIZE];
+	u32 assoclen = req->assoclen;
+	struct scatter_walk walk;
+	unsigned int buflen = 0;
+
+	scatterwalk_start(&walk, req->src);
+
+	do {
+		u32 n = scatterwalk_clamp(&walk, assoclen);
+		u8 *p, *ptr;
+
+		if (!n) {
+			scatterwalk_start(&walk, sg_next(walk.sg));
+			n = scatterwalk_clamp(&walk, assoclen);
+		}
+
+		p = ptr = scatterwalk_map(&walk);
+		assoclen -= n;
+		scatterwalk_advance(&walk, n);
+
+		if (n + buflen < GHASH_BLOCK_SIZE) {
+			memcpy(&buffer[buflen], ptr, n);
+			buflen += n;
+		} else {
+			unsigned int nblocks;
+
+			if (buflen) {
+				unsigned int l = GHASH_BLOCK_SIZE - buflen;
+
+				memcpy(&buffer[buflen], ptr, l);
+				ptr += l;
+				n -= l;
+
+				pmull_ghash_update(ctx->ghash_table, ghash,
+						   buffer, 1);
+			}
+
+			nblocks = n / GHASH_BLOCK_SIZE;
+			if (nblocks) {
+				pmull_ghash_update(ctx->ghash_table, ghash,
+						   ptr, nblocks);
+				ptr += nblocks * GHASH_BLOCK_SIZE;
+			}
+
+			buflen = n % GHASH_BLOCK_SIZE;
+			if (buflen)
+				memcpy(&buffer[0], ptr, buflen);
+		}
+
+		scatterwalk_unmap(p);
+		scatterwalk_done(&walk, 0, assoclen);
+	} while (assoclen);
+
+	/* padding with '0' */
+	if (buflen) {
+		memset(&buffer[buflen], 0, GHASH_BLOCK_SIZE - buflen);
+		pmull_ghash_update(ctx->ghash_table, ghash, buffer, 1);
+	}
+}
+
+static int gcm_crypt(struct aead_request *req, struct skcipher_walk *walk,
+		     struct sm4_gcm_ctx *ctx, u8 ghash[],
+		     void (*sm4_ce_pmull_gcm_crypt)(const u32 *rkey_enc,
+				u8 *dst, const u8 *src, u8 *iv,
+				unsigned int nbytes, u8 *ghash,
+				const u8 *ghash_table, const u8 *lengths))
+{
+	u8 __aligned(8) iv[SM4_BLOCK_SIZE];
+	be128 __aligned(8) lengths;
+	int err;
+
+	memset(ghash, 0, SM4_BLOCK_SIZE);
+
+	lengths.a = cpu_to_be64(req->assoclen * 8);
+	lengths.b = cpu_to_be64(walk->total * 8);
+
+	memcpy(iv, walk->iv, GCM_IV_SIZE);
+	put_unaligned_be32(2, iv + GCM_IV_SIZE);
+
+	kernel_neon_begin();
+
+	if (req->assoclen)
+		gcm_calculate_auth_mac(req, ghash);
+
+	do {
+		unsigned int tail = walk->nbytes % SM4_BLOCK_SIZE;
+		const u8 *src = walk->src.virt.addr;
+		u8 *dst = walk->dst.virt.addr;
+
+		if (walk->nbytes == walk->total) {
+			tail = 0;
+
+			sm4_ce_pmull_gcm_crypt(ctx->key.rkey_enc, dst, src, iv,
+					       walk->nbytes, ghash,
+					       ctx->ghash_table,
+					       (const u8 *)&lengths);
+		} else if (walk->nbytes - tail) {
+			sm4_ce_pmull_gcm_crypt(ctx->key.rkey_enc, dst, src, iv,
+					       walk->nbytes - tail, ghash,
+					       ctx->ghash_table, NULL);
+		}
+
+		kernel_neon_end();
+
+		err = skcipher_walk_done(walk, tail);
+		if (err)
+			return err;
+		if (walk->nbytes)
+			kernel_neon_begin();
+	} while (walk->nbytes > 0);
+
+	return 0;
+}
+
+static int gcm_encrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) ghash[SM4_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	int err;
+
+	err = skcipher_walk_aead_encrypt(&walk, req, false);
+	if (err)
+		return err;
+
+	err = gcm_crypt(req, &walk, ctx, ghash, sm4_ce_pmull_gcm_enc);
+	if (err)
+		return err;
+
+	/* copy authtag to end of dst */
+	scatterwalk_map_and_copy(ghash, req->dst, req->assoclen + req->cryptlen,
+				 crypto_aead_authsize(aead), 1);
+
+	return 0;
+}
+
+static int gcm_decrypt(struct aead_request *req)
+{
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	unsigned int authsize = crypto_aead_authsize(aead);
+	struct sm4_gcm_ctx *ctx = crypto_aead_ctx(aead);
+	u8 __aligned(8) ghash[SM4_BLOCK_SIZE];
+	u8 authtag[SM4_BLOCK_SIZE];
+	struct skcipher_walk walk;
+	int err;
+
+	err = skcipher_walk_aead_decrypt(&walk, req, false);
+	if (err)
+		return err;
+
+	err = gcm_crypt(req, &walk, ctx, ghash, sm4_ce_pmull_gcm_dec);
+	if (err)
+		return err;
+
+	/* compare calculated auth tag with the stored one */
+	scatterwalk_map_and_copy(authtag, req->src,
+				 req->assoclen + req->cryptlen - authsize,
+				 authsize, 0);
+
+	if (crypto_memneq(authtag, ghash, authsize))
+		return -EBADMSG;
+
+	return 0;
+}
+
+static struct aead_alg sm4_gcm_alg = {
+	.base = {
+		.cra_name		= "gcm(sm4)",
+		.cra_driver_name	= "gcm-sm4-ce",
+		.cra_priority		= 400,
+		.cra_blocksize		= 1,
+		.cra_ctxsize		= sizeof(struct sm4_gcm_ctx),
+		.cra_module		= THIS_MODULE,
+	},
+	.ivsize		= GCM_IV_SIZE,
+	.chunksize	= SM4_BLOCK_SIZE,
+	.maxauthsize	= SM4_BLOCK_SIZE,
+	.setkey		= gcm_setkey,
+	.setauthsize	= gcm_setauthsize,
+	.encrypt	= gcm_encrypt,
+	.decrypt	= gcm_decrypt,
+};
+
+static int __init sm4_ce_gcm_init(void)
+{
+	if (!cpu_have_named_feature(PMULL))
+		return -ENODEV;
+
+	return crypto_register_aead(&sm4_gcm_alg);
+}
+
+static void __exit sm4_ce_gcm_exit(void)
+{
+	crypto_unregister_aead(&sm4_gcm_alg);
+}
+
+static const struct cpu_feature sm4_ce_gcm_cpu_feature[] = {
+	{ cpu_feature(PMULL) },
+	{}
+};
+MODULE_DEVICE_TABLE(cpu, sm4_ce_gcm_cpu_feature);
+
+module_cpu_feature_match(SM4, sm4_ce_gcm_init);
+module_exit(sm4_ce_gcm_exit);
+
+MODULE_DESCRIPTION("Synchronous SM4 in GCM mode using ARMv8 Crypto Extensions");
+MODULE_ALIAS_CRYPTO("gcm(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");

From patchwork Mon Sep 26 09:36:20 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "tianjia.zhang" <tianjia.zhang@linux.alibaba.com>
X-Patchwork-Id: 12988503
Return-Path: 
 <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 4CD5BC07E9D
	for <linux-arm-kernel@archiver.kernel.org>;
 Mon, 26 Sep 2022 09:49:06 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To:
	Message-Id:Date:Subject:To:From:Reply-To:Cc:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=9NcMf1fwHT5CAcsnaRB+deSposESCt4czMU0P646FlI=; b=XgGqHL2Y4WcVdq
	Qx0F4ac1rr2HUfdwxHl1cj65k5fINceB4zILCo68bo7XY6l//xZx1vK5LG7d5x8J2mUc8qArSPf5k
	vhU1Z4JQxlVsfdjHRaXxj8L5v2ggovd/GKMUCl44YPvjhUfGf5iS4StF4cybshe8UAcc5mZw+H/kD
	KspI1PVxFKk5NvpT7BpR+wOFMuyR5zh3iyvhxgU5eJz3mwoc2JjdT/T+McPHFtTykQ8Af2VNFGxyr
	GdHY256MZ4TwEX9ybt4ZVrPp7J8LljtrhXhFMdrmSfmivEdWsDrVr7erzp1NeNj6DLp1WUkKBg/q7
	lxUG4gIt2KGLgB+NBJBA==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockiB-003m2c-AL; Mon, 26 Sep 2022 09:47:40 +0000
Received: from out30-54.freemail.mail.aliyun.com ([115.124.30.54])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ockXq-003gvi-WE
	for linux-arm-kernel@lists.infradead.org; Mon, 26 Sep 2022 09:37:21 +0000
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R211e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046049;MF=tianjia.zhang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0VQjskeo_1664185014;
Received: from localhost(mailfrom:tianjia.zhang@linux.alibaba.com
 fp:SMTPD_---0VQjskeo_1664185014)
          by smtp.aliyun-inc.com;
          Mon, 26 Sep 2022 17:36:55 +0800
From: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
To: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	Jussi Kivilinna <jussi.kivilinna@iki.fi>,
	Ard Biesheuvel <ardb@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Maxime Coquelin <mcoquelin.stm32@gmail.com>,
	Alexandre Torgue <alexandre.torgue@foss.st.com>,
	Eric Biggers <ebiggers@kernel.org>,
	linux-crypto@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com
Subject: [PATCH 16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography
 acceleration implementation
Date: Mon, 26 Sep 2022 17:36:20 +0800
Message-Id: <20220926093620.99898-17-tianjia.zhang@linux.alibaba.com>
X-Mailer: git-send-email 2.24.3 (Apple Git-128)
In-Reply-To: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
References: <20220926093620.99898-1-tianjia.zhang@linux.alibaba.com>
MIME-Version: 1.0
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220926_023659_677947_809CA2CC 
X-CRM114-Status: GOOD (  17.59  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: 
 <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: 
 <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Scalable Vector Extension (SVE) is the next-generation SIMD extension for
arm64. SVE allows flexible vector length implementations with a range of
possible values in CPU implementations. The vector length can vary from a
minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
The SVE design guarantees that the same application can run on different
implementations that support SVE, without the need to recompile the code.

SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
expand and improve it. Similar to the Crypto Extension supported by the
NEON instruction set for the algorithm, SVE also supports the similar
instructions, called cryptography acceleration instructions, but this is
also optional instruction set.

This patch uses SM4 cryptography acceleration instructions and SVE2
instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
Since the encryption of CBC/CFB cannot be parallelized, the Crypto
Extension instruction is used.

Since no test environment with a Vector Length (VL) greater than 128 bits
was found, the performance data was obtained on a machine with a VL is
128 bits, because this driver is enabled when the VL is greater than 128
bits, so this performance is only for reference. It can be seen from the
data that there is little difference between the data optimized by Crypto
Extension and SVE (VL=128 bits), and the optimization effect will be more
obvious when VL=256 bits or longer.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode
of tcrypt, and compared with that optimized by Crypto Extension.  The
abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:

sm4-ce      |      16       64      128      256     1024     1420     4096
------------+--------------------------------------------------------------
    ECB enc |  315.18  1162.65  1815.66  2553.50  3692.91  3727.20  4001.93
    ECB dec |  316.06  1172.97  1817.81  2554.66  3692.18  3786.54  4001.93
    CBC enc |  304.82   629.54   768.65   864.72   953.90   963.32   974.06
    CBC dec |  306.05  1142.53  1805.11  2481.67  3522.06  3587.87  3790.99
    CFB enc |  309.48   635.70   774.44   865.85   950.62   952.68   968.24
    CFB dec |  315.98  1170.38  1828.75  2509.72  3543.63  3539.40  3793.25
    CTR enc |  285.83  1036.59  1583.50  2147.26  2933.54  2954.66  3041.14
    CTR dec |  285.29  1037.47  1584.67  2145.51  2934.10  2950.89  3041.62

sm4-sve-ce (VL = 128 bits)
    ECB enc |  310.00  1154.70  1813.26  2579.74  3766.90  3869.45  4100.26
    ECB dec |  315.60  1176.22  1838.06  2593.69  3774.95  3878.42  4098.83
    CBC enc |  303.44   622.65   764.67   861.40   953.18   963.05   973.77
    CBC dec |  302.13  1091.15  1689.10  2267.79  3182.84  3242.68  3408.92
    CFB enc |  296.62   620.41   762.94   858.96   948.18   956.04   967.67
    CFB dec |  291.23  1065.50  1637.33  2228.12  3158.52  3213.35  3403.83
    CTR enc |  272.27   959.35  1466.34  1934.24  2562.80  2595.87  2695.15
    CTR dec |  273.40   963.65  1471.83  1938.97  2563.12  2597.25  2694.54

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig           |   19 +
 arch/arm64/crypto/Makefile          |    3 +
 arch/arm64/crypto/sm4-sve-ce-core.S | 1028 +++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-sve-ce-glue.c |  332 +++++++++
 4 files changed, 1382 insertions(+)
 create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
 create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 6793d5bc3ee5..bbb5a7a08af5 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -249,6 +249,25 @@ config CRYPTO_SM4_ARM64_CE_BLK
 	  - ARMv8 Crypto Extensions
 	  - NEON (Advanced SIMD) extensions
 
+config CRYPTO_SM4_ARM64_SVE_CE_BLK
+	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography acceleration with SVE2)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_SKCIPHER
+	select CRYPTO_SM4
+	select CRYPTO_SM4_ARM64_CE_BLK
+	help
+	  Length-preserving ciphers: SM4 cipher algorithms (OSCCA GB/T 32907-2016)
+	  with block cipher modes:
+	  - ECB (Electronic Codebook) mode (NIST SP800-38A)
+	  - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
+	  - CFB (Cipher Feedback) mode (NIST SP800-38A)
+	  - CTR (Counter) mode (NIST SP800-38A)
+
+	  Architecture: arm64 using:
+	  - ARMv8 Crypto Extensions
+	  - ARMv9 cryptography acceleration with SVE2
+	  - NEON (Advanced SIMD) extensions
+
 config CRYPTO_SM4_ARM64_NEON_BLK
 	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (NEON)"
 	depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 4818e204c2ac..355dd9053434 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -38,6 +38,9 @@ sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
 obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
 sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
 
+obj-$(CONFIG_CRYPTO_SM4_ARM64_SVE_CE_BLK) += sm4-sve-ce.o
+sm4-sve-ce-y := sm4-sve-ce-glue.o sm4-sve-ce-core.o
+
 obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
 ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
 
diff --git a/arch/arm64/crypto/sm4-sve-ce-core.S b/arch/arm64/crypto/sm4-sve-ce-core.S
new file mode 100644
index 000000000000..caecbdf2536c
--- /dev/null
+++ b/arch/arm64/crypto/sm4-sve-ce-core.S
@@ -0,0 +1,1028 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 Cipher Algorithm for ARMv9 Crypto Extensions with SVE2
+ * as specified in
+ * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
+ *
+ * Copyright (C) 2022, Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+.arch	armv8-a+crypto+sve+sve2
+
+.irp b, 0, 15, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lv\b\().4s, \b
+.endr
+
+.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
+		16, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lz\b\().s, \b
+.endr
+
+.macro sm4e, vd, vn
+	.inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+.macro sm4e_sve, zd, zn
+	.inst 0x4523e000 | (.L\zn << 5) | .L\zd
+.endm
+
+
+/* Register macros */
+
+#define RCTR        z16
+#define RCTRv       v16
+#define RIV         z16
+#define RIVv        v16
+#define RSWAP128    z17
+#define RZERO       z18
+#define RLE128_INC  z19
+
+#define RTMP0       z20
+#define RTMP0v      v20
+#define RTMP1       z21
+#define RTMP2       z22
+#define RTMP3       z23
+
+
+/* Helper macros. */
+
+#define SM4_PREPARE(ptr)					\
+		adr_l		x7, .Lbswap128_mask;		\
+		ptrue		p0.b, ALL;			\
+		rdvl		x5, #1;				\
+		ld1b		{RSWAP128.b}, p0/z, [x7];	\
+								\
+		ld1		{v24.16b-v27.16b}, [ptr], #64;	\
+		ld1		{v28.16b-v31.16b}, [ptr];	\
+		dup		z24.q, z24.q[0];		\
+		dup		z25.q, z25.q[0];		\
+		dup		z26.q, z26.q[0];		\
+		dup		z27.q, z27.q[0];		\
+		dup		z28.q, z28.q[0];		\
+		dup		z29.q, z29.q[0];		\
+		dup		z30.q, z30.q[0];		\
+		dup		z31.q, z31.q[0];
+
+#define SM4_SVE_CE_CRYPT_BLK(b0)				\
+		revb		b0.s, p0/m, b0.s;		\
+		sm4e_sve	b0.s, z24.s;			\
+		sm4e_sve	b0.s, z25.s;			\
+		sm4e_sve	b0.s, z26.s;			\
+		sm4e_sve	b0.s, z27.s;			\
+		sm4e_sve	b0.s, z28.s;			\
+		sm4e_sve	b0.s, z29.s;			\
+		sm4e_sve	b0.s, z30.s;			\
+		sm4e_sve	b0.s, z31.s;			\
+		tbl		b0.b, {b0.b}, RSWAP128.b;	\
+		revb		b0.s, p0/m, b0.s;
+
+#define SM4_SVE_CE_CRYPT_BLK4(b0, b1, b2, b3)			\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;		\
+		sm4e_sve	b0.s, z24.s;			\
+		sm4e_sve	b1.s, z24.s;			\
+		sm4e_sve	b2.s, z24.s;			\
+		sm4e_sve	b3.s, z24.s;			\
+		sm4e_sve	b0.s, z25.s;			\
+		sm4e_sve	b1.s, z25.s;			\
+		sm4e_sve	b2.s, z25.s;			\
+		sm4e_sve	b3.s, z25.s;			\
+		sm4e_sve	b0.s, z26.s;			\
+		sm4e_sve	b1.s, z26.s;			\
+		sm4e_sve	b2.s, z26.s;			\
+		sm4e_sve	b3.s, z26.s;			\
+		sm4e_sve	b0.s, z27.s;			\
+		sm4e_sve	b1.s, z27.s;			\
+		sm4e_sve	b2.s, z27.s;			\
+		sm4e_sve	b3.s, z27.s;			\
+		sm4e_sve	b0.s, z28.s;			\
+		sm4e_sve	b1.s, z28.s;			\
+		sm4e_sve	b2.s, z28.s;			\
+		sm4e_sve	b3.s, z28.s;			\
+		sm4e_sve	b0.s, z29.s;			\
+		sm4e_sve	b1.s, z29.s;			\
+		sm4e_sve	b2.s, z29.s;			\
+		sm4e_sve	b3.s, z29.s;			\
+		sm4e_sve	b0.s, z30.s;			\
+		sm4e_sve	b1.s, z30.s;			\
+		sm4e_sve	b2.s, z30.s;			\
+		sm4e_sve	b3.s, z30.s;			\
+		sm4e_sve	b0.s, z31.s;			\
+		sm4e_sve	b1.s, z31.s;			\
+		sm4e_sve	b2.s, z31.s;			\
+		sm4e_sve	b3.s, z31.s;			\
+		tbl		b0.b, {b0.b}, RSWAP128.b;	\
+		tbl		b1.b, {b1.b}, RSWAP128.b;	\
+		tbl		b2.b, {b2.b}, RSWAP128.b;	\
+		tbl		b3.b, {b3.b}, RSWAP128.b;	\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;
+
+#define SM4_SVE_CE_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)	\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;		\
+		revb		b4.s, p0/m, b4.s;		\
+		revb		b5.s, p0/m, b5.s;		\
+		revb		b6.s, p0/m, b6.s;		\
+		revb		b7.s, p0/m, b7.s;		\
+		sm4e_sve	b0.s, z24.s;			\
+		sm4e_sve	b1.s, z24.s;			\
+		sm4e_sve	b2.s, z24.s;			\
+		sm4e_sve	b3.s, z24.s;			\
+		sm4e_sve	b4.s, z24.s;			\
+		sm4e_sve	b5.s, z24.s;			\
+		sm4e_sve	b6.s, z24.s;			\
+		sm4e_sve	b7.s, z24.s;			\
+		sm4e_sve	b0.s, z25.s;			\
+		sm4e_sve	b1.s, z25.s;			\
+		sm4e_sve	b2.s, z25.s;			\
+		sm4e_sve	b3.s, z25.s;			\
+		sm4e_sve	b4.s, z25.s;			\
+		sm4e_sve	b5.s, z25.s;			\
+		sm4e_sve	b6.s, z25.s;			\
+		sm4e_sve	b7.s, z25.s;			\
+		sm4e_sve	b0.s, z26.s;			\
+		sm4e_sve	b1.s, z26.s;			\
+		sm4e_sve	b2.s, z26.s;			\
+		sm4e_sve	b3.s, z26.s;			\
+		sm4e_sve	b4.s, z26.s;			\
+		sm4e_sve	b5.s, z26.s;			\
+		sm4e_sve	b6.s, z26.s;			\
+		sm4e_sve	b7.s, z26.s;			\
+		sm4e_sve	b0.s, z27.s;			\
+		sm4e_sve	b1.s, z27.s;			\
+		sm4e_sve	b2.s, z27.s;			\
+		sm4e_sve	b3.s, z27.s;			\
+		sm4e_sve	b4.s, z27.s;			\
+		sm4e_sve	b5.s, z27.s;			\
+		sm4e_sve	b6.s, z27.s;			\
+		sm4e_sve	b7.s, z27.s;			\
+		sm4e_sve	b0.s, z28.s;			\
+		sm4e_sve	b1.s, z28.s;			\
+		sm4e_sve	b2.s, z28.s;			\
+		sm4e_sve	b3.s, z28.s;			\
+		sm4e_sve	b4.s, z28.s;			\
+		sm4e_sve	b5.s, z28.s;			\
+		sm4e_sve	b6.s, z28.s;			\
+		sm4e_sve	b7.s, z28.s;			\
+		sm4e_sve	b0.s, z29.s;			\
+		sm4e_sve	b1.s, z29.s;			\
+		sm4e_sve	b2.s, z29.s;			\
+		sm4e_sve	b3.s, z29.s;			\
+		sm4e_sve	b4.s, z29.s;			\
+		sm4e_sve	b5.s, z29.s;			\
+		sm4e_sve	b6.s, z29.s;			\
+		sm4e_sve	b7.s, z29.s;			\
+		sm4e_sve	b0.s, z30.s;			\
+		sm4e_sve	b1.s, z30.s;			\
+		sm4e_sve	b2.s, z30.s;			\
+		sm4e_sve	b3.s, z30.s;			\
+		sm4e_sve	b4.s, z30.s;			\
+		sm4e_sve	b5.s, z30.s;			\
+		sm4e_sve	b6.s, z30.s;			\
+		sm4e_sve	b7.s, z30.s;			\
+		sm4e_sve	b0.s, z31.s;			\
+		sm4e_sve	b1.s, z31.s;			\
+		sm4e_sve	b2.s, z31.s;			\
+		sm4e_sve	b3.s, z31.s;			\
+		sm4e_sve	b4.s, z31.s;			\
+		sm4e_sve	b5.s, z31.s;			\
+		sm4e_sve	b6.s, z31.s;			\
+		sm4e_sve	b7.s, z31.s;			\
+		tbl		b0.b, {b0.b}, RSWAP128.b;	\
+		tbl		b1.b, {b1.b}, RSWAP128.b;	\
+		tbl		b2.b, {b2.b}, RSWAP128.b;	\
+		tbl		b3.b, {b3.b}, RSWAP128.b;	\
+		tbl		b4.b, {b4.b}, RSWAP128.b;	\
+		tbl		b5.b, {b5.b}, RSWAP128.b;	\
+		tbl		b6.b, {b6.b}, RSWAP128.b;	\
+		tbl		b7.b, {b7.b}, RSWAP128.b;	\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;		\
+		revb		b4.s, p0/m, b4.s;		\
+		revb		b5.s, p0/m, b5.s;		\
+		revb		b6.s, p0/m, b6.s;		\
+		revb		b7.s, p0/m, b7.s;
+
+#define SM4_CE_CRYPT_BLK(b0)					\
+		rev32		b0.16b, b0.16b;			\
+		sm4e		b0.4s, v24.4s;			\
+		sm4e		b0.4s, v25.4s;			\
+		sm4e		b0.4s, v26.4s;			\
+		sm4e		b0.4s, v27.4s;			\
+		sm4e		b0.4s, v28.4s;			\
+		sm4e		b0.4s, v29.4s;			\
+		sm4e		b0.4s, v30.4s;			\
+		sm4e		b0.4s, v31.4s;			\
+		rev64		b0.4s, b0.4s;			\
+		ext		b0.16b, b0.16b, b0.16b, #8;	\
+		rev32		b0.16b, b0.16b;
+
+#define inc_le128(zctr)						\
+		mov		RCTRv.d[1], x8;			\
+		mov		RCTRv.d[0], x7;			\
+		mov		zctr.d, RLE128_INC.d;		\
+		dup		RCTR.q, RCTR.q[0];		\
+		adds		x8, x8, x5, LSR #4;		\
+		adclt		zctr.d, RCTR.d, RZERO.d;	\
+		adclt		RCTR.d, zctr.d, RZERO.d;	\
+		adc		x7, x7, xzr;			\
+		trn1		zctr.d, RCTR.d, zctr.d;		\
+		revb		zctr.d, p0/m, zctr.d;
+
+#define inc_le128_4x(zctr0, zctr1, zctr2, zctr3)		\
+		mov		v8.d[1], x8;			\
+		mov		v8.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr0.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v9.d[1], x8;			\
+		mov		v9.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr1.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v10.d[1], x8;			\
+		mov		v10.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr2.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v11.d[1], x8;			\
+		mov		v11.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr3.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		dup		z8.q, z8.q[0];			\
+		dup		z9.q, z9.q[0];			\
+		dup		z10.q, z10.q[0];		\
+		dup		z11.q, z11.q[0];		\
+		adclt		zctr0.d, z8.d, RZERO.d;		\
+		adclt		zctr1.d, z9.d, RZERO.d;		\
+		adclt		zctr2.d, z10.d, RZERO.d;	\
+		adclt		zctr3.d, z11.d, RZERO.d;	\
+		adclt		z8.d, zctr0.d, RZERO.d;		\
+		adclt		z9.d, zctr1.d, RZERO.d;		\
+		adclt		z10.d, zctr2.d, RZERO.d;	\
+		adclt		z11.d, zctr3.d, RZERO.d;	\
+		trn1		zctr0.d, z8.d, zctr0.d;		\
+		trn1		zctr1.d, z9.d, zctr1.d;		\
+		trn1		zctr2.d, z10.d, zctr2.d;	\
+		trn1		zctr3.d, z11.d, zctr3.d;	\
+		revb		zctr0.d, p0/m, zctr0.d;		\
+		revb		zctr1.d, p0/m, zctr1.d;		\
+		revb		zctr2.d, p0/m, zctr2.d;		\
+		revb		zctr3.d, p0/m, zctr3.d;
+
+#define inc_le128_8x(zctr0, zctr1, zctr2, zctr3,		\
+		     zctr4, zctr5, zctr6, zctr7)		\
+		mov		v8.d[1], x8;			\
+		mov		v8.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr0.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v9.d[1], x8;			\
+		mov		v9.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr1.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v10.d[1], x8;			\
+		mov		v10.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr2.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v11.d[1], x8;			\
+		mov		v11.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr3.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v12.d[1], x8;			\
+		mov		v12.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr4.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v13.d[1], x8;			\
+		mov		v13.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr5.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v14.d[1], x8;			\
+		mov		v14.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr6.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v15.d[1], x8;			\
+		mov		v15.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr7.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		dup		z8.q, z8.q[0];			\
+		dup		z9.q, z9.q[0];			\
+		dup		z10.q, z10.q[0];		\
+		dup		z11.q, z11.q[0];		\
+		dup		z12.q, z12.q[0];		\
+		dup		z13.q, z13.q[0];		\
+		dup		z14.q, z14.q[0];		\
+		dup		z15.q, z15.q[0];		\
+		adclt		zctr0.d, z8.d, RZERO.d;		\
+		adclt		zctr1.d, z9.d, RZERO.d;		\
+		adclt		zctr2.d, z10.d, RZERO.d;	\
+		adclt		zctr3.d, z11.d, RZERO.d;	\
+		adclt		zctr4.d, z12.d, RZERO.d;	\
+		adclt		zctr5.d, z13.d, RZERO.d;	\
+		adclt		zctr6.d, z14.d, RZERO.d;	\
+		adclt		zctr7.d, z15.d, RZERO.d;	\
+		adclt		z8.d, zctr0.d, RZERO.d;		\
+		adclt		z9.d, zctr1.d, RZERO.d;		\
+		adclt		z10.d, zctr2.d, RZERO.d;	\
+		adclt		z11.d, zctr3.d, RZERO.d;	\
+		adclt		z12.d, zctr4.d, RZERO.d;	\
+		adclt		z13.d, zctr5.d, RZERO.d;	\
+		adclt		z14.d, zctr6.d, RZERO.d;	\
+		adclt		z15.d, zctr7.d, RZERO.d;	\
+		trn1		zctr0.d, z8.d, zctr0.d;		\
+		trn1		zctr1.d, z9.d, zctr1.d;		\
+		trn1		zctr2.d, z10.d, zctr2.d;	\
+		trn1		zctr3.d, z11.d, zctr3.d;	\
+		trn1		zctr4.d, z12.d, zctr4.d;	\
+		trn1		zctr5.d, z13.d, zctr5.d;	\
+		trn1		zctr6.d, z14.d, zctr6.d;	\
+		trn1		zctr7.d, z15.d, zctr7.d;	\
+		revb		zctr0.d, p0/m, zctr0.d;		\
+		revb		zctr1.d, p0/m, zctr1.d;		\
+		revb		zctr2.d, p0/m, zctr2.d;		\
+		revb		zctr3.d, p0/m, zctr3.d;		\
+		revb		zctr4.d, p0/m, zctr4.d;		\
+		revb		zctr5.d, p0/m, zctr5.d;		\
+		revb		zctr6.d, p0/m, zctr6.d;		\
+		revb		zctr7.d, p0/m, zctr7.d;
+
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_crypt)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   w3: nblocks
+	 */
+	uxtw		x3, w3
+	SM4_PREPARE(x0)
+
+.Lcrypt_loop_8x:
+	sub		x3, x3, x5, LSR #1		/* x3 - (8 * VL) */
+	tbnz		x3, #63, .Lcrypt_4x
+
+	ld1b		{z0.b}, p0/z, [x2]
+	ld1b		{z1.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z2.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z3.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z4.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z5.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z6.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z7.b}, p0/z, [x2, #7, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x3, .Lcrypt_end
+	b		.Lcrypt_loop_8x
+
+.Lcrypt_4x:
+	add		x3, x3, x5, LSR #1
+	cmp		x3, x5, LSR #2
+	blt		.Lcrypt_loop_1x
+
+	sub		x3, x3, x5, LSR #2		/* x3 - (4 * VL) */
+
+	ld1b		{z0.b}, p0/z, [x2]
+	ld1b		{z1.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z2.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z3.b}, p0/z, [x2, #3, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x3, .Lcrypt_end
+
+.Lcrypt_loop_1x:
+	cmp		x3, x5, LSR #4
+	blt		.Lcrypt_ce_loop_1x
+
+	sub		x3, x3, x5, LSR #4		/* x3 - VL */
+
+	ld1b		{z0.b}, p0/z, [x2]
+
+	SM4_SVE_CE_CRYPT_BLK(z0)
+
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x3, .Lcrypt_end
+	b		.Lcrypt_loop_1x
+
+.Lcrypt_ce_loop_1x:
+	sub		x3, x3, #1
+
+	ld1		{v0.16b}, [x2], #16
+	SM4_CE_CRYPT_BLK(v0)
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x3, .Lcrypt_ce_loop_1x
+
+.Lcrypt_end:
+	ret
+SYM_FUNC_END(sm4_sve_ce_crypt)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_cbc_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 */
+	uxtw		x4, w4
+	SM4_PREPARE(x0)
+
+	ld1		{RIVv.16b}, [x3]
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcbc_dec_loop_8x:
+	sub		x4, x4, x5, LSR #1		/* x4 - (8 * VL) */
+	tbnz		x4, #63, .Lcbc_dec_4x
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z9.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z8.b}, p0/z, [x2, #7, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		z4.b, z11.b
+	rev		z5.b, z10.b
+	rev		z6.b, z9.b
+	rev		z7.b, z8.b
+	rev		RTMP0.b, RIV.b
+	ext		z7.b, z7.b, z6.b, #16
+	ext		z6.b, z6.b, z5.b, #16
+	ext		z5.b, z5.b, z4.b, #16
+	ext		z4.b, z4.b, z3.b, #16
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z7.b, z7.b
+	rev		z6.b, z6.b
+	rev		z5.b, z5.b
+	rev		z4.b, z4.b
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z8.d
+
+	SM4_SVE_CE_CRYPT_BLK8(z15, z14, z13, z12, z11, z10, z9, z8)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	eor		z4.d, z4.d, z11.d
+	eor		z5.d, z5.d, z10.d
+	eor		z6.d, z6.d, z9.d
+	eor		z7.d, z7.d, z8.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_8x
+
+.Lcbc_dec_4x:
+	add		x4, x4, x5, LSR #1
+	cmp		x4, x5, LSR #2
+	blt		.Lcbc_dec_loop_1x
+
+	sub		x4, x4, x5, LSR #2		/* x4 - (4 * VL) */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		RTMP0.b, RIV.b
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z12.d
+
+	SM4_SVE_CE_CRYPT_BLK4(z15, z14, z13, z12)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x4, .Lcbc_dec_end
+
+.Lcbc_dec_loop_1x:
+	cmp		x4, x5, LSR #4
+	blt		.Lcbc_dec_ce
+
+	sub		x4, x4, x5, LSR #4		/* x4 - VL */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	rev		RTMP0.b, RIV.b
+	rev		z0.b, z15.b
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z0.b, z0.b
+	mov		RIV.d, z15.d
+
+	SM4_SVE_CE_CRYPT_BLK(z15)
+
+	eor		z0.d, z0.d, z15.d
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_1x
+
+.Lcbc_dec_ce:
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+
+.Lcbc_dec_ce_loop_1x:
+	sub		x4, x4, #1
+
+	ld1		{v15.16b}, [x2], #16
+	mov		v0.16b, RIVv.16b
+	mov		RIVv.16b, v15.16b
+	SM4_CE_CRYPT_BLK(v15)
+	eor		v0.16b, v0.16b, v15.16b
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x4, .Lcbc_dec_ce_loop_1x
+
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcbc_dec_end:
+	/* store new IV */
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+	st1		{RIVv.16b}, [x3]
+
+	ret
+SYM_FUNC_END(sm4_sve_ce_cbc_dec)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_cfb_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 */
+	uxtw		x4, w4
+	SM4_PREPARE(x0)
+
+	ld1		{RIVv.16b}, [x3]
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcfb_dec_loop_8x:
+	sub		x4, x4, x5, LSR #1		/* x4 - (8 * VL) */
+	tbnz		x4, #63, .Lcfb_dec_4x
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z9.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z8.b}, p0/z, [x2, #7, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		z4.b, z11.b
+	rev		z5.b, z10.b
+	rev		z6.b, z9.b
+	rev		z7.b, z8.b
+	rev		RTMP0.b, RIV.b
+	ext		z7.b, z7.b, z6.b, #16
+	ext		z6.b, z6.b, z5.b, #16
+	ext		z5.b, z5.b, z4.b, #16
+	ext		z4.b, z4.b, z3.b, #16
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z7.b, z7.b
+	rev		z6.b, z6.b
+	rev		z5.b, z5.b
+	rev		z4.b, z4.b
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z8.d
+
+	SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	eor		z4.d, z4.d, z11.d
+	eor		z5.d, z5.d, z10.d
+	eor		z6.d, z6.d, z9.d
+	eor		z7.d, z7.d, z8.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_8x
+
+.Lcfb_dec_4x:
+	add		x4, x4, x5, LSR #1
+	cmp		x4, x5, LSR #2
+	blt		.Lcfb_dec_loop_1x
+
+	sub		x4, x4, x5, LSR #2		/* x4 - (4 * VL) */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		RTMP0.b, RIV.b
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z12.d
+
+	SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x4, .Lcfb_dec_end
+
+.Lcfb_dec_loop_1x:
+	cmp		x4, x5, LSR #4
+	blt		.Lcfb_dec_ce
+
+	sub		x4, x4, x5, LSR #4		/* x4 - VL */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	rev		RTMP0.b, RIV.b
+	rev		z0.b, z15.b
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z0.b, z0.b
+	mov		RIV.d, z15.d
+
+	SM4_SVE_CE_CRYPT_BLK(z0)
+
+	eor		z0.d, z0.d, z15.d
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_1x
+
+.Lcfb_dec_ce:
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+
+.Lcfb_dec_ce_loop_1x:
+	sub		x4, x4, #1
+
+	ld1		{v15.16b}, [x2], #16
+	mov		v0.16b, RIVv.16b
+	mov		RIVv.16b, v15.16b
+	SM4_CE_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v15.16b
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x4, .Lcfb_dec_ce_loop_1x
+
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcfb_dec_end:
+	/* store new IV */
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+	st1		{RIVv.16b}, [x3]
+
+	ret
+SYM_FUNC_END(sm4_sve_ce_cfb_dec)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_ctr_crypt)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nblocks
+	 */
+	uxtw		x4, w4
+	SM4_PREPARE(x0)
+
+	dup		RZERO.d, #0
+	adr_l		x6, .Lle128_inc
+	ld1b		{RLE128_INC.b}, p0/z, [x6]
+
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
+
+.Lctr_loop_8x:
+	sub		x4, x4, x5, LSR #1		/* x4 - (8 * VL) */
+	tbnz		x4, #63, .Lctr_4x
+
+	inc_le128_8x(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	ld1b		{z8.b}, p0/z, [x2]
+	ld1b		{z9.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z14.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z15.b}, p0/z, [x2, #7, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	eor		z0.d, z0.d, z8.d
+	eor		z1.d, z1.d, z9.d
+	eor		z2.d, z2.d, z10.d
+	eor		z3.d, z3.d, z11.d
+	eor		z4.d, z4.d, z12.d
+	eor		z5.d, z5.d, z13.d
+	eor		z6.d, z6.d, z14.d
+	eor		z7.d, z7.d, z15.d
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x4, .Lctr_end
+	b		.Lctr_loop_8x
+
+.Lctr_4x:
+	add		x4, x4, x5, LSR #1
+	cmp		x4, x5, LSR #2
+	blt		.Lctr_loop_1x
+
+	sub		x4, x4, x5, LSR #2		/* x4 - (4 * VL) */
+
+	inc_le128_4x(z0, z1, z2, z3)
+
+	ld1b		{z8.b}, p0/z, [x2]
+	ld1b		{z9.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #3, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+	eor		z0.d, z0.d, z8.d
+	eor		z1.d, z1.d, z9.d
+	eor		z2.d, z2.d, z10.d
+	eor		z3.d, z3.d, z11.d
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x4, .Lctr_end
+
+.Lctr_loop_1x:
+	cmp		x4, x5, LSR #4
+	blt		.Lctr_ce_loop_1x
+
+	sub		x4, x4, x5, LSR #4		/* x4 - VL */
+
+	inc_le128(z0)
+	ld1b		{z8.b}, p0/z, [x2]
+
+	SM4_SVE_CE_CRYPT_BLK(z0)
+
+	eor		z0.d, z0.d, z8.d
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x4, .Lctr_end
+	b		.Lctr_loop_1x
+
+.Lctr_ce_loop_1x:
+	sub		x4, x4, #1
+
+	/* inc_le128 for CE */
+	mov		v0.d[1], x8
+	mov		v0.d[0], x7
+	adds		x8, x8, #1
+	rev64		v0.16b, v0.16b
+	adc		x7, x7, xzr
+
+	ld1		{v8.16b}, [x2], #16
+
+	SM4_CE_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x4, .Lctr_ce_loop_1x
+
+.Lctr_end:
+	/* store new CTR */
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
+
+	ret
+SYM_FUNC_END(sm4_sve_ce_ctr_crypt)
+
+.align 3
+SYM_FUNC_START(sm4_sve_get_vl)
+	/* VL in bytes */
+	rdvl		x0, #1
+
+	ret
+SYM_FUNC_END(sm4_sve_get_vl)
+
+
+	.section	".rodata", "a"
+	.align 4
+.Lbswap128_mask:
+	.byte		0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
+	.byte		0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
+	.byte		0x1c, 0x1d, 0x1e, 0x1f, 0x18, 0x19, 0x1a, 0x1b
+	.byte		0x14, 0x15, 0x16, 0x17, 0x10, 0x11, 0x12, 0x13
+	.byte		0x2c, 0x2d, 0x2e, 0x2f, 0x28, 0x29, 0x2a, 0x2b
+	.byte		0x24, 0x25, 0x26, 0x27, 0x20, 0x21, 0x22, 0x23
+	.byte		0x3c, 0x3d, 0x3e, 0x3f, 0x38, 0x39, 0x3a, 0x3b
+	.byte		0x34, 0x35, 0x36, 0x37, 0x30, 0x31, 0x32, 0x33
+	.byte		0x4c, 0x4d, 0x4e, 0x4f, 0x48, 0x49, 0x4a, 0x4b
+	.byte		0x44, 0x45, 0x46, 0x47, 0x40, 0x41, 0x42, 0x43
+	.byte		0x5c, 0x5d, 0x5e, 0x5f, 0x58, 0x59, 0x5a, 0x5b
+	.byte		0x54, 0x55, 0x56, 0x57, 0x50, 0x51, 0x52, 0x53
+	.byte		0x6c, 0x6d, 0x6e, 0x6f, 0x68, 0x69, 0x6a, 0x6b
+	.byte		0x64, 0x65, 0x66, 0x67, 0x60, 0x61, 0x62, 0x63
+	.byte		0x7c, 0x7d, 0x7e, 0x7f, 0x78, 0x79, 0x7a, 0x7b
+	.byte		0x74, 0x75, 0x76, 0x77, 0x70, 0x71, 0x72, 0x73
+	.byte		0x8c, 0x8d, 0x8e, 0x8f, 0x88, 0x89, 0x8a, 0x8b
+	.byte		0x84, 0x85, 0x86, 0x87, 0x80, 0x81, 0x82, 0x83
+	.byte		0x9c, 0x9d, 0x9e, 0x9f, 0x98, 0x99, 0x9a, 0x9b
+	.byte		0x94, 0x95, 0x96, 0x97, 0x90, 0x91, 0x92, 0x93
+	.byte		0xac, 0xad, 0xae, 0xaf, 0xa8, 0xa9, 0xaa, 0xab
+	.byte		0xa4, 0xa5, 0xa6, 0xa7, 0xa0, 0xa1, 0xa2, 0xa3
+	.byte		0xbc, 0xbd, 0xbe, 0xbf, 0xb8, 0xb9, 0xba, 0xbb
+	.byte		0xb4, 0xb5, 0xb6, 0xb7, 0xb0, 0xb1, 0xb2, 0xb3
+	.byte		0xcc, 0xcd, 0xce, 0xcf, 0xc8, 0xc9, 0xca, 0xcb
+	.byte		0xc4, 0xc5, 0xc6, 0xc7, 0xc0, 0xc1, 0xc2, 0xc3
+	.byte		0xdc, 0xdd, 0xde, 0xdf, 0xd8, 0xd9, 0xda, 0xdb
+	.byte		0xd4, 0xd5, 0xd6, 0xd7, 0xd0, 0xd1, 0xd2, 0xd3
+	.byte		0xec, 0xed, 0xee, 0xef, 0xe8, 0xe9, 0xea, 0xeb
+	.byte		0xe4, 0xe5, 0xe6, 0xe7, 0xe0, 0xe1, 0xe2, 0xe3
+	.byte		0xfc, 0xfd, 0xfe, 0xff, 0xf8, 0xf9, 0xfa, 0xfb
+	.byte		0xf4, 0xf5, 0xf6, 0xf7, 0xf0, 0xf1, 0xf2, 0xf3
+
+.Lle128_inc:
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0d, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
diff --git a/arch/arm64/crypto/sm4-sve-ce-glue.c b/arch/arm64/crypto/sm4-sve-ce-glue.c
new file mode 100644
index 000000000000..fc797b72b5f0
--- /dev/null
+++ b/arch/arm64/crypto/sm4-sve-ce-glue.c
@@ -0,0 +1,332 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 Cipher Algorithm, using ARMv9 Crypto Extensions with SVE2
+ * as specified in
+ * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
+ *
+ * Copyright (C) 2022, Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/internal/simd.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_sve_ce_crypt(const u32 *rkey, u8 *dst,
+				 const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_sve_ce_cbc_dec(const u32 *rkey_dec, u8 *dst,
+				   const u8 *src, u8 *iv,
+				   unsigned int nblocks);
+asmlinkage void sm4_sve_ce_cfb_dec(const u32 *rkey_enc, u8 *dst,
+				   const u8 *src, u8 *iv,
+				   unsigned int nblocks);
+asmlinkage void sm4_sve_ce_ctr_crypt(const u32 *rkey_enc, u8 *dst,
+				     const u8 *src, u8 *iv,
+				     unsigned int nblocks);
+asmlinkage unsigned int sm4_sve_get_vl(void);
+
+
+static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
+		      unsigned int key_len)
+{
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int ecb_crypt(struct skcipher_request *req, const u32 *rkey)
+{
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_sve_ce_crypt(rkey, dst, src, nblocks);
+
+			kernel_neon_end();
+		}
+
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
+	}
+
+	return err;
+}
+
+static int ecb_encrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return ecb_crypt(req, ctx->rkey_enc);
+}
+
+static int ecb_decrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return ecb_crypt(req, ctx->rkey_dec);
+}
+
+static int cbc_crypt(struct skcipher_request *req, const u32 *rkey,
+		     void (*sm4_cbc_crypt)(const u32 *rkey, u8 *dst,
+				const u8 *src, u8 *iv, unsigned int nblocks))
+{
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_cbc_crypt(rkey, dst, src, walk.iv, nblocks);
+
+			kernel_neon_end();
+		}
+
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
+	}
+
+	return err;
+}
+
+static int cbc_encrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return cbc_crypt(req, ctx->rkey_enc, sm4_ce_cbc_enc);
+}
+
+static int cbc_decrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return cbc_crypt(req, ctx->rkey_dec, sm4_sve_ce_cbc_dec);
+}
+
+static int cfb_crypt(struct skcipher_request *req,
+		     void (*sm4_cfb_crypt)(const u32 *rkey, u8 *dst,
+				const u8 *src, u8 *iv, unsigned int nblocks))
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_cfb_crypt(ctx->rkey_enc, dst, src,
+				      walk.iv, nblocks);
+
+			kernel_neon_end();
+
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
+
+		/* tail */
+		if (walk.nbytes == walk.total && nbytes > 0) {
+			u8 keystream[SM4_BLOCK_SIZE];
+
+			sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
+			crypto_xor_cpy(dst, src, keystream, nbytes);
+			nbytes = 0;
+		}
+
+		err = skcipher_walk_done(&walk, nbytes);
+	}
+
+	return err;
+}
+
+static int cfb_encrypt(struct skcipher_request *req)
+{
+	return cfb_crypt(req, sm4_ce_cfb_enc);
+}
+
+static int cfb_decrypt(struct skcipher_request *req)
+{
+	return cfb_crypt(req, sm4_sve_ce_cfb_dec);
+}
+
+static int ctr_crypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_sve_ce_ctr_crypt(ctx->rkey_enc, dst, src,
+					     walk.iv, nblocks);
+
+			kernel_neon_end();
+
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
+
+		/* tail */
+		if (walk.nbytes == walk.total && nbytes > 0) {
+			u8 keystream[SM4_BLOCK_SIZE];
+
+			sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
+			crypto_inc(walk.iv, SM4_BLOCK_SIZE);
+			crypto_xor_cpy(dst, src, keystream, nbytes);
+			nbytes = 0;
+		}
+
+		err = skcipher_walk_done(&walk, nbytes);
+	}
+
+	return err;
+}
+
+static struct skcipher_alg sm4_algs[] = {
+	{
+		.base = {
+			.cra_name		= "ecb(sm4)",
+			.cra_driver_name	= "ecb-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= ecb_encrypt,
+		.decrypt	= ecb_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "cbc(sm4)",
+			.cra_driver_name	= "cbc-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= cbc_encrypt,
+		.decrypt	= cbc_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "cfb(sm4)",
+			.cra_driver_name	= "cfb-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= 1,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.chunksize	= SM4_BLOCK_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= cfb_encrypt,
+		.decrypt	= cfb_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "ctr(sm4)",
+			.cra_driver_name	= "ctr-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= 1,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.chunksize	= SM4_BLOCK_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= ctr_crypt,
+		.decrypt	= ctr_crypt,
+	}
+};
+
+static int __init sm4_sve_ce_init(void)
+{
+	if (sm4_sve_get_vl() <= 16)
+		return -ENODEV;
+
+	return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+}
+
+static void __exit sm4_sve_ce_exit(void)
+{
+	crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+}
+
+module_cpu_feature_match(SVESM4, sm4_sve_ce_init);
+module_exit(sm4_sve_ce_exit);
+
+MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv9 Crypto Extensions with SVE2");
+MODULE_ALIAS_CRYPTO("sm4-sve-ce");
+MODULE_ALIAS_CRYPTO("sm4");
+MODULE_ALIAS_CRYPTO("ecb(sm4)");
+MODULE_ALIAS_CRYPTO("cbc(sm4)");
+MODULE_ALIAS_CRYPTO("cfb(sm4)");
+MODULE_ALIAS_CRYPTO("ctr(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");