diff mbox series

[16/16] crypto: arm64/sm4 - add ARMv9 SVE cryptography acceleration implementation

Message ID 20220926093620.99898-17-tianjia.zhang@linux.alibaba.com (mailing list archive)
State Superseded
Delegated to: Herbert Xu
Headers show
Series Optimizing SM3 and SM4 algorithms using NEON/CE/SVE instructions | expand

Commit Message

tianjia.zhang Sept. 26, 2022, 9:36 a.m. UTC
Scalable Vector Extension (SVE) is the next-generation SIMD extension for
arm64. SVE allows flexible vector length implementations with a range of
possible values in CPU implementations. The vector length can vary from a
minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
The SVE design guarantees that the same application can run on different
implementations that support SVE, without the need to recompile the code.

SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
expand and improve it. Similar to the Crypto Extension supported by the
NEON instruction set for the algorithm, SVE also supports the similar
instructions, called cryptography acceleration instructions, but this is
also optional instruction set.

This patch uses SM4 cryptography acceleration instructions and SVE2
instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
Since the encryption of CBC/CFB cannot be parallelized, the Crypto
Extension instruction is used.

Since no test environment with a Vector Length (VL) greater than 128 bits
was found, the performance data was obtained on a machine with a VL is
128 bits, because this driver is enabled when the VL is greater than 128
bits, so this performance is only for reference. It can be seen from the
data that there is little difference between the data optimized by Crypto
Extension and SVE (VL=128 bits), and the optimization effect will be more
obvious when VL=256 bits or longer.

Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode
of tcrypt, and compared with that optimized by Crypto Extension.  The
abscissas are blocks of different lengths. The data is tabulated and the
unit is Mb/s:

sm4-ce      |      16       64      128      256     1024     1420     4096
------------+--------------------------------------------------------------
    ECB enc |  315.18  1162.65  1815.66  2553.50  3692.91  3727.20  4001.93
    ECB dec |  316.06  1172.97  1817.81  2554.66  3692.18  3786.54  4001.93
    CBC enc |  304.82   629.54   768.65   864.72   953.90   963.32   974.06
    CBC dec |  306.05  1142.53  1805.11  2481.67  3522.06  3587.87  3790.99
    CFB enc |  309.48   635.70   774.44   865.85   950.62   952.68   968.24
    CFB dec |  315.98  1170.38  1828.75  2509.72  3543.63  3539.40  3793.25
    CTR enc |  285.83  1036.59  1583.50  2147.26  2933.54  2954.66  3041.14
    CTR dec |  285.29  1037.47  1584.67  2145.51  2934.10  2950.89  3041.62

sm4-sve-ce (VL = 128 bits)
    ECB enc |  310.00  1154.70  1813.26  2579.74  3766.90  3869.45  4100.26
    ECB dec |  315.60  1176.22  1838.06  2593.69  3774.95  3878.42  4098.83
    CBC enc |  303.44   622.65   764.67   861.40   953.18   963.05   973.77
    CBC dec |  302.13  1091.15  1689.10  2267.79  3182.84  3242.68  3408.92
    CFB enc |  296.62   620.41   762.94   858.96   948.18   956.04   967.67
    CFB dec |  291.23  1065.50  1637.33  2228.12  3158.52  3213.35  3403.83
    CTR enc |  272.27   959.35  1466.34  1934.24  2562.80  2595.87  2695.15
    CTR dec |  273.40   963.65  1471.83  1938.97  2563.12  2597.25  2694.54

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
 arch/arm64/crypto/Kconfig           |   19 +
 arch/arm64/crypto/Makefile          |    3 +
 arch/arm64/crypto/sm4-sve-ce-core.S | 1028 +++++++++++++++++++++++++++
 arch/arm64/crypto/sm4-sve-ce-glue.c |  332 +++++++++
 4 files changed, 1382 insertions(+)
 create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
 create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c

Comments

Ard Biesheuvel Sept. 26, 2022, 10:02 a.m. UTC | #1
(cc Mark Brown)

Hello Tianjia,

On Mon, 26 Sept 2022 at 11:37, Tianjia Zhang
<tianjia.zhang@linux.alibaba.com> wrote:
>
> Scalable Vector Extension (SVE) is the next-generation SIMD extension for
> arm64. SVE allows flexible vector length implementations with a range of
> possible values in CPU implementations. The vector length can vary from a
> minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
> The SVE design guarantees that the same application can run on different
> implementations that support SVE, without the need to recompile the code.
>
> SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
> expand and improve it. Similar to the Crypto Extension supported by the
> NEON instruction set for the algorithm, SVE also supports the similar
> instructions, called cryptography acceleration instructions, but this is
> also optional instruction set.
>
> This patch uses SM4 cryptography acceleration instructions and SVE2
> instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
> Since the encryption of CBC/CFB cannot be parallelized, the Crypto
> Extension instruction is used.
>

Given that we currently do not support the use of SVE in kernel mode,
this patch cannot be accepted at this time (but the rest of the series
looks reasonable to me, although I have only skimmed over the patches)

In view of the disappointing benchmark results below, I don't think
this is worth the hassle at the moment. If we can find a case where
using SVE in kernel mode truly makes a [favorable] difference, we can
revisit this, but not without a thorough analysis of the impact it
will have to support SVE in the kernel. Also, the fact that SVE may
also cover cryptographic extensions does not necessarily imply that a
micro-architecture will perform those crypto transformations in
parallel and so the performance may be the same even if VL > 128.

In summary, please drop this patch for now, and once there are more
encouraging performance numbers, please resubmit it as part of a
series that explicitly enables SVE in kernel mode on arm64, and
documents the requirements and constraints.

I have cc'ed Mark who has been working on the SVE support., who might
have something to add here as well.

Thanks,
Ard.



> Since no test environment with a Vector Length (VL) greater than 128 bits
> was found, the performance data was obtained on a machine with a VL is
> 128 bits, because this driver is enabled when the VL is greater than 128
> bits, so this performance is only for reference. It can be seen from the
> data that there is little difference between the data optimized by Crypto
> Extension and SVE (VL=128 bits), and the optimization effect will be more
> obvious when VL=256 bits or longer.
>
> Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode
> of tcrypt, and compared with that optimized by Crypto Extension.  The
> abscissas are blocks of different lengths. The data is tabulated and the
> unit is Mb/s:
>
> sm4-ce      |      16       64      128      256     1024     1420     4096
> ------------+--------------------------------------------------------------
>     ECB enc |  315.18  1162.65  1815.66  2553.50  3692.91  3727.20  4001.93
>     ECB dec |  316.06  1172.97  1817.81  2554.66  3692.18  3786.54  4001.93
>     CBC enc |  304.82   629.54   768.65   864.72   953.90   963.32   974.06
>     CBC dec |  306.05  1142.53  1805.11  2481.67  3522.06  3587.87  3790.99
>     CFB enc |  309.48   635.70   774.44   865.85   950.62   952.68   968.24
>     CFB dec |  315.98  1170.38  1828.75  2509.72  3543.63  3539.40  3793.25
>     CTR enc |  285.83  1036.59  1583.50  2147.26  2933.54  2954.66  3041.14
>     CTR dec |  285.29  1037.47  1584.67  2145.51  2934.10  2950.89  3041.62
>
> sm4-sve-ce (VL = 128 bits)
>     ECB enc |  310.00  1154.70  1813.26  2579.74  3766.90  3869.45  4100.26
>     ECB dec |  315.60  1176.22  1838.06  2593.69  3774.95  3878.42  4098.83
>     CBC enc |  303.44   622.65   764.67   861.40   953.18   963.05   973.77
>     CBC dec |  302.13  1091.15  1689.10  2267.79  3182.84  3242.68  3408.92
>     CFB enc |  296.62   620.41   762.94   858.96   948.18   956.04   967.67
>     CFB dec |  291.23  1065.50  1637.33  2228.12  3158.52  3213.35  3403.83
>     CTR enc |  272.27   959.35  1466.34  1934.24  2562.80  2595.87  2695.15
>     CTR dec |  273.40   963.65  1471.83  1938.97  2563.12  2597.25  2694.54
>
> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> ---
>  arch/arm64/crypto/Kconfig           |   19 +
>  arch/arm64/crypto/Makefile          |    3 +
>  arch/arm64/crypto/sm4-sve-ce-core.S | 1028 +++++++++++++++++++++++++++
>  arch/arm64/crypto/sm4-sve-ce-glue.c |  332 +++++++++
>  4 files changed, 1382 insertions(+)
>  create mode 100644 arch/arm64/crypto/sm4-sve-ce-core.S
>  create mode 100644 arch/arm64/crypto/sm4-sve-ce-glue.c
>
> diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
> index 6793d5bc3ee5..bbb5a7a08af5 100644
> --- a/arch/arm64/crypto/Kconfig
> +++ b/arch/arm64/crypto/Kconfig
> @@ -249,6 +249,25 @@ config CRYPTO_SM4_ARM64_CE_BLK
>           - ARMv8 Crypto Extensions
>           - NEON (Advanced SIMD) extensions
>
> +config CRYPTO_SM4_ARM64_SVE_CE_BLK
> +       tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography acceleration with SVE2)"
> +       depends on KERNEL_MODE_NEON
> +       select CRYPTO_SKCIPHER
> +       select CRYPTO_SM4
> +       select CRYPTO_SM4_ARM64_CE_BLK
> +       help
> +         Length-preserving ciphers: SM4 cipher algorithms (OSCCA GB/T 32907-2016)
> +         with block cipher modes:
> +         - ECB (Electronic Codebook) mode (NIST SP800-38A)
> +         - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
> +         - CFB (Cipher Feedback) mode (NIST SP800-38A)
> +         - CTR (Counter) mode (NIST SP800-38A)
> +
> +         Architecture: arm64 using:
> +         - ARMv8 Crypto Extensions
> +         - ARMv9 cryptography acceleration with SVE2
> +         - NEON (Advanced SIMD) extensions
> +
>  config CRYPTO_SM4_ARM64_NEON_BLK
>         tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (NEON)"
>         depends on KERNEL_MODE_NEON
> diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
> index 4818e204c2ac..355dd9053434 100644
> --- a/arch/arm64/crypto/Makefile
> +++ b/arch/arm64/crypto/Makefile
> @@ -38,6 +38,9 @@ sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
>  obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
>  sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
>
> +obj-$(CONFIG_CRYPTO_SM4_ARM64_SVE_CE_BLK) += sm4-sve-ce.o
> +sm4-sve-ce-y := sm4-sve-ce-glue.o sm4-sve-ce-core.o
> +
>  obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
>  ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
>
> diff --git a/arch/arm64/crypto/sm4-sve-ce-core.S b/arch/arm64/crypto/sm4-sve-ce-core.S
> new file mode 100644
> index 000000000000..caecbdf2536c
> --- /dev/null
> +++ b/arch/arm64/crypto/sm4-sve-ce-core.S
> @@ -0,0 +1,1028 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * SM4 Cipher Algorithm for ARMv9 Crypto Extensions with SVE2
> + * as specified in
> + * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
> + *
> + * Copyright (C) 2022, Alibaba Group.
> + * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/assembler.h>
> +
> +.arch  armv8-a+crypto+sve+sve2
> +
> +.irp b, 0, 15, 24, 25, 26, 27, 28, 29, 30, 31
> +       .set .Lv\b\().4s, \b
> +.endr
> +
> +.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
> +               16, 24, 25, 26, 27, 28, 29, 30, 31
> +       .set .Lz\b\().s, \b
> +.endr
> +
> +.macro sm4e, vd, vn
> +       .inst 0xcec08400 | (.L\vn << 5) | .L\vd
> +.endm
> +
> +.macro sm4e_sve, zd, zn
> +       .inst 0x4523e000 | (.L\zn << 5) | .L\zd
> +.endm
> +
> +
> +/* Register macros */
> +
> +#define RCTR        z16
> +#define RCTRv       v16
> +#define RIV         z16
> +#define RIVv        v16
> +#define RSWAP128    z17
> +#define RZERO       z18
> +#define RLE128_INC  z19
> +
> +#define RTMP0       z20
> +#define RTMP0v      v20
> +#define RTMP1       z21
> +#define RTMP2       z22
> +#define RTMP3       z23
> +
> +
> +/* Helper macros. */
> +
> +#define SM4_PREPARE(ptr)                                       \
> +               adr_l           x7, .Lbswap128_mask;            \
> +               ptrue           p0.b, ALL;                      \
> +               rdvl            x5, #1;                         \
> +               ld1b            {RSWAP128.b}, p0/z, [x7];       \
> +                                                               \
> +               ld1             {v24.16b-v27.16b}, [ptr], #64;  \
> +               ld1             {v28.16b-v31.16b}, [ptr];       \
> +               dup             z24.q, z24.q[0];                \
> +               dup             z25.q, z25.q[0];                \
> +               dup             z26.q, z26.q[0];                \
> +               dup             z27.q, z27.q[0];                \
> +               dup             z28.q, z28.q[0];                \
> +               dup             z29.q, z29.q[0];                \
> +               dup             z30.q, z30.q[0];                \
> +               dup             z31.q, z31.q[0];
> +
> +#define SM4_SVE_CE_CRYPT_BLK(b0)                               \
> +               revb            b0.s, p0/m, b0.s;               \
> +               sm4e_sve        b0.s, z24.s;                    \
> +               sm4e_sve        b0.s, z25.s;                    \
> +               sm4e_sve        b0.s, z26.s;                    \
> +               sm4e_sve        b0.s, z27.s;                    \
> +               sm4e_sve        b0.s, z28.s;                    \
> +               sm4e_sve        b0.s, z29.s;                    \
> +               sm4e_sve        b0.s, z30.s;                    \
> +               sm4e_sve        b0.s, z31.s;                    \
> +               tbl             b0.b, {b0.b}, RSWAP128.b;       \
> +               revb            b0.s, p0/m, b0.s;
> +
> +#define SM4_SVE_CE_CRYPT_BLK4(b0, b1, b2, b3)                  \
> +               revb            b0.s, p0/m, b0.s;               \
> +               revb            b1.s, p0/m, b1.s;               \
> +               revb            b2.s, p0/m, b2.s;               \
> +               revb            b3.s, p0/m, b3.s;               \
> +               sm4e_sve        b0.s, z24.s;                    \
> +               sm4e_sve        b1.s, z24.s;                    \
> +               sm4e_sve        b2.s, z24.s;                    \
> +               sm4e_sve        b3.s, z24.s;                    \
> +               sm4e_sve        b0.s, z25.s;                    \
> +               sm4e_sve        b1.s, z25.s;                    \
> +               sm4e_sve        b2.s, z25.s;                    \
> +               sm4e_sve        b3.s, z25.s;                    \
> +               sm4e_sve        b0.s, z26.s;                    \
> +               sm4e_sve        b1.s, z26.s;                    \
> +               sm4e_sve        b2.s, z26.s;                    \
> +               sm4e_sve        b3.s, z26.s;                    \
> +               sm4e_sve        b0.s, z27.s;                    \
> +               sm4e_sve        b1.s, z27.s;                    \
> +               sm4e_sve        b2.s, z27.s;                    \
> +               sm4e_sve        b3.s, z27.s;                    \
> +               sm4e_sve        b0.s, z28.s;                    \
> +               sm4e_sve        b1.s, z28.s;                    \
> +               sm4e_sve        b2.s, z28.s;                    \
> +               sm4e_sve        b3.s, z28.s;                    \
> +               sm4e_sve        b0.s, z29.s;                    \
> +               sm4e_sve        b1.s, z29.s;                    \
> +               sm4e_sve        b2.s, z29.s;                    \
> +               sm4e_sve        b3.s, z29.s;                    \
> +               sm4e_sve        b0.s, z30.s;                    \
> +               sm4e_sve        b1.s, z30.s;                    \
> +               sm4e_sve        b2.s, z30.s;                    \
> +               sm4e_sve        b3.s, z30.s;                    \
> +               sm4e_sve        b0.s, z31.s;                    \
> +               sm4e_sve        b1.s, z31.s;                    \
> +               sm4e_sve        b2.s, z31.s;                    \
> +               sm4e_sve        b3.s, z31.s;                    \
> +               tbl             b0.b, {b0.b}, RSWAP128.b;       \
> +               tbl             b1.b, {b1.b}, RSWAP128.b;       \
> +               tbl             b2.b, {b2.b}, RSWAP128.b;       \
> +               tbl             b3.b, {b3.b}, RSWAP128.b;       \
> +               revb            b0.s, p0/m, b0.s;               \
> +               revb            b1.s, p0/m, b1.s;               \
> +               revb            b2.s, p0/m, b2.s;               \
> +               revb            b3.s, p0/m, b3.s;
> +
> +#define SM4_SVE_CE_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)  \
> +               revb            b0.s, p0/m, b0.s;               \
> +               revb            b1.s, p0/m, b1.s;               \
> +               revb            b2.s, p0/m, b2.s;               \
> +               revb            b3.s, p0/m, b3.s;               \
> +               revb            b4.s, p0/m, b4.s;               \
> +               revb            b5.s, p0/m, b5.s;               \
> +               revb            b6.s, p0/m, b6.s;               \
> +               revb            b7.s, p0/m, b7.s;               \
> +               sm4e_sve        b0.s, z24.s;                    \
> +               sm4e_sve        b1.s, z24.s;                    \
> +               sm4e_sve        b2.s, z24.s;                    \
> +               sm4e_sve        b3.s, z24.s;                    \
> +               sm4e_sve        b4.s, z24.s;                    \
> +               sm4e_sve        b5.s, z24.s;                    \
> +               sm4e_sve        b6.s, z24.s;                    \
> +               sm4e_sve        b7.s, z24.s;                    \
> +               sm4e_sve        b0.s, z25.s;                    \
> +               sm4e_sve        b1.s, z25.s;                    \
> +               sm4e_sve        b2.s, z25.s;                    \
> +               sm4e_sve        b3.s, z25.s;                    \
> +               sm4e_sve        b4.s, z25.s;                    \
> +               sm4e_sve        b5.s, z25.s;                    \
> +               sm4e_sve        b6.s, z25.s;                    \
> +               sm4e_sve        b7.s, z25.s;                    \
> +               sm4e_sve        b0.s, z26.s;                    \
> +               sm4e_sve        b1.s, z26.s;                    \
> +               sm4e_sve        b2.s, z26.s;                    \
> +               sm4e_sve        b3.s, z26.s;                    \
> +               sm4e_sve        b4.s, z26.s;                    \
> +               sm4e_sve        b5.s, z26.s;                    \
> +               sm4e_sve        b6.s, z26.s;                    \
> +               sm4e_sve        b7.s, z26.s;                    \
> +               sm4e_sve        b0.s, z27.s;                    \
> +               sm4e_sve        b1.s, z27.s;                    \
> +               sm4e_sve        b2.s, z27.s;                    \
> +               sm4e_sve        b3.s, z27.s;                    \
> +               sm4e_sve        b4.s, z27.s;                    \
> +               sm4e_sve        b5.s, z27.s;                    \
> +               sm4e_sve        b6.s, z27.s;                    \
> +               sm4e_sve        b7.s, z27.s;                    \
> +               sm4e_sve        b0.s, z28.s;                    \
> +               sm4e_sve        b1.s, z28.s;                    \
> +               sm4e_sve        b2.s, z28.s;                    \
> +               sm4e_sve        b3.s, z28.s;                    \
> +               sm4e_sve        b4.s, z28.s;                    \
> +               sm4e_sve        b5.s, z28.s;                    \
> +               sm4e_sve        b6.s, z28.s;                    \
> +               sm4e_sve        b7.s, z28.s;                    \
> +               sm4e_sve        b0.s, z29.s;                    \
> +               sm4e_sve        b1.s, z29.s;                    \
> +               sm4e_sve        b2.s, z29.s;                    \
> +               sm4e_sve        b3.s, z29.s;                    \
> +               sm4e_sve        b4.s, z29.s;                    \
> +               sm4e_sve        b5.s, z29.s;                    \
> +               sm4e_sve        b6.s, z29.s;                    \
> +               sm4e_sve        b7.s, z29.s;                    \
> +               sm4e_sve        b0.s, z30.s;                    \
> +               sm4e_sve        b1.s, z30.s;                    \
> +               sm4e_sve        b2.s, z30.s;                    \
> +               sm4e_sve        b3.s, z30.s;                    \
> +               sm4e_sve        b4.s, z30.s;                    \
> +               sm4e_sve        b5.s, z30.s;                    \
> +               sm4e_sve        b6.s, z30.s;                    \
> +               sm4e_sve        b7.s, z30.s;                    \
> +               sm4e_sve        b0.s, z31.s;                    \
> +               sm4e_sve        b1.s, z31.s;                    \
> +               sm4e_sve        b2.s, z31.s;                    \
> +               sm4e_sve        b3.s, z31.s;                    \
> +               sm4e_sve        b4.s, z31.s;                    \
> +               sm4e_sve        b5.s, z31.s;                    \
> +               sm4e_sve        b6.s, z31.s;                    \
> +               sm4e_sve        b7.s, z31.s;                    \
> +               tbl             b0.b, {b0.b}, RSWAP128.b;       \
> +               tbl             b1.b, {b1.b}, RSWAP128.b;       \
> +               tbl             b2.b, {b2.b}, RSWAP128.b;       \
> +               tbl             b3.b, {b3.b}, RSWAP128.b;       \
> +               tbl             b4.b, {b4.b}, RSWAP128.b;       \
> +               tbl             b5.b, {b5.b}, RSWAP128.b;       \
> +               tbl             b6.b, {b6.b}, RSWAP128.b;       \
> +               tbl             b7.b, {b7.b}, RSWAP128.b;       \
> +               revb            b0.s, p0/m, b0.s;               \
> +               revb            b1.s, p0/m, b1.s;               \
> +               revb            b2.s, p0/m, b2.s;               \
> +               revb            b3.s, p0/m, b3.s;               \
> +               revb            b4.s, p0/m, b4.s;               \
> +               revb            b5.s, p0/m, b5.s;               \
> +               revb            b6.s, p0/m, b6.s;               \
> +               revb            b7.s, p0/m, b7.s;
> +
> +#define SM4_CE_CRYPT_BLK(b0)                                   \
> +               rev32           b0.16b, b0.16b;                 \
> +               sm4e            b0.4s, v24.4s;                  \
> +               sm4e            b0.4s, v25.4s;                  \
> +               sm4e            b0.4s, v26.4s;                  \
> +               sm4e            b0.4s, v27.4s;                  \
> +               sm4e            b0.4s, v28.4s;                  \
> +               sm4e            b0.4s, v29.4s;                  \
> +               sm4e            b0.4s, v30.4s;                  \
> +               sm4e            b0.4s, v31.4s;                  \
> +               rev64           b0.4s, b0.4s;                   \
> +               ext             b0.16b, b0.16b, b0.16b, #8;     \
> +               rev32           b0.16b, b0.16b;
> +
> +#define inc_le128(zctr)                                                \
> +               mov             RCTRv.d[1], x8;                 \
> +               mov             RCTRv.d[0], x7;                 \
> +               mov             zctr.d, RLE128_INC.d;           \
> +               dup             RCTR.q, RCTR.q[0];              \
> +               adds            x8, x8, x5, LSR #4;             \
> +               adclt           zctr.d, RCTR.d, RZERO.d;        \
> +               adclt           RCTR.d, zctr.d, RZERO.d;        \
> +               adc             x7, x7, xzr;                    \
> +               trn1            zctr.d, RCTR.d, zctr.d;         \
> +               revb            zctr.d, p0/m, zctr.d;
> +
> +#define inc_le128_4x(zctr0, zctr1, zctr2, zctr3)               \
> +               mov             v8.d[1], x8;                    \
> +               mov             v8.d[0], x7;                    \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr0.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v9.d[1], x8;                    \
> +               mov             v9.d[0], x7;                    \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr1.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v10.d[1], x8;                   \
> +               mov             v10.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr2.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v11.d[1], x8;                   \
> +               mov             v11.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr3.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               dup             z8.q, z8.q[0];                  \
> +               dup             z9.q, z9.q[0];                  \
> +               dup             z10.q, z10.q[0];                \
> +               dup             z11.q, z11.q[0];                \
> +               adclt           zctr0.d, z8.d, RZERO.d;         \
> +               adclt           zctr1.d, z9.d, RZERO.d;         \
> +               adclt           zctr2.d, z10.d, RZERO.d;        \
> +               adclt           zctr3.d, z11.d, RZERO.d;        \
> +               adclt           z8.d, zctr0.d, RZERO.d;         \
> +               adclt           z9.d, zctr1.d, RZERO.d;         \
> +               adclt           z10.d, zctr2.d, RZERO.d;        \
> +               adclt           z11.d, zctr3.d, RZERO.d;        \
> +               trn1            zctr0.d, z8.d, zctr0.d;         \
> +               trn1            zctr1.d, z9.d, zctr1.d;         \
> +               trn1            zctr2.d, z10.d, zctr2.d;        \
> +               trn1            zctr3.d, z11.d, zctr3.d;        \
> +               revb            zctr0.d, p0/m, zctr0.d;         \
> +               revb            zctr1.d, p0/m, zctr1.d;         \
> +               revb            zctr2.d, p0/m, zctr2.d;         \
> +               revb            zctr3.d, p0/m, zctr3.d;
> +
> +#define inc_le128_8x(zctr0, zctr1, zctr2, zctr3,               \
> +                    zctr4, zctr5, zctr6, zctr7)                \
> +               mov             v8.d[1], x8;                    \
> +               mov             v8.d[0], x7;                    \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr0.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v9.d[1], x8;                    \
> +               mov             v9.d[0], x7;                    \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr1.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v10.d[1], x8;                   \
> +               mov             v10.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr2.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v11.d[1], x8;                   \
> +               mov             v11.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr3.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v12.d[1], x8;                   \
> +               mov             v12.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr4.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v13.d[1], x8;                   \
> +               mov             v13.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr5.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v14.d[1], x8;                   \
> +               mov             v14.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr6.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               mov             v15.d[1], x8;                   \
> +               mov             v15.d[0], x7;                   \
> +               adds            x8, x8, x5, LSR #4;             \
> +               mov             zctr7.d, RLE128_INC.d;          \
> +               adc             x7, x7, xzr;                    \
> +               dup             z8.q, z8.q[0];                  \
> +               dup             z9.q, z9.q[0];                  \
> +               dup             z10.q, z10.q[0];                \
> +               dup             z11.q, z11.q[0];                \
> +               dup             z12.q, z12.q[0];                \
> +               dup             z13.q, z13.q[0];                \
> +               dup             z14.q, z14.q[0];                \
> +               dup             z15.q, z15.q[0];                \
> +               adclt           zctr0.d, z8.d, RZERO.d;         \
> +               adclt           zctr1.d, z9.d, RZERO.d;         \
> +               adclt           zctr2.d, z10.d, RZERO.d;        \
> +               adclt           zctr3.d, z11.d, RZERO.d;        \
> +               adclt           zctr4.d, z12.d, RZERO.d;        \
> +               adclt           zctr5.d, z13.d, RZERO.d;        \
> +               adclt           zctr6.d, z14.d, RZERO.d;        \
> +               adclt           zctr7.d, z15.d, RZERO.d;        \
> +               adclt           z8.d, zctr0.d, RZERO.d;         \
> +               adclt           z9.d, zctr1.d, RZERO.d;         \
> +               adclt           z10.d, zctr2.d, RZERO.d;        \
> +               adclt           z11.d, zctr3.d, RZERO.d;        \
> +               adclt           z12.d, zctr4.d, RZERO.d;        \
> +               adclt           z13.d, zctr5.d, RZERO.d;        \
> +               adclt           z14.d, zctr6.d, RZERO.d;        \
> +               adclt           z15.d, zctr7.d, RZERO.d;        \
> +               trn1            zctr0.d, z8.d, zctr0.d;         \
> +               trn1            zctr1.d, z9.d, zctr1.d;         \
> +               trn1            zctr2.d, z10.d, zctr2.d;        \
> +               trn1            zctr3.d, z11.d, zctr3.d;        \
> +               trn1            zctr4.d, z12.d, zctr4.d;        \
> +               trn1            zctr5.d, z13.d, zctr5.d;        \
> +               trn1            zctr6.d, z14.d, zctr6.d;        \
> +               trn1            zctr7.d, z15.d, zctr7.d;        \
> +               revb            zctr0.d, p0/m, zctr0.d;         \
> +               revb            zctr1.d, p0/m, zctr1.d;         \
> +               revb            zctr2.d, p0/m, zctr2.d;         \
> +               revb            zctr3.d, p0/m, zctr3.d;         \
> +               revb            zctr4.d, p0/m, zctr4.d;         \
> +               revb            zctr5.d, p0/m, zctr5.d;         \
> +               revb            zctr6.d, p0/m, zctr6.d;         \
> +               revb            zctr7.d, p0/m, zctr7.d;
> +
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_crypt)
> +       /* input:
> +        *   x0: round key array, CTX
> +        *   x1: dst
> +        *   x2: src
> +        *   w3: nblocks
> +        */
> +       uxtw            x3, w3
> +       SM4_PREPARE(x0)
> +
> +.Lcrypt_loop_8x:
> +       sub             x3, x3, x5, LSR #1              /* x3 - (8 * VL) */
> +       tbnz            x3, #63, .Lcrypt_4x
> +
> +       ld1b            {z0.b}, p0/z, [x2]
> +       ld1b            {z1.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z2.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z3.b}, p0/z, [x2, #3, MUL VL]
> +       ld1b            {z4.b}, p0/z, [x2, #4, MUL VL]
> +       ld1b            {z5.b}, p0/z, [x2, #5, MUL VL]
> +       ld1b            {z6.b}, p0/z, [x2, #6, MUL VL]
> +       ld1b            {z7.b}, p0/z, [x2, #7, MUL VL]
> +
> +       SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +       st1b            {z4.b}, p0, [x1, #4, MUL VL]
> +       st1b            {z5.b}, p0, [x1, #5, MUL VL]
> +       st1b            {z6.b}, p0, [x1, #6, MUL VL]
> +       st1b            {z7.b}, p0, [x1, #7, MUL VL]
> +
> +       addvl           x2, x2, #8
> +       addvl           x1, x1, #8
> +
> +       cbz             x3, .Lcrypt_end
> +       b               .Lcrypt_loop_8x
> +
> +.Lcrypt_4x:
> +       add             x3, x3, x5, LSR #1
> +       cmp             x3, x5, LSR #2
> +       blt             .Lcrypt_loop_1x
> +
> +       sub             x3, x3, x5, LSR #2              /* x3 - (4 * VL) */
> +
> +       ld1b            {z0.b}, p0/z, [x2]
> +       ld1b            {z1.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z2.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z3.b}, p0/z, [x2, #3, MUL VL]
> +
> +       SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +
> +       addvl           x2, x2, #4
> +       addvl           x1, x1, #4
> +
> +       cbz             x3, .Lcrypt_end
> +
> +.Lcrypt_loop_1x:
> +       cmp             x3, x5, LSR #4
> +       blt             .Lcrypt_ce_loop_1x
> +
> +       sub             x3, x3, x5, LSR #4              /* x3 - VL */
> +
> +       ld1b            {z0.b}, p0/z, [x2]
> +
> +       SM4_SVE_CE_CRYPT_BLK(z0)
> +
> +       st1b            {z0.b}, p0, [x1]
> +
> +       addvl           x2, x2, #1
> +       addvl           x1, x1, #1
> +
> +       cbz             x3, .Lcrypt_end
> +       b               .Lcrypt_loop_1x
> +
> +.Lcrypt_ce_loop_1x:
> +       sub             x3, x3, #1
> +
> +       ld1             {v0.16b}, [x2], #16
> +       SM4_CE_CRYPT_BLK(v0)
> +       st1             {v0.16b}, [x1], #16
> +
> +       cbnz            x3, .Lcrypt_ce_loop_1x
> +
> +.Lcrypt_end:
> +       ret
> +SYM_FUNC_END(sm4_sve_ce_crypt)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_cbc_dec)
> +       /* input:
> +        *   x0: round key array, CTX
> +        *   x1: dst
> +        *   x2: src
> +        *   x3: iv (big endian, 128 bit)
> +        *   w4: nblocks
> +        */
> +       uxtw            x4, w4
> +       SM4_PREPARE(x0)
> +
> +       ld1             {RIVv.16b}, [x3]
> +       ext             RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcbc_dec_loop_8x:
> +       sub             x4, x4, x5, LSR #1              /* x4 - (8 * VL) */
> +       tbnz            x4, #63, .Lcbc_dec_4x
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       ld1b            {z14.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #3, MUL VL]
> +       ld1b            {z11.b}, p0/z, [x2, #4, MUL VL]
> +       ld1b            {z10.b}, p0/z, [x2, #5, MUL VL]
> +       ld1b            {z9.b}, p0/z, [x2, #6, MUL VL]
> +       ld1b            {z8.b}, p0/z, [x2, #7, MUL VL]
> +       rev             z0.b, z15.b
> +       rev             z1.b, z14.b
> +       rev             z2.b, z13.b
> +       rev             z3.b, z12.b
> +       rev             z4.b, z11.b
> +       rev             z5.b, z10.b
> +       rev             z6.b, z9.b
> +       rev             z7.b, z8.b
> +       rev             RTMP0.b, RIV.b
> +       ext             z7.b, z7.b, z6.b, #16
> +       ext             z6.b, z6.b, z5.b, #16
> +       ext             z5.b, z5.b, z4.b, #16
> +       ext             z4.b, z4.b, z3.b, #16
> +       ext             z3.b, z3.b, z2.b, #16
> +       ext             z2.b, z2.b, z1.b, #16
> +       ext             z1.b, z1.b, z0.b, #16
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z7.b, z7.b
> +       rev             z6.b, z6.b
> +       rev             z5.b, z5.b
> +       rev             z4.b, z4.b
> +       rev             z3.b, z3.b
> +       rev             z2.b, z2.b
> +       rev             z1.b, z1.b
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z8.d
> +
> +       SM4_SVE_CE_CRYPT_BLK8(z15, z14, z13, z12, z11, z10, z9, z8)
> +
> +       eor             z0.d, z0.d, z15.d
> +       eor             z1.d, z1.d, z14.d
> +       eor             z2.d, z2.d, z13.d
> +       eor             z3.d, z3.d, z12.d
> +       eor             z4.d, z4.d, z11.d
> +       eor             z5.d, z5.d, z10.d
> +       eor             z6.d, z6.d, z9.d
> +       eor             z7.d, z7.d, z8.d
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +       st1b            {z4.b}, p0, [x1, #4, MUL VL]
> +       st1b            {z5.b}, p0, [x1, #5, MUL VL]
> +       st1b            {z6.b}, p0, [x1, #6, MUL VL]
> +       st1b            {z7.b}, p0, [x1, #7, MUL VL]
> +
> +       addvl           x2, x2, #8
> +       addvl           x1, x1, #8
> +
> +       cbz             x4, .Lcbc_dec_end
> +       b               .Lcbc_dec_loop_8x
> +
> +.Lcbc_dec_4x:
> +       add             x4, x4, x5, LSR #1
> +       cmp             x4, x5, LSR #2
> +       blt             .Lcbc_dec_loop_1x
> +
> +       sub             x4, x4, x5, LSR #2              /* x4 - (4 * VL) */
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       ld1b            {z14.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #3, MUL VL]
> +       rev             z0.b, z15.b
> +       rev             z1.b, z14.b
> +       rev             z2.b, z13.b
> +       rev             z3.b, z12.b
> +       rev             RTMP0.b, RIV.b
> +       ext             z3.b, z3.b, z2.b, #16
> +       ext             z2.b, z2.b, z1.b, #16
> +       ext             z1.b, z1.b, z0.b, #16
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z3.b, z3.b
> +       rev             z2.b, z2.b
> +       rev             z1.b, z1.b
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z12.d
> +
> +       SM4_SVE_CE_CRYPT_BLK4(z15, z14, z13, z12)
> +
> +       eor             z0.d, z0.d, z15.d
> +       eor             z1.d, z1.d, z14.d
> +       eor             z2.d, z2.d, z13.d
> +       eor             z3.d, z3.d, z12.d
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +
> +       addvl           x2, x2, #4
> +       addvl           x1, x1, #4
> +
> +       cbz             x4, .Lcbc_dec_end
> +
> +.Lcbc_dec_loop_1x:
> +       cmp             x4, x5, LSR #4
> +       blt             .Lcbc_dec_ce
> +
> +       sub             x4, x4, x5, LSR #4              /* x4 - VL */
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       rev             RTMP0.b, RIV.b
> +       rev             z0.b, z15.b
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z15.d
> +
> +       SM4_SVE_CE_CRYPT_BLK(z15)
> +
> +       eor             z0.d, z0.d, z15.d
> +       st1b            {z0.b}, p0, [x1]
> +
> +       addvl           x2, x2, #1
> +       addvl           x1, x1, #1
> +
> +       cbz             x4, .Lcbc_dec_end
> +       b               .Lcbc_dec_loop_1x
> +
> +.Lcbc_dec_ce:
> +       rev             RIV.s, RIV.s
> +       tbl             RIV.b, {RIV.b}, RSWAP128.b
> +
> +.Lcbc_dec_ce_loop_1x:
> +       sub             x4, x4, #1
> +
> +       ld1             {v15.16b}, [x2], #16
> +       mov             v0.16b, RIVv.16b
> +       mov             RIVv.16b, v15.16b
> +       SM4_CE_CRYPT_BLK(v15)
> +       eor             v0.16b, v0.16b, v15.16b
> +       st1             {v0.16b}, [x1], #16
> +
> +       cbnz            x4, .Lcbc_dec_ce_loop_1x
> +
> +       ext             RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcbc_dec_end:
> +       /* store new IV */
> +       rev             RIV.s, RIV.s
> +       tbl             RIV.b, {RIV.b}, RSWAP128.b
> +       st1             {RIVv.16b}, [x3]
> +
> +       ret
> +SYM_FUNC_END(sm4_sve_ce_cbc_dec)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_cfb_dec)
> +       /* input:
> +        *   x0: round key array, CTX
> +        *   x1: dst
> +        *   x2: src
> +        *   x3: iv (big endian, 128 bit)
> +        *   w4: nblocks
> +        */
> +       uxtw            x4, w4
> +       SM4_PREPARE(x0)
> +
> +       ld1             {RIVv.16b}, [x3]
> +       ext             RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcfb_dec_loop_8x:
> +       sub             x4, x4, x5, LSR #1              /* x4 - (8 * VL) */
> +       tbnz            x4, #63, .Lcfb_dec_4x
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       ld1b            {z14.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #3, MUL VL]
> +       ld1b            {z11.b}, p0/z, [x2, #4, MUL VL]
> +       ld1b            {z10.b}, p0/z, [x2, #5, MUL VL]
> +       ld1b            {z9.b}, p0/z, [x2, #6, MUL VL]
> +       ld1b            {z8.b}, p0/z, [x2, #7, MUL VL]
> +       rev             z0.b, z15.b
> +       rev             z1.b, z14.b
> +       rev             z2.b, z13.b
> +       rev             z3.b, z12.b
> +       rev             z4.b, z11.b
> +       rev             z5.b, z10.b
> +       rev             z6.b, z9.b
> +       rev             z7.b, z8.b
> +       rev             RTMP0.b, RIV.b
> +       ext             z7.b, z7.b, z6.b, #16
> +       ext             z6.b, z6.b, z5.b, #16
> +       ext             z5.b, z5.b, z4.b, #16
> +       ext             z4.b, z4.b, z3.b, #16
> +       ext             z3.b, z3.b, z2.b, #16
> +       ext             z2.b, z2.b, z1.b, #16
> +       ext             z1.b, z1.b, z0.b, #16
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z7.b, z7.b
> +       rev             z6.b, z6.b
> +       rev             z5.b, z5.b
> +       rev             z4.b, z4.b
> +       rev             z3.b, z3.b
> +       rev             z2.b, z2.b
> +       rev             z1.b, z1.b
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z8.d
> +
> +       SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> +       eor             z0.d, z0.d, z15.d
> +       eor             z1.d, z1.d, z14.d
> +       eor             z2.d, z2.d, z13.d
> +       eor             z3.d, z3.d, z12.d
> +       eor             z4.d, z4.d, z11.d
> +       eor             z5.d, z5.d, z10.d
> +       eor             z6.d, z6.d, z9.d
> +       eor             z7.d, z7.d, z8.d
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +       st1b            {z4.b}, p0, [x1, #4, MUL VL]
> +       st1b            {z5.b}, p0, [x1, #5, MUL VL]
> +       st1b            {z6.b}, p0, [x1, #6, MUL VL]
> +       st1b            {z7.b}, p0, [x1, #7, MUL VL]
> +
> +       addvl           x2, x2, #8
> +       addvl           x1, x1, #8
> +
> +       cbz             x4, .Lcfb_dec_end
> +       b               .Lcfb_dec_loop_8x
> +
> +.Lcfb_dec_4x:
> +       add             x4, x4, x5, LSR #1
> +       cmp             x4, x5, LSR #2
> +       blt             .Lcfb_dec_loop_1x
> +
> +       sub             x4, x4, x5, LSR #2              /* x4 - (4 * VL) */
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       ld1b            {z14.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #3, MUL VL]
> +       rev             z0.b, z15.b
> +       rev             z1.b, z14.b
> +       rev             z2.b, z13.b
> +       rev             z3.b, z12.b
> +       rev             RTMP0.b, RIV.b
> +       ext             z3.b, z3.b, z2.b, #16
> +       ext             z2.b, z2.b, z1.b, #16
> +       ext             z1.b, z1.b, z0.b, #16
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z3.b, z3.b
> +       rev             z2.b, z2.b
> +       rev             z1.b, z1.b
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z12.d
> +
> +       SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> +       eor             z0.d, z0.d, z15.d
> +       eor             z1.d, z1.d, z14.d
> +       eor             z2.d, z2.d, z13.d
> +       eor             z3.d, z3.d, z12.d
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +
> +       addvl           x2, x2, #4
> +       addvl           x1, x1, #4
> +
> +       cbz             x4, .Lcfb_dec_end
> +
> +.Lcfb_dec_loop_1x:
> +       cmp             x4, x5, LSR #4
> +       blt             .Lcfb_dec_ce
> +
> +       sub             x4, x4, x5, LSR #4              /* x4 - VL */
> +
> +       ld1b            {z15.b}, p0/z, [x2]
> +       rev             RTMP0.b, RIV.b
> +       rev             z0.b, z15.b
> +       ext             z0.b, z0.b, RTMP0.b, #16
> +       rev             z0.b, z0.b
> +       mov             RIV.d, z15.d
> +
> +       SM4_SVE_CE_CRYPT_BLK(z0)
> +
> +       eor             z0.d, z0.d, z15.d
> +       st1b            {z0.b}, p0, [x1]
> +
> +       addvl           x2, x2, #1
> +       addvl           x1, x1, #1
> +
> +       cbz             x4, .Lcfb_dec_end
> +       b               .Lcfb_dec_loop_1x
> +
> +.Lcfb_dec_ce:
> +       rev             RIV.s, RIV.s
> +       tbl             RIV.b, {RIV.b}, RSWAP128.b
> +
> +.Lcfb_dec_ce_loop_1x:
> +       sub             x4, x4, #1
> +
> +       ld1             {v15.16b}, [x2], #16
> +       mov             v0.16b, RIVv.16b
> +       mov             RIVv.16b, v15.16b
> +       SM4_CE_CRYPT_BLK(v0)
> +       eor             v0.16b, v0.16b, v15.16b
> +       st1             {v0.16b}, [x1], #16
> +
> +       cbnz            x4, .Lcfb_dec_ce_loop_1x
> +
> +       ext             RIV.b, RIV.b, RIV.b, #16
> +
> +.Lcfb_dec_end:
> +       /* store new IV */
> +       rev             RIV.s, RIV.s
> +       tbl             RIV.b, {RIV.b}, RSWAP128.b
> +       st1             {RIVv.16b}, [x3]
> +
> +       ret
> +SYM_FUNC_END(sm4_sve_ce_cfb_dec)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_ce_ctr_crypt)
> +       /* input:
> +        *   x0: round key array, CTX
> +        *   x1: dst
> +        *   x2: src
> +        *   x3: ctr (big endian, 128 bit)
> +        *   w4: nblocks
> +        */
> +       uxtw            x4, w4
> +       SM4_PREPARE(x0)
> +
> +       dup             RZERO.d, #0
> +       adr_l           x6, .Lle128_inc
> +       ld1b            {RLE128_INC.b}, p0/z, [x6]
> +
> +       ldp             x7, x8, [x3]
> +       rev             x7, x7
> +       rev             x8, x8
> +
> +.Lctr_loop_8x:
> +       sub             x4, x4, x5, LSR #1              /* x4 - (8 * VL) */
> +       tbnz            x4, #63, .Lctr_4x
> +
> +       inc_le128_8x(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> +       ld1b            {z8.b}, p0/z, [x2]
> +       ld1b            {z9.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z10.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z11.b}, p0/z, [x2, #3, MUL VL]
> +       ld1b            {z12.b}, p0/z, [x2, #4, MUL VL]
> +       ld1b            {z13.b}, p0/z, [x2, #5, MUL VL]
> +       ld1b            {z14.b}, p0/z, [x2, #6, MUL VL]
> +       ld1b            {z15.b}, p0/z, [x2, #7, MUL VL]
> +
> +       SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
> +
> +       eor             z0.d, z0.d, z8.d
> +       eor             z1.d, z1.d, z9.d
> +       eor             z2.d, z2.d, z10.d
> +       eor             z3.d, z3.d, z11.d
> +       eor             z4.d, z4.d, z12.d
> +       eor             z5.d, z5.d, z13.d
> +       eor             z6.d, z6.d, z14.d
> +       eor             z7.d, z7.d, z15.d
> +
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +       st1b            {z4.b}, p0, [x1, #4, MUL VL]
> +       st1b            {z5.b}, p0, [x1, #5, MUL VL]
> +       st1b            {z6.b}, p0, [x1, #6, MUL VL]
> +       st1b            {z7.b}, p0, [x1, #7, MUL VL]
> +
> +       addvl           x2, x2, #8
> +       addvl           x1, x1, #8
> +
> +       cbz             x4, .Lctr_end
> +       b               .Lctr_loop_8x
> +
> +.Lctr_4x:
> +       add             x4, x4, x5, LSR #1
> +       cmp             x4, x5, LSR #2
> +       blt             .Lctr_loop_1x
> +
> +       sub             x4, x4, x5, LSR #2              /* x4 - (4 * VL) */
> +
> +       inc_le128_4x(z0, z1, z2, z3)
> +
> +       ld1b            {z8.b}, p0/z, [x2]
> +       ld1b            {z9.b}, p0/z, [x2, #1, MUL VL]
> +       ld1b            {z10.b}, p0/z, [x2, #2, MUL VL]
> +       ld1b            {z11.b}, p0/z, [x2, #3, MUL VL]
> +
> +       SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
> +
> +       eor             z0.d, z0.d, z8.d
> +       eor             z1.d, z1.d, z9.d
> +       eor             z2.d, z2.d, z10.d
> +       eor             z3.d, z3.d, z11.d
> +
> +       st1b            {z0.b}, p0, [x1]
> +       st1b            {z1.b}, p0, [x1, #1, MUL VL]
> +       st1b            {z2.b}, p0, [x1, #2, MUL VL]
> +       st1b            {z3.b}, p0, [x1, #3, MUL VL]
> +
> +       addvl           x2, x2, #4
> +       addvl           x1, x1, #4
> +
> +       cbz             x4, .Lctr_end
> +
> +.Lctr_loop_1x:
> +       cmp             x4, x5, LSR #4
> +       blt             .Lctr_ce_loop_1x
> +
> +       sub             x4, x4, x5, LSR #4              /* x4 - VL */
> +
> +       inc_le128(z0)
> +       ld1b            {z8.b}, p0/z, [x2]
> +
> +       SM4_SVE_CE_CRYPT_BLK(z0)
> +
> +       eor             z0.d, z0.d, z8.d
> +       st1b            {z0.b}, p0, [x1]
> +
> +       addvl           x2, x2, #1
> +       addvl           x1, x1, #1
> +
> +       cbz             x4, .Lctr_end
> +       b               .Lctr_loop_1x
> +
> +.Lctr_ce_loop_1x:
> +       sub             x4, x4, #1
> +
> +       /* inc_le128 for CE */
> +       mov             v0.d[1], x8
> +       mov             v0.d[0], x7
> +       adds            x8, x8, #1
> +       rev64           v0.16b, v0.16b
> +       adc             x7, x7, xzr
> +
> +       ld1             {v8.16b}, [x2], #16
> +
> +       SM4_CE_CRYPT_BLK(v0)
> +
> +       eor             v0.16b, v0.16b, v8.16b
> +       st1             {v0.16b}, [x1], #16
> +
> +       cbnz            x4, .Lctr_ce_loop_1x
> +
> +.Lctr_end:
> +       /* store new CTR */
> +       rev             x7, x7
> +       rev             x8, x8
> +       stp             x7, x8, [x3]
> +
> +       ret
> +SYM_FUNC_END(sm4_sve_ce_ctr_crypt)
> +
> +.align 3
> +SYM_FUNC_START(sm4_sve_get_vl)
> +       /* VL in bytes */
> +       rdvl            x0, #1
> +
> +       ret
> +SYM_FUNC_END(sm4_sve_get_vl)
> +
> +
> +       .section        ".rodata", "a"
> +       .align 4
> +.Lbswap128_mask:
> +       .byte           0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
> +       .byte           0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
> +       .byte           0x1c, 0x1d, 0x1e, 0x1f, 0x18, 0x19, 0x1a, 0x1b
> +       .byte           0x14, 0x15, 0x16, 0x17, 0x10, 0x11, 0x12, 0x13
> +       .byte           0x2c, 0x2d, 0x2e, 0x2f, 0x28, 0x29, 0x2a, 0x2b
> +       .byte           0x24, 0x25, 0x26, 0x27, 0x20, 0x21, 0x22, 0x23
> +       .byte           0x3c, 0x3d, 0x3e, 0x3f, 0x38, 0x39, 0x3a, 0x3b
> +       .byte           0x34, 0x35, 0x36, 0x37, 0x30, 0x31, 0x32, 0x33
> +       .byte           0x4c, 0x4d, 0x4e, 0x4f, 0x48, 0x49, 0x4a, 0x4b
> +       .byte           0x44, 0x45, 0x46, 0x47, 0x40, 0x41, 0x42, 0x43
> +       .byte           0x5c, 0x5d, 0x5e, 0x5f, 0x58, 0x59, 0x5a, 0x5b
> +       .byte           0x54, 0x55, 0x56, 0x57, 0x50, 0x51, 0x52, 0x53
> +       .byte           0x6c, 0x6d, 0x6e, 0x6f, 0x68, 0x69, 0x6a, 0x6b
> +       .byte           0x64, 0x65, 0x66, 0x67, 0x60, 0x61, 0x62, 0x63
> +       .byte           0x7c, 0x7d, 0x7e, 0x7f, 0x78, 0x79, 0x7a, 0x7b
> +       .byte           0x74, 0x75, 0x76, 0x77, 0x70, 0x71, 0x72, 0x73
> +       .byte           0x8c, 0x8d, 0x8e, 0x8f, 0x88, 0x89, 0x8a, 0x8b
> +       .byte           0x84, 0x85, 0x86, 0x87, 0x80, 0x81, 0x82, 0x83
> +       .byte           0x9c, 0x9d, 0x9e, 0x9f, 0x98, 0x99, 0x9a, 0x9b
> +       .byte           0x94, 0x95, 0x96, 0x97, 0x90, 0x91, 0x92, 0x93
> +       .byte           0xac, 0xad, 0xae, 0xaf, 0xa8, 0xa9, 0xaa, 0xab
> +       .byte           0xa4, 0xa5, 0xa6, 0xa7, 0xa0, 0xa1, 0xa2, 0xa3
> +       .byte           0xbc, 0xbd, 0xbe, 0xbf, 0xb8, 0xb9, 0xba, 0xbb
> +       .byte           0xb4, 0xb5, 0xb6, 0xb7, 0xb0, 0xb1, 0xb2, 0xb3
> +       .byte           0xcc, 0xcd, 0xce, 0xcf, 0xc8, 0xc9, 0xca, 0xcb
> +       .byte           0xc4, 0xc5, 0xc6, 0xc7, 0xc0, 0xc1, 0xc2, 0xc3
> +       .byte           0xdc, 0xdd, 0xde, 0xdf, 0xd8, 0xd9, 0xda, 0xdb
> +       .byte           0xd4, 0xd5, 0xd6, 0xd7, 0xd0, 0xd1, 0xd2, 0xd3
> +       .byte           0xec, 0xed, 0xee, 0xef, 0xe8, 0xe9, 0xea, 0xeb
> +       .byte           0xe4, 0xe5, 0xe6, 0xe7, 0xe0, 0xe1, 0xe2, 0xe3
> +       .byte           0xfc, 0xfd, 0xfe, 0xff, 0xf8, 0xf9, 0xfa, 0xfb
> +       .byte           0xf4, 0xf5, 0xf6, 0xf7, 0xf0, 0xf1, 0xf2, 0xf3
> +
> +.Lle128_inc:
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0d, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x0f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> +       .byte           0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
> diff --git a/arch/arm64/crypto/sm4-sve-ce-glue.c b/arch/arm64/crypto/sm4-sve-ce-glue.c
> new file mode 100644
> index 000000000000..fc797b72b5f0
> --- /dev/null
> +++ b/arch/arm64/crypto/sm4-sve-ce-glue.c
> @@ -0,0 +1,332 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * SM4 Cipher Algorithm, using ARMv9 Crypto Extensions with SVE2
> + * as specified in
> + * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
> + *
> + * Copyright (C) 2022, Alibaba Group.
> + * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/crypto.h>
> +#include <linux/kernel.h>
> +#include <linux/cpufeature.h>
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +#include <crypto/internal/simd.h>
> +#include <crypto/internal/skcipher.h>
> +#include <crypto/sm4.h>
> +#include "sm4-ce.h"
> +
> +asmlinkage void sm4_sve_ce_crypt(const u32 *rkey, u8 *dst,
> +                                const u8 *src, unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_cbc_dec(const u32 *rkey_dec, u8 *dst,
> +                                  const u8 *src, u8 *iv,
> +                                  unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_cfb_dec(const u32 *rkey_enc, u8 *dst,
> +                                  const u8 *src, u8 *iv,
> +                                  unsigned int nblocks);
> +asmlinkage void sm4_sve_ce_ctr_crypt(const u32 *rkey_enc, u8 *dst,
> +                                    const u8 *src, u8 *iv,
> +                                    unsigned int nblocks);
> +asmlinkage unsigned int sm4_sve_get_vl(void);
> +
> +
> +static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
> +                     unsigned int key_len)
> +{
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       if (key_len != SM4_KEY_SIZE)
> +               return -EINVAL;
> +
> +       kernel_neon_begin();
> +       sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
> +                         crypto_sm4_fk, crypto_sm4_ck);
> +       kernel_neon_end();
> +
> +       return 0;
> +}
> +
> +static int ecb_crypt(struct skcipher_request *req, const u32 *rkey)
> +{
> +       struct skcipher_walk walk;
> +       unsigned int nbytes;
> +       int err;
> +
> +       err = skcipher_walk_virt(&walk, req, false);
> +
> +       while ((nbytes = walk.nbytes) > 0) {
> +               const u8 *src = walk.src.virt.addr;
> +               u8 *dst = walk.dst.virt.addr;
> +               unsigned int nblocks;
> +
> +               nblocks = nbytes / SM4_BLOCK_SIZE;
> +               if (nblocks) {
> +                       kernel_neon_begin();
> +
> +                       sm4_sve_ce_crypt(rkey, dst, src, nblocks);
> +
> +                       kernel_neon_end();
> +               }
> +
> +               err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
> +       }
> +
> +       return err;
> +}
> +
> +static int ecb_encrypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       return ecb_crypt(req, ctx->rkey_enc);
> +}
> +
> +static int ecb_decrypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       return ecb_crypt(req, ctx->rkey_dec);
> +}
> +
> +static int cbc_crypt(struct skcipher_request *req, const u32 *rkey,
> +                    void (*sm4_cbc_crypt)(const u32 *rkey, u8 *dst,
> +                               const u8 *src, u8 *iv, unsigned int nblocks))
> +{
> +       struct skcipher_walk walk;
> +       unsigned int nbytes;
> +       int err;
> +
> +       err = skcipher_walk_virt(&walk, req, false);
> +
> +       while ((nbytes = walk.nbytes) > 0) {
> +               const u8 *src = walk.src.virt.addr;
> +               u8 *dst = walk.dst.virt.addr;
> +               unsigned int nblocks;
> +
> +               nblocks = nbytes / SM4_BLOCK_SIZE;
> +               if (nblocks) {
> +                       kernel_neon_begin();
> +
> +                       sm4_cbc_crypt(rkey, dst, src, walk.iv, nblocks);
> +
> +                       kernel_neon_end();
> +               }
> +
> +               err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
> +       }
> +
> +       return err;
> +}
> +
> +static int cbc_encrypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       return cbc_crypt(req, ctx->rkey_enc, sm4_ce_cbc_enc);
> +}
> +
> +static int cbc_decrypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +       return cbc_crypt(req, ctx->rkey_dec, sm4_sve_ce_cbc_dec);
> +}
> +
> +static int cfb_crypt(struct skcipher_request *req,
> +                    void (*sm4_cfb_crypt)(const u32 *rkey, u8 *dst,
> +                               const u8 *src, u8 *iv, unsigned int nblocks))
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +       struct skcipher_walk walk;
> +       unsigned int nbytes;
> +       int err;
> +
> +       err = skcipher_walk_virt(&walk, req, false);
> +
> +       while ((nbytes = walk.nbytes) > 0) {
> +               const u8 *src = walk.src.virt.addr;
> +               u8 *dst = walk.dst.virt.addr;
> +               unsigned int nblocks;
> +
> +               nblocks = nbytes / SM4_BLOCK_SIZE;
> +               if (nblocks) {
> +                       kernel_neon_begin();
> +
> +                       sm4_cfb_crypt(ctx->rkey_enc, dst, src,
> +                                     walk.iv, nblocks);
> +
> +                       kernel_neon_end();
> +
> +                       dst += nblocks * SM4_BLOCK_SIZE;
> +                       src += nblocks * SM4_BLOCK_SIZE;
> +                       nbytes -= nblocks * SM4_BLOCK_SIZE;
> +               }
> +
> +               /* tail */
> +               if (walk.nbytes == walk.total && nbytes > 0) {
> +                       u8 keystream[SM4_BLOCK_SIZE];
> +
> +                       sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
> +                       crypto_xor_cpy(dst, src, keystream, nbytes);
> +                       nbytes = 0;
> +               }
> +
> +               err = skcipher_walk_done(&walk, nbytes);
> +       }
> +
> +       return err;
> +}
> +
> +static int cfb_encrypt(struct skcipher_request *req)
> +{
> +       return cfb_crypt(req, sm4_ce_cfb_enc);
> +}
> +
> +static int cfb_decrypt(struct skcipher_request *req)
> +{
> +       return cfb_crypt(req, sm4_sve_ce_cfb_dec);
> +}
> +
> +static int ctr_crypt(struct skcipher_request *req)
> +{
> +       struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +       struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
> +       struct skcipher_walk walk;
> +       unsigned int nbytes;
> +       int err;
> +
> +       err = skcipher_walk_virt(&walk, req, false);
> +
> +       while ((nbytes = walk.nbytes) > 0) {
> +               const u8 *src = walk.src.virt.addr;
> +               u8 *dst = walk.dst.virt.addr;
> +               unsigned int nblocks;
> +
> +               nblocks = nbytes / SM4_BLOCK_SIZE;
> +               if (nblocks) {
> +                       kernel_neon_begin();
> +
> +                       sm4_sve_ce_ctr_crypt(ctx->rkey_enc, dst, src,
> +                                            walk.iv, nblocks);
> +
> +                       kernel_neon_end();
> +
> +                       dst += nblocks * SM4_BLOCK_SIZE;
> +                       src += nblocks * SM4_BLOCK_SIZE;
> +                       nbytes -= nblocks * SM4_BLOCK_SIZE;
> +               }
> +
> +               /* tail */
> +               if (walk.nbytes == walk.total && nbytes > 0) {
> +                       u8 keystream[SM4_BLOCK_SIZE];
> +
> +                       sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
> +                       crypto_inc(walk.iv, SM4_BLOCK_SIZE);
> +                       crypto_xor_cpy(dst, src, keystream, nbytes);
> +                       nbytes = 0;
> +               }
> +
> +               err = skcipher_walk_done(&walk, nbytes);
> +       }
> +
> +       return err;
> +}
> +
> +static struct skcipher_alg sm4_algs[] = {
> +       {
> +               .base = {
> +                       .cra_name               = "ecb(sm4)",
> +                       .cra_driver_name        = "ecb-sm4-sve-ce",
> +                       .cra_priority           = 500,
> +                       .cra_blocksize          = SM4_BLOCK_SIZE,
> +                       .cra_ctxsize            = sizeof(struct sm4_ctx),
> +                       .cra_module             = THIS_MODULE,
> +               },
> +               .min_keysize    = SM4_KEY_SIZE,
> +               .max_keysize    = SM4_KEY_SIZE,
> +               .setkey         = sm4_setkey,
> +               .encrypt        = ecb_encrypt,
> +               .decrypt        = ecb_decrypt,
> +       }, {
> +               .base = {
> +                       .cra_name               = "cbc(sm4)",
> +                       .cra_driver_name        = "cbc-sm4-sve-ce",
> +                       .cra_priority           = 500,
> +                       .cra_blocksize          = SM4_BLOCK_SIZE,
> +                       .cra_ctxsize            = sizeof(struct sm4_ctx),
> +                       .cra_module             = THIS_MODULE,
> +               },
> +               .min_keysize    = SM4_KEY_SIZE,
> +               .max_keysize    = SM4_KEY_SIZE,
> +               .ivsize         = SM4_BLOCK_SIZE,
> +               .setkey         = sm4_setkey,
> +               .encrypt        = cbc_encrypt,
> +               .decrypt        = cbc_decrypt,
> +       }, {
> +               .base = {
> +                       .cra_name               = "cfb(sm4)",
> +                       .cra_driver_name        = "cfb-sm4-sve-ce",
> +                       .cra_priority           = 500,
> +                       .cra_blocksize          = 1,
> +                       .cra_ctxsize            = sizeof(struct sm4_ctx),
> +                       .cra_module             = THIS_MODULE,
> +               },
> +               .min_keysize    = SM4_KEY_SIZE,
> +               .max_keysize    = SM4_KEY_SIZE,
> +               .ivsize         = SM4_BLOCK_SIZE,
> +               .chunksize      = SM4_BLOCK_SIZE,
> +               .setkey         = sm4_setkey,
> +               .encrypt        = cfb_encrypt,
> +               .decrypt        = cfb_decrypt,
> +       }, {
> +               .base = {
> +                       .cra_name               = "ctr(sm4)",
> +                       .cra_driver_name        = "ctr-sm4-sve-ce",
> +                       .cra_priority           = 500,
> +                       .cra_blocksize          = 1,
> +                       .cra_ctxsize            = sizeof(struct sm4_ctx),
> +                       .cra_module             = THIS_MODULE,
> +               },
> +               .min_keysize    = SM4_KEY_SIZE,
> +               .max_keysize    = SM4_KEY_SIZE,
> +               .ivsize         = SM4_BLOCK_SIZE,
> +               .chunksize      = SM4_BLOCK_SIZE,
> +               .setkey         = sm4_setkey,
> +               .encrypt        = ctr_crypt,
> +               .decrypt        = ctr_crypt,
> +       }
> +};
> +
> +static int __init sm4_sve_ce_init(void)
> +{
> +       if (sm4_sve_get_vl() <= 16)
> +               return -ENODEV;
> +
> +       return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
> +}
> +
> +static void __exit sm4_sve_ce_exit(void)
> +{
> +       crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
> +}
> +
> +module_cpu_feature_match(SVESM4, sm4_sve_ce_init);
> +module_exit(sm4_sve_ce_exit);
> +
> +MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv9 Crypto Extensions with SVE2");
> +MODULE_ALIAS_CRYPTO("sm4-sve-ce");
> +MODULE_ALIAS_CRYPTO("sm4");
> +MODULE_ALIAS_CRYPTO("ecb(sm4)");
> +MODULE_ALIAS_CRYPTO("cbc(sm4)");
> +MODULE_ALIAS_CRYPTO("cfb(sm4)");
> +MODULE_ALIAS_CRYPTO("ctr(sm4)");
> +MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
> +MODULE_LICENSE("GPL v2");
> --
> 2.24.3 (Apple Git-128)
>
Mark Brown Sept. 26, 2022, 5:14 p.m. UTC | #2
On Mon, Sep 26, 2022 at 12:02:04PM +0200, Ard Biesheuvel wrote:

> Given that we currently do not support the use of SVE in kernel mode,
> this patch cannot be accepted at this time (but the rest of the series
> looks reasonable to me, although I have only skimmed over the patches)

> In view of the disappointing benchmark results below, I don't think
> this is worth the hassle at the moment. If we can find a case where
> using SVE in kernel mode truly makes a [favorable] difference, we can
> revisit this, but not without a thorough analysis of the impact it
> will have to support SVE in the kernel. Also, the fact that SVE may

The kernel code doesn't really distinguish between FPSIMD and SVE in
terms of state management, and with the sharing of the V and Z registers
the architecture is very similar too so it shouldn't be too much hassle,
the only thing we should need is some management for the VL when
starting kernel mode SVE (probably just setting the maximum VL as a
first pass).

The current code should *work* and on a system with only a single VL
supported it'd be equivalent since setting the VL is a noop, it'd just
mean that any kernel mode SVE would end up using whatever the last VL
set on the PE happened to be in which could result in inconsistent
performance. 

> also cover cryptographic extensions does not necessarily imply that a
> micro-architecture will perform those crypto transformations in
> parallel and so the performance may be the same even if VL > 128.

Indeed, though so long as the performance is comparable I guess it
doesn't really hurt - if we run into situations where for some
implementations SVE performs worse then we'd need to do something more
complicated than just using SVE if it's available but...

> In summary, please drop this patch for now, and once there are more
> encouraging performance numbers, please resubmit it as part of a
> series that explicitly enables SVE in kernel mode on arm64, and
> documents the requirements and constraints.

...in any case as you say until there are cases where SVE does better
for some in kernel use case we probably just shouldn't merge things.

Having said that I have been tempted to put together a branch which has
a kernel_sve_begin() implementation and collects proposed algorithm
implementations so they're there for people to experiment with as new
hardware becomes available.  There's clearly interest in trying to use
SVE in kernel and it makes sense to try to avoid common pitfalls and
reduce duplication of effort.

A couple of very minor comments on the patch:

> > +config CRYPTO_SM4_ARM64_SVE_CE_BLK
> > +       tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography
> +acceleration with SVE2)"
> > +       depends on KERNEL_MODE_NEON
> > +       select CRYPTO_SKCIPHER
> > +       select CRYPTO_SM4
> > +       select CRYPTO_SM4_ARM64_CE_BLK
> > +       help

Our current baseline binutils version requirement predates SVE support
so we'd either need to manually encode all SVE instructions used or add
suitable dependency.  The dependency seems a lot more reasonable here,
and we could require a new enough version to avoid the manual encoding
that is done in the patch (though I've not checked how new a version
that'd end up requiring, it might be unreasonable so perhaps just
depending on binutils having basic SVE support and continuing with the
manual encoding might be more helpful).

> > +.macro sm4e, vd, vn
> > +       .inst 0xcec08400 | (.L\vn << 5) | .L\vd
> > +.endm

For any manual encodings that do get left it'd be good to note the
binutils and LLVM versions which support the instruction so we can
hopefully at some point switch to assembling them normally.

> > +static int __init sm4_sve_ce_init(void)
> > +{
> > +       if (sm4_sve_get_vl() <= 16)
> > +               return -ENODEV;

I'm not clear what this check is attempting to guard against - what's
the issue with larger VLs?

If it is needed then we already have a sve_get_vl() in the core kernel
which we should probably be making available to modules rather than
having them open code something (eg, making it a static inline rather
than putting it in asm).
tianjia.zhang Sept. 27, 2022, 4:26 a.m. UTC | #3
Hi Ard,

On 9/26/22 6:02 PM, Ard Biesheuvel wrote:
> (cc Mark Brown)
> 
> Hello Tianjia,
> 
> On Mon, 26 Sept 2022 at 11:37, Tianjia Zhang
> <tianjia.zhang@linux.alibaba.com> wrote:
>>
>> Scalable Vector Extension (SVE) is the next-generation SIMD extension for
>> arm64. SVE allows flexible vector length implementations with a range of
>> possible values in CPU implementations. The vector length can vary from a
>> minimum of 128 bits up to a maximum of 2048 bits, at 128-bit increments.
>> The SVE design guarantees that the same application can run on different
>> implementations that support SVE, without the need to recompile the code.
>>
>> SVE was originally introduced by ARMv8, and ARMv9 introduced SVE2 to
>> expand and improve it. Similar to the Crypto Extension supported by the
>> NEON instruction set for the algorithm, SVE also supports the similar
>> instructions, called cryptography acceleration instructions, but this is
>> also optional instruction set.
>>
>> This patch uses SM4 cryptography acceleration instructions and SVE2
>> instructions to optimize the SM4 algorithm for ECB/CBC/CFB/CTR modes.
>> Since the encryption of CBC/CFB cannot be parallelized, the Crypto
>> Extension instruction is used.
>>
> 
> Given that we currently do not support the use of SVE in kernel mode,
> this patch cannot be accepted at this time (but the rest of the series
> looks reasonable to me, although I have only skimmed over the patches)
> 
> In view of the disappointing benchmark results below, I don't think
> this is worth the hassle at the moment. If we can find a case where
> using SVE in kernel mode truly makes a [favorable] difference, we can
> revisit this, but not without a thorough analysis of the impact it
> will have to support SVE in the kernel. Also, the fact that SVE may
> also cover cryptographic extensions does not necessarily imply that a
> micro-architecture will perform those crypto transformations in
> parallel and so the performance may be the same even if VL > 128.
> 
> In summary, please drop this patch for now, and once there are more
> encouraging performance numbers, please resubmit it as part of a
> series that explicitly enables SVE in kernel mode on arm64, and
> documents the requirements and constraints.
> 
> I have cc'ed Mark who has been working on the SVE support., who might
> have something to add here as well.
> 
> Thanks,
> Ard.
> 
> 

Thanks for your reply, the current performance of SVE is really
unsatisfactory. One reason is that the optimization of SVE needs to deal
with more and more complex data shifting operations, such as in CBC/CFB
mode, but also in CTR mode. needing more instruction to complete the
128-bit count increment, and the use of CE optimization does not have
these complications.

In addition, I naively thought that when the VL is 256-bit, the
performance will simply double compared to 128-bit. At present, this is
not the case. Maybe it is worth using SVE until there are significantly
improved performance data. I'll follow your advice and drop this
patch.

Best regards,
Tianjia
tianjia.zhang Sept. 27, 2022, 4:30 a.m. UTC | #4
Hi Mark,

On 9/27/22 1:14 AM, Mark Brown wrote:
> On Mon, Sep 26, 2022 at 12:02:04PM +0200, Ard Biesheuvel wrote:
> 
>> Given that we currently do not support the use of SVE in kernel mode,
>> this patch cannot be accepted at this time (but the rest of the series
>> looks reasonable to me, although I have only skimmed over the patches)
> 
>> In view of the disappointing benchmark results below, I don't think
>> this is worth the hassle at the moment. If we can find a case where
>> using SVE in kernel mode truly makes a [favorable] difference, we can
>> revisit this, but not without a thorough analysis of the impact it
>> will have to support SVE in the kernel. Also, the fact that SVE may
> 
> The kernel code doesn't really distinguish between FPSIMD and SVE in
> terms of state management, and with the sharing of the V and Z registers
> the architecture is very similar too so it shouldn't be too much hassle,
> the only thing we should need is some management for the VL when
> starting kernel mode SVE (probably just setting the maximum VL as a
> first pass).
> 
> The current code should *work* and on a system with only a single VL
> supported it'd be equivalent since setting the VL is a noop, it'd just
> mean that any kernel mode SVE would end up using whatever the last VL
> set on the PE happened to be in which could result in inconsistent
> performance.
> 
>> also cover cryptographic extensions does not necessarily imply that a
>> micro-architecture will perform those crypto transformations in
>> parallel and so the performance may be the same even if VL > 128.
> 
> Indeed, though so long as the performance is comparable I guess it
> doesn't really hurt - if we run into situations where for some
> implementations SVE performs worse then we'd need to do something more
> complicated than just using SVE if it's available but...
> 
>> In summary, please drop this patch for now, and once there are more
>> encouraging performance numbers, please resubmit it as part of a
>> series that explicitly enables SVE in kernel mode on arm64, and
>> documents the requirements and constraints.
> 
> ...in any case as you say until there are cases where SVE does better
> for some in kernel use case we probably just shouldn't merge things.
> 
> Having said that I have been tempted to put together a branch which has
> a kernel_sve_begin() implementation and collects proposed algorithm
> implementations so they're there for people to experiment with as new
> hardware becomes available.  There's clearly interest in trying to use
> SVE in kernel and it makes sense to try to avoid common pitfalls and
> reduce duplication of effort.
> 

Your reply helped me a lot, I did encounter problems when using qemu VL
larger than 128-bit environment, but I also tested it with the pure
user-mode library libgcrypt, it seems to be normal, maybe in 128-bit
It's just a coincidence that it works fine in the physical machine.

I am looking forward to your experimental branch, and I believe that
there will be breakthroughs in hardware in the near future.

> A couple of very minor comments on the patch:
> 
>>> +config CRYPTO_SM4_ARM64_SVE_CE_BLK
>>> +       tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography
>> +acceleration with SVE2)"
>>> +       depends on KERNEL_MODE_NEON
>>> +       select CRYPTO_SKCIPHER
>>> +       select CRYPTO_SM4
>>> +       select CRYPTO_SM4_ARM64_CE_BLK
>>> +       help
> 
> Our current baseline binutils version requirement predates SVE support
> so we'd either need to manually encode all SVE instructions used or add
> suitable dependency.  The dependency seems a lot more reasonable here,
> and we could require a new enough version to avoid the manual encoding
> that is done in the patch (though I've not checked how new a version
> that'd end up requiring, it might be unreasonable so perhaps just
> depending on binutils having basic SVE support and continuing with the
> manual encoding might be more helpful).
> 
>>> +.macro sm4e, vd, vn
>>> +       .inst 0xcec08400 | (.L\vn << 5) | .L\vd
>>> +.endm
> 
> For any manual encodings that do get left it'd be good to note the
> binutils and LLVM versions which support the instruction so we can
> hopefully at some point switch to assembling them normally.
> 
>>> +static int __init sm4_sve_ce_init(void)
>>> +{
>>> +       if (sm4_sve_get_vl() <= 16)
>>> +               return -ENODEV;
> 
> I'm not clear what this check is attempting to guard against - what's
> the issue with larger VLs?

Since there is no physical environment, this check is based on my naive
assumption that the performance when VL is 256-bit should theoretically
be twice that of 128-bit, because SVE needs to handle more complex data
shifting operations and CTR incrementing operations, so When VL is
greater than or equal to 256 bits, the use of SVE will bring performance
improvement, otherwise it is a suitable choice to degenerate to CE.

Now it seems that this assumption itself is not valid, I will drop
this patch first.

> 
> If it is needed then we already have a sve_get_vl() in the core kernel
> which we should probably be making available to modules rather than
> having them open code something (eg, making it a static inline rather
> than putting it in asm).

Yes, I agree, exporting sve_get_vl() to the module is the more
appropriate approach.

Best regards,
Tianjia
diff mbox series

Patch

diff --git a/arch/arm64/crypto/Kconfig b/arch/arm64/crypto/Kconfig
index 6793d5bc3ee5..bbb5a7a08af5 100644
--- a/arch/arm64/crypto/Kconfig
+++ b/arch/arm64/crypto/Kconfig
@@ -249,6 +249,25 @@  config CRYPTO_SM4_ARM64_CE_BLK
 	  - ARMv8 Crypto Extensions
 	  - NEON (Advanced SIMD) extensions
 
+config CRYPTO_SM4_ARM64_SVE_CE_BLK
+	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (ARMv9 cryptography acceleration with SVE2)"
+	depends on KERNEL_MODE_NEON
+	select CRYPTO_SKCIPHER
+	select CRYPTO_SM4
+	select CRYPTO_SM4_ARM64_CE_BLK
+	help
+	  Length-preserving ciphers: SM4 cipher algorithms (OSCCA GB/T 32907-2016)
+	  with block cipher modes:
+	  - ECB (Electronic Codebook) mode (NIST SP800-38A)
+	  - CBC (Cipher Block Chaining) mode (NIST SP800-38A)
+	  - CFB (Cipher Feedback) mode (NIST SP800-38A)
+	  - CTR (Counter) mode (NIST SP800-38A)
+
+	  Architecture: arm64 using:
+	  - ARMv8 Crypto Extensions
+	  - ARMv9 cryptography acceleration with SVE2
+	  - NEON (Advanced SIMD) extensions
+
 config CRYPTO_SM4_ARM64_NEON_BLK
 	tristate "Ciphers: SM4, modes: ECB/CBC/CFB/CTR (NEON)"
 	depends on KERNEL_MODE_NEON
diff --git a/arch/arm64/crypto/Makefile b/arch/arm64/crypto/Makefile
index 4818e204c2ac..355dd9053434 100644
--- a/arch/arm64/crypto/Makefile
+++ b/arch/arm64/crypto/Makefile
@@ -38,6 +38,9 @@  sm4-ce-gcm-y := sm4-ce-gcm-glue.o sm4-ce-gcm-core.o
 obj-$(CONFIG_CRYPTO_SM4_ARM64_NEON_BLK) += sm4-neon.o
 sm4-neon-y := sm4-neon-glue.o sm4-neon-core.o
 
+obj-$(CONFIG_CRYPTO_SM4_ARM64_SVE_CE_BLK) += sm4-sve-ce.o
+sm4-sve-ce-y := sm4-sve-ce-glue.o sm4-sve-ce-core.o
+
 obj-$(CONFIG_CRYPTO_GHASH_ARM64_CE) += ghash-ce.o
 ghash-ce-y := ghash-ce-glue.o ghash-ce-core.o
 
diff --git a/arch/arm64/crypto/sm4-sve-ce-core.S b/arch/arm64/crypto/sm4-sve-ce-core.S
new file mode 100644
index 000000000000..caecbdf2536c
--- /dev/null
+++ b/arch/arm64/crypto/sm4-sve-ce-core.S
@@ -0,0 +1,1028 @@ 
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 Cipher Algorithm for ARMv9 Crypto Extensions with SVE2
+ * as specified in
+ * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
+ *
+ * Copyright (C) 2022, Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+.arch	armv8-a+crypto+sve+sve2
+
+.irp b, 0, 15, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lv\b\().4s, \b
+.endr
+
+.irp b, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, \
+		16, 24, 25, 26, 27, 28, 29, 30, 31
+	.set .Lz\b\().s, \b
+.endr
+
+.macro sm4e, vd, vn
+	.inst 0xcec08400 | (.L\vn << 5) | .L\vd
+.endm
+
+.macro sm4e_sve, zd, zn
+	.inst 0x4523e000 | (.L\zn << 5) | .L\zd
+.endm
+
+
+/* Register macros */
+
+#define RCTR        z16
+#define RCTRv       v16
+#define RIV         z16
+#define RIVv        v16
+#define RSWAP128    z17
+#define RZERO       z18
+#define RLE128_INC  z19
+
+#define RTMP0       z20
+#define RTMP0v      v20
+#define RTMP1       z21
+#define RTMP2       z22
+#define RTMP3       z23
+
+
+/* Helper macros. */
+
+#define SM4_PREPARE(ptr)					\
+		adr_l		x7, .Lbswap128_mask;		\
+		ptrue		p0.b, ALL;			\
+		rdvl		x5, #1;				\
+		ld1b		{RSWAP128.b}, p0/z, [x7];	\
+								\
+		ld1		{v24.16b-v27.16b}, [ptr], #64;	\
+		ld1		{v28.16b-v31.16b}, [ptr];	\
+		dup		z24.q, z24.q[0];		\
+		dup		z25.q, z25.q[0];		\
+		dup		z26.q, z26.q[0];		\
+		dup		z27.q, z27.q[0];		\
+		dup		z28.q, z28.q[0];		\
+		dup		z29.q, z29.q[0];		\
+		dup		z30.q, z30.q[0];		\
+		dup		z31.q, z31.q[0];
+
+#define SM4_SVE_CE_CRYPT_BLK(b0)				\
+		revb		b0.s, p0/m, b0.s;		\
+		sm4e_sve	b0.s, z24.s;			\
+		sm4e_sve	b0.s, z25.s;			\
+		sm4e_sve	b0.s, z26.s;			\
+		sm4e_sve	b0.s, z27.s;			\
+		sm4e_sve	b0.s, z28.s;			\
+		sm4e_sve	b0.s, z29.s;			\
+		sm4e_sve	b0.s, z30.s;			\
+		sm4e_sve	b0.s, z31.s;			\
+		tbl		b0.b, {b0.b}, RSWAP128.b;	\
+		revb		b0.s, p0/m, b0.s;
+
+#define SM4_SVE_CE_CRYPT_BLK4(b0, b1, b2, b3)			\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;		\
+		sm4e_sve	b0.s, z24.s;			\
+		sm4e_sve	b1.s, z24.s;			\
+		sm4e_sve	b2.s, z24.s;			\
+		sm4e_sve	b3.s, z24.s;			\
+		sm4e_sve	b0.s, z25.s;			\
+		sm4e_sve	b1.s, z25.s;			\
+		sm4e_sve	b2.s, z25.s;			\
+		sm4e_sve	b3.s, z25.s;			\
+		sm4e_sve	b0.s, z26.s;			\
+		sm4e_sve	b1.s, z26.s;			\
+		sm4e_sve	b2.s, z26.s;			\
+		sm4e_sve	b3.s, z26.s;			\
+		sm4e_sve	b0.s, z27.s;			\
+		sm4e_sve	b1.s, z27.s;			\
+		sm4e_sve	b2.s, z27.s;			\
+		sm4e_sve	b3.s, z27.s;			\
+		sm4e_sve	b0.s, z28.s;			\
+		sm4e_sve	b1.s, z28.s;			\
+		sm4e_sve	b2.s, z28.s;			\
+		sm4e_sve	b3.s, z28.s;			\
+		sm4e_sve	b0.s, z29.s;			\
+		sm4e_sve	b1.s, z29.s;			\
+		sm4e_sve	b2.s, z29.s;			\
+		sm4e_sve	b3.s, z29.s;			\
+		sm4e_sve	b0.s, z30.s;			\
+		sm4e_sve	b1.s, z30.s;			\
+		sm4e_sve	b2.s, z30.s;			\
+		sm4e_sve	b3.s, z30.s;			\
+		sm4e_sve	b0.s, z31.s;			\
+		sm4e_sve	b1.s, z31.s;			\
+		sm4e_sve	b2.s, z31.s;			\
+		sm4e_sve	b3.s, z31.s;			\
+		tbl		b0.b, {b0.b}, RSWAP128.b;	\
+		tbl		b1.b, {b1.b}, RSWAP128.b;	\
+		tbl		b2.b, {b2.b}, RSWAP128.b;	\
+		tbl		b3.b, {b3.b}, RSWAP128.b;	\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;
+
+#define SM4_SVE_CE_CRYPT_BLK8(b0, b1, b2, b3, b4, b5, b6, b7)	\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;		\
+		revb		b4.s, p0/m, b4.s;		\
+		revb		b5.s, p0/m, b5.s;		\
+		revb		b6.s, p0/m, b6.s;		\
+		revb		b7.s, p0/m, b7.s;		\
+		sm4e_sve	b0.s, z24.s;			\
+		sm4e_sve	b1.s, z24.s;			\
+		sm4e_sve	b2.s, z24.s;			\
+		sm4e_sve	b3.s, z24.s;			\
+		sm4e_sve	b4.s, z24.s;			\
+		sm4e_sve	b5.s, z24.s;			\
+		sm4e_sve	b6.s, z24.s;			\
+		sm4e_sve	b7.s, z24.s;			\
+		sm4e_sve	b0.s, z25.s;			\
+		sm4e_sve	b1.s, z25.s;			\
+		sm4e_sve	b2.s, z25.s;			\
+		sm4e_sve	b3.s, z25.s;			\
+		sm4e_sve	b4.s, z25.s;			\
+		sm4e_sve	b5.s, z25.s;			\
+		sm4e_sve	b6.s, z25.s;			\
+		sm4e_sve	b7.s, z25.s;			\
+		sm4e_sve	b0.s, z26.s;			\
+		sm4e_sve	b1.s, z26.s;			\
+		sm4e_sve	b2.s, z26.s;			\
+		sm4e_sve	b3.s, z26.s;			\
+		sm4e_sve	b4.s, z26.s;			\
+		sm4e_sve	b5.s, z26.s;			\
+		sm4e_sve	b6.s, z26.s;			\
+		sm4e_sve	b7.s, z26.s;			\
+		sm4e_sve	b0.s, z27.s;			\
+		sm4e_sve	b1.s, z27.s;			\
+		sm4e_sve	b2.s, z27.s;			\
+		sm4e_sve	b3.s, z27.s;			\
+		sm4e_sve	b4.s, z27.s;			\
+		sm4e_sve	b5.s, z27.s;			\
+		sm4e_sve	b6.s, z27.s;			\
+		sm4e_sve	b7.s, z27.s;			\
+		sm4e_sve	b0.s, z28.s;			\
+		sm4e_sve	b1.s, z28.s;			\
+		sm4e_sve	b2.s, z28.s;			\
+		sm4e_sve	b3.s, z28.s;			\
+		sm4e_sve	b4.s, z28.s;			\
+		sm4e_sve	b5.s, z28.s;			\
+		sm4e_sve	b6.s, z28.s;			\
+		sm4e_sve	b7.s, z28.s;			\
+		sm4e_sve	b0.s, z29.s;			\
+		sm4e_sve	b1.s, z29.s;			\
+		sm4e_sve	b2.s, z29.s;			\
+		sm4e_sve	b3.s, z29.s;			\
+		sm4e_sve	b4.s, z29.s;			\
+		sm4e_sve	b5.s, z29.s;			\
+		sm4e_sve	b6.s, z29.s;			\
+		sm4e_sve	b7.s, z29.s;			\
+		sm4e_sve	b0.s, z30.s;			\
+		sm4e_sve	b1.s, z30.s;			\
+		sm4e_sve	b2.s, z30.s;			\
+		sm4e_sve	b3.s, z30.s;			\
+		sm4e_sve	b4.s, z30.s;			\
+		sm4e_sve	b5.s, z30.s;			\
+		sm4e_sve	b6.s, z30.s;			\
+		sm4e_sve	b7.s, z30.s;			\
+		sm4e_sve	b0.s, z31.s;			\
+		sm4e_sve	b1.s, z31.s;			\
+		sm4e_sve	b2.s, z31.s;			\
+		sm4e_sve	b3.s, z31.s;			\
+		sm4e_sve	b4.s, z31.s;			\
+		sm4e_sve	b5.s, z31.s;			\
+		sm4e_sve	b6.s, z31.s;			\
+		sm4e_sve	b7.s, z31.s;			\
+		tbl		b0.b, {b0.b}, RSWAP128.b;	\
+		tbl		b1.b, {b1.b}, RSWAP128.b;	\
+		tbl		b2.b, {b2.b}, RSWAP128.b;	\
+		tbl		b3.b, {b3.b}, RSWAP128.b;	\
+		tbl		b4.b, {b4.b}, RSWAP128.b;	\
+		tbl		b5.b, {b5.b}, RSWAP128.b;	\
+		tbl		b6.b, {b6.b}, RSWAP128.b;	\
+		tbl		b7.b, {b7.b}, RSWAP128.b;	\
+		revb		b0.s, p0/m, b0.s;		\
+		revb		b1.s, p0/m, b1.s;		\
+		revb		b2.s, p0/m, b2.s;		\
+		revb		b3.s, p0/m, b3.s;		\
+		revb		b4.s, p0/m, b4.s;		\
+		revb		b5.s, p0/m, b5.s;		\
+		revb		b6.s, p0/m, b6.s;		\
+		revb		b7.s, p0/m, b7.s;
+
+#define SM4_CE_CRYPT_BLK(b0)					\
+		rev32		b0.16b, b0.16b;			\
+		sm4e		b0.4s, v24.4s;			\
+		sm4e		b0.4s, v25.4s;			\
+		sm4e		b0.4s, v26.4s;			\
+		sm4e		b0.4s, v27.4s;			\
+		sm4e		b0.4s, v28.4s;			\
+		sm4e		b0.4s, v29.4s;			\
+		sm4e		b0.4s, v30.4s;			\
+		sm4e		b0.4s, v31.4s;			\
+		rev64		b0.4s, b0.4s;			\
+		ext		b0.16b, b0.16b, b0.16b, #8;	\
+		rev32		b0.16b, b0.16b;
+
+#define inc_le128(zctr)						\
+		mov		RCTRv.d[1], x8;			\
+		mov		RCTRv.d[0], x7;			\
+		mov		zctr.d, RLE128_INC.d;		\
+		dup		RCTR.q, RCTR.q[0];		\
+		adds		x8, x8, x5, LSR #4;		\
+		adclt		zctr.d, RCTR.d, RZERO.d;	\
+		adclt		RCTR.d, zctr.d, RZERO.d;	\
+		adc		x7, x7, xzr;			\
+		trn1		zctr.d, RCTR.d, zctr.d;		\
+		revb		zctr.d, p0/m, zctr.d;
+
+#define inc_le128_4x(zctr0, zctr1, zctr2, zctr3)		\
+		mov		v8.d[1], x8;			\
+		mov		v8.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr0.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v9.d[1], x8;			\
+		mov		v9.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr1.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v10.d[1], x8;			\
+		mov		v10.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr2.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v11.d[1], x8;			\
+		mov		v11.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr3.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		dup		z8.q, z8.q[0];			\
+		dup		z9.q, z9.q[0];			\
+		dup		z10.q, z10.q[0];		\
+		dup		z11.q, z11.q[0];		\
+		adclt		zctr0.d, z8.d, RZERO.d;		\
+		adclt		zctr1.d, z9.d, RZERO.d;		\
+		adclt		zctr2.d, z10.d, RZERO.d;	\
+		adclt		zctr3.d, z11.d, RZERO.d;	\
+		adclt		z8.d, zctr0.d, RZERO.d;		\
+		adclt		z9.d, zctr1.d, RZERO.d;		\
+		adclt		z10.d, zctr2.d, RZERO.d;	\
+		adclt		z11.d, zctr3.d, RZERO.d;	\
+		trn1		zctr0.d, z8.d, zctr0.d;		\
+		trn1		zctr1.d, z9.d, zctr1.d;		\
+		trn1		zctr2.d, z10.d, zctr2.d;	\
+		trn1		zctr3.d, z11.d, zctr3.d;	\
+		revb		zctr0.d, p0/m, zctr0.d;		\
+		revb		zctr1.d, p0/m, zctr1.d;		\
+		revb		zctr2.d, p0/m, zctr2.d;		\
+		revb		zctr3.d, p0/m, zctr3.d;
+
+#define inc_le128_8x(zctr0, zctr1, zctr2, zctr3,		\
+		     zctr4, zctr5, zctr6, zctr7)		\
+		mov		v8.d[1], x8;			\
+		mov		v8.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr0.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v9.d[1], x8;			\
+		mov		v9.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr1.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v10.d[1], x8;			\
+		mov		v10.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr2.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v11.d[1], x8;			\
+		mov		v11.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr3.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v12.d[1], x8;			\
+		mov		v12.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr4.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v13.d[1], x8;			\
+		mov		v13.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr5.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v14.d[1], x8;			\
+		mov		v14.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr6.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		mov		v15.d[1], x8;			\
+		mov		v15.d[0], x7;			\
+		adds		x8, x8, x5, LSR #4;		\
+		mov		zctr7.d, RLE128_INC.d;		\
+		adc		x7, x7, xzr;			\
+		dup		z8.q, z8.q[0];			\
+		dup		z9.q, z9.q[0];			\
+		dup		z10.q, z10.q[0];		\
+		dup		z11.q, z11.q[0];		\
+		dup		z12.q, z12.q[0];		\
+		dup		z13.q, z13.q[0];		\
+		dup		z14.q, z14.q[0];		\
+		dup		z15.q, z15.q[0];		\
+		adclt		zctr0.d, z8.d, RZERO.d;		\
+		adclt		zctr1.d, z9.d, RZERO.d;		\
+		adclt		zctr2.d, z10.d, RZERO.d;	\
+		adclt		zctr3.d, z11.d, RZERO.d;	\
+		adclt		zctr4.d, z12.d, RZERO.d;	\
+		adclt		zctr5.d, z13.d, RZERO.d;	\
+		adclt		zctr6.d, z14.d, RZERO.d;	\
+		adclt		zctr7.d, z15.d, RZERO.d;	\
+		adclt		z8.d, zctr0.d, RZERO.d;		\
+		adclt		z9.d, zctr1.d, RZERO.d;		\
+		adclt		z10.d, zctr2.d, RZERO.d;	\
+		adclt		z11.d, zctr3.d, RZERO.d;	\
+		adclt		z12.d, zctr4.d, RZERO.d;	\
+		adclt		z13.d, zctr5.d, RZERO.d;	\
+		adclt		z14.d, zctr6.d, RZERO.d;	\
+		adclt		z15.d, zctr7.d, RZERO.d;	\
+		trn1		zctr0.d, z8.d, zctr0.d;		\
+		trn1		zctr1.d, z9.d, zctr1.d;		\
+		trn1		zctr2.d, z10.d, zctr2.d;	\
+		trn1		zctr3.d, z11.d, zctr3.d;	\
+		trn1		zctr4.d, z12.d, zctr4.d;	\
+		trn1		zctr5.d, z13.d, zctr5.d;	\
+		trn1		zctr6.d, z14.d, zctr6.d;	\
+		trn1		zctr7.d, z15.d, zctr7.d;	\
+		revb		zctr0.d, p0/m, zctr0.d;		\
+		revb		zctr1.d, p0/m, zctr1.d;		\
+		revb		zctr2.d, p0/m, zctr2.d;		\
+		revb		zctr3.d, p0/m, zctr3.d;		\
+		revb		zctr4.d, p0/m, zctr4.d;		\
+		revb		zctr5.d, p0/m, zctr5.d;		\
+		revb		zctr6.d, p0/m, zctr6.d;		\
+		revb		zctr7.d, p0/m, zctr7.d;
+
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_crypt)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   w3: nblocks
+	 */
+	uxtw		x3, w3
+	SM4_PREPARE(x0)
+
+.Lcrypt_loop_8x:
+	sub		x3, x3, x5, LSR #1		/* x3 - (8 * VL) */
+	tbnz		x3, #63, .Lcrypt_4x
+
+	ld1b		{z0.b}, p0/z, [x2]
+	ld1b		{z1.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z2.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z3.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z4.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z5.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z6.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z7.b}, p0/z, [x2, #7, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x3, .Lcrypt_end
+	b		.Lcrypt_loop_8x
+
+.Lcrypt_4x:
+	add		x3, x3, x5, LSR #1
+	cmp		x3, x5, LSR #2
+	blt		.Lcrypt_loop_1x
+
+	sub		x3, x3, x5, LSR #2		/* x3 - (4 * VL) */
+
+	ld1b		{z0.b}, p0/z, [x2]
+	ld1b		{z1.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z2.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z3.b}, p0/z, [x2, #3, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x3, .Lcrypt_end
+
+.Lcrypt_loop_1x:
+	cmp		x3, x5, LSR #4
+	blt		.Lcrypt_ce_loop_1x
+
+	sub		x3, x3, x5, LSR #4		/* x3 - VL */
+
+	ld1b		{z0.b}, p0/z, [x2]
+
+	SM4_SVE_CE_CRYPT_BLK(z0)
+
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x3, .Lcrypt_end
+	b		.Lcrypt_loop_1x
+
+.Lcrypt_ce_loop_1x:
+	sub		x3, x3, #1
+
+	ld1		{v0.16b}, [x2], #16
+	SM4_CE_CRYPT_BLK(v0)
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x3, .Lcrypt_ce_loop_1x
+
+.Lcrypt_end:
+	ret
+SYM_FUNC_END(sm4_sve_ce_crypt)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_cbc_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 */
+	uxtw		x4, w4
+	SM4_PREPARE(x0)
+
+	ld1		{RIVv.16b}, [x3]
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcbc_dec_loop_8x:
+	sub		x4, x4, x5, LSR #1		/* x4 - (8 * VL) */
+	tbnz		x4, #63, .Lcbc_dec_4x
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z9.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z8.b}, p0/z, [x2, #7, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		z4.b, z11.b
+	rev		z5.b, z10.b
+	rev		z6.b, z9.b
+	rev		z7.b, z8.b
+	rev		RTMP0.b, RIV.b
+	ext		z7.b, z7.b, z6.b, #16
+	ext		z6.b, z6.b, z5.b, #16
+	ext		z5.b, z5.b, z4.b, #16
+	ext		z4.b, z4.b, z3.b, #16
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z7.b, z7.b
+	rev		z6.b, z6.b
+	rev		z5.b, z5.b
+	rev		z4.b, z4.b
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z8.d
+
+	SM4_SVE_CE_CRYPT_BLK8(z15, z14, z13, z12, z11, z10, z9, z8)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	eor		z4.d, z4.d, z11.d
+	eor		z5.d, z5.d, z10.d
+	eor		z6.d, z6.d, z9.d
+	eor		z7.d, z7.d, z8.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_8x
+
+.Lcbc_dec_4x:
+	add		x4, x4, x5, LSR #1
+	cmp		x4, x5, LSR #2
+	blt		.Lcbc_dec_loop_1x
+
+	sub		x4, x4, x5, LSR #2		/* x4 - (4 * VL) */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		RTMP0.b, RIV.b
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z12.d
+
+	SM4_SVE_CE_CRYPT_BLK4(z15, z14, z13, z12)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x4, .Lcbc_dec_end
+
+.Lcbc_dec_loop_1x:
+	cmp		x4, x5, LSR #4
+	blt		.Lcbc_dec_ce
+
+	sub		x4, x4, x5, LSR #4		/* x4 - VL */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	rev		RTMP0.b, RIV.b
+	rev		z0.b, z15.b
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z0.b, z0.b
+	mov		RIV.d, z15.d
+
+	SM4_SVE_CE_CRYPT_BLK(z15)
+
+	eor		z0.d, z0.d, z15.d
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x4, .Lcbc_dec_end
+	b		.Lcbc_dec_loop_1x
+
+.Lcbc_dec_ce:
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+
+.Lcbc_dec_ce_loop_1x:
+	sub		x4, x4, #1
+
+	ld1		{v15.16b}, [x2], #16
+	mov		v0.16b, RIVv.16b
+	mov		RIVv.16b, v15.16b
+	SM4_CE_CRYPT_BLK(v15)
+	eor		v0.16b, v0.16b, v15.16b
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x4, .Lcbc_dec_ce_loop_1x
+
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcbc_dec_end:
+	/* store new IV */
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+	st1		{RIVv.16b}, [x3]
+
+	ret
+SYM_FUNC_END(sm4_sve_ce_cbc_dec)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_cfb_dec)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: iv (big endian, 128 bit)
+	 *   w4: nblocks
+	 */
+	uxtw		x4, w4
+	SM4_PREPARE(x0)
+
+	ld1		{RIVv.16b}, [x3]
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcfb_dec_loop_8x:
+	sub		x4, x4, x5, LSR #1		/* x4 - (8 * VL) */
+	tbnz		x4, #63, .Lcfb_dec_4x
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z9.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z8.b}, p0/z, [x2, #7, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		z4.b, z11.b
+	rev		z5.b, z10.b
+	rev		z6.b, z9.b
+	rev		z7.b, z8.b
+	rev		RTMP0.b, RIV.b
+	ext		z7.b, z7.b, z6.b, #16
+	ext		z6.b, z6.b, z5.b, #16
+	ext		z5.b, z5.b, z4.b, #16
+	ext		z4.b, z4.b, z3.b, #16
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z7.b, z7.b
+	rev		z6.b, z6.b
+	rev		z5.b, z5.b
+	rev		z4.b, z4.b
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z8.d
+
+	SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	eor		z4.d, z4.d, z11.d
+	eor		z5.d, z5.d, z10.d
+	eor		z6.d, z6.d, z9.d
+	eor		z7.d, z7.d, z8.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_8x
+
+.Lcfb_dec_4x:
+	add		x4, x4, x5, LSR #1
+	cmp		x4, x5, LSR #2
+	blt		.Lcfb_dec_loop_1x
+
+	sub		x4, x4, x5, LSR #2		/* x4 - (4 * VL) */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	ld1b		{z14.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #3, MUL VL]
+	rev		z0.b, z15.b
+	rev		z1.b, z14.b
+	rev		z2.b, z13.b
+	rev		z3.b, z12.b
+	rev		RTMP0.b, RIV.b
+	ext		z3.b, z3.b, z2.b, #16
+	ext		z2.b, z2.b, z1.b, #16
+	ext		z1.b, z1.b, z0.b, #16
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z3.b, z3.b
+	rev		z2.b, z2.b
+	rev		z1.b, z1.b
+	rev		z0.b, z0.b
+	mov		RIV.d, z12.d
+
+	SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+	eor		z0.d, z0.d, z15.d
+	eor		z1.d, z1.d, z14.d
+	eor		z2.d, z2.d, z13.d
+	eor		z3.d, z3.d, z12.d
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x4, .Lcfb_dec_end
+
+.Lcfb_dec_loop_1x:
+	cmp		x4, x5, LSR #4
+	blt		.Lcfb_dec_ce
+
+	sub		x4, x4, x5, LSR #4		/* x4 - VL */
+
+	ld1b		{z15.b}, p0/z, [x2]
+	rev		RTMP0.b, RIV.b
+	rev		z0.b, z15.b
+	ext		z0.b, z0.b, RTMP0.b, #16
+	rev		z0.b, z0.b
+	mov		RIV.d, z15.d
+
+	SM4_SVE_CE_CRYPT_BLK(z0)
+
+	eor		z0.d, z0.d, z15.d
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x4, .Lcfb_dec_end
+	b		.Lcfb_dec_loop_1x
+
+.Lcfb_dec_ce:
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+
+.Lcfb_dec_ce_loop_1x:
+	sub		x4, x4, #1
+
+	ld1		{v15.16b}, [x2], #16
+	mov		v0.16b, RIVv.16b
+	mov		RIVv.16b, v15.16b
+	SM4_CE_CRYPT_BLK(v0)
+	eor		v0.16b, v0.16b, v15.16b
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x4, .Lcfb_dec_ce_loop_1x
+
+	ext		RIV.b, RIV.b, RIV.b, #16
+
+.Lcfb_dec_end:
+	/* store new IV */
+	rev		RIV.s, RIV.s
+	tbl		RIV.b, {RIV.b}, RSWAP128.b
+	st1		{RIVv.16b}, [x3]
+
+	ret
+SYM_FUNC_END(sm4_sve_ce_cfb_dec)
+
+.align 3
+SYM_FUNC_START(sm4_sve_ce_ctr_crypt)
+	/* input:
+	 *   x0: round key array, CTX
+	 *   x1: dst
+	 *   x2: src
+	 *   x3: ctr (big endian, 128 bit)
+	 *   w4: nblocks
+	 */
+	uxtw		x4, w4
+	SM4_PREPARE(x0)
+
+	dup		RZERO.d, #0
+	adr_l		x6, .Lle128_inc
+	ld1b		{RLE128_INC.b}, p0/z, [x6]
+
+	ldp		x7, x8, [x3]
+	rev		x7, x7
+	rev		x8, x8
+
+.Lctr_loop_8x:
+	sub		x4, x4, x5, LSR #1		/* x4 - (8 * VL) */
+	tbnz		x4, #63, .Lctr_4x
+
+	inc_le128_8x(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	ld1b		{z8.b}, p0/z, [x2]
+	ld1b		{z9.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #3, MUL VL]
+	ld1b		{z12.b}, p0/z, [x2, #4, MUL VL]
+	ld1b		{z13.b}, p0/z, [x2, #5, MUL VL]
+	ld1b		{z14.b}, p0/z, [x2, #6, MUL VL]
+	ld1b		{z15.b}, p0/z, [x2, #7, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK8(z0, z1, z2, z3, z4, z5, z6, z7)
+
+	eor		z0.d, z0.d, z8.d
+	eor		z1.d, z1.d, z9.d
+	eor		z2.d, z2.d, z10.d
+	eor		z3.d, z3.d, z11.d
+	eor		z4.d, z4.d, z12.d
+	eor		z5.d, z5.d, z13.d
+	eor		z6.d, z6.d, z14.d
+	eor		z7.d, z7.d, z15.d
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+	st1b		{z4.b}, p0, [x1, #4, MUL VL]
+	st1b		{z5.b}, p0, [x1, #5, MUL VL]
+	st1b		{z6.b}, p0, [x1, #6, MUL VL]
+	st1b		{z7.b}, p0, [x1, #7, MUL VL]
+
+	addvl		x2, x2, #8
+	addvl		x1, x1, #8
+
+	cbz		x4, .Lctr_end
+	b		.Lctr_loop_8x
+
+.Lctr_4x:
+	add		x4, x4, x5, LSR #1
+	cmp		x4, x5, LSR #2
+	blt		.Lctr_loop_1x
+
+	sub		x4, x4, x5, LSR #2		/* x4 - (4 * VL) */
+
+	inc_le128_4x(z0, z1, z2, z3)
+
+	ld1b		{z8.b}, p0/z, [x2]
+	ld1b		{z9.b}, p0/z, [x2, #1, MUL VL]
+	ld1b		{z10.b}, p0/z, [x2, #2, MUL VL]
+	ld1b		{z11.b}, p0/z, [x2, #3, MUL VL]
+
+	SM4_SVE_CE_CRYPT_BLK4(z0, z1, z2, z3)
+
+	eor		z0.d, z0.d, z8.d
+	eor		z1.d, z1.d, z9.d
+	eor		z2.d, z2.d, z10.d
+	eor		z3.d, z3.d, z11.d
+
+	st1b		{z0.b}, p0, [x1]
+	st1b		{z1.b}, p0, [x1, #1, MUL VL]
+	st1b		{z2.b}, p0, [x1, #2, MUL VL]
+	st1b		{z3.b}, p0, [x1, #3, MUL VL]
+
+	addvl		x2, x2, #4
+	addvl		x1, x1, #4
+
+	cbz		x4, .Lctr_end
+
+.Lctr_loop_1x:
+	cmp		x4, x5, LSR #4
+	blt		.Lctr_ce_loop_1x
+
+	sub		x4, x4, x5, LSR #4		/* x4 - VL */
+
+	inc_le128(z0)
+	ld1b		{z8.b}, p0/z, [x2]
+
+	SM4_SVE_CE_CRYPT_BLK(z0)
+
+	eor		z0.d, z0.d, z8.d
+	st1b		{z0.b}, p0, [x1]
+
+	addvl		x2, x2, #1
+	addvl		x1, x1, #1
+
+	cbz		x4, .Lctr_end
+	b		.Lctr_loop_1x
+
+.Lctr_ce_loop_1x:
+	sub		x4, x4, #1
+
+	/* inc_le128 for CE */
+	mov		v0.d[1], x8
+	mov		v0.d[0], x7
+	adds		x8, x8, #1
+	rev64		v0.16b, v0.16b
+	adc		x7, x7, xzr
+
+	ld1		{v8.16b}, [x2], #16
+
+	SM4_CE_CRYPT_BLK(v0)
+
+	eor		v0.16b, v0.16b, v8.16b
+	st1		{v0.16b}, [x1], #16
+
+	cbnz		x4, .Lctr_ce_loop_1x
+
+.Lctr_end:
+	/* store new CTR */
+	rev		x7, x7
+	rev		x8, x8
+	stp		x7, x8, [x3]
+
+	ret
+SYM_FUNC_END(sm4_sve_ce_ctr_crypt)
+
+.align 3
+SYM_FUNC_START(sm4_sve_get_vl)
+	/* VL in bytes */
+	rdvl		x0, #1
+
+	ret
+SYM_FUNC_END(sm4_sve_get_vl)
+
+
+	.section	".rodata", "a"
+	.align 4
+.Lbswap128_mask:
+	.byte		0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
+	.byte		0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03
+	.byte		0x1c, 0x1d, 0x1e, 0x1f, 0x18, 0x19, 0x1a, 0x1b
+	.byte		0x14, 0x15, 0x16, 0x17, 0x10, 0x11, 0x12, 0x13
+	.byte		0x2c, 0x2d, 0x2e, 0x2f, 0x28, 0x29, 0x2a, 0x2b
+	.byte		0x24, 0x25, 0x26, 0x27, 0x20, 0x21, 0x22, 0x23
+	.byte		0x3c, 0x3d, 0x3e, 0x3f, 0x38, 0x39, 0x3a, 0x3b
+	.byte		0x34, 0x35, 0x36, 0x37, 0x30, 0x31, 0x32, 0x33
+	.byte		0x4c, 0x4d, 0x4e, 0x4f, 0x48, 0x49, 0x4a, 0x4b
+	.byte		0x44, 0x45, 0x46, 0x47, 0x40, 0x41, 0x42, 0x43
+	.byte		0x5c, 0x5d, 0x5e, 0x5f, 0x58, 0x59, 0x5a, 0x5b
+	.byte		0x54, 0x55, 0x56, 0x57, 0x50, 0x51, 0x52, 0x53
+	.byte		0x6c, 0x6d, 0x6e, 0x6f, 0x68, 0x69, 0x6a, 0x6b
+	.byte		0x64, 0x65, 0x66, 0x67, 0x60, 0x61, 0x62, 0x63
+	.byte		0x7c, 0x7d, 0x7e, 0x7f, 0x78, 0x79, 0x7a, 0x7b
+	.byte		0x74, 0x75, 0x76, 0x77, 0x70, 0x71, 0x72, 0x73
+	.byte		0x8c, 0x8d, 0x8e, 0x8f, 0x88, 0x89, 0x8a, 0x8b
+	.byte		0x84, 0x85, 0x86, 0x87, 0x80, 0x81, 0x82, 0x83
+	.byte		0x9c, 0x9d, 0x9e, 0x9f, 0x98, 0x99, 0x9a, 0x9b
+	.byte		0x94, 0x95, 0x96, 0x97, 0x90, 0x91, 0x92, 0x93
+	.byte		0xac, 0xad, 0xae, 0xaf, 0xa8, 0xa9, 0xaa, 0xab
+	.byte		0xa4, 0xa5, 0xa6, 0xa7, 0xa0, 0xa1, 0xa2, 0xa3
+	.byte		0xbc, 0xbd, 0xbe, 0xbf, 0xb8, 0xb9, 0xba, 0xbb
+	.byte		0xb4, 0xb5, 0xb6, 0xb7, 0xb0, 0xb1, 0xb2, 0xb3
+	.byte		0xcc, 0xcd, 0xce, 0xcf, 0xc8, 0xc9, 0xca, 0xcb
+	.byte		0xc4, 0xc5, 0xc6, 0xc7, 0xc0, 0xc1, 0xc2, 0xc3
+	.byte		0xdc, 0xdd, 0xde, 0xdf, 0xd8, 0xd9, 0xda, 0xdb
+	.byte		0xd4, 0xd5, 0xd6, 0xd7, 0xd0, 0xd1, 0xd2, 0xd3
+	.byte		0xec, 0xed, 0xee, 0xef, 0xe8, 0xe9, 0xea, 0xeb
+	.byte		0xe4, 0xe5, 0xe6, 0xe7, 0xe0, 0xe1, 0xe2, 0xe3
+	.byte		0xfc, 0xfd, 0xfe, 0xff, 0xf8, 0xf9, 0xfa, 0xfb
+	.byte		0xf4, 0xf5, 0xf6, 0xf7, 0xf0, 0xf1, 0xf2, 0xf3
+
+.Lle128_inc:
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x04, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x05, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x07, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0d, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x0f, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
+	.byte		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
diff --git a/arch/arm64/crypto/sm4-sve-ce-glue.c b/arch/arm64/crypto/sm4-sve-ce-glue.c
new file mode 100644
index 000000000000..fc797b72b5f0
--- /dev/null
+++ b/arch/arm64/crypto/sm4-sve-ce-glue.c
@@ -0,0 +1,332 @@ 
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * SM4 Cipher Algorithm, using ARMv9 Crypto Extensions with SVE2
+ * as specified in
+ * https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html
+ *
+ * Copyright (C) 2022, Alibaba Group.
+ * Copyright (C) 2022 Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
+ */
+
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/cpufeature.h>
+#include <asm/neon.h>
+#include <asm/simd.h>
+#include <crypto/internal/simd.h>
+#include <crypto/internal/skcipher.h>
+#include <crypto/sm4.h>
+#include "sm4-ce.h"
+
+asmlinkage void sm4_sve_ce_crypt(const u32 *rkey, u8 *dst,
+				 const u8 *src, unsigned int nblocks);
+asmlinkage void sm4_sve_ce_cbc_dec(const u32 *rkey_dec, u8 *dst,
+				   const u8 *src, u8 *iv,
+				   unsigned int nblocks);
+asmlinkage void sm4_sve_ce_cfb_dec(const u32 *rkey_enc, u8 *dst,
+				   const u8 *src, u8 *iv,
+				   unsigned int nblocks);
+asmlinkage void sm4_sve_ce_ctr_crypt(const u32 *rkey_enc, u8 *dst,
+				     const u8 *src, u8 *iv,
+				     unsigned int nblocks);
+asmlinkage unsigned int sm4_sve_get_vl(void);
+
+
+static int sm4_setkey(struct crypto_skcipher *tfm, const u8 *key,
+		      unsigned int key_len)
+{
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	if (key_len != SM4_KEY_SIZE)
+		return -EINVAL;
+
+	kernel_neon_begin();
+	sm4_ce_expand_key(key, ctx->rkey_enc, ctx->rkey_dec,
+			  crypto_sm4_fk, crypto_sm4_ck);
+	kernel_neon_end();
+
+	return 0;
+}
+
+static int ecb_crypt(struct skcipher_request *req, const u32 *rkey)
+{
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_sve_ce_crypt(rkey, dst, src, nblocks);
+
+			kernel_neon_end();
+		}
+
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
+	}
+
+	return err;
+}
+
+static int ecb_encrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return ecb_crypt(req, ctx->rkey_enc);
+}
+
+static int ecb_decrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return ecb_crypt(req, ctx->rkey_dec);
+}
+
+static int cbc_crypt(struct skcipher_request *req, const u32 *rkey,
+		     void (*sm4_cbc_crypt)(const u32 *rkey, u8 *dst,
+				const u8 *src, u8 *iv, unsigned int nblocks))
+{
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_cbc_crypt(rkey, dst, src, walk.iv, nblocks);
+
+			kernel_neon_end();
+		}
+
+		err = skcipher_walk_done(&walk, nbytes % SM4_BLOCK_SIZE);
+	}
+
+	return err;
+}
+
+static int cbc_encrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return cbc_crypt(req, ctx->rkey_enc, sm4_ce_cbc_enc);
+}
+
+static int cbc_decrypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+
+	return cbc_crypt(req, ctx->rkey_dec, sm4_sve_ce_cbc_dec);
+}
+
+static int cfb_crypt(struct skcipher_request *req,
+		     void (*sm4_cfb_crypt)(const u32 *rkey, u8 *dst,
+				const u8 *src, u8 *iv, unsigned int nblocks))
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_cfb_crypt(ctx->rkey_enc, dst, src,
+				      walk.iv, nblocks);
+
+			kernel_neon_end();
+
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
+
+		/* tail */
+		if (walk.nbytes == walk.total && nbytes > 0) {
+			u8 keystream[SM4_BLOCK_SIZE];
+
+			sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
+			crypto_xor_cpy(dst, src, keystream, nbytes);
+			nbytes = 0;
+		}
+
+		err = skcipher_walk_done(&walk, nbytes);
+	}
+
+	return err;
+}
+
+static int cfb_encrypt(struct skcipher_request *req)
+{
+	return cfb_crypt(req, sm4_ce_cfb_enc);
+}
+
+static int cfb_decrypt(struct skcipher_request *req)
+{
+	return cfb_crypt(req, sm4_sve_ce_cfb_dec);
+}
+
+static int ctr_crypt(struct skcipher_request *req)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	struct sm4_ctx *ctx = crypto_skcipher_ctx(tfm);
+	struct skcipher_walk walk;
+	unsigned int nbytes;
+	int err;
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while ((nbytes = walk.nbytes) > 0) {
+		const u8 *src = walk.src.virt.addr;
+		u8 *dst = walk.dst.virt.addr;
+		unsigned int nblocks;
+
+		nblocks = nbytes / SM4_BLOCK_SIZE;
+		if (nblocks) {
+			kernel_neon_begin();
+
+			sm4_sve_ce_ctr_crypt(ctx->rkey_enc, dst, src,
+					     walk.iv, nblocks);
+
+			kernel_neon_end();
+
+			dst += nblocks * SM4_BLOCK_SIZE;
+			src += nblocks * SM4_BLOCK_SIZE;
+			nbytes -= nblocks * SM4_BLOCK_SIZE;
+		}
+
+		/* tail */
+		if (walk.nbytes == walk.total && nbytes > 0) {
+			u8 keystream[SM4_BLOCK_SIZE];
+
+			sm4_ce_crypt_block(ctx->rkey_enc, keystream, walk.iv);
+			crypto_inc(walk.iv, SM4_BLOCK_SIZE);
+			crypto_xor_cpy(dst, src, keystream, nbytes);
+			nbytes = 0;
+		}
+
+		err = skcipher_walk_done(&walk, nbytes);
+	}
+
+	return err;
+}
+
+static struct skcipher_alg sm4_algs[] = {
+	{
+		.base = {
+			.cra_name		= "ecb(sm4)",
+			.cra_driver_name	= "ecb-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= ecb_encrypt,
+		.decrypt	= ecb_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "cbc(sm4)",
+			.cra_driver_name	= "cbc-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= SM4_BLOCK_SIZE,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= cbc_encrypt,
+		.decrypt	= cbc_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "cfb(sm4)",
+			.cra_driver_name	= "cfb-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= 1,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.chunksize	= SM4_BLOCK_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= cfb_encrypt,
+		.decrypt	= cfb_decrypt,
+	}, {
+		.base = {
+			.cra_name		= "ctr(sm4)",
+			.cra_driver_name	= "ctr-sm4-sve-ce",
+			.cra_priority		= 500,
+			.cra_blocksize		= 1,
+			.cra_ctxsize		= sizeof(struct sm4_ctx),
+			.cra_module		= THIS_MODULE,
+		},
+		.min_keysize	= SM4_KEY_SIZE,
+		.max_keysize	= SM4_KEY_SIZE,
+		.ivsize		= SM4_BLOCK_SIZE,
+		.chunksize	= SM4_BLOCK_SIZE,
+		.setkey		= sm4_setkey,
+		.encrypt	= ctr_crypt,
+		.decrypt	= ctr_crypt,
+	}
+};
+
+static int __init sm4_sve_ce_init(void)
+{
+	if (sm4_sve_get_vl() <= 16)
+		return -ENODEV;
+
+	return crypto_register_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+}
+
+static void __exit sm4_sve_ce_exit(void)
+{
+	crypto_unregister_skciphers(sm4_algs, ARRAY_SIZE(sm4_algs));
+}
+
+module_cpu_feature_match(SVESM4, sm4_sve_ce_init);
+module_exit(sm4_sve_ce_exit);
+
+MODULE_DESCRIPTION("SM4 ECB/CBC/CFB/CTR using ARMv9 Crypto Extensions with SVE2");
+MODULE_ALIAS_CRYPTO("sm4-sve-ce");
+MODULE_ALIAS_CRYPTO("sm4");
+MODULE_ALIAS_CRYPTO("ecb(sm4)");
+MODULE_ALIAS_CRYPTO("cbc(sm4)");
+MODULE_ALIAS_CRYPTO("cfb(sm4)");
+MODULE_ALIAS_CRYPTO("ctr(sm4)");
+MODULE_AUTHOR("Tianjia Zhang <tianjia.zhang@linux.alibaba.com>");
+MODULE_LICENSE("GPL v2");