mbox series

[v3,0/4] Introduce x86 assembler accelerated implementation for SM4 algorithm

Message ID 20210720034642.19230-1-tianjia.zhang@linux.alibaba.com (mailing list archive)
Headers show
Series Introduce x86 assembler accelerated implementation for SM4 algorithm | expand

Message

tianjia.zhang July 20, 2021, 3:46 a.m. UTC
This patchset extracts the public SM4 algorithm as a separate library,
At the same time, the acceleration implementation of SM4 in arm64 was
adjusted to adapt to this SM4 library. Then introduces an accelerated
implementation of the instruction set on x86_64.

This optimization supports the four modes of SM4, ECB, CBC, CFB, and
CTR. Since CBC and CFB do not support multiple block parallel
encryption, the optimization effect is not obvious. And all selftests
have passed already.

The main algorithm implementation comes from SM4 AES-NI work by
libgcrypt and Markku-Juhani O. Saarinen at:
https://github.com/mjosaarinen/sm4ni

Benchmark on Intel Xeon Cascadelake, the data comes from the mode 218
and mode 518 of tcrypt. The abscissas are blocks of different lengths.
The data is tabulated and the unit is Mb/s:

sm4-generic   |    16      64     128     256    1024    1420    4096
      ECB enc | 40.99   46.50   48.05   48.41   49.20   49.25   49.28
      ECB dec | 41.07   46.99   48.15   48.67   49.20   49.25   49.29
      CBC enc | 37.71   45.28   46.77   47.60   48.32   48.37   48.40
      CBC dec | 36.48   44.82   46.43   47.45   48.23   48.30   48.36
      CFB enc | 37.94   44.84   46.12   46.94   47.57   47.46   47.68
      CFB dec | 37.50   42.84   43.74   44.37   44.85   44.80   44.96
      CTR enc | 39.20   45.63   46.75   47.49   48.09   47.85   48.08
      CTR dec | 39.64   45.70   46.72   47.47   47.98   47.88   48.06
sm4-aesni-avx
      ECB enc | 33.75  134.47  221.64  243.43  264.05  251.58  258.13
      ECB dec | 34.02  134.92  223.11  245.14  264.12  251.04  258.33
      CBC enc | 38.85   46.18   47.67   48.34   49.00   48.96   49.14
      CBC dec | 33.54  131.29  223.88  245.27  265.50  252.41  263.78
      CFB enc | 38.70   46.10   47.58   48.29   49.01   48.94   49.19
      CFB dec | 32.79  128.40  223.23  244.87  265.77  253.31  262.79
      CTR enc | 32.58  122.23  220.29  241.16  259.57  248.32  256.69
      CTR dec | 32.81  122.47  218.99  241.54  258.42  248.58  256.61

---
v3 changes:
  * Remove single block algorithm that does not greatly improve performance
  * Remove accelerated for sm4 key expand, which is not performance-critical
  * Fix the warning on arm64/sm4-ce

v2 changes:
  * SM4 library functions use "sm4_" prefix instead of "crypto_" prefix
  * sm4-aesni-avx supports accelerated implementation of four specific modes
  * tcrypt benchmark supports sm4-aesni-avx
  * fixes of other reviews


Tianjia Zhang (4):
  crypto: sm4 - create SM4 library based on sm4 generic code
  crypto: arm64/sm4-ce - Make dependent on sm4 library instead of
    sm4-generic
  crypto: x86/sm4 - add AES-NI/AVX/x86_64 implementation
  crypto: tcrypt - add the asynchronous speed test for SM4

 arch/arm64/crypto/Kconfig              |   2 +-
 arch/arm64/crypto/sm4-ce-glue.c        |  20 +-
 arch/x86/crypto/Makefile               |   3 +
 arch/x86/crypto/sm4-aesni-avx-asm_64.S | 589 +++++++++++++++++++++++++
 arch/x86/crypto/sm4_aesni_avx_glue.c   | 459 +++++++++++++++++++
 crypto/Kconfig                         |  22 +
 crypto/sm4_generic.c                   | 180 +-------
 crypto/tcrypt.c                        |  26 +-
 include/crypto/sm4.h                   |  25 +-
 lib/crypto/Kconfig                     |   3 +
 lib/crypto/Makefile                    |   3 +
 lib/crypto/sm4.c                       | 176 ++++++++
 12 files changed, 1330 insertions(+), 178 deletions(-)
 create mode 100644 arch/x86/crypto/sm4-aesni-avx-asm_64.S
 create mode 100644 arch/x86/crypto/sm4_aesni_avx_glue.c
 create mode 100644 lib/crypto/sm4.c

Comments

Herbert Xu July 30, 2021, 3:09 a.m. UTC | #1
On Tue, Jul 20, 2021 at 11:46:38AM +0800, Tianjia Zhang wrote:
> This patchset extracts the public SM4 algorithm as a separate library,
> At the same time, the acceleration implementation of SM4 in arm64 was
> adjusted to adapt to this SM4 library. Then introduces an accelerated
> implementation of the instruction set on x86_64.
> 
> This optimization supports the four modes of SM4, ECB, CBC, CFB, and
> CTR. Since CBC and CFB do not support multiple block parallel
> encryption, the optimization effect is not obvious. And all selftests
> have passed already.
> 
> The main algorithm implementation comes from SM4 AES-NI work by
> libgcrypt and Markku-Juhani O. Saarinen at:
> https://github.com/mjosaarinen/sm4ni
> 
> Benchmark on Intel Xeon Cascadelake, the data comes from the mode 218
> and mode 518 of tcrypt. The abscissas are blocks of different lengths.
> The data is tabulated and the unit is Mb/s:
> 
> sm4-generic   |    16      64     128     256    1024    1420    4096
>       ECB enc | 40.99   46.50   48.05   48.41   49.20   49.25   49.28
>       ECB dec | 41.07   46.99   48.15   48.67   49.20   49.25   49.29
>       CBC enc | 37.71   45.28   46.77   47.60   48.32   48.37   48.40
>       CBC dec | 36.48   44.82   46.43   47.45   48.23   48.30   48.36
>       CFB enc | 37.94   44.84   46.12   46.94   47.57   47.46   47.68
>       CFB dec | 37.50   42.84   43.74   44.37   44.85   44.80   44.96
>       CTR enc | 39.20   45.63   46.75   47.49   48.09   47.85   48.08
>       CTR dec | 39.64   45.70   46.72   47.47   47.98   47.88   48.06
> sm4-aesni-avx
>       ECB enc | 33.75  134.47  221.64  243.43  264.05  251.58  258.13
>       ECB dec | 34.02  134.92  223.11  245.14  264.12  251.04  258.33
>       CBC enc | 38.85   46.18   47.67   48.34   49.00   48.96   49.14
>       CBC dec | 33.54  131.29  223.88  245.27  265.50  252.41  263.78
>       CFB enc | 38.70   46.10   47.58   48.29   49.01   48.94   49.19
>       CFB dec | 32.79  128.40  223.23  244.87  265.77  253.31  262.79
>       CTR enc | 32.58  122.23  220.29  241.16  259.57  248.32  256.69
>       CTR dec | 32.81  122.47  218.99  241.54  258.42  248.58  256.61
> 
> ---
> v3 changes:
>   * Remove single block algorithm that does not greatly improve performance
>   * Remove accelerated for sm4 key expand, which is not performance-critical
>   * Fix the warning on arm64/sm4-ce
> 
> v2 changes:
>   * SM4 library functions use "sm4_" prefix instead of "crypto_" prefix
>   * sm4-aesni-avx supports accelerated implementation of four specific modes
>   * tcrypt benchmark supports sm4-aesni-avx
>   * fixes of other reviews
> 
> 
> Tianjia Zhang (4):
>   crypto: sm4 - create SM4 library based on sm4 generic code
>   crypto: arm64/sm4-ce - Make dependent on sm4 library instead of
>     sm4-generic
>   crypto: x86/sm4 - add AES-NI/AVX/x86_64 implementation
>   crypto: tcrypt - add the asynchronous speed test for SM4
> 
>  arch/arm64/crypto/Kconfig              |   2 +-
>  arch/arm64/crypto/sm4-ce-glue.c        |  20 +-
>  arch/x86/crypto/Makefile               |   3 +
>  arch/x86/crypto/sm4-aesni-avx-asm_64.S | 589 +++++++++++++++++++++++++
>  arch/x86/crypto/sm4_aesni_avx_glue.c   | 459 +++++++++++++++++++
>  crypto/Kconfig                         |  22 +
>  crypto/sm4_generic.c                   | 180 +-------
>  crypto/tcrypt.c                        |  26 +-
>  include/crypto/sm4.h                   |  25 +-
>  lib/crypto/Kconfig                     |   3 +
>  lib/crypto/Makefile                    |   3 +
>  lib/crypto/sm4.c                       | 176 ++++++++
>  12 files changed, 1330 insertions(+), 178 deletions(-)
>  create mode 100644 arch/x86/crypto/sm4-aesni-avx-asm_64.S
>  create mode 100644 arch/x86/crypto/sm4_aesni_avx_glue.c
>  create mode 100644 lib/crypto/sm4.c

All applied.  Thanks.