mbox series

[0/2] add AES-NI/AVX2/x86_64 implementation

Message ID 20210818033117.91717-1-tianjia.zhang@linux.alibaba.com (mailing list archive)
Headers show
Series add AES-NI/AVX2/x86_64 implementation | expand

Message

tianjia.zhang Aug. 18, 2021, 3:31 a.m. UTC
This patchsets exported some of the common functions implemented by
the SM4 AESNI/AVX algorithm, and reused these functions to achieve
the acceleration of AESNI/AVX2 implementation.

The main algorithm implementation comes from SM4 AES-NI work by
libgcrypt and Markku-Juhani O. Saarinen at:
https://github.com/mjosaarinen/sm4ni

Benchmark on Intel i5-6200U 2.30GHz, performance data of three
implementation methods, pure software sm4-generic, aesni/avx
acceleration, and aesni/avx2 acceleration, the data comes from
the 218 mode and 518 mode of tcrypt. The abscissas are blocks of
different lengths. The data is tabulated and the unit is Mb/s:

block-size  |    16      64     128     256    1024    1420    4096
sm4-generic
    ECB enc | 60.94   70.41   72.27   73.02   73.87   73.58   73.59
    ECB dec | 61.87   70.53   72.15   73.09   73.89   73.92   73.86
    CBC enc | 56.71   66.31   68.05   69.84   70.02   70.12   70.24
    CBC dec | 54.54   65.91   68.22   69.51   70.63   70.79   70.82
    CFB enc | 57.21   67.24   69.10   70.25   70.73   70.52   71.42
    CFB dec | 57.22   64.74   66.31   67.24   67.40   67.64   67.58
    CTR enc | 59.47   68.64   69.91   71.02   71.86   71.61   71.95
    CTR dec | 59.94   68.77   69.95   71.00   71.84   71.55   71.95
sm4-aesni-avx
    ECB enc | 44.95  177.35  292.06  316.98  339.48  322.27  330.59
    ECB dec | 45.28  178.66  292.31  317.52  339.59  322.52  331.16
    CBC enc | 57.75   67.68   69.72   70.60   71.48   71.63   71.74
    CBC dec | 44.32  176.83  284.32  307.24  328.61  312.61  325.82
    CFB enc | 57.81   67.64   69.63   70.55   71.40   71.35   71.70
    CFB dec | 43.14  167.78  282.03  307.20  328.35  318.24  325.95
    CTR enc | 42.35  163.32  279.11  302.93  320.86  310.56  317.93
    CTR dec | 42.39  162.81  278.49  302.37  321.11  310.33  318.37
sm4-aesni-avx2
    ECB enc | 45.19  177.41  292.42  316.12  339.90  322.53  330.54
    ECB dec | 44.83  178.90  291.45  317.31  339.85  322.55  331.07
    CBC enc | 57.66   67.62   69.73   70.55   71.58   71.66   71.77
    CBC dec | 44.34  176.86  286.10  501.68  559.58  483.87  527.46
    CFB enc | 57.43   67.60   69.61   70.52   71.43   71.28   71.65
    CFB dec | 43.12  167.75  268.09  499.33  558.35  490.36  524.73
    CTR enc | 42.42  163.39  256.17  493.95  552.45  481.58  517.19
    CTR dec | 42.49  163.11  256.36  493.34  552.62  481.49  516.83

From the benchmark data, it can be seen that when the block size is
1024, compared to AVX acceleration, the performance achieved by AVX2
has increased by about 70%, it is also 7.7 times of the pure software
implementation of sm4-generic.

Tianjia Zhang (2):
  crypto: x86/sm4 - export reusable AESNI/AVX functions
  crypto: x86/sm4 - add AES-NI/AVX2/x86_64 implementation

 arch/x86/crypto/Makefile                |   3 +
 arch/x86/crypto/sm4-aesni-avx2-asm_64.S | 497 ++++++++++++++++++++++++
 arch/x86/crypto/sm4-avx.h               |  24 ++
 arch/x86/crypto/sm4_aesni_avx2_glue.c   | 169 ++++++++
 arch/x86/crypto/sm4_aesni_avx_glue.c    |  92 +++--
 crypto/Kconfig                          |  22 ++
 6 files changed, 775 insertions(+), 32 deletions(-)
 create mode 100644 arch/x86/crypto/sm4-aesni-avx2-asm_64.S
 create mode 100644 arch/x86/crypto/sm4-avx.h
 create mode 100644 arch/x86/crypto/sm4_aesni_avx2_glue.c

Comments

David Laight Aug. 20, 2021, 10:03 a.m. UTC | #1
From: Tianjia Zhang
> Sent: 18 August 2021 04:31
> 
> This patchsets exported some of the common functions implemented by
> the SM4 AESNI/AVX algorithm, and reused these functions to achieve
> the acceleration of AESNI/AVX2 implementation.

These functions need bracketing by kernel_fpu_enable()
(or whatever it is called.)
That will significantly affect the performance.

Also the functions look pretty big (I don't know how big
the generic ones are) and will take time to load into the I$
and will displace other code.

So while a hot-cache benchmark might show improvements
for repeated calls is isn't obvious that any significant
gain will be made for real-life calls which could easily
be of single buffers.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
tianjia.zhang Aug. 22, 2021, 5:20 a.m. UTC | #2
Hi David,

On 8/20/21 6:03 PM, David Laight wrote:
> From: Tianjia Zhang
>> Sent: 18 August 2021 04:31
>>
>> This patchsets exported some of the common functions implemented by
>> the SM4 AESNI/AVX algorithm, and reused these functions to achieve
>> the acceleration of AESNI/AVX2 implementation.
> 
> These functions need bracketing by kernel_fpu_enable()
> (or whatever it is called.)
> That will significantly affect the performance.
> 
> Also the functions look pretty big (I don't know how big
> the generic ones are) and will take time to load into the I$
> and will displace other code.
> 
> So while a hot-cache benchmark might show improvements
> for repeated calls is isn't obvious that any significant
> gain will be made for real-life calls which could easily
> be of single buffers.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 

Yes, the functions bracketed by kernel_fpu_begin() do affect 
performance. It seems that the kernel can only do this, so by processing 
as much data as possible in a kernel_fpu_begin/end() operation to 
improve performance. actually do it now.

Best regards,
Tianjia
Herbert Xu Aug. 27, 2021, 8:35 a.m. UTC | #3
On Wed, Aug 18, 2021 at 11:31:15AM +0800, Tianjia Zhang wrote:
> This patchsets exported some of the common functions implemented by
> the SM4 AESNI/AVX algorithm, and reused these functions to achieve
> the acceleration of AESNI/AVX2 implementation.
> 
> The main algorithm implementation comes from SM4 AES-NI work by
> libgcrypt and Markku-Juhani O. Saarinen at:
> https://github.com/mjosaarinen/sm4ni
> 
> Benchmark on Intel i5-6200U 2.30GHz, performance data of three
> implementation methods, pure software sm4-generic, aesni/avx
> acceleration, and aesni/avx2 acceleration, the data comes from
> the 218 mode and 518 mode of tcrypt. The abscissas are blocks of
> different lengths. The data is tabulated and the unit is Mb/s:
> 
> block-size  |    16      64     128     256    1024    1420    4096
> sm4-generic
>     ECB enc | 60.94   70.41   72.27   73.02   73.87   73.58   73.59
>     ECB dec | 61.87   70.53   72.15   73.09   73.89   73.92   73.86
>     CBC enc | 56.71   66.31   68.05   69.84   70.02   70.12   70.24
>     CBC dec | 54.54   65.91   68.22   69.51   70.63   70.79   70.82
>     CFB enc | 57.21   67.24   69.10   70.25   70.73   70.52   71.42
>     CFB dec | 57.22   64.74   66.31   67.24   67.40   67.64   67.58
>     CTR enc | 59.47   68.64   69.91   71.02   71.86   71.61   71.95
>     CTR dec | 59.94   68.77   69.95   71.00   71.84   71.55   71.95
> sm4-aesni-avx
>     ECB enc | 44.95  177.35  292.06  316.98  339.48  322.27  330.59
>     ECB dec | 45.28  178.66  292.31  317.52  339.59  322.52  331.16
>     CBC enc | 57.75   67.68   69.72   70.60   71.48   71.63   71.74
>     CBC dec | 44.32  176.83  284.32  307.24  328.61  312.61  325.82
>     CFB enc | 57.81   67.64   69.63   70.55   71.40   71.35   71.70
>     CFB dec | 43.14  167.78  282.03  307.20  328.35  318.24  325.95
>     CTR enc | 42.35  163.32  279.11  302.93  320.86  310.56  317.93
>     CTR dec | 42.39  162.81  278.49  302.37  321.11  310.33  318.37
> sm4-aesni-avx2
>     ECB enc | 45.19  177.41  292.42  316.12  339.90  322.53  330.54
>     ECB dec | 44.83  178.90  291.45  317.31  339.85  322.55  331.07
>     CBC enc | 57.66   67.62   69.73   70.55   71.58   71.66   71.77
>     CBC dec | 44.34  176.86  286.10  501.68  559.58  483.87  527.46
>     CFB enc | 57.43   67.60   69.61   70.52   71.43   71.28   71.65
>     CFB dec | 43.12  167.75  268.09  499.33  558.35  490.36  524.73
>     CTR enc | 42.42  163.39  256.17  493.95  552.45  481.58  517.19
>     CTR dec | 42.49  163.11  256.36  493.34  552.62  481.49  516.83
> 
> >From the benchmark data, it can be seen that when the block size is
> 1024, compared to AVX acceleration, the performance achieved by AVX2
> has increased by about 70%, it is also 7.7 times of the pure software
> implementation of sm4-generic.
> 
> Tianjia Zhang (2):
>   crypto: x86/sm4 - export reusable AESNI/AVX functions
>   crypto: x86/sm4 - add AES-NI/AVX2/x86_64 implementation
> 
>  arch/x86/crypto/Makefile                |   3 +
>  arch/x86/crypto/sm4-aesni-avx2-asm_64.S | 497 ++++++++++++++++++++++++
>  arch/x86/crypto/sm4-avx.h               |  24 ++
>  arch/x86/crypto/sm4_aesni_avx2_glue.c   | 169 ++++++++
>  arch/x86/crypto/sm4_aesni_avx_glue.c    |  92 +++--
>  crypto/Kconfig                          |  22 ++
>  6 files changed, 775 insertions(+), 32 deletions(-)
>  create mode 100644 arch/x86/crypto/sm4-aesni-avx2-asm_64.S
>  create mode 100644 arch/x86/crypto/sm4-avx.h
>  create mode 100644 arch/x86/crypto/sm4_aesni_avx2_glue.c

All applied.  Thanks.