mbox series

[v2,00/19] crypto: x86 - fix RCU stalls

Message ID 20221012215931.3896-1-elliott@hpe.com (mailing list archive)
Headers show
Series crypto: x86 - fix RCU stalls | expand

Message

Elliott, Robert (Servers) Oct. 12, 2022, 9:59 p.m. UTC
This series fixes the RCU stalls triggered by the x86 crypto
modules discussed in
https://lore.kernel.org/all/MW5PR84MB18426EBBA3303770A8BC0BDFAB759@MW5PR84MB1842.NAMPRD84.PROD.OUTLOOK.COM/

Two root causes were:
- too much data processed between kernel_fpu_begin and
  kernel_fpu_end calls (which are heavily used by the x86
  optimized drivers)
- tcrypt not calling cond_resched during speed test loops

These problems have always been lurking, but improving the
loading of the x86/sha512 module led to it happening a lot
during boot when using SHA-512 for module signature checking. 

Fixing these problems makes it safer to improve loading
the rest of the x86 modules like the sha512 module.

This series only handles the x86 modules.

Testing
=======
The most effective testing was by enabling
  CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y

which creates random test vectors and compares the results
of the CPU-optimized function to the generic function,
and running two threads of repeated modprobe commands
to exercise those tests:
  watch -n 0 modprobe tcrypt mode=200
  watch -n 0 ./tcrypt_sweep

where tcrypt_sweep walks through all the test modes:
#!/usr/bin/perl
use strict;

my @modes;

open SOURCE, "<", "/home/me/linux/crypto/tcrypt.c" or die $!;
while (<SOURCE>) {
        if (/^\s+case ([0-9]+):$/) {
                push @modes, $1;
        }
}
close SOURCE;

foreach (@modes) {
        print "$_ ";

        # don't run mode 300, which runs 301-399
        # don't run mode 400, which runs 401-499
        if (($_ eq "0") || ($_ eq "300") || ($_ eq "400")) {
                system "echo \"===== Skipping special modprobe tcrypt mode=$_\" > /dev/kmsg";
        } else {
                system "echo \"Running modprobe tcrypt mode=$_\" > /dev/kmsg";
                system "modprobe tcrypt mode=$_";
        }
}



Robert Elliott (19):
  crypto: tcrypt - test crc32
  crypto: tcrypt - test nhpoly1305
  crypto: tcrypt - reschedule during cycles speed tests
  crypto: x86/sha - limit FPU preemption
  crypto: x86/crc - limit FPU preemption
  crypto: x86/sm3 - limit FPU preemption
  crypto: x86/ghash - restructure FPU context saving
  crypto: x86/ghash - limit FPU preemption
  crypto: x86 - use common macro for FPU limit
  crypto: x86/sha1, sha256 - load based on CPU features
  crypto: x86/crc - load based on CPU features
  crypto: x86/sm3 - load based on CPU features
  crypto: x86/ghash - load based on CPU features
  crypto: x86 - load based on CPU features
  crypto: x86 - add pr_fmt to all modules
  crypto: x86 - print CPU optimized loaded messages
  crypto: x86 - standardize suboptimal prints
  crypto: x86 - standardize not loaded prints
  crypto: x86/sha - register only the best function

 arch/x86/crypto/aegis128-aesni-glue.c      |  21 ++-
 arch/x86/crypto/aesni-intel_glue.c         |  31 ++--
 arch/x86/crypto/aria_aesni_avx_glue.c      |  19 +-
 arch/x86/crypto/blake2s-glue.c             |  34 +++-
 arch/x86/crypto/blowfish_glue.c            |  19 +-
 arch/x86/crypto/camellia_aesni_avx2_glue.c |  25 ++-
 arch/x86/crypto/camellia_aesni_avx_glue.c  |  24 ++-
 arch/x86/crypto/camellia_glue.c            |  20 ++-
 arch/x86/crypto/cast5_avx_glue.c           |  21 ++-
 arch/x86/crypto/cast6_avx_glue.c           |  21 ++-
 arch/x86/crypto/chacha_glue.c              |  35 +++-
 arch/x86/crypto/crc32-pclmul_asm.S         |   6 +-
 arch/x86/crypto/crc32-pclmul_glue.c        |  37 ++--
 arch/x86/crypto/crc32c-intel_glue.c        |  51 ++++--
 arch/x86/crypto/crct10dif-pclmul_glue.c    |  54 ++++--
 arch/x86/crypto/curve25519-x86_64.c        |  27 ++-
 arch/x86/crypto/des3_ede_glue.c            |  16 +-
 arch/x86/crypto/ghash-clmulni-intel_glue.c |  40 +++--
 arch/x86/crypto/nhpoly1305-avx2-glue.c     |  27 ++-
 arch/x86/crypto/nhpoly1305-sse2-glue.c     |  23 ++-
 arch/x86/crypto/poly1305_glue.c            |  64 +++++--
 arch/x86/crypto/polyval-clmulni_glue.c     |  14 +-
 arch/x86/crypto/serpent_avx2_glue.c        |  25 ++-
 arch/x86/crypto/serpent_avx_glue.c         |  21 ++-
 arch/x86/crypto/serpent_sse2_glue.c        |  19 +-
 arch/x86/crypto/sha1_ssse3_glue.c          | 188 +++++++++++--------
 arch/x86/crypto/sha256_ssse3_glue.c        | 198 ++++++++++++---------
 arch/x86/crypto/sha512_ssse3_glue.c        | 154 +++++++++-------
 arch/x86/crypto/sm3_avx_glue.c             |  52 +++++-
 arch/x86/crypto/sm4_aesni_avx2_glue.c      |  25 ++-
 arch/x86/crypto/sm4_aesni_avx_glue.c       |  23 ++-
 arch/x86/crypto/twofish_avx_glue.c         |  25 ++-
 arch/x86/crypto/twofish_glue.c             |  19 +-
 arch/x86/crypto/twofish_glue_3way.c        |  26 ++-
 crypto/tcrypt.c                            |  56 +++---
 35 files changed, 1060 insertions(+), 400 deletions(-)

Comments

Elliott, Robert (Servers) Nov. 1, 2022, 9:34 p.m. UTC | #1
> -----Original Message-----
> From: Elliott, Robert (Servers) <elliott@hpe.com>
> Sent: Wednesday, October 12, 2022 4:59 PM
> To: herbert@gondor.apana.org.au; davem@davemloft.net;
> tim.c.chen@linux.intel.com; ap420073@gmail.com; ardb@kernel.org; linux-
> crypto@vger.kernel.org; linux-kernel@vger.kernel.org
> Cc: Elliott, Robert (Servers) <elliott@hpe.com>
> Subject: [PATCH v2 00/19] crypto: x86 - fix RCU stalls
> 
> This series fixes the RCU stalls triggered by the x86 crypto
> modules discussed in
> https://lore.kernel.org/all/MW5PR84MB18426EBBA3303770A8BC0BDFAB759@MW5PR84
> MB1842.NAMPRD84.PROD.OUTLOOK.COM/

I've instrumented all the x86 crypto modules, including ways to
experiment with different loop sizes. Here are some results with
the hash functions.

Key:
    calls = number of kernel_fpu_begin()/end() calls made by the module
     cost = number of CPU cycles consumed by those calls (overhead)
maxcycles = number of CPU cycles between those calls in FPU context
      bpf = bytes_per_fpu loop size
      KiB = bpf expressed in KiB
   maxlen = maximum number of bytes per loop via update()
  maxlen2 = maximum number of bytes per loop via finup()

This is on a 2.2 GHz Cascade Lake CPU, where each cycle is nominally
0.45 ns.  The CPU does not support SHA-NI instructions, so those
results are missing.

Here are the results from a boot with the avx2 bytes_per_fpu values set
to 0 (unlimited - original behavior).

Booting includes:
  - processing 2.3 GB of SHA-512 kernel module hashes
  - crypto self-tests
  - crypto extra self-tests (CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y)

   calls        cost    maxcycles      bpf  KiB   maxlen  maxlen2                algorithm                       module
======== =========== ============ ======== ==== ======== ======== ======================== ============================
    3641      177182        10230        0    0     4096        0      __ghash-pclmulqdqni          ghash_clmulni_intel
    2242      150516         1684        0    0     8112        0             crc32-pclmul                 crc32_pclmul
    1008       43800        22404        0    0     8068     8105             crc32c-intel                 crc32c_intel
    2565      179734         4286        0    0     7791     8027         crct10dif-pclmul             crct10dif_pclmul
    1603       77112         2414        0    0     8132        0          nhpoly1305-avx2              nhpoly1305_avx2
    1671       81108         9390   203776  199     8109        0          nhpoly1305-sse2              nhpoly1305_sse2
    1977      103598         5314        0    0     8112        0            poly1305-simd              poly1305_x86_64
   26744     1251756         2046        0    0     8096        0          polyval-clmulni              polyval_clmulni
   14669      682428        65462    30720   30      251     8096                 sha1-avx                   sha1_ssse3
   14669      682428        65462        0    0     7170        0                sha1-avx2                   sha1_ssse3
   14669      682428        65462    34816   34        0        0               sha1-shani                   sha1_ssse3
   14669      682428        65462    26624   26     8089     8164               sha1-ssse3                   sha1_ssse3
   26768     1230100       144902    11264   11     8130     8159               sha224-avx                 sha256_ssse3
   26768     1230100       144902    13312   13     8078     8146              sha224-avx2                 sha256_ssse3
   26768     1230100       144902    13312   13        0        0             sha224-shani                 sha256_ssse3
   26768     1230100       144902    11264   11     8068     8168             sha224-ssse3                 sha256_ssse3
   26768     1230100       144902    11264   11     8130     8159               sha256-avx                 sha256_ssse3
   26768     1230100       144902    13312   13     8078     8146              sha256-avx2                 sha256_ssse3
   26768     1230100       144902    13312   13        0        0             sha256-shani                 sha256_ssse3
   26768     1230100       144902    11264   11     8068     8168             sha256-ssse3                 sha256_ssse3
   29157     2044882    164510724    17408   17        0     8127               sha384-avx                 sha512_ssse3
   29157     2044882    164510724        0    0        0 48175432              sha384-avx2                 sha512_ssse3
   29157     2044882    164510724    17408   17        0     8055             sha384-ssse3                 sha512_ssse3
   29157     2044882    164510724    17408   17        0     8127               sha512-avx                 sha512_ssse3
   29157     2044882    164510724        0    0        0 48175432              sha512-avx2                 sha512_ssse3
   29157     2044882    164510724    17408   17        0     8055             sha512-ssse3                 sha512_ssse3
    4314      193456       124918        0    0     7672     8101                  sm3-avx               sm3_avx_x86_64

The self-tests only test small data sets (even the extra tests
limit themselves to PAGE_SIZE * 2) so only the sha512_ssse3
module was stressed with large requests.

The cost of the kernel_fpu_begin()/end() calls (2044882 cycles) was
929 us, and the longest time in FPU context (164510724) was 75 ms.
I think the biggest file it encounters is:
-rw-r--r--. 1 root root 48186713 Nov  1 13:14 /lib/modules/6.0.0+/kernel/fs/xfs/xfs.ko


I added tcrypt tests to exercise each driver ten times with 1 MiB data,
and that exposes all the drivers to larger requests.

bigbuf tests with no limits:
   calls        cost    maxcycles      bpf  KiB   maxlen  maxlen2                algorithm                       module
======== =========== ============ ======== ==== ======== ======== ======================== ============================
    1000      156354      1484434        0    0  1048576        0      __ghash-pclmulqdqni          ghash_clmulni_intel
    1000      150386       221710        0    0  1048576        0             crc32-pclmul                 crc32_pclmul
    1000      104890       114000        0    0  1048576        0             crc32c-intel                 crc32c_intel
    1000      169596       182904        0    0  1048576        0         crct10dif-pclmul             crct10dif_pclmul
    1000      122842       267568        0    0  1048576        0          nhpoly1305-avx2              nhpoly1305_avx2
    1000      190530       453118        0    0  1048576        0          nhpoly1305-sse2              nhpoly1305_sse2
    1000      134682       431264        0    0  1048576        0            poly1305-simd              poly1305_x86_64
    8000      387206       215922        0    0  1048576        0          polyval-clmulni              polyval_clmulni
    6000      562932      2831190        0    0  1048576        0                 sha1-avx                   sha1_ssse3
    6000      562932      2831190        0    0  1048576        0                sha1-avx2                   sha1_ssse3
    6000      562932      2831190    34816   34        0        0               sha1-shani                   sha1_ssse3
    6000      562932      2831190        0    0  1048576        0               sha1-ssse3                   sha1_ssse3
   12000     1212742      6558712        0    0  1048576        0               sha224-avx                 sha256_ssse3
   12000     1212742      6558712        0    0  1048576        0              sha224-avx2                 sha256_ssse3
   12000     1212742      6558712    13312   13        0        0             sha224-shani                 sha256_ssse3
   12000     1212742      6558712        0    0  1048576        0             sha224-ssse3                 sha256_ssse3
   12000     1212742      6558712        0    0  1048576        0               sha256-avx                 sha256_ssse3
   12000     1212742      6558712        0    0  1048576        0              sha256-avx2                 sha256_ssse3
   12000     1212742      6558712    13312   13        0        0             sha256-shani                 sha256_ssse3
   12000     1212742      6558712        0    0  1048576        0             sha256-ssse3                 sha256_ssse3
   12006     1250296      4621038        0    0  1048576        0               sha384-avx                 sha512_ssse3
   12006     1250296      4621038        0    0  1048576  1037416              sha384-avx2                 sha512_ssse3
   12006     1250296      4621038        0    0  1048576        0             sha384-ssse3                 sha512_ssse3
   12006     1250296      4621038        0    0  1048576        0               sha512-avx                 sha512_ssse3
   12006     1250296      4621038        0    0  1048576  1037416              sha512-avx2                 sha512_ssse3
   12006     1250296      4621038        0    0  1048576        0             sha512-ssse3                 sha512_ssse3
    2000      221468      6236756        0    0  1048576        0                  sm3-avx               sm3_avx_x86_64

Setting bpf limits based on those results narrows the maxcycles in
FPU context. I've seen results vary from 81912 (37 us) to
(102 us) - not real tight, but much better than ranging up
to 75 ms.

bigbuf tests with bytes_per_fpu limits as shown:
   calls        cost    maxcycles      bpf  KiB   maxlen  maxlen2                algorithm                       module
======== =========== ============ ======== ==== ======== ======== ======================== ============================
   21000     1002372       138558    51200   50    51200        0      __ghash-pclmulqdqni          ghash_clmulni_intel
    2000      220666       226806   646912  631   646912        0             crc32-pclmul                 crc32_pclmul
    2000      255110       105968   895232  874   895232        0             crc32c-intel                 crc32c_intel
    2000      218942       107930   626944  612   626944        0         crct10dif-pclmul             crct10dif_pclmul
    4000      208170       141356   345088  337   345088        0          nhpoly1305-avx2              nhpoly1305_avx2
    6000      285286       105072   203520  198   203520        0          nhpoly1305-sse2              nhpoly1305_sse2
    5000      368866       162262   222976  217   222976        0            poly1305-simd              poly1305_x86_64
   10000      457010       142362   402688  393   402688        0          polyval-clmulni              polyval_clmulni
  108000     6048076       160670    30720   30    30720        0                 sha1-avx                   sha1_ssse3
  108000     6048076       160670    34816   34    34816        0                sha1-avx2                   sha1_ssse3
  108000     6048076       160670    27392   26    27392        0               sha1-ssse3                   sha1_ssse3
  520000    23646576       196462    11520   11    11520        0               sha224-avx                 sha256_ssse3
  520000    23646576       196462    14080   13    14080        0              sha224-avx2                 sha256_ssse3
  520000    23646576       196462    11776   11    11776        0             sha224-ssse3                 sha256_ssse3
  520000    23646576       196462    11520   11    11520        0               sha256-avx                 sha256_ssse3
  520000    23646576       196462    14080   13    14080        0              sha256-avx2                 sha256_ssse3
  520000    23646576       196462    11776   11    11776        0             sha256-ssse3                 sha256_ssse3
  356156    18242860       226538    17152   16    17152        0               sha384-avx                 sha512_ssse3
  356156    18242860       226538    20480   20    20480    20480              sha384-avx2                 sha512_ssse3
  356156    18242860       226538    17408   17    17408        0             sha384-ssse3                 sha512_ssse3
  356156    18242860       226538    17152   16    17152        0               sha512-avx                 sha512_ssse3
  356156    18242860       226538    20480   20    20480    20480              sha512-avx2                 sha512_ssse3
  356156    18242860       226538    17408   17    17408        0             sha512-ssse3                 sha512_ssse3
   93000     4537164       138924    11520   11    11520        0                  sm3-avx               sm3_avx_x86_64

If I reboot with sha512-avx2 set to 20 KiB, the sha512-avx2
maxlength can still take a long time (e.g., 2 ms). That's much
better than the original 75 ms, but still not in the 50 us range.

I set /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor to
"performance" in .bash_profile, but that's not effective during
boot, so maybe that is the source of variability.

Example boot with 20 KiB limit:
   calls        cost    maxcycles      bpf  KiB   maxlen  maxlen2                algorithm                       module
======== =========== ============ ======== ==== ======== ======== ======================== ============================
  161011    16232280      4049644    20480   20        0    20480              sha512-avx2                 sha512_ssse3

Limiting it to 1 KiB does reduce maxcycles to the us range, but
the cost of all the extra calls soars.

So, for v3 of the series, I plan to propose values ranging from:
  - 11 to 20 KiB for sha* amd sm3
  - 200 to 400 Kib for *poly*
  - 600 to 800 KiB for crc*

v3 will only cover the hash functions - skcipher and aead
have some unique challenges that we can tackle later.