Message ID | 20221012215931.3896-1-elliott@hpe.com (mailing list archive) |
---|---|
Headers | show |
Series | crypto: x86 - fix RCU stalls | expand |
> -----Original Message----- > From: Elliott, Robert (Servers) <elliott@hpe.com> > Sent: Wednesday, October 12, 2022 4:59 PM > To: herbert@gondor.apana.org.au; davem@davemloft.net; > tim.c.chen@linux.intel.com; ap420073@gmail.com; ardb@kernel.org; linux- > crypto@vger.kernel.org; linux-kernel@vger.kernel.org > Cc: Elliott, Robert (Servers) <elliott@hpe.com> > Subject: [PATCH v2 00/19] crypto: x86 - fix RCU stalls > > This series fixes the RCU stalls triggered by the x86 crypto > modules discussed in > https://lore.kernel.org/all/MW5PR84MB18426EBBA3303770A8BC0BDFAB759@MW5PR84 > MB1842.NAMPRD84.PROD.OUTLOOK.COM/ I've instrumented all the x86 crypto modules, including ways to experiment with different loop sizes. Here are some results with the hash functions. Key: calls = number of kernel_fpu_begin()/end() calls made by the module cost = number of CPU cycles consumed by those calls (overhead) maxcycles = number of CPU cycles between those calls in FPU context bpf = bytes_per_fpu loop size KiB = bpf expressed in KiB maxlen = maximum number of bytes per loop via update() maxlen2 = maximum number of bytes per loop via finup() This is on a 2.2 GHz Cascade Lake CPU, where each cycle is nominally 0.45 ns. The CPU does not support SHA-NI instructions, so those results are missing. Here are the results from a boot with the avx2 bytes_per_fpu values set to 0 (unlimited - original behavior). Booting includes: - processing 2.3 GB of SHA-512 kernel module hashes - crypto self-tests - crypto extra self-tests (CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y) calls cost maxcycles bpf KiB maxlen maxlen2 algorithm module ======== =========== ============ ======== ==== ======== ======== ======================== ============================ 3641 177182 10230 0 0 4096 0 __ghash-pclmulqdqni ghash_clmulni_intel 2242 150516 1684 0 0 8112 0 crc32-pclmul crc32_pclmul 1008 43800 22404 0 0 8068 8105 crc32c-intel crc32c_intel 2565 179734 4286 0 0 7791 8027 crct10dif-pclmul crct10dif_pclmul 1603 77112 2414 0 0 8132 0 nhpoly1305-avx2 nhpoly1305_avx2 1671 81108 9390 203776 199 8109 0 nhpoly1305-sse2 nhpoly1305_sse2 1977 103598 5314 0 0 8112 0 poly1305-simd poly1305_x86_64 26744 1251756 2046 0 0 8096 0 polyval-clmulni polyval_clmulni 14669 682428 65462 30720 30 251 8096 sha1-avx sha1_ssse3 14669 682428 65462 0 0 7170 0 sha1-avx2 sha1_ssse3 14669 682428 65462 34816 34 0 0 sha1-shani sha1_ssse3 14669 682428 65462 26624 26 8089 8164 sha1-ssse3 sha1_ssse3 26768 1230100 144902 11264 11 8130 8159 sha224-avx sha256_ssse3 26768 1230100 144902 13312 13 8078 8146 sha224-avx2 sha256_ssse3 26768 1230100 144902 13312 13 0 0 sha224-shani sha256_ssse3 26768 1230100 144902 11264 11 8068 8168 sha224-ssse3 sha256_ssse3 26768 1230100 144902 11264 11 8130 8159 sha256-avx sha256_ssse3 26768 1230100 144902 13312 13 8078 8146 sha256-avx2 sha256_ssse3 26768 1230100 144902 13312 13 0 0 sha256-shani sha256_ssse3 26768 1230100 144902 11264 11 8068 8168 sha256-ssse3 sha256_ssse3 29157 2044882 164510724 17408 17 0 8127 sha384-avx sha512_ssse3 29157 2044882 164510724 0 0 0 48175432 sha384-avx2 sha512_ssse3 29157 2044882 164510724 17408 17 0 8055 sha384-ssse3 sha512_ssse3 29157 2044882 164510724 17408 17 0 8127 sha512-avx sha512_ssse3 29157 2044882 164510724 0 0 0 48175432 sha512-avx2 sha512_ssse3 29157 2044882 164510724 17408 17 0 8055 sha512-ssse3 sha512_ssse3 4314 193456 124918 0 0 7672 8101 sm3-avx sm3_avx_x86_64 The self-tests only test small data sets (even the extra tests limit themselves to PAGE_SIZE * 2) so only the sha512_ssse3 module was stressed with large requests. The cost of the kernel_fpu_begin()/end() calls (2044882 cycles) was 929 us, and the longest time in FPU context (164510724) was 75 ms. I think the biggest file it encounters is: -rw-r--r--. 1 root root 48186713 Nov 1 13:14 /lib/modules/6.0.0+/kernel/fs/xfs/xfs.ko I added tcrypt tests to exercise each driver ten times with 1 MiB data, and that exposes all the drivers to larger requests. bigbuf tests with no limits: calls cost maxcycles bpf KiB maxlen maxlen2 algorithm module ======== =========== ============ ======== ==== ======== ======== ======================== ============================ 1000 156354 1484434 0 0 1048576 0 __ghash-pclmulqdqni ghash_clmulni_intel 1000 150386 221710 0 0 1048576 0 crc32-pclmul crc32_pclmul 1000 104890 114000 0 0 1048576 0 crc32c-intel crc32c_intel 1000 169596 182904 0 0 1048576 0 crct10dif-pclmul crct10dif_pclmul 1000 122842 267568 0 0 1048576 0 nhpoly1305-avx2 nhpoly1305_avx2 1000 190530 453118 0 0 1048576 0 nhpoly1305-sse2 nhpoly1305_sse2 1000 134682 431264 0 0 1048576 0 poly1305-simd poly1305_x86_64 8000 387206 215922 0 0 1048576 0 polyval-clmulni polyval_clmulni 6000 562932 2831190 0 0 1048576 0 sha1-avx sha1_ssse3 6000 562932 2831190 0 0 1048576 0 sha1-avx2 sha1_ssse3 6000 562932 2831190 34816 34 0 0 sha1-shani sha1_ssse3 6000 562932 2831190 0 0 1048576 0 sha1-ssse3 sha1_ssse3 12000 1212742 6558712 0 0 1048576 0 sha224-avx sha256_ssse3 12000 1212742 6558712 0 0 1048576 0 sha224-avx2 sha256_ssse3 12000 1212742 6558712 13312 13 0 0 sha224-shani sha256_ssse3 12000 1212742 6558712 0 0 1048576 0 sha224-ssse3 sha256_ssse3 12000 1212742 6558712 0 0 1048576 0 sha256-avx sha256_ssse3 12000 1212742 6558712 0 0 1048576 0 sha256-avx2 sha256_ssse3 12000 1212742 6558712 13312 13 0 0 sha256-shani sha256_ssse3 12000 1212742 6558712 0 0 1048576 0 sha256-ssse3 sha256_ssse3 12006 1250296 4621038 0 0 1048576 0 sha384-avx sha512_ssse3 12006 1250296 4621038 0 0 1048576 1037416 sha384-avx2 sha512_ssse3 12006 1250296 4621038 0 0 1048576 0 sha384-ssse3 sha512_ssse3 12006 1250296 4621038 0 0 1048576 0 sha512-avx sha512_ssse3 12006 1250296 4621038 0 0 1048576 1037416 sha512-avx2 sha512_ssse3 12006 1250296 4621038 0 0 1048576 0 sha512-ssse3 sha512_ssse3 2000 221468 6236756 0 0 1048576 0 sm3-avx sm3_avx_x86_64 Setting bpf limits based on those results narrows the maxcycles in FPU context. I've seen results vary from 81912 (37 us) to (102 us) - not real tight, but much better than ranging up to 75 ms. bigbuf tests with bytes_per_fpu limits as shown: calls cost maxcycles bpf KiB maxlen maxlen2 algorithm module ======== =========== ============ ======== ==== ======== ======== ======================== ============================ 21000 1002372 138558 51200 50 51200 0 __ghash-pclmulqdqni ghash_clmulni_intel 2000 220666 226806 646912 631 646912 0 crc32-pclmul crc32_pclmul 2000 255110 105968 895232 874 895232 0 crc32c-intel crc32c_intel 2000 218942 107930 626944 612 626944 0 crct10dif-pclmul crct10dif_pclmul 4000 208170 141356 345088 337 345088 0 nhpoly1305-avx2 nhpoly1305_avx2 6000 285286 105072 203520 198 203520 0 nhpoly1305-sse2 nhpoly1305_sse2 5000 368866 162262 222976 217 222976 0 poly1305-simd poly1305_x86_64 10000 457010 142362 402688 393 402688 0 polyval-clmulni polyval_clmulni 108000 6048076 160670 30720 30 30720 0 sha1-avx sha1_ssse3 108000 6048076 160670 34816 34 34816 0 sha1-avx2 sha1_ssse3 108000 6048076 160670 27392 26 27392 0 sha1-ssse3 sha1_ssse3 520000 23646576 196462 11520 11 11520 0 sha224-avx sha256_ssse3 520000 23646576 196462 14080 13 14080 0 sha224-avx2 sha256_ssse3 520000 23646576 196462 11776 11 11776 0 sha224-ssse3 sha256_ssse3 520000 23646576 196462 11520 11 11520 0 sha256-avx sha256_ssse3 520000 23646576 196462 14080 13 14080 0 sha256-avx2 sha256_ssse3 520000 23646576 196462 11776 11 11776 0 sha256-ssse3 sha256_ssse3 356156 18242860 226538 17152 16 17152 0 sha384-avx sha512_ssse3 356156 18242860 226538 20480 20 20480 20480 sha384-avx2 sha512_ssse3 356156 18242860 226538 17408 17 17408 0 sha384-ssse3 sha512_ssse3 356156 18242860 226538 17152 16 17152 0 sha512-avx sha512_ssse3 356156 18242860 226538 20480 20 20480 20480 sha512-avx2 sha512_ssse3 356156 18242860 226538 17408 17 17408 0 sha512-ssse3 sha512_ssse3 93000 4537164 138924 11520 11 11520 0 sm3-avx sm3_avx_x86_64 If I reboot with sha512-avx2 set to 20 KiB, the sha512-avx2 maxlength can still take a long time (e.g., 2 ms). That's much better than the original 75 ms, but still not in the 50 us range. I set /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor to "performance" in .bash_profile, but that's not effective during boot, so maybe that is the source of variability. Example boot with 20 KiB limit: calls cost maxcycles bpf KiB maxlen maxlen2 algorithm module ======== =========== ============ ======== ==== ======== ======== ======================== ============================ 161011 16232280 4049644 20480 20 0 20480 sha512-avx2 sha512_ssse3 Limiting it to 1 KiB does reduce maxcycles to the us range, but the cost of all the extra calls soars. So, for v3 of the series, I plan to propose values ranging from: - 11 to 20 KiB for sha* amd sm3 - 200 to 400 Kib for *poly* - 600 to 800 KiB for crc* v3 will only cover the hash functions - skcipher and aead have some unique challenges that we can tackle later.