Message ID | 20220813230431.2666-1-elliott@hpe.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Herbert Xu |
Headers | show |
Series | crypto: x86/sha512 - load based on CPU features | expand |
Robert Elliott <elliott@hpe.com> wrote: > x86 optimized crypto modules built as modules rather than built-in > to the kernel end up as .ko files in the filesystem, e.g., in > /usr/lib/modules. If the filesystem itself is a module, these might > not be available when the crypto API is initialized, resulting in > the generic implementation being used (e.g., sha512_transform rather > than sha512_transform_avx2). > > In one test case, CPU utilization in the sha512 function dropped > from 15.34% to 7.18% after forcing loading of the optimized module. > > Add module aliases for this x86 optimized crypto module based on CPU > feature bits so udev gets a chance to load them later in the boot > process when the filesystems are all running. > > Signed-off-by: Robert Elliott <elliott@hpe.com> > --- > arch/x86/crypto/sha512_ssse3_glue.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) Patch applied. Thanks.
> -----Original Message----- > From: Herbert Xu <herbert@gondor.apana.org.au> > Sent: Friday, August 19, 2022 6:05 AM > Subject: Re: [PATCH] crypto: x86/sha512 - load based on CPU features > > Robert Elliott <elliott@hpe.com> wrote: > > x86 optimized crypto modules built as modules rather than built-in > > to the kernel end up as .ko files in the filesystem, e.g., in > > /usr/lib/modules. If the filesystem itself is a module, these might > > not be available when the crypto API is initialized, resulting in > > the generic implementation being used (e.g., sha512_transform rather > > than sha512_transform_avx2). > > > > In one test case, CPU utilization in the sha512 function dropped > > from 15.34% to 7.18% after forcing loading of the optimized module. > > > > Add module aliases for this x86 optimized crypto module based on CPU > > feature bits so udev gets a chance to load them later in the boot > > process when the filesystems are all running. > > > > Signed-off-by: Robert Elliott <elliott@hpe.com> > > --- > > arch/x86/crypto/sha512_ssse3_glue.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > Patch applied. Thanks. I'll post a series that applies this technique to all the other x86 modules, if there are no problems reported with sha512. Do other architectures have the same issue, or is something else done to ensure that modules are loaded? MODULE_DEVICE_TABLE() is used by three other architecture-specific modules: 1. One arm (32-bit) module, matching on certain Arm extensions: arch/arm/crypto/crc32-ce-glue.c:MODULE_DEVICE_TABLE(cpu, crc32_cpu_feature); 2. One arm64 module, matching on certain Arm extensions: arch/arm64/crypto/ghash-ce-glue.c:MODULE_DEVICE_TABLE(cpu, ghash_cpu_feature); 3. All of the sparc modules share a global solution: arch/sparc/crypto/crop_devid.c:MODULE_DEVICE_TABLE(of, crypto_opcode_match); /* This is a dummy device table linked into all of the crypto * opcode drivers. It serves to trigger the module autoloading * mechanisms in userspace which scan the OF device tree and * load any modules which have device table entries that * match OF device nodes. */ Each module .c file includes that .c file: aes_glue.c:#include "crop_devid.c" camellia_glue.c:#include "crop_devid.c" crc32c_glue.c:#include "crop_devid.c" des_glue.c:#include "crop_devid.c" md5_glue.c:#include "crop_devid.c" sha1_glue.c:#include "crop_devid.c" sha256_glue.c:#include "crop_devid.c" sha512_glue.c:#include "crop_devid.c"
> -----Original Message----- > From: Elliott, Robert (Servers) <elliott@hpe.com> > Sent: Friday, August 19, 2022 5:37 PM > To: Herbert Xu <herbert@gondor.apana.org.au> > Subject: RE: [PATCH] crypto: x86/sha512 - load based on CPU features > > > -----Original Message----- > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Sent: Friday, August 19, 2022 6:05 AM > > Subject: Re: [PATCH] crypto: x86/sha512 - load based on CPU features > > > > Robert Elliott <elliott@hpe.com> wrote: > > > x86 optimized crypto modules built as modules rather than built-in > > > to the kernel end up as .ko files in the filesystem, e.g., in > > > /usr/lib/modules. If the filesystem itself is a module, these might > > > not be available when the crypto API is initialized, resulting in > > > the generic implementation being used (e.g., sha512_transform rather > > > than sha512_transform_avx2). > > > > > > In one test case, CPU utilization in the sha512 function dropped > > > from 15.34% to 7.18% after forcing loading of the optimized module. > > > > > > Add module aliases for this x86 optimized crypto module based on CPU > > > feature bits so udev gets a chance to load them later in the boot > > > process when the filesystems are all running. > > > > > > Signed-off-by: Robert Elliott <elliott@hpe.com> > > > --- > > > arch/x86/crypto/sha512_ssse3_glue.c | 10 ++++++++++ > > > 1 file changed, 10 insertions(+) > > > > Patch applied. Thanks. > > I'll post a series that applies this technique to all the other > x86 modules, if there are no problems reported with sha512. Suggestion: please revert the sha512-x86 patch for a while. I've been running with this technique applied to all x86 modules for a few weeks and noticed a potential problem. I've seen several cases of "rcu_preempt detected expedited stalls" with stack dumps pointing to the x86 crypto functions: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 12-... } 22 jiffies s: 277 root: 0x1/. (It doesn't always happen; I haven't found a definitive recipe to trigger the problem yet.) Three instances occurred during boot, with the stack track pointing to the sha512-x86 function (my system is set to use SHA-512 for module signing, so it's using that algorithm a lot): module_sig_check/mod_verify_sig/ pkcs7_verify/pkcs7_digest/ sha512_finup/sha512_base_do_update I also triggered three of them by running "modprobe tcrypt mode=n" looping from 0 to 1000, all in the aes-x86 module: tcrypt: testing encryption speed of sync skcipher cts(cbc(aes)) using cts(cbc(aes-aesni)) tcrypt: testing encryption speed of sync skcipher cfb(aes) using cfb(aes-aesni) tcrypt: testing decryption speed of sync skcipher cfb(aes) using cfb(aes-aesni) In that case the stack traces pointed to: do_test/prepare_alloc_pages/test_skcipher_speed.cold Theory: I noticed the proposed aria-x86 module has these calls surrounding its main data processing functions: kernel_fpu_begin(); kernel_fpu_end(); The sha-x86 and aes-x86 implementations use those as well. For example, sha512_update() is structured as: kernel_fpu_begin(); sha512_base_do_update(desc, data, len, sha512_xform); kernel_fpu_end(); and aesni_encrypt() is structured as: kernel_fpu_begin(); aesni_enc(ctx, dst, src); kernel_fpu_end(); I noticed that kernel_fpu_begin() includes this: preempt_disable(); and kernel_fpu_end() has: preempt_enable(); Is that sufficient to cause the RCU problems? Although preempt_disable isn't disabling interrupts, it's blocking the scheduler from using the CPUs (A few of the arm AES functions mess with interrupts, but none of the others seem to do so). So, I suspect that could lead to RCU problems and soft lockups, but not hard lockups. Do these functions need to break up their processing into smaller chunks (e.g., a few Megabytes), calling kernel_fpu_end() periodically to allow the scheduler to take over the CPUs if needed? If so, what chunk size would be appropriate? Example stack trace: [ 29.729811] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 12-... } 22 jiffies s: 277 root: 0x1/. [ 29.729815] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-13:0x1000/. [ 29.729818] Task dump for CPU 12: [ 29.729819] task:modprobe state:R running task stack: 0 pid: 1703 ppid: 1702 flags:0x0000400a [ 29.729822] Call Trace: [ 29.729823] <TASK> [ 29.729825] ? sysvec_apic_timer_interrupt+0xab/0xc0 [ 29.729830] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 [ 29.729835] ? loop1+0x3f2/0x98f [sha512_ssse3] [ 29.729839] ? crypto_create_tfm_node+0x33/0x100 [ 29.729843] ? nowork+0x6/0x6 [sha512_ssse3] [ 29.729844] ? sha512_base_do_update.isra.0+0xeb/0x160 [sha512_ssse3] [ 29.729847] ? nowork+0x6/0x6 [sha512_ssse3] [ 29.729849] ? sha512_finup.part.0+0x1de/0x230 [sha512_ssse3] [ 29.729851] ? pkcs7_digest+0xaf/0x1f0 [ 29.729854] ? pkcs7_verify+0x61/0x540 [ 29.729856] ? verify_pkcs7_message_sig+0x4a/0xe0 [ 29.729859] ? pkcs7_parse_message+0x174/0x1b0 [ 29.729861] ? verify_pkcs7_signature+0x4c/0x80 [ 29.729862] ? mod_verify_sig+0x74/0x90 [ 29.729867] ? module_sig_check+0x87/0xd0 [ 29.729868] ? load_module+0x4e/0x1fc0 [ 29.729871] ? xfs_file_read_iter+0x70/0xe0 [xfs] [ 29.729955] ? __kernel_read+0x118/0x290 [ 29.729959] ? ima_post_read_file+0xac/0xc0 [ 29.729962] ? kernel_read_file+0x211/0x2a0 [ 29.729965] ? __do_sys_finit_module+0x93/0xf0 [ 29.729967] ? __do_sys_finit_module+0x93/0xf0 [ 29.729969] ? do_syscall_64+0x58/0x80 [ 29.729971] ? syscall_exit_to_user_mode+0x17/0x40 [ 29.729973] ? do_syscall_64+0x67/0x80 [ 29.729975] ? exit_to_user_mode_prepare+0x18f/0x1f0 [ 29.729976] ? syscall_exit_to_user_mode+0x17/0x40 [ 29.729977] ? do_syscall_64+0x67/0x80 [ 29.729979] ? syscall_exit_to_user_mode+0x17/0x40 [ 29.729980] ? do_syscall_64+0x67/0x80 [ 29.729982] ? exc_page_fault+0x70/0x170 [ 29.729983] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 29.729986] </TASK> [ 29.753810] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { } 236 jiffies s: 281 root: 0x0/.
On Fri, Aug 26, 2022 at 02:40:58AM +0000, Elliott, Robert (Servers) wrote: > > Suggestion: please revert the sha512-x86 patch for a while. This problem would have existed anyway if the module was built into the kernel. > Do these functions need to break up their processing into smaller chunks > (e.g., a few Megabytes), calling kernel_fpu_end() periodically to > allow the scheduler to take over the CPUs if needed? If so, what > chunk size would be appropriate? Yes these should be limited to 4K each. It appears that all the sha* helpers in arch/x86/crypto have the same problem. Cheers,
On Fri, 2022-08-26 at 10:52 +0800, Herbert Xu wrote: > On Fri, Aug 26, 2022 at 02:40:58AM +0000, Elliott, Robert (Servers) wrote: > > Suggestion: please revert the sha512-x86 patch for a while. > > This problem would have existed anyway if the module was built > into the kernel. > > > Do these functions need to break up their processing into smaller chunks > > (e.g., a few Megabytes), calling kernel_fpu_end() periodically to > > allow the scheduler to take over the CPUs if needed? If so, what > > chunk size would be appropriate? > > Yes these should be limited to 4K each. It appears that all the > sha* helpers in arch/x86/crypto have the same problem. > I think limiting chunk to 4K is reasonable to prevent the CPU from being hogged by the crypto code for too long. Tim
diff --git a/arch/x86/crypto/sha512_ssse3_glue.c b/arch/x86/crypto/sha512_ssse3_glue.c index 30e70f4fe2f7..6d3b85e53d0e 100644 --- a/arch/x86/crypto/sha512_ssse3_glue.c +++ b/arch/x86/crypto/sha512_ssse3_glue.c @@ -36,6 +36,7 @@ #include <linux/types.h> #include <crypto/sha2.h> #include <crypto/sha512_base.h> +#include <asm/cpu_device_id.h> #include <asm/simd.h> asmlinkage void sha512_transform_ssse3(struct sha512_state *state, @@ -284,6 +285,13 @@ static int register_sha512_avx2(void) ARRAY_SIZE(sha512_avx2_algs)); return 0; } +static const struct x86_cpu_id module_cpu_ids[] = { + X86_MATCH_FEATURE(X86_FEATURE_AVX2, NULL), + X86_MATCH_FEATURE(X86_FEATURE_AVX, NULL), + X86_MATCH_FEATURE(X86_FEATURE_SSSE3, NULL), + {} +}; +MODULE_DEVICE_TABLE(x86cpu, module_cpu_ids); static void unregister_sha512_avx2(void) { @@ -294,6 +302,8 @@ static void unregister_sha512_avx2(void) static int __init sha512_ssse3_mod_init(void) { + if (!x86_match_cpu(module_cpu_ids)) + return -ENODEV; if (register_sha512_ssse3()) goto fail;
x86 optimized crypto modules built as modules rather than built-in to the kernel end up as .ko files in the filesystem, e.g., in /usr/lib/modules. If the filesystem itself is a module, these might not be available when the crypto API is initialized, resulting in the generic implementation being used (e.g., sha512_transform rather than sha512_transform_avx2). In one test case, CPU utilization in the sha512 function dropped from 15.34% to 7.18% after forcing loading of the optimized module. Add module aliases for this x86 optimized crypto module based on CPU feature bits so udev gets a chance to load them later in the boot process when the filesystems are all running. Signed-off-by: Robert Elliott <elliott@hpe.com> --- arch/x86/crypto/sha512_ssse3_glue.c | 10 ++++++++++ 1 file changed, 10 insertions(+)