diff mbox series

[v2,05/19] crypto: x86/crc - limit FPU preemption

Message ID 20221012215931.3896-6-elliott@hpe.com (mailing list archive)
State Changes Requested
Delegated to: Herbert Xu
Headers show
Series crypto: x86 - fix RCU stalls | expand

Commit Message

Elliott, Robert (Servers) Oct. 12, 2022, 9:59 p.m. UTC
As done by the ECB and CBC helpers in arch/x86/crypt/ecb_cbc_helpers.h,
limit the number of bytes processed between kernel_fpu_begin() and
kernel_fpu_end() calls.

Those functions call preempt_disable() and preempt_enable(), so
the CPU core is unavailable for scheduling while running, leading to:
    rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: ...

Fixes: 78c37d191dd6 ("crypto: crc32 - add crc32 pclmulqdq implementation and wrappers for table implementation")
Fixes: 6a8ce1ef3940 ("crypto: crc32c - Optimize CRC32C calculation with PCLMULQDQ instruction")
Fixes: 0b95a7f85718 ("crypto: crct10dif - Glue code to cast accelerated CRCT10DIF assembly as a crypto transform")
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Robert Elliott <elliott@hpe.com>
---
 arch/x86/crypto/crc32-pclmul_asm.S      |  6 ++--
 arch/x86/crypto/crc32-pclmul_glue.c     | 19 ++++++++----
 arch/x86/crypto/crc32c-intel_glue.c     | 29 ++++++++++++++----
 arch/x86/crypto/crct10dif-pclmul_glue.c | 39 ++++++++++++++++++++-----
 4 files changed, 71 insertions(+), 22 deletions(-)

Comments

Herbert Xu Oct. 13, 2022, 2 a.m. UTC | #1
On Wed, Oct 12, 2022 at 04:59:17PM -0500, Robert Elliott wrote:
>
> @@ -170,9 +179,17 @@ static int __crc32c_pcl_intel_finup(u32 *crcp, const u8 *data, unsigned int len,
>  				u8 *out)
>  {
>  	if (len >= CRC32C_PCL_BREAKEVEN && crypto_simd_usable()) {
> -		kernel_fpu_begin();
> -		*(__le32 *)out = ~cpu_to_le32(crc_pcl(data, len, *crcp));
> -		kernel_fpu_end();
> +		do {
> +			unsigned int chunk = min(len, FPU_BYTES);
> +
> +			kernel_fpu_begin();
> +			*crcp = crc_pcl(data, chunk, *crcp);

How about storing the intermediate result in a local variable
instead of overwriting *crcp?

Thanks,
Elliott, Robert (Servers) Oct. 13, 2022, 10:34 p.m. UTC | #2
> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Wednesday, October 12, 2022 9:00 PM
> To: Elliott, Robert (Servers) <elliott@hpe.com>
> Cc: davem@davemloft.net; tim.c.chen@linux.intel.com; ap420073@gmail.com;
> ardb@kernel.org; linux-crypto@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v2 05/19] crypto: x86/crc - limit FPU preemption
> 
> On Wed, Oct 12, 2022 at 04:59:17PM -0500, Robert Elliott wrote:
> >
> > @@ -170,9 +179,17 @@ static int __crc32c_pcl_intel_finup(u32 *crcp, const u8
> *data, unsigned int len,
> >  				u8 *out)
> >  {
> >  	if (len >= CRC32C_PCL_BREAKEVEN && crypto_simd_usable()) {
> > -		kernel_fpu_begin();
> > -		*(__le32 *)out = ~cpu_to_le32(crc_pcl(data, len, *crcp));
> > -		kernel_fpu_end();
> > +		do {
> > +			unsigned int chunk = min(len, FPU_BYTES);
> > +
> > +			kernel_fpu_begin();
> > +			*crcp = crc_pcl(data, chunk, *crcp);
> 
> How about storing the intermediate result in a local variable
> instead of overwriting *crcp?
> 
> Thanks,

The _update function does so, and it's not marked const here,
so seemed prudent to keep up to date.

Do the callers understand it's no longer valid after finup, or
is there any case they might treat finup like an update and try
again?
David Laight Oct. 14, 2022, 4:02 a.m. UTC | #3
From: Robert Elliott
> Sent: 12 October 2022 22:59
> 
> As done by the ECB and CBC helpers in arch/x86/crypt/ecb_cbc_helpers.h,
> limit the number of bytes processed between kernel_fpu_begin() and
> kernel_fpu_end() calls.
> 
> Those functions call preempt_disable() and preempt_enable(), so
> the CPU core is unavailable for scheduling while running, leading to:
>     rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: ...

How long were the buffers being processed when the rcu stall was reported?
It looks like you are adding kernel_fpu_end(); kernel_fpu_begin()
pairs every 4096 bytes.
I'd guess the crc instruction runs at 4 bytes/clock
(or at least gets somewhere near that).
So you are talking of few thousand clocks at most.
A pci read from a device can easily take much longer than that.
So I'm surprised you need to do such small buffers to avoid
rcu stalls.

The kernel_fpu_end(); kernel_fpu_begin() pair pair will also cost.
(Maybe not as much as the first kernel_fpu_begin() ?)

Some performance figures might be enlightening.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Yujie Liu Oct. 24, 2022, 2:03 a.m. UTC | #4
Greeting,

FYI, we noticed ltp.fsopen01.fail due to commit (built with gcc-11):

commit: 0c664cbc906012f02c5bf128cf2dff854cca65c7 ("[PATCH v2 05/19] crypto: x86/crc - limit FPU preemption")
url: https://github.com/intel-lab-lkp/linux/commits/Robert-Elliott/crypto-tcrypt-test-crc32/20221013-065919
base: https://git.kernel.org/cgit/linux/kernel/git/herbert/cryptodev-2.6.git master
patch link: https://lore.kernel.org/linux-crypto/20221012215931.3896-6-elliott@hpe.com
patch subject: [PATCH v2 05/19] crypto: x86/crc - limit FPU preemption

in testcase: ltp
version: ltp-x86_64-14c1f76-1_20221009
with following parameters:

	disk: 1HDD
	fs: ext4
	test: syscalls-07

test-description: The LTP testsuite contains a collection of tools for testing the Linux kernel and related features.
test-url: http://linux-test-project.github.io/

on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz (Skylake) with 32G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


<<<test_start>>>
tag=fsopen01 stime=1666383665
cmdline="fsopen01"
contacts=""
analysis=exit
<<<test_output>>>
...
tst_test.c:1599: TINFO: === Testing on btrfs ===
tst_test.c:1064: TINFO: Formatting /dev/loop0 with btrfs opts='' extra opts=''
fsopen01.c:42: TFAIL: fsconfig(FSCONFIG_CMD_CREATE) failed: EINVAL (22)
fsopen01.c:42: TFAIL: fsconfig(FSCONFIG_CMD_CREATE) failed: EINVAL (22)
...

Summary:
passed   12
failed   2
broken   0
skipped  0
warnings 0
<<<execution_status>>>
initiation_status="ok"
duration=3 termination_type=exited termination_id=1 corefile=no
cutime=3 cstime=47
<<<test_end>>>


[  152.413919][ T4912] BTRFS: device fsid 05e51863-81c3-4c32-9e24-3d49d849f724 devid 1 transid 6 /dev/loop0 scanned by mkfs.btrfs (4912)
[  152.429076][ T4851] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
[  152.438743][ T4851] BTRFS info (device loop0): using free space tree
[  152.449103][    T8] BTRFS warning (device loop0): checksum verify failed on logical 22036480 mirror 1 wanted 0xc4a1f4f3 found 0x76f09a51 level 0
[  152.463363][   T35] BTRFS warning (device loop0): checksum verify failed on logical 22036480 mirror 2 wanted 0xc4a1f4f3 found 0x76f09a51 level 0
[  152.477446][ T4851] BTRFS error (device loop0): failed to read chunk root
[  152.486164][ T4851] BTRFS error (device loop0): open_ctree failed


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <yujie.liu@intel.com>
| Link: https://lore.kernel.org/r/202210240920.a0dfb6a3-yujie.liu@intel.com


To reproduce:

        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        sudo bin/lkp install job.yaml           # job file is attached in this email
        bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
        sudo bin/lkp run generated-yaml-file

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.
diff mbox series

Patch

diff --git a/arch/x86/crypto/crc32-pclmul_asm.S b/arch/x86/crypto/crc32-pclmul_asm.S
index ca53e96996ac..9abd861636c3 100644
--- a/arch/x86/crypto/crc32-pclmul_asm.S
+++ b/arch/x86/crypto/crc32-pclmul_asm.S
@@ -72,15 +72,15 @@ 
 .text
 /**
  *      Calculate crc32
- *      BUF - buffer (16 bytes aligned)
- *      LEN - sizeof buffer (16 bytes aligned), LEN should be grater than 63
+ *      BUF - buffer - must be 16 bytes aligned
+ *      LEN - sizeof buffer - must be multiple of 16 bytes and greater than 63
  *      CRC - initial crc32
  *      return %eax crc32
  *      uint crc32_pclmul_le_16(unsigned char const *buffer,
  *	                     size_t len, uint crc32)
  */
 
-SYM_FUNC_START(crc32_pclmul_le_16) /* buffer and buffer size are 16 bytes aligned */
+SYM_FUNC_START(crc32_pclmul_le_16)
 	movdqa  (BUF), %xmm1
 	movdqa  0x10(BUF), %xmm2
 	movdqa  0x20(BUF), %xmm3
diff --git a/arch/x86/crypto/crc32-pclmul_glue.c b/arch/x86/crypto/crc32-pclmul_glue.c
index 98cf3b4e4c9f..38539c6edfe5 100644
--- a/arch/x86/crypto/crc32-pclmul_glue.c
+++ b/arch/x86/crypto/crc32-pclmul_glue.c
@@ -46,6 +46,8 @@ 
 #define SCALE_F			16L	/* size of xmm register */
 #define SCALE_F_MASK		(SCALE_F - 1)
 
+#define FPU_BYTES 4096U /* avoid kernel_fpu_begin/end scheduler/rcu stalls */
+
 u32 crc32_pclmul_le_16(unsigned char const *buffer, size_t len, u32 crc32);
 
 static u32 __attribute__((pure))
@@ -70,12 +72,19 @@  static u32 __attribute__((pure))
 	iquotient = len & (~SCALE_F_MASK);
 	iremainder = len & SCALE_F_MASK;
 
-	kernel_fpu_begin();
-	crc = crc32_pclmul_le_16(p, iquotient, crc);
-	kernel_fpu_end();
+	do {
+		unsigned int chunk = min(iquotient, FPU_BYTES);
+
+		kernel_fpu_begin();
+		crc = crc32_pclmul_le_16(p, chunk, crc);
+		kernel_fpu_end();
+
+		iquotient -= chunk;
+		p += chunk;
+	} while (iquotient >= PCLMUL_MIN_LEN);
 
-	if (iremainder)
-		crc = crc32_le(crc, p + iquotient, iremainder);
+	if (iquotient || iremainder)
+		crc = crc32_le(crc, p, iquotient + iremainder);
 
 	return crc;
 }
diff --git a/arch/x86/crypto/crc32c-intel_glue.c b/arch/x86/crypto/crc32c-intel_glue.c
index feccb5254c7e..ece620227057 100644
--- a/arch/x86/crypto/crc32c-intel_glue.c
+++ b/arch/x86/crypto/crc32c-intel_glue.c
@@ -41,6 +41,8 @@ 
  */
 #define CRC32C_PCL_BREAKEVEN	512
 
+#define FPU_BYTES 4096U /* avoid kernel_fpu_begin/end scheduler/rcu stalls */
+
 asmlinkage unsigned int crc_pcl(const u8 *buffer, int len,
 				unsigned int crc_init);
 #endif /* CONFIG_X86_64 */
@@ -158,9 +160,16 @@  static int crc32c_pcl_intel_update(struct shash_desc *desc, const u8 *data,
 	 * overcome kernel fpu state save/restore overhead
 	 */
 	if (len >= CRC32C_PCL_BREAKEVEN && crypto_simd_usable()) {
-		kernel_fpu_begin();
-		*crcp = crc_pcl(data, len, *crcp);
-		kernel_fpu_end();
+		do {
+			unsigned int chunk = min(len, FPU_BYTES);
+
+			kernel_fpu_begin();
+			*crcp = crc_pcl(data, chunk, *crcp);
+			kernel_fpu_end();
+
+			len -= chunk;
+			data += chunk;
+		} while (len);
 	} else
 		*crcp = crc32c_intel_le_hw(*crcp, data, len);
 	return 0;
@@ -170,9 +179,17 @@  static int __crc32c_pcl_intel_finup(u32 *crcp, const u8 *data, unsigned int len,
 				u8 *out)
 {
 	if (len >= CRC32C_PCL_BREAKEVEN && crypto_simd_usable()) {
-		kernel_fpu_begin();
-		*(__le32 *)out = ~cpu_to_le32(crc_pcl(data, len, *crcp));
-		kernel_fpu_end();
+		do {
+			unsigned int chunk = min(len, FPU_BYTES);
+
+			kernel_fpu_begin();
+			*crcp = crc_pcl(data, chunk, *crcp);
+			kernel_fpu_end();
+
+			len -= chunk;
+			data += chunk;
+		} while (len);
+		*(__le32 *)out = ~cpu_to_le32(*crcp);
 	} else
 		*(__le32 *)out =
 			~cpu_to_le32(crc32c_intel_le_hw(*crcp, data, len));
diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c b/arch/x86/crypto/crct10dif-pclmul_glue.c
index 71291d5af9f4..54a537fc88ee 100644
--- a/arch/x86/crypto/crct10dif-pclmul_glue.c
+++ b/arch/x86/crypto/crct10dif-pclmul_glue.c
@@ -34,6 +34,10 @@ 
 #include <asm/cpu_device_id.h>
 #include <asm/simd.h>
 
+#define PCLMUL_MIN_LEN 16U /* minimum size of buffer for crc_t10dif_pcl */
+
+#define FPU_BYTES 4096U /* avoid kernel_fpu_begin/end scheduler/rcu stalls */
+
 asmlinkage u16 crc_t10dif_pcl(u16 init_crc, const u8 *buf, size_t len);
 
 struct chksum_desc_ctx {
@@ -54,10 +58,19 @@  static int chksum_update(struct shash_desc *desc, const u8 *data,
 {
 	struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
 
-	if (length >= 16 && crypto_simd_usable()) {
-		kernel_fpu_begin();
-		ctx->crc = crc_t10dif_pcl(ctx->crc, data, length);
-		kernel_fpu_end();
+	if (length >= PCLMUL_MIN_LEN && crypto_simd_usable()) {
+		do {
+			unsigned int chunk = min(length, FPU_BYTES);
+
+			kernel_fpu_begin();
+			ctx->crc = crc_t10dif_pcl(ctx->crc, data, chunk);
+			kernel_fpu_end();
+
+			length -= chunk;
+			data += chunk;
+		} while (length >= PCLMUL_MIN_LEN);
+		if (length)
+			ctx->crc = crc_t10dif_generic(ctx->crc, data, length);
 	} else
 		ctx->crc = crc_t10dif_generic(ctx->crc, data, length);
 	return 0;
@@ -73,10 +86,20 @@  static int chksum_final(struct shash_desc *desc, u8 *out)
 
 static int __chksum_finup(__u16 crc, const u8 *data, unsigned int len, u8 *out)
 {
-	if (len >= 16 && crypto_simd_usable()) {
-		kernel_fpu_begin();
-		*(__u16 *)out = crc_t10dif_pcl(crc, data, len);
-		kernel_fpu_end();
+	if (len >= PCLMUL_MIN_LEN && crypto_simd_usable()) {
+		do {
+			unsigned int chunk = min(len, FPU_BYTES);
+
+			kernel_fpu_begin();
+			crc = crc_t10dif_pcl(crc, data, chunk);
+			kernel_fpu_end();
+
+			len -= chunk;
+			data += chunk;
+		} while (len >= PCLMUL_MIN_LEN);
+		if (len)
+			crc = crc_t10dif_generic(crc, data, len);
+		*(__u16 *)out = crc;
 	} else
 		*(__u16 *)out = crc_t10dif_generic(crc, data, len);
 	return 0;