diff mbox series

[v6,3/4] riscv: Add checksum library

Message ID 20230915-optimize_checksum-v6-3-14a6cf61c618@rivosinc.com (mailing list archive)
State Superseded
Headers show
Series riscv: Add fine-tuned checksum functions | expand

Checks

Context Check Description
conchuod/cover_letter success Series has a cover letter
conchuod/tree_selection success Guessed tree name to be for-next at HEAD 0bb80ecc33a8
conchuod/fixes_present success Fixes tag not required for -next series
conchuod/maintainers_pattern success MAINTAINERS pattern errors before the patch: 5 and now 5
conchuod/verify_signedoff success Signed-off-by tag matches author and committer
conchuod/kdoc success Errors and warnings before: 0 this patch: 0
conchuod/build_rv64_clang_allmodconfig success Errors and warnings before: 1244 this patch: 1244
conchuod/module_param success Was 0 now: 0
conchuod/build_rv64_gcc_allmodconfig success Errors and warnings before: 379 this patch: 379
conchuod/build_rv32_defconfig success Build OK
conchuod/dtb_warn_rv64 success Errors and warnings before: 25 this patch: 25
conchuod/header_inline success No static functions without inline keyword in header files
conchuod/checkpatch warning CHECK: extern prototypes should be avoided in .h files WARNING: Avoid line continuations in quoted strings WARNING: IS_ENABLED(__LITTLE_ENDIAN) is normally used as IS_ENABLED(CONFIG___LITTLE_ENDIAN) WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: else is not generally useful after a break or return WARNING: unnecessary whitespace before a quoted newline
conchuod/build_rv64_nommu_k210_defconfig success Build OK
conchuod/verify_fixes success No Fixes tag
conchuod/build_rv64_nommu_virt_defconfig success Build OK

Commit Message

Charlie Jenkins Sept. 15, 2023, 5:01 p.m. UTC
Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
will load from the buffer in groups of 32 bits, and when compiled for
64-bit will load in groups of 64 bits.

Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
---
 arch/riscv/include/asm/checksum.h |  12 +++
 arch/riscv/lib/Makefile           |   1 +
 arch/riscv/lib/csum.c             | 198 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 211 insertions(+)

Comments

David Laight Sept. 16, 2023, 9:32 a.m. UTC | #1
From: Charlie Jenkins
> Sent: 15 September 2023 18:01
> 
> Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> will load from the buffer in groups of 32 bits, and when compiled for
> 64-bit will load in groups of 64 bits.
> 
...
> +	/*
> +	 * Do 32-bit reads on RV32 and 64-bit reads otherwise. This should be
> +	 * faster than doing 32-bit reads on architectures that support larger
> +	 * reads.
> +	 */
> +	while (len > 0) {
> +		csum += data;
> +		csum += csum < data;
> +		len -= sizeof(unsigned long);
> +		ptr += 1;
> +		data = *ptr;
> +	}

I think you'd be better adding the 'carry' bits in a separate
variable.
It reduces the register dependency chain length in the loop.
(Helps if the cpu can execute two instructions in one clock.)

The masked misaligned data values are max 24 bits
(if 

You'll also almost certainly remove at least one instruction
from the loop by comparing against the end address rather than
changing 'len'.

So ending up with (something like):
	end = buff + length;
	...
	while (++ptr < end) {
		csum += data;
		carry += csum < data;
		data = ptr[-1];
	}
(Although a do-while loop tends to generate better code
and gcc will pretty much always make that transformation.)

I think that is 4 instructions per word (load, add, cmp+set, add).
In principle they could be completely pipelined and all
execute (for different loop iterations) in the same clock.
(But that is pretty unlikely to happen - even x86 isn't that good.)
But taking two clocks is quite plausible.
Plus 2 instructions per loop (inc, cmp+jmp).
They might execute in parallel, but unrolling once
may be required.

...
> +	if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) &&
> +	    riscv_has_extension_likely(RISCV_ISA_EXT_ZBB)) {
...
> +		}
> +end:
> +		return csum >> 16;
> +	}

Is it really worth doing all that to save (I think) 4 instructions?
(shift, shift, or with rotate twice).
There is much more to be gained by careful inspection
of the loop (even leaving it in C).

> +
> +#ifndef CONFIG_32BIT
> +	csum += (csum >> 32) | (csum << 32);
> +	csum >>= 32;
> +#endif
> +	csum = (unsigned int)csum + (((unsigned int)csum >> 16) | ((unsigned int)csum << 16));

Use ror64() and ror32().

	David

> +	if (offset & 1)
> +		return (unsigned short)swab32(csum);
> +	return csum >> 16;
> +}
> 
> --
> 2.42.0

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Charlie Jenkins Sept. 19, 2023, 2:58 a.m. UTC | #2
On Sat, Sep 16, 2023 at 09:32:40AM +0000, David Laight wrote:
> From: Charlie Jenkins
> > Sent: 15 September 2023 18:01
> > 
> > Provide a 32 and 64 bit version of do_csum. When compiled for 32-bit
> > will load from the buffer in groups of 32 bits, and when compiled for
> > 64-bit will load in groups of 64 bits.
> > 
> ...
> > +	/*
> > +	 * Do 32-bit reads on RV32 and 64-bit reads otherwise. This should be
> > +	 * faster than doing 32-bit reads on architectures that support larger
> > +	 * reads.
> > +	 */
> > +	while (len > 0) {
> > +		csum += data;
> > +		csum += csum < data;
> > +		len -= sizeof(unsigned long);
> > +		ptr += 1;
> > +		data = *ptr;
> > +	}
> 
> I think you'd be better adding the 'carry' bits in a separate
> variable.
> It reduces the register dependency chain length in the loop.
> (Helps if the cpu can execute two instructions in one clock.)
> 
> The masked misaligned data values are max 24 bits
> (if 
> 
> You'll also almost certainly remove at least one instruction
> from the loop by comparing against the end address rather than
> changing 'len'.
> 
> So ending up with (something like):
> 	end = buff + length;
> 	...
> 	while (++ptr < end) {
> 		csum += data;
> 		carry += csum < data;
> 		data = ptr[-1];
> 	}
> (Although a do-while loop tends to generate better code
> and gcc will pretty much always make that transformation.)
> 
> I think that is 4 instructions per word (load, add, cmp+set, add).
> In principle they could be completely pipelined and all
> execute (for different loop iterations) in the same clock.
> (But that is pretty unlikely to happen - even x86 isn't that good.)
> But taking two clocks is quite plausible.
> Plus 2 instructions per loop (inc, cmp+jmp).
> They might execute in parallel, but unrolling once
> may be required.
> 
It looks like GCC actually ends up generating 7 total instructions:
ffffffff808d2acc:	97b6                	add	a5,a5,a3
ffffffff808d2ace:	00d7b533          	sltu	a0,a5,a3
ffffffff808d2ad2:	0721                	add	a4,a4,8
ffffffff808d2ad4:	86be                	mv	a3,a5
ffffffff808d2ad6:	962a                	add	a2,a2,a0
ffffffff808d2ad8:	ff873783          	ld	a5,-8(a4)
ffffffff808d2adc:	feb768e3          	bltu	a4,a1,ffffffff808d2acc <do_csum+0x34>

This mv instruction could be avoided if the registers were shuffled
around, but perhaps this way reduces some dependency chains.
> ...
> > +	if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) &&
> > +	    riscv_has_extension_likely(RISCV_ISA_EXT_ZBB)) {
> ...
> > +		}
> > +end:
> > +		return csum >> 16;
> > +	}
> 
> Is it really worth doing all that to save (I think) 4 instructions?
> (shift, shift, or with rotate twice).
> There is much more to be gained by careful inspection
> of the loop (even leaving it in C).
> 

The main benefit was from using rev8 to replace swab32. However, now
that I am looking at the assembly in the kernel it is not outputting the
asm that matches what I have from an out of kernel test case, so rev8
might not be beneficial. I am going to have to look at this more to
figure out what is happening.

> > +
> > +#ifndef CONFIG_32BIT
> > +	csum += (csum >> 32) | (csum << 32);
> > +	csum >>= 32;
> > +#endif
> > +	csum = (unsigned int)csum + (((unsigned int)csum >> 16) | ((unsigned int)csum << 16));
> 
> Use ror64() and ror32().
> 
> 	David
> 

Good idea.

- Charlie

> > +	if (offset & 1)
> > +		return (unsigned short)swab32(csum);
> > +	return csum >> 16;
> > +}
> > 
> > --
> > 2.42.0
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
David Laight Sept. 19, 2023, 8 a.m. UTC | #3
...
> > So ending up with (something like):
> > 	end = buff + length;
> > 	...
> > 	while (++ptr < end) {
> > 		csum += data;
> > 		carry += csum < data;
> > 		data = ptr[-1];
> > 	}
> > (Although a do-while loop tends to generate better code
> > and gcc will pretty much always make that transformation.)
> >
> > I think that is 4 instructions per word (load, add, cmp+set, add).
> > In principle they could be completely pipelined and all
> > execute (for different loop iterations) in the same clock.
> > (But that is pretty unlikely to happen - even x86 isn't that good.)
> > But taking two clocks is quite plausible.
> > Plus 2 instructions per loop (inc, cmp+jmp).
> > They might execute in parallel, but unrolling once
> > may be required.
> >
> It looks like GCC actually ends up generating 7 total instructions:
> ffffffff808d2acc:	97b6                	add	a5,a5,a3
> ffffffff808d2ace:	00d7b533          	sltu	a0,a5,a3
> ffffffff808d2ad2:	0721                	add	a4,a4,8
> ffffffff808d2ad4:	86be                	mv	a3,a5
> ffffffff808d2ad6:	962a                	add	a2,a2,a0
> ffffffff808d2ad8:	ff873783          	ld	a5,-8(a4)
> ffffffff808d2adc:	feb768e3          	bltu	a4,a1,ffffffff808d2acc <do_csum+0x34>
> 
> This mv instruction could be avoided if the registers were shuffled
> around, but perhaps this way reduces some dependency chains.

gcc managed to do 'data += csum' so had add 'csum = data'.
If you unroll once that might go away.
It might then be 10 instructions for 16 bytes.
Although you then need slightly larger alignment code.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Charlie Jenkins Sept. 19, 2023, 6:04 p.m. UTC | #4
On Tue, Sep 19, 2023 at 08:00:12AM +0000, David Laight wrote:
> ...
> > > So ending up with (something like):
> > > 	end = buff + length;
> > > 	...
> > > 	while (++ptr < end) {
> > > 		csum += data;
> > > 		carry += csum < data;
> > > 		data = ptr[-1];
> > > 	}
> > > (Although a do-while loop tends to generate better code
> > > and gcc will pretty much always make that transformation.)
> > >
> > > I think that is 4 instructions per word (load, add, cmp+set, add).
> > > In principle they could be completely pipelined and all
> > > execute (for different loop iterations) in the same clock.
> > > (But that is pretty unlikely to happen - even x86 isn't that good.)
> > > But taking two clocks is quite plausible.
> > > Plus 2 instructions per loop (inc, cmp+jmp).
> > > They might execute in parallel, but unrolling once
> > > may be required.
> > >
> > It looks like GCC actually ends up generating 7 total instructions:
> > ffffffff808d2acc:	97b6                	add	a5,a5,a3
> > ffffffff808d2ace:	00d7b533          	sltu	a0,a5,a3
> > ffffffff808d2ad2:	0721                	add	a4,a4,8
> > ffffffff808d2ad4:	86be                	mv	a3,a5
> > ffffffff808d2ad6:	962a                	add	a2,a2,a0
> > ffffffff808d2ad8:	ff873783          	ld	a5,-8(a4)
> > ffffffff808d2adc:	feb768e3          	bltu	a4,a1,ffffffff808d2acc <do_csum+0x34>
> > 
> > This mv instruction could be avoided if the registers were shuffled
> > around, but perhaps this way reduces some dependency chains.
> 
> gcc managed to do 'data += csum' so had add 'csum = data'.
> If you unroll once that might go away.
> It might then be 10 instructions for 16 bytes.
> Although you then need slightly larger alignment code.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 
I messed with it a bit and couldn't get the mv to go away. I would expect
mv to be very cheap so it should be fine, and I would like to avoid adding
too much to the alignment code since it is already large, and I assume
that buff will be aligned more often than not.

Interestingly, the mv does not appear pre gcc 12, and does not appear on clang.

- Charlie
diff mbox series

Patch

diff --git a/arch/riscv/include/asm/checksum.h b/arch/riscv/include/asm/checksum.h
index dc0dd89f2a13..7fcd07edb8b3 100644
--- a/arch/riscv/include/asm/checksum.h
+++ b/arch/riscv/include/asm/checksum.h
@@ -12,6 +12,18 @@ 
 
 #define ip_fast_csum ip_fast_csum
 
+extern unsigned int do_csum(const unsigned char *buff, int len);
+#define do_csum do_csum
+
+/* Default version is sufficient for 32 bit */
+#ifdef CONFIG_64BIT
+#define _HAVE_ARCH_IPV6_CSUM
+__sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+			const struct in6_addr *daddr,
+			__u32 len, __u8 proto, __wsum sum);
+#endif
+
+// Define riscv versions of functions before importing asm-generic/checksum.h
 #include <asm-generic/checksum.h>
 
 /*
diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile
index 26cb2502ecf8..2aa1a4ad361f 100644
--- a/arch/riscv/lib/Makefile
+++ b/arch/riscv/lib/Makefile
@@ -6,6 +6,7 @@  lib-y			+= memmove.o
 lib-y			+= strcmp.o
 lib-y			+= strlen.o
 lib-y			+= strncmp.o
+lib-y			+= csum.o
 lib-$(CONFIG_MMU)	+= uaccess.o
 lib-$(CONFIG_64BIT)	+= tishift.o
 lib-$(CONFIG_RISCV_ISA_ZICBOZ)	+= clear_page.o
diff --git a/arch/riscv/lib/csum.c b/arch/riscv/lib/csum.c
new file mode 100644
index 000000000000..1fda96d2bd8d
--- /dev/null
+++ b/arch/riscv/lib/csum.c
@@ -0,0 +1,198 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * IP checksum library
+ *
+ * Influenced by arch/arm64/lib/csum.c
+ * Copyright (C) 2023 Rivos Inc.
+ */
+#include <linux/bitops.h>
+#include <linux/compiler.h>
+#include <linux/kasan-checks.h>
+#include <linux/kernel.h>
+
+#include <net/checksum.h>
+
+/* Default version is sufficient for 32 bit */
+#ifndef CONFIG_32BIT
+__sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+			const struct in6_addr *daddr,
+			__u32 len, __u8 proto, __wsum csum)
+{
+	unsigned int ulen, uproto;
+	unsigned long sum = csum;
+
+	sum += saddr->s6_addr32[0];
+	sum += saddr->s6_addr32[1];
+	sum += saddr->s6_addr32[2];
+	sum += saddr->s6_addr32[3];
+
+	sum += daddr->s6_addr32[0];
+	sum += daddr->s6_addr32[1];
+	sum += daddr->s6_addr32[2];
+	sum += daddr->s6_addr32[3];
+
+	ulen = htonl((unsigned int)len);
+	sum += ulen;
+
+	uproto = htonl(proto);
+	sum += uproto;
+
+	if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) &&
+	    IS_ENABLED(CONFIG_RISCV_ALTERNATIVE)) {
+		unsigned long fold_temp;
+
+		/*
+		 * Zbb is likely available when the kernel is compiled with Zbb
+		 * support, so nop when Zbb is available and jump when Zbb is
+		 * not available.
+		 */
+		asm_volatile_goto(ALTERNATIVE("j %l[no_zbb]", "nop", 0,
+					      RISCV_ISA_EXT_ZBB, 1)
+				  :
+				  :
+				  :
+				  : no_zbb);
+		asm(".option push					\n\
+		.option arch,+zbb					\n\
+			rori	%[fold_temp], %[sum], 32		\n\
+			add	%[sum], %[fold_temp], %[sum]		\n\
+			srli	%[sum], %[sum], 32			\n\
+			not	%[fold_temp], %[sum]			\n\
+			roriw	%[sum], %[sum], 16			\n\
+			subw	%[sum], %[fold_temp], %[sum]		\n\
+		.option pop"
+		: [sum] "+r" (sum), [fold_temp] "=&r" (fold_temp));
+		return (__force __sum16)(sum >> 16);
+	}
+no_zbb:
+	sum += (sum >> 32) | (sum << 32);
+	sum >>= 32;
+	return csum_fold((__force __wsum)sum);
+}
+EXPORT_SYMBOL(csum_ipv6_magic);
+#endif // !CONFIG_32BIT
+
+#ifdef CONFIG_32BIT
+#define OFFSET_MASK 3
+#elif CONFIG_64BIT
+#define OFFSET_MASK 7
+#endif
+
+/*
+ * Perform a checksum on an arbitrary memory address.
+ * Algorithm accounts for buff being misaligned.
+ * If buff is not aligned, will over-read bytes but not use the bytes that it
+ * shouldn't. The same thing will occur on the tail-end of the read.
+ */
+unsigned int __no_sanitize_address do_csum(const unsigned char *buff, int len)
+{
+	unsigned int offset, shift;
+	unsigned long csum = 0, data;
+	const unsigned long *ptr;
+
+	if (unlikely(len <= 0))
+		return 0;
+	/*
+	 * To align the address, grab the whole first byte in buff.
+	 * Since it is inside of a same byte, it will never cross pages or cache
+	 * lines.
+	 * Directly call KASAN with the alignment we will be using.
+	 */
+	offset = (unsigned long)buff & OFFSET_MASK;
+	kasan_check_read(buff, len);
+	ptr = (const unsigned long *)(buff - offset);
+	len = len + offset - sizeof(unsigned long);
+
+	/*
+	 * Clear the most signifant bits that were over-read if buff was not
+	 * aligned.
+	 */
+	shift = offset * 8;
+	data = *ptr;
+	if (IS_ENABLED(__LITTLE_ENDIAN))
+		data = (data >> shift) << shift;
+	else
+		data = (data << shift) >> shift;
+
+	/*
+	 * Do 32-bit reads on RV32 and 64-bit reads otherwise. This should be
+	 * faster than doing 32-bit reads on architectures that support larger
+	 * reads.
+	 */
+	while (len > 0) {
+		csum += data;
+		csum += csum < data;
+		len -= sizeof(unsigned long);
+		ptr += 1;
+		data = *ptr;
+	}
+
+	/*
+	 * Perform alignment (and over-read) bytes on the tail if any bytes
+	 * leftover.
+	 */
+	shift = len * -8;
+	if (IS_ENABLED(__LITTLE_ENDIAN))
+		data = (data << shift) >> shift;
+	else
+		data = (data >> shift) << shift;
+
+	csum += data;
+	csum += csum < data;
+
+	if (IS_ENABLED(CONFIG_RISCV_ISA_ZBB) &&
+	    riscv_has_extension_likely(RISCV_ISA_EXT_ZBB)) {
+		unsigned int fold_temp;
+
+		if (IS_ENABLED(CONFIG_32BIT)) {
+			asm_volatile_goto(".option push			\n\
+			.option arch,+zbb				\n\
+				rori	%[fold_temp], %[csum], 16	\n\
+				andi	%[offset], %[offset], 1		\n\
+				add	%[csum], %[fold_temp], %[csum]	\n\
+				beq	%[offset], zero, %l[end]	\n\
+				rev8	%[csum], %[csum]		\n\
+				zext.h	%[csum], %[csum]		\n\
+			.option pop"
+				: [csum] "+r" (csum),
+				  [fold_temp] "=&r" (fold_temp)
+				: [offset] "r" (offset)
+				:
+				: end);
+
+			return csum;
+		} else {
+			asm_volatile_goto(".option push			\n\
+			.option arch,+zbb				\n\
+				rori	%[fold_temp], %[csum], 32	\n\
+				add	%[csum], %[fold_temp], %[csum]	\n\
+				srli	%[csum], %[csum], 32		\n\
+				roriw	%[fold_temp], %[csum], 16	\n\
+				addw	%[csum], %[fold_temp], %[csum]	\n\
+				andi	%[offset], %[offset], 1		\n\
+				beq	%[offset], zero, %l[end]	\n\
+				rev8	%[csum], %[csum]		\n\
+				srli	%[csum], %[csum], 32		\n\
+				zext.h	%[csum], %[csum]		\n\
+			.option pop"
+				: [csum] "+r" (csum),
+				  [fold_temp] "=&r" (fold_temp)
+				: [offset] "r" (offset)
+				:
+				: end);
+
+			return csum;
+		}
+end:
+		return csum >> 16;
+	}
+
+#ifndef CONFIG_32BIT
+	csum += (csum >> 32) | (csum << 32);
+	csum >>= 32;
+#endif
+	csum = (unsigned int)csum + (((unsigned int)csum >> 16) | ((unsigned int)csum << 16));
+	if (offset & 1)
+		return (unsigned short)swab32(csum);
+	return csum >> 16;
+}