[v2,1/3] lib/string: optimized memcpy

Message ID	20210702123153.14093-2-mcroce@linux.microsoft.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=3Ayx=L2=lists.infradead.org=linux-riscv-bounces+linux-riscv=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A2693613F5 From: Matteo Croce <mcroce@linux.microsoft.com> To: linux-kernel@vger.kernel.org, Nick Kossifidis <mick@ics.forth.gr>, Guo Ren <guoren@kernel.org>, Christoph Hellwig <hch@infradead.org>, David Laight <David.Laight@aculab.com>, Palmer Dabbelt <palmer@dabbelt.com>, Emil Renner Berthing <kernel@esmil.dk>, Drew Fustini <drew@beagleboard.org> Cc: linux-arch@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>, Nick Desaulniers <ndesaulniers@google.com>, linux-riscv@lists.infradead.org Subject: [PATCH v2 1/3] lib/string: optimized memcpy Date: Fri, 2 Jul 2021 14:31:51 +0200 Message-Id: <20210702123153.14093-2-mcroce@linux.microsoft.com> In-Reply-To: <20210702123153.14093-1-mcroce@linux.microsoft.com> References: <20210702123153.14093-1-mcroce@linux.microsoft.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org> Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org
Series	lib/string: optimized mem* functions \| expand [v2,0/3] lib/string: optimized mem* functions [v2,1/3] lib/string: optimized memcpy [v2,2/3] lib/string: optimized memmove [v2,3/3] lib/string: optimized memset

Message ID

20210702123153.14093-2-mcroce@linux.microsoft.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A2693613F5
From: Matteo Croce <mcroce@linux.microsoft.com>
To: linux-kernel@vger.kernel.org, Nick Kossifidis <mick@ics.forth.gr>,
 Guo Ren <guoren@kernel.org>, Christoph Hellwig <hch@infradead.org>,
 David Laight <David.Laight@aculab.com>,
 Palmer Dabbelt <palmer@dabbelt.com>,
 Emil Renner Berthing <kernel@esmil.dk>, Drew Fustini <drew@beagleboard.org>
Cc: linux-arch@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>,
 Nick Desaulniers <ndesaulniers@google.com>, linux-riscv@lists.infradead.org
Subject: [PATCH v2 1/3] lib/string: optimized memcpy
Date: Fri,  2 Jul 2021 14:31:51 +0200
Message-Id: <20210702123153.14093-2-mcroce@linux.microsoft.com>
In-Reply-To: <20210702123153.14093-1-mcroce@linux.microsoft.com>
References: <20210702123153.14093-1-mcroce@linux.microsoft.com>
MIME-Version: 1.0
Precedence: list
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org>
Errors-To: 
 linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org

Series

lib/string: optimized mem* functions | expand

Commit Message

Matteo Croce July 2, 2021, 12:31 p.m. UTC

From: Matteo Croce <mcroce@microsoft.com>

Rewrite the generic memcpy() to copy a word at time, without generating
unaligned accesses.

The procedure is made of three steps:
First copy data one byte at time until the destination buffer is aligned
to a long boundary.
Then copy the data one long at time shifting the current and the next long
to compose a long at every cycle.
Finally, copy the remainder one byte at time.

This is the improvement on RISC-V:

original aligned:	 75 Mb/s
original unaligned:	 75 Mb/s
new aligned:		114 Mb/s
new unaligned:		107 Mb/s

and this the binary size increase according to bloat-o-meter:

Function     old     new   delta
memcpy        36     324    +288


Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 lib/string.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 77 insertions(+), 3 deletions(-)

Comments

Ben Dooks July 2, 2021, 2:37 p.m. UTC | #1

On 02/07/2021 13:31, Matteo Croce wrote:
> From: Matteo Croce <mcroce@microsoft.com>
> 
> Rewrite the generic memcpy() to copy a word at time, without generating
> unaligned accesses.
> 
> The procedure is made of three steps:
> First copy data one byte at time until the destination buffer is aligned
> to a long boundary.
> Then copy the data one long at time shifting the current and the next long
> to compose a long at every cycle.
> Finally, copy the remainder one byte at time.
> 
> This is the improvement on RISC-V:
> 
> original aligned:	 75 Mb/s
> original unaligned:	 75 Mb/s
> new aligned:		114 Mb/s
> new unaligned:		107 Mb/s
> 
> and this the binary size increase according to bloat-o-meter:
> 
> Function     old     new   delta
> memcpy        36     324    +288
> 
> 
> Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> ---
>   lib/string.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 77 insertions(+), 3 deletions(-)

Doesn't arch/riscv/lib/memcpy.S also exist for an architecture
optimised version? I would have thought the lib/string.c version
was not being used?

Matteo Croce July 2, 2021, 2:44 p.m. UTC | #2

On Fri, Jul 2, 2021 at 4:37 PM Ben Dooks <ben.dooks@codethink.co.uk> wrote:
>
> On 02/07/2021 13:31, Matteo Croce wrote:
> > From: Matteo Croce <mcroce@microsoft.com>
> >
> > Rewrite the generic memcpy() to copy a word at time, without generating
> > unaligned accesses.
> >
> > The procedure is made of three steps:
> > First copy data one byte at time until the destination buffer is aligned
> > to a long boundary.
> > Then copy the data one long at time shifting the current and the next long
> > to compose a long at every cycle.
> > Finally, copy the remainder one byte at time.
> >
> > This is the improvement on RISC-V:
> >
> > original aligned:      75 Mb/s
> > original unaligned:    75 Mb/s
> > new aligned:          114 Mb/s
> > new unaligned:                107 Mb/s
> >
> > and this the binary size increase according to bloat-o-meter:
> >
> > Function     old     new   delta
> > memcpy        36     324    +288
> >
> >
> > Signed-off-by: Matteo Croce <mcroce@microsoft.com>
> > ---
> >   lib/string.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >   1 file changed, 77 insertions(+), 3 deletions(-)
>
> Doesn't arch/riscv/lib/memcpy.S also exist for an architecture
> optimised version? I would have thought the lib/string.c version
> was not being used?
>
>

Yes, but this series started as C replacement for the assembly one,
which generates unaligned accesses.
Unfortunately the existing RISC-V processors can't handle unaligned
accesses, so they are emulated with a terrible slowdown.
Then, since there wasn't any riscv specific code, it was proposed as
generic code:

Discussion: https://lore.kernel.org/linux-riscv/20210617152754.17960-1-mcroce@linux.microsoft.com/

diff --git a/lib/string.c b/lib/string.c
index 546d59711a12..caeef4264c43 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -33,6 +33,23 @@ 
 #include <asm/word-at-a-time.h>
 #include <asm/page.h>
 
+#define BYTES_LONG	sizeof(long)
+#define WORD_MASK	(BYTES_LONG - 1)
+#define MIN_THRESHOLD	(BYTES_LONG * 2)
+
+/* convenience union to avoid cast between different pointer types */
+union types {
+	u8 *as_u8;
+	unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
+union const_types {
+	const u8 *as_u8;
+	const unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
 #ifndef __HAVE_ARCH_STRNCASECMP
 /**
  * strncasecmp - Case insensitive, length-limited string comparison
@@ -869,6 +886,13 @@  EXPORT_SYMBOL(memset64);
 #endif
 
 #ifndef __HAVE_ARCH_MEMCPY
+
+#ifdef __BIG_ENDIAN
+#define MERGE_UL(h, l, d) ((h) << ((d) * 8) | (l) >> ((BYTES_LONG - (d)) * 8))
+#else
+#define MERGE_UL(h, l, d) ((h) >> ((d) * 8) | (l) << ((BYTES_LONG - (d)) * 8))
+#endif
+
 /**
  * memcpy - Copy one area of memory to another
  * @dest: Where to copy to
@@ -880,14 +904,64 @@  EXPORT_SYMBOL(memset64);
  */
 void *memcpy(void *dest, const void *src, size_t count)
 {
-	char *tmp = dest;
-	const char *s = src;
+	union const_types s = { .as_u8 = src };
+	union types d = { .as_u8 = dest };
+	int distance = 0;
+
+	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+		if (count < MIN_THRESHOLD)
+			goto copy_remainder;
+
+		/* Copy a byte at time until destination is aligned. */
+		for (; d.as_uptr & WORD_MASK; count--)
+			*d.as_u8++ = *s.as_u8++;
+
+		distance = s.as_uptr & WORD_MASK;
+	}
+
+	if (distance) {
+		unsigned long last, next;
 
+		/*
+		 * s is distance bytes ahead of d, and d just reached
+		 * the alignment boundary. Move s backward to word align it
+		 * and shift data to compensate for distance, in order to do
+		 * word-by-word copy.
+		 */
+		s.as_u8 -= distance;
+
+		next = s.as_ulong[0];
+		for (; count >= BYTES_LONG; count -= BYTES_LONG) {
+			last = next;
+			next = s.as_ulong[1];
+
+			d.as_ulong[0] = MERGE_UL(last, next, distance);
+
+			d.as_ulong++;
+			s.as_ulong++;
+		}
+
+		/* Restore s with the original offset. */
+		s.as_u8 += distance;
+	} else {
+		/*
+		 * If the source and dest lower bits are the same, do a simple
+		 * 32/64 bit wide copy.
+		 */
+		for (; count >= BYTES_LONG; count -= BYTES_LONG)
+			*d.as_ulong++ = *s.as_ulong++;
+	}
+
+copy_remainder:
 	while (count--)
-		*tmp++ = *s++;
+		*d.as_u8++ = *s.as_u8++;
+
 	return dest;
 }
 EXPORT_SYMBOL(memcpy);
+
+#undef MERGE_UL
+
 #endif
 
 #ifndef __HAVE_ARCH_MEMMOVE

[v2,1/3] lib/string: optimized memcpy

Commit Message

Comments

Patch