mbox series

[v2,0/3] lib/string: optimized mem* functions

Message ID 20210702123153.14093-1-mcroce@linux.microsoft.com (mailing list archive)
Headers show
Series lib/string: optimized mem* functions | expand

Message

Matteo Croce July 2, 2021, 12:31 p.m. UTC
From: Matteo Croce <mcroce@microsoft.com>

Rewrite the generic mem{cpy,move,set} so that memory is accessed with
the widest size possible, but without doing unaligned accesses.

This was originally posted as C string functions for RISC-V[1], but as
there was no specific RISC-V code, it was proposed for the generic
lib/string.c implementation.

Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
and HAVE_EFFICIENT_UNALIGNED_ACCESS.

These are the performances of memcpy() and memset() of a RISC-V machine
on a 32 mbyte buffer:

memcpy:
original aligned:	 75 Mb/s
original unaligned:	 75 Mb/s
new aligned:		114 Mb/s
new unaligned:		107 Mb/s

memset:
original aligned:	140 Mb/s
original unaligned:	140 Mb/s
new aligned:		241 Mb/s
new unaligned:		241 Mb/s

The size increase is negligible:

$ scripts/bloat-o-meter vmlinux.orig vmlinux
add/remove: 0/0 grow/shrink: 4/1 up/down: 427/-6 (421)
Function                                     old     new   delta
memcpy                                        29     351    +322
memset                                        29     117     +88
strlcat                                       68      78     +10
strlcpy                                       50      57      +7
memmove                                       56      50      -6
Total: Before=8556964, After=8557385, chg +0.00%

These functions will be used for RISC-V initially.

[1] https://lore.kernel.org/linux-riscv/20210617152754.17960-1-mcroce@linux.microsoft.com/

Matteo Croce (3):
  lib/string: optimized memcpy
  lib/string: optimized memmove
  lib/string: optimized memset

 lib/string.c | 130 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 113 insertions(+), 17 deletions(-)

Comments

Andrew Morton July 10, 2021, 9:31 p.m. UTC | #1
On Fri,  2 Jul 2021 14:31:50 +0200 Matteo Croce <mcroce@linux.microsoft.com> wrote:

> From: Matteo Croce <mcroce@microsoft.com>
> 
> Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> the widest size possible, but without doing unaligned accesses.
> 
> This was originally posted as C string functions for RISC-V[1], but as
> there was no specific RISC-V code, it was proposed for the generic
> lib/string.c implementation.
> 
> Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> 
> These are the performances of memcpy() and memset() of a RISC-V machine
> on a 32 mbyte buffer:
> 
> memcpy:
> original aligned:	 75 Mb/s
> original unaligned:	 75 Mb/s
> new aligned:		114 Mb/s
> new unaligned:		107 Mb/s
> 
> memset:
> original aligned:	140 Mb/s
> original unaligned:	140 Mb/s
> new aligned:		241 Mb/s
> new unaligned:		241 Mb/s

Did you record the x86_64 performance?


Which other architectures are affected by this change?
Matteo Croce July 10, 2021, 11:07 p.m. UTC | #2
On Sat, Jul 10, 2021 at 11:31 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Fri,  2 Jul 2021 14:31:50 +0200 Matteo Croce <mcroce@linux.microsoft.com> wrote:
>
> > From: Matteo Croce <mcroce@microsoft.com>
> >
> > Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> > the widest size possible, but without doing unaligned accesses.
> >
> > This was originally posted as C string functions for RISC-V[1], but as
> > there was no specific RISC-V code, it was proposed for the generic
> > lib/string.c implementation.
> >
> > Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> > and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> >
> > These are the performances of memcpy() and memset() of a RISC-V machine
> > on a 32 mbyte buffer:
> >
> > memcpy:
> > original aligned:      75 Mb/s
> > original unaligned:    75 Mb/s
> > new aligned:          114 Mb/s
> > new unaligned:                107 Mb/s
> >
> > memset:
> > original aligned:     140 Mb/s
> > original unaligned:   140 Mb/s
> > new aligned:          241 Mb/s
> > new unaligned:                241 Mb/s
>
> Did you record the x86_64 performance?
>
>
> Which other architectures are affected by this change?

x86_64 won't use these functions because it defines __HAVE_ARCH_MEMCPY
and has optimized implementations in arch/x86/lib.
Anyway, I was curious and I tested them on x86_64 too, there was zero
gain over the generic ones.

The only architecture which will use all the three function will be
riscv, while memmove() will be used by arc, h8300, hexagon, ia64,
openrisc and parisc.
Keep in mind that memmove() isn't anything special, it just calls
memcpy() when possible (e.g. buffers not overlapping), and fallbacks
to the byte by byte copy otherwise.

In future we can write two functions, one which copies forward and
another one which copies backward, and call the right one depending on
the buffers position.
Then, we could alias memcpy() and memmove(), as proposed by Linus:

https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132

Regards,
David Laight July 12, 2021, 8:15 a.m. UTC | #3
From: Matteo Croce
> Sent: 11 July 2021 00:08
> 
> On Sat, Jul 10, 2021 at 11:31 PM Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > On Fri,  2 Jul 2021 14:31:50 +0200 Matteo Croce <mcroce@linux.microsoft.com> wrote:
> >
> > > From: Matteo Croce <mcroce@microsoft.com>
> > >
> > > Rewrite the generic mem{cpy,move,set} so that memory is accessed with
> > > the widest size possible, but without doing unaligned accesses.
> > >
> > > This was originally posted as C string functions for RISC-V[1], but as
> > > there was no specific RISC-V code, it was proposed for the generic
> > > lib/string.c implementation.
> > >
> > > Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
> > > and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> > >
> > > These are the performances of memcpy() and memset() of a RISC-V machine
> > > on a 32 mbyte buffer:
> > >
> > > memcpy:
> > > original aligned:      75 Mb/s
> > > original unaligned:    75 Mb/s
> > > new aligned:          114 Mb/s
> > > new unaligned:                107 Mb/s
> > >
> > > memset:
> > > original aligned:     140 Mb/s
> > > original unaligned:   140 Mb/s
> > > new aligned:          241 Mb/s
> > > new unaligned:                241 Mb/s
> >
> > Did you record the x86_64 performance?
> >
> >
> > Which other architectures are affected by this change?
> 
> x86_64 won't use these functions because it defines __HAVE_ARCH_MEMCPY
> and has optimized implementations in arch/x86/lib.
> Anyway, I was curious and I tested them on x86_64 too, there was zero
> gain over the generic ones.

x86 performance (and attainable performance) does depend on the cpu
micro-archiecture.

Any recent 'desktop' intel cpu will almost certainly manage to
re-order the execution of almost any copy loop and attain 1 write per clock.
(Even the trivial 'while (count--) *dest++ = *src++;' loop.)

The same isn't true of the Atom based cpu that may be on small servers.
Theses are no slouches (eg 4 cores at 2.4GHz) but only have limited
out-of-order execution and so are much more sensitive to instruction
ordering.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)