Message ID | 20210702123153.14093-1-mcroce@linux.microsoft.com (mailing list archive) |
---|---|
Headers | show |
Series | lib/string: optimized mem* functions | expand |
On Fri, 2 Jul 2021 14:31:50 +0200 Matteo Croce <mcroce@linux.microsoft.com> wrote: > From: Matteo Croce <mcroce@microsoft.com> > > Rewrite the generic mem{cpy,move,set} so that memory is accessed with > the widest size possible, but without doing unaligned accesses. > > This was originally posted as C string functions for RISC-V[1], but as > there was no specific RISC-V code, it was proposed for the generic > lib/string.c implementation. > > Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE} > and HAVE_EFFICIENT_UNALIGNED_ACCESS. > > These are the performances of memcpy() and memset() of a RISC-V machine > on a 32 mbyte buffer: > > memcpy: > original aligned: 75 Mb/s > original unaligned: 75 Mb/s > new aligned: 114 Mb/s > new unaligned: 107 Mb/s > > memset: > original aligned: 140 Mb/s > original unaligned: 140 Mb/s > new aligned: 241 Mb/s > new unaligned: 241 Mb/s Did you record the x86_64 performance? Which other architectures are affected by this change?
On Sat, Jul 10, 2021 at 11:31 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 2 Jul 2021 14:31:50 +0200 Matteo Croce <mcroce@linux.microsoft.com> wrote: > > > From: Matteo Croce <mcroce@microsoft.com> > > > > Rewrite the generic mem{cpy,move,set} so that memory is accessed with > > the widest size possible, but without doing unaligned accesses. > > > > This was originally posted as C string functions for RISC-V[1], but as > > there was no specific RISC-V code, it was proposed for the generic > > lib/string.c implementation. > > > > Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE} > > and HAVE_EFFICIENT_UNALIGNED_ACCESS. > > > > These are the performances of memcpy() and memset() of a RISC-V machine > > on a 32 mbyte buffer: > > > > memcpy: > > original aligned: 75 Mb/s > > original unaligned: 75 Mb/s > > new aligned: 114 Mb/s > > new unaligned: 107 Mb/s > > > > memset: > > original aligned: 140 Mb/s > > original unaligned: 140 Mb/s > > new aligned: 241 Mb/s > > new unaligned: 241 Mb/s > > Did you record the x86_64 performance? > > > Which other architectures are affected by this change? x86_64 won't use these functions because it defines __HAVE_ARCH_MEMCPY and has optimized implementations in arch/x86/lib. Anyway, I was curious and I tested them on x86_64 too, there was zero gain over the generic ones. The only architecture which will use all the three function will be riscv, while memmove() will be used by arc, h8300, hexagon, ia64, openrisc and parisc. Keep in mind that memmove() isn't anything special, it just calls memcpy() when possible (e.g. buffers not overlapping), and fallbacks to the byte by byte copy otherwise. In future we can write two functions, one which copies forward and another one which copies backward, and call the right one depending on the buffers position. Then, we could alias memcpy() and memmove(), as proposed by Linus: https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132 Regards,
From: Matteo Croce > Sent: 11 July 2021 00:08 > > On Sat, Jul 10, 2021 at 11:31 PM Andrew Morton > <akpm@linux-foundation.org> wrote: > > > > On Fri, 2 Jul 2021 14:31:50 +0200 Matteo Croce <mcroce@linux.microsoft.com> wrote: > > > > > From: Matteo Croce <mcroce@microsoft.com> > > > > > > Rewrite the generic mem{cpy,move,set} so that memory is accessed with > > > the widest size possible, but without doing unaligned accesses. > > > > > > This was originally posted as C string functions for RISC-V[1], but as > > > there was no specific RISC-V code, it was proposed for the generic > > > lib/string.c implementation. > > > > > > Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE} > > > and HAVE_EFFICIENT_UNALIGNED_ACCESS. > > > > > > These are the performances of memcpy() and memset() of a RISC-V machine > > > on a 32 mbyte buffer: > > > > > > memcpy: > > > original aligned: 75 Mb/s > > > original unaligned: 75 Mb/s > > > new aligned: 114 Mb/s > > > new unaligned: 107 Mb/s > > > > > > memset: > > > original aligned: 140 Mb/s > > > original unaligned: 140 Mb/s > > > new aligned: 241 Mb/s > > > new unaligned: 241 Mb/s > > > > Did you record the x86_64 performance? > > > > > > Which other architectures are affected by this change? > > x86_64 won't use these functions because it defines __HAVE_ARCH_MEMCPY > and has optimized implementations in arch/x86/lib. > Anyway, I was curious and I tested them on x86_64 too, there was zero > gain over the generic ones. x86 performance (and attainable performance) does depend on the cpu micro-archiecture. Any recent 'desktop' intel cpu will almost certainly manage to re-order the execution of almost any copy loop and attain 1 write per clock. (Even the trivial 'while (count--) *dest++ = *src++;' loop.) The same isn't true of the Atom based cpu that may be on small servers. Theses are no slouches (eg 4 cores at 2.4GHz) but only have limited out-of-order execution and so are much more sensitive to instruction ordering. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
From: Matteo Croce <mcroce@microsoft.com> Rewrite the generic mem{cpy,move,set} so that memory is accessed with the widest size possible, but without doing unaligned accesses. This was originally posted as C string functions for RISC-V[1], but as there was no specific RISC-V code, it was proposed for the generic lib/string.c implementation. Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE} and HAVE_EFFICIENT_UNALIGNED_ACCESS. These are the performances of memcpy() and memset() of a RISC-V machine on a 32 mbyte buffer: memcpy: original aligned: 75 Mb/s original unaligned: 75 Mb/s new aligned: 114 Mb/s new unaligned: 107 Mb/s memset: original aligned: 140 Mb/s original unaligned: 140 Mb/s new aligned: 241 Mb/s new unaligned: 241 Mb/s The size increase is negligible: $ scripts/bloat-o-meter vmlinux.orig vmlinux add/remove: 0/0 grow/shrink: 4/1 up/down: 427/-6 (421) Function old new delta memcpy 29 351 +322 memset 29 117 +88 strlcat 68 78 +10 strlcpy 50 57 +7 memmove 56 50 -6 Total: Before=8556964, After=8557385, chg +0.00% These functions will be used for RISC-V initially. [1] https://lore.kernel.org/linux-riscv/20210617152754.17960-1-mcroce@linux.microsoft.com/ Matteo Croce (3): lib/string: optimized memcpy lib/string: optimized memmove lib/string: optimized memset lib/string.c | 130 ++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 113 insertions(+), 17 deletions(-)