mbox series

[0/1] riscv: better network performance with memcpy, uaccess

Message ID CACuRN0NjftJDUAsF2pkXbx0jnJ=bba9+j-hJA8Mjj0r4RVicLA@mail.gmail.com (mailing list archive)
Headers show
Series riscv: better network performance with memcpy, uaccess | expand

Message

Akira Tsukamoto June 4, 2021, 9:53 a.m. UTC
I am adding a cover letter to explain the history and details since
improvement is a combination with Gary's memcpy patch [1].

Comparison of iperf3 benchmark results by applying Gary's memcpy patch and
my uaccess optimization patch. All results are from the same base kernel,
same rootfs and save BeagleV beta board.

First left column : beaglev 5.13.rc4 kernel [2]
Second column     : Added Palmer's memcpy in C + my uaccess patch [3]
Third column      : Added Gary's memcpy + my uaccess patch [4]

--- TCP recv ---
686 Mbits/sec  |  700 Mbits/sec  |  904 Mbits/sec
683 Mbits/sec  |  701 Mbits/sec  |  898 Mbits/sec
695 Mbits/sec  |  702 Mbits/sec  |  905 Mbits/sec

--- TCP send ---
383 Mbits/sec  |  390 Mbits/sec  |  393 Mbits/sec
384 Mbits/sec  |  393 Mbits/sec  |  392 Mbits/sec

--- UDP send ---
307 Mbits/sec  |  358 Mbits/sec  |  402 Mbits/sec
307 Mbits/sec  |  359 Mbits/sec  |  402 Mbits/sec

--- UDP recv ---
630 Mbits/sec  |  799 Mbits/sec  |  875 Mbits/sec
730 Mbits/sec  |  796 Mbits/sec  |  873 Mbits/sec


The uaccess patch is reducing pipeline stall of read after write (RAW)
by unroling load and store.
The main reason for using assembler inside uaccess.S is because the
__asm_to/copy_from_user() handling page fault must be done manually inside
the functions.

The above result is combination from Gary $B!G (Bs memcpy speeding up
by reducing
the S-mode and M-mode switching and my uaccess reducing pipeline stall for
user space uses syscall with large data.

We had a discussion of improving network performance on the BeagleV beta
board with Palmer.

Palmer suggested to use C-based string routines, which checks the unaligned
address and use 8 bytes aligned copy if the both src and dest are aligned
and if not use the current copy function.

The Gary's assembly version of memcpy is improving by not using unaligned
access in 64 bit boundary, uses shifting it after reading with offset of
aligned access, because every misaligned access is trapped and switches to
opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
and M-mode (opensbi) switching.

Processing network packets require a lot of unaligned access for the packet
header, which is not able to change the design of the header format to be
aligned.
And user applications pass large packet data with send/recf() and sendto/
recvfrom() to repeat less function calls for reading and writing data for the
optimization.

Akira

[1] https://lkml.org/lkml/2021/2/16/778
[2] https://github.com/mcd500/linux-jh7100/tree/starlight-sdimproved
[3] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-palmer-string
[4] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-gary

Akira Tsukamoto (1):
  riscv: prevent pipeline stall in __asm_to/copy_from_user

 arch/riscv/lib/uaccess.S | 106 +++++++++++++++++++++++++++------------
 1 file changed, 73 insertions(+), 33 deletions(-)

--
2.17.1

Comments

Palmer Dabbelt June 4, 2021, 4:19 p.m. UTC | #1
On Fri, 04 Jun 2021 02:53:33 PDT (-0700), akira.tsukamoto@gmail.com wrote:
> I am adding a cover letter to explain the history and details since
> improvement is a combination with Gary's memcpy patch [1].
>
> Comparison of iperf3 benchmark results by applying Gary's memcpy patch and
> my uaccess optimization patch. All results are from the same base kernel,
> same rootfs and save BeagleV beta board.
>
> First left column : beaglev 5.13.rc4 kernel [2]
> Second column     : Added Palmer's memcpy in C + my uaccess patch [3]
> Third column      : Added Gary's memcpy + my uaccess patch [4]
>
> --- TCP recv ---
> 686 Mbits/sec  |  700 Mbits/sec  |  904 Mbits/sec
> 683 Mbits/sec  |  701 Mbits/sec  |  898 Mbits/sec
> 695 Mbits/sec  |  702 Mbits/sec  |  905 Mbits/sec
>
> --- TCP send ---
> 383 Mbits/sec  |  390 Mbits/sec  |  393 Mbits/sec
> 384 Mbits/sec  |  393 Mbits/sec  |  392 Mbits/sec
>
> --- UDP send ---
> 307 Mbits/sec  |  358 Mbits/sec  |  402 Mbits/sec
> 307 Mbits/sec  |  359 Mbits/sec  |  402 Mbits/sec
>
> --- UDP recv ---
> 630 Mbits/sec  |  799 Mbits/sec  |  875 Mbits/sec
> 730 Mbits/sec  |  796 Mbits/sec  |  873 Mbits/sec
>
>
> The uaccess patch is reducing pipeline stall of read after write (RAW)
> by unroling load and store.
> The main reason for using assembler inside uaccess.S is because the
> __asm_to/copy_from_user() handling page fault must be done manually inside
> the functions.
>
> The above result is combination from Gary $B!G (Bs memcpy speeding up
> by reducing
> the S-mode and M-mode switching and my uaccess reducing pipeline stall for
> user space uses syscall with large data.
>
> We had a discussion of improving network performance on the BeagleV beta
> board with Palmer.
>
> Palmer suggested to use C-based string routines, which checks the unaligned
> address and use 8 bytes aligned copy if the both src and dest are aligned
> and if not use the current copy function.
>
> The Gary's assembly version of memcpy is improving by not using unaligned
> access in 64 bit boundary, uses shifting it after reading with offset of
> aligned access, because every misaligned access is trapped and switches to
> opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
> and M-mode (opensbi) switching.
>
> Processing network packets require a lot of unaligned access for the packet
> header, which is not able to change the design of the header format to be
> aligned.
> And user applications pass large packet data with send/recf() and sendto/
> recvfrom() to repeat less function calls for reading and writing data for the
> optimization.

Makes sense.  I'm still not opposed to moving to a C version, but it'd 
need to be a fairly complicated one.  I think having a fast C memcpy 
would likely benefit a handful of architectures, as everything we're 
talking about is an algorithmic improvement that can be expressed in C.

Given that the simple memcpy doesn't perform well for your workload, I'm 
fine taking the assembly version.

Thanks!

>
> Akira
>
> [1] https://lkml.org/lkml/2021/2/16/778
> [2] https://github.com/mcd500/linux-jh7100/tree/starlight-sdimproved
> [3] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-palmer-string
> [4] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-gary
>
> Akira Tsukamoto (1):
>   riscv: prevent pipeline stall in __asm_to/copy_from_user
>
>  arch/riscv/lib/uaccess.S | 106 +++++++++++++++++++++++++++------------
>  1 file changed, 73 insertions(+), 33 deletions(-)
Akira Tsukamoto June 5, 2021, 8:02 a.m. UTC | #2
On Sat, Jun 5, 2021 at 1:19 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
> On Fri, 04 Jun 2021 02:53:33 PDT (-0700), akira.tsukamoto@gmail.com wrote:
> > I am adding a cover letter to explain the history and details since
> > improvement is a combination with Gary's memcpy patch [1].
> >
> > Comparison of iperf3 benchmark results by applying Gary's memcpy patch and
> > my uaccess optimization patch. All results are from the same base kernel,
> > same rootfs and save BeagleV beta board.
> >
> > First left column : beaglev 5.13.rc4 kernel [2]
> > Second column     : Added Palmer's memcpy in C + my uaccess patch [3]
> > Third column      : Added Gary's memcpy + my uaccess patch [4]
> >
> > --- TCP recv ---
> > 686 Mbits/sec  |  700 Mbits/sec  |  904 Mbits/sec
> > 683 Mbits/sec  |  701 Mbits/sec  |  898 Mbits/sec
> > 695 Mbits/sec  |  702 Mbits/sec  |  905 Mbits/sec
> >
> > --- TCP send ---
> > 383 Mbits/sec  |  390 Mbits/sec  |  393 Mbits/sec
> > 384 Mbits/sec  |  393 Mbits/sec  |  392 Mbits/sec
> >
> > --- UDP send ---
> > 307 Mbits/sec  |  358 Mbits/sec  |  402 Mbits/sec
> > 307 Mbits/sec  |  359 Mbits/sec  |  402 Mbits/sec
> >
> > --- UDP recv ---
> > 630 Mbits/sec  |  799 Mbits/sec  |  875 Mbits/sec
> > 730 Mbits/sec  |  796 Mbits/sec  |  873 Mbits/sec
> >
> >
> > The uaccess patch is reducing pipeline stall of read after write (RAW)
> > by unroling load and store.
> > The main reason for using assembler inside uaccess.S is because the
> > __asm_to/copy_from_user() handling page fault must be done manually inside
> > the functions.
> >
> > The above result is combination from Gary $B!G (Bs memcpy speeding up
> > by reducing
> > the S-mode and M-mode switching and my uaccess reducing pipeline stall for
> > user space uses syscall with large data.
> >
> > We had a discussion of improving network performance on the BeagleV beta
> > board with Palmer.
> >
> > Palmer suggested to use C-based string routines, which checks the unaligned
> > address and use 8 bytes aligned copy if the both src and dest are aligned
> > and if not use the current copy function.
> >
> > The Gary's assembly version of memcpy is improving by not using unaligned
> > access in 64 bit boundary, uses shifting it after reading with offset of
> > aligned access, because every misaligned access is trapped and switches to
> > opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
> > and M-mode (opensbi) switching.
> >
> > Processing network packets require a lot of unaligned access for the packet
> > header, which is not able to change the design of the header format to be
> > aligned.
> > And user applications pass large packet data with send/recf() and sendto/
> > recvfrom() to repeat less function calls for reading and writing data for the
> > optimization.
>
> Makes sense.  I'm still not opposed to moving to a C version, but it'd
> need to be a fairly complicated one.  I think having a fast C memcpy
> would likely benefit a handful of architectures, as everything we're
> talking about is an algorithmic improvement that can be expressed in C.
>
> Given that the simple memcpy doesn't perform well for your workload, I'm
> fine taking the assembly version.

Thanks, for merging them.

I agree that having a fast C memcpy would benefit many architectures.
I will make the patches for lib/string.c by extending your memcpy and send
them after I finish other priorities. The current functions in lib/string.c
use a byte copy, while most linux capable cpus moved to 64 bits.

Akira

>
> Thanks!
>
> >
> > Akira
> >
> > [1] https://lkml.org/lkml/2021/2/16/778
> > [2] https://github.com/mcd500/linux-jh7100/tree/starlight-sdimproved
> > [3] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-palmer-string
> > [4] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-gary
> >
> > Akira Tsukamoto (1):
> >   riscv: prevent pipeline stall in __asm_to/copy_from_user
> >
> >  arch/riscv/lib/uaccess.S | 106 +++++++++++++++++++++++++++------------
> >  1 file changed, 73 insertions(+), 33 deletions(-)