diff mbox series

riscv: fix memmove and optimise memcpy when misalign

Message ID 20210216225555.4976-1-gary@garyguo.net (mailing list archive)
State New, archived
Headers show
Series riscv: fix memmove and optimise memcpy when misalign | expand

Commit Message

Gary Guo Feb. 16, 2021, 10:55 p.m. UTC
04091d6 introduces an assembly version of memmove but
it does take misalignment into account (it checks if
length is a multiple of machine word size but pointers
need also be aligned). As a result it will generate
misaligned load/store for the majority of cases and causes
significant performance regression on hardware that traps
misaligned load/store and emulate them using firmware.

The current behaviour of memcpy is that it checks if both
src and dest pointers are co-aligned (aka congruent
modular SZ_REG). If aligned, it will copy data word-by-word
after first aligning pointers to word boundary. If src
and dst are not co-aligned, however, byte-wise copy will
be performed.

This patch fixes the memmove and optimises memcpy for
misaligned cases. It will first align destination pointer
to word-boundary regardless whether src and dest are
co-aligned or not. If they indeed are, then wordwise copy
is performed. If they are not co-aligned, then it will
load two adjacent words from src and use shifts to assemble
a full machine word. Some additional assembly level
micro-optimisation is also performed to ensure more
instructions can be compressed (e.g. prefer a0 to t6).

In my testing this speeds up memcpy 4~5x when src and dest
are not co-aligned (which is quite common in networking),
and speeds up memmove 1000+x by avoiding trapping to firmware.

Signed-off-by: Gary Guo <gary@garyguo.net>
---
 arch/riscv/lib/memcpy.S  | 223 ++++++++++++++++++++++++---------------
 arch/riscv/lib/memmove.S | 176 ++++++++++++++++++++----------
 2 files changed, 257 insertions(+), 142 deletions(-)

Comments

Bin Meng May 13, 2021, 8:13 a.m. UTC | #1
On Wed, Feb 17, 2021 at 7:00 AM Gary Guo <gary@garyguo.net> wrote:
>
> 04091d6 introduces an assembly version of memmove but
> it does take misalignment into account (it checks if
> length is a multiple of machine word size but pointers
> need also be aligned). As a result it will generate
> misaligned load/store for the majority of cases and causes
> significant performance regression on hardware that traps
> misaligned load/store and emulate them using firmware.
>
> The current behaviour of memcpy is that it checks if both
> src and dest pointers are co-aligned (aka congruent
> modular SZ_REG). If aligned, it will copy data word-by-word
> after first aligning pointers to word boundary. If src
> and dst are not co-aligned, however, byte-wise copy will
> be performed.
>
> This patch fixes the memmove and optimises memcpy for
> misaligned cases. It will first align destination pointer
> to word-boundary regardless whether src and dest are
> co-aligned or not. If they indeed are, then wordwise copy
> is performed. If they are not co-aligned, then it will
> load two adjacent words from src and use shifts to assemble
> a full machine word. Some additional assembly level
> micro-optimisation is also performed to ensure more
> instructions can be compressed (e.g. prefer a0 to t6).
>
> In my testing this speeds up memcpy 4~5x when src and dest
> are not co-aligned (which is quite common in networking),
> and speeds up memmove 1000+x by avoiding trapping to firmware.
>
> Signed-off-by: Gary Guo <gary@garyguo.net>
> ---
>  arch/riscv/lib/memcpy.S  | 223 ++++++++++++++++++++++++---------------
>  arch/riscv/lib/memmove.S | 176 ++++++++++++++++++++----------
>  2 files changed, 257 insertions(+), 142 deletions(-)
>

Looks this patch remains unapplied.

This patch fixed an booting failure of U-Boot SPL on SiFive Unleashed
board, which was built from the latest U-Boot sources that has taken
the assembly version of mem* from the Linux kernel recently.
The exact load misalignment happens in the original memmove()
implementation that it does not handle the alignment correctly. With
this patch, the U-Boot SPL boots again.

Tested-by: Bin Meng <bmeng.cn@gmail.com>

Regards,
Bin
Gary Guo May 22, 2021, 10:22 p.m. UTC | #2
On Tue, 16 Feb 2021 22:55:51 +0000
Gary Guo <gary@garyguo.net> wrote:

> 04091d6 introduces an assembly version of memmove but
> it does take misalignment into account (it checks if
> length is a multiple of machine word size but pointers
> need also be aligned). As a result it will generate
> misaligned load/store for the majority of cases and causes
> significant performance regression on hardware that traps
> misaligned load/store and emulate them using firmware.
> 
> The current behaviour of memcpy is that it checks if both
> src and dest pointers are co-aligned (aka congruent
> modular SZ_REG). If aligned, it will copy data word-by-word
> after first aligning pointers to word boundary. If src
> and dst are not co-aligned, however, byte-wise copy will
> be performed.
> 
> This patch fixes the memmove and optimises memcpy for
> misaligned cases. It will first align destination pointer
> to word-boundary regardless whether src and dest are
> co-aligned or not. If they indeed are, then wordwise copy
> is performed. If they are not co-aligned, then it will
> load two adjacent words from src and use shifts to assemble
> a full machine word. Some additional assembly level
> micro-optimisation is also performed to ensure more
> instructions can be compressed (e.g. prefer a0 to t6).
> 
> In my testing this speeds up memcpy 4~5x when src and dest
> are not co-aligned (which is quite common in networking),
> and speeds up memmove 1000+x by avoiding trapping to firmware.
> 
> Signed-off-by: Gary Guo <gary@garyguo.net>
> ---
>  arch/riscv/lib/memcpy.S  | 223
> ++++++++++++++++++++++++--------------- arch/riscv/lib/memmove.S |
> 176 ++++++++++++++++++++---------- 2 files changed, 257
> insertions(+), 142 deletions(-)
> 
> diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
> index 51ab716253fa..00672c19ad9b 100644
> --- a/arch/riscv/lib/memcpy.S
> +++ b/arch/riscv/lib/memcpy.S
> @@ -9,100 +9,151 @@
>  /* void *memcpy(void *, const void *, size_t) */
>  ENTRY(__memcpy)
>  WEAK(memcpy)
> -	move t6, a0  /* Preserve return value */
> +	/* Save for return value */
> +	mv	t6, a0
>  
> -	/* Defer to byte-oriented copy for small sizes */
> -	sltiu a3, a2, 128
> -	bnez a3, 4f
> -	/* Use word-oriented copy only if low-order bits match */
> -	andi a3, t6, SZREG-1
> -	andi a4, a1, SZREG-1
> -	bne a3, a4, 4f
> +	/*
> +	 * Register allocation for code below:
> +	 * a0 - start of uncopied dst
> +	 * a1 - start of uncopied src
> +	 * t0 - end of uncopied dst
> +	 */
> +	add	t0, a0, a2
>  
> -	beqz a3, 2f  /* Skip if already aligned */
>  	/*
> -	 * Round to nearest double word-aligned address
> -	 * greater than or equal to start address
> +	 * Use bytewise copy if too small.
> +	 *
> +	 * This threshold must be at least 2*SZREG to ensure at
> least one
> +	 * wordwise copy is performed. It is chosen to be 16 because
> it will
> +	 * save at least 7 iterations of bytewise copy, which pays
> off the
> +	 * fixed overhead.
>  	 */
> -	andi a3, a1, ~(SZREG-1)
> -	addi a3, a3, SZREG
> -	/* Handle initial misalignment */
> -	sub a4, a3, a1
> +	li	a3, 16
> +	bltu	a2, a3, .Lbyte_copy_tail
> +
> +	/*
> +	 * Bytewise copy first to align a0 to word boundary.
> +	 */
> +	addi	a2, a0, SZREG-1
> +	andi	a2, a2, ~(SZREG-1)
> +	beq	a0, a2, 2f
>  1:
> -	lb a5, 0(a1)
> -	addi a1, a1, 1
> -	sb a5, 0(t6)
> -	addi t6, t6, 1
> -	bltu a1, a3, 1b
> -	sub a2, a2, a4  /* Update count */
> +	lb	a5, 0(a1)
> +	addi	a1, a1, 1
> +	sb	a5, 0(a0)
> +	addi	a0, a0, 1
> +	bne	a0, a2, 1b
> +2:
> +
> +	/*
> +	 * Now a0 is word-aligned. If a1 is also word aligned, we
> could perform
> +	 * aligned word-wise copy. Otherwise we need to perform
> misaligned
> +	 * word-wise copy.
> +	 */
> +	andi	a3, a1, SZREG-1
> +	bnez	a3, .Lmisaligned_word_copy
>  
> +	/* Unrolled wordwise copy */
> +	addi	t0, t0, -(16*SZREG-1)
> +	bgeu	a0, t0, 2f
> +1:
> +	REG_L	a2,        0(a1)
> +	REG_L	a3,    SZREG(a1)
> +	REG_L	a4,  2*SZREG(a1)
> +	REG_L	a5,  3*SZREG(a1)
> +	REG_L	a6,  4*SZREG(a1)
> +	REG_L	a7,  5*SZREG(a1)
> +	REG_L	t1,  6*SZREG(a1)
> +	REG_L	t2,  7*SZREG(a1)
> +	REG_L	t3,  8*SZREG(a1)
> +	REG_L	t4,  9*SZREG(a1)
> +	REG_L	t5, 10*SZREG(a1)
> +	REG_S	a2,        0(a0)
> +	REG_S	a3,    SZREG(a0)
> +	REG_S	a4,  2*SZREG(a0)
> +	REG_S	a5,  3*SZREG(a0)
> +	REG_S	a6,  4*SZREG(a0)
> +	REG_S	a7,  5*SZREG(a0)
> +	REG_S	t1,  6*SZREG(a0)
> +	REG_S	t2,  7*SZREG(a0)
> +	REG_S	t3,  8*SZREG(a0)
> +	REG_S	t4,  9*SZREG(a0)
> +	REG_S	t5, 10*SZREG(a0)
> +	REG_L	a2, 11*SZREG(a1)
> +	REG_L	a3, 12*SZREG(a1)
> +	REG_L	a4, 13*SZREG(a1)
> +	REG_L	a5, 14*SZREG(a1)
> +	REG_L	a6, 15*SZREG(a1)
> +	addi	a1, a1, 16*SZREG
> +	REG_S	a2, 11*SZREG(a0)
> +	REG_S	a3, 12*SZREG(a0)
> +	REG_S	a4, 13*SZREG(a0)
> +	REG_S	a5, 14*SZREG(a0)
> +	REG_S	a6, 15*SZREG(a0)
> +	addi	a0, a0, 16*SZREG
> +	bltu	a0, t0, 1b
>  2:
> -	andi a4, a2, ~((16*SZREG)-1)
> -	beqz a4, 4f
> -	add a3, a1, a4
> -3:
> -	REG_L a4,       0(a1)
> -	REG_L a5,   SZREG(a1)
> -	REG_L a6, 2*SZREG(a1)
> -	REG_L a7, 3*SZREG(a1)
> -	REG_L t0, 4*SZREG(a1)
> -	REG_L t1, 5*SZREG(a1)
> -	REG_L t2, 6*SZREG(a1)
> -	REG_L t3, 7*SZREG(a1)
> -	REG_L t4, 8*SZREG(a1)
> -	REG_L t5, 9*SZREG(a1)
> -	REG_S a4,       0(t6)
> -	REG_S a5,   SZREG(t6)
> -	REG_S a6, 2*SZREG(t6)
> -	REG_S a7, 3*SZREG(t6)
> -	REG_S t0, 4*SZREG(t6)
> -	REG_S t1, 5*SZREG(t6)
> -	REG_S t2, 6*SZREG(t6)
> -	REG_S t3, 7*SZREG(t6)
> -	REG_S t4, 8*SZREG(t6)
> -	REG_S t5, 9*SZREG(t6)
> -	REG_L a4, 10*SZREG(a1)
> -	REG_L a5, 11*SZREG(a1)
> -	REG_L a6, 12*SZREG(a1)
> -	REG_L a7, 13*SZREG(a1)
> -	REG_L t0, 14*SZREG(a1)
> -	REG_L t1, 15*SZREG(a1)
> -	addi a1, a1, 16*SZREG
> -	REG_S a4, 10*SZREG(t6)
> -	REG_S a5, 11*SZREG(t6)
> -	REG_S a6, 12*SZREG(t6)
> -	REG_S a7, 13*SZREG(t6)
> -	REG_S t0, 14*SZREG(t6)
> -	REG_S t1, 15*SZREG(t6)
> -	addi t6, t6, 16*SZREG
> -	bltu a1, a3, 3b
> -	andi a2, a2, (16*SZREG)-1  /* Update count */
> -
> -4:
> -	/* Handle trailing misalignment */
> -	beqz a2, 6f
> -	add a3, a1, a2
> -
> -	/* Use word-oriented copy if co-aligned to word boundary */
> -	or a5, a1, t6
> -	or a5, a5, a3
> -	andi a5, a5, 3
> -	bnez a5, 5f
> -7:
> -	lw a4, 0(a1)
> -	addi a1, a1, 4
> -	sw a4, 0(t6)
> -	addi t6, t6, 4
> -	bltu a1, a3, 7b
> +	/* Post-loop increment by 16*SZREG-1 and pre-loop decrement
> by SZREG-1 */
> +	addi	t0, t0, 15*SZREG
>  
> -	ret
> +	/* Wordwise copy */
> +	bgeu	a0, t0, 2f
> +1:
> +	REG_L	a5, 0(a1)
> +	addi	a1, a1, SZREG
> +	REG_S	a5, 0(a0)
> +	addi	a0, a0, SZREG
> +	bltu	a0, t0, 1b
> +2:
> +	addi	t0, t0, SZREG-1
>  
> -5:
> -	lb a4, 0(a1)
> -	addi a1, a1, 1
> -	sb a4, 0(t6)
> -	addi t6, t6, 1
> -	bltu a1, a3, 5b
> -6:
> +.Lbyte_copy_tail:
> +	/*
> +	 * Bytewise copy anything left.
> +	 */
> +	beq	a0, t0, 2f
> +1:
> +	lb	a5, 0(a1)
> +	addi	a1, a1, 1
> +	sb	a5, 0(a0)
> +	addi	a0, a0, 1
> +	bne	a0, t0, 1b
> +2:
> +
> +	mv	a0, t6
>  	ret
> +
> +.Lmisaligned_word_copy:
> +	/*
> +	 * Misaligned word-wise copy.
> +	 * For misaligned copy we still perform word-wise copy, but
> we need to
> +	 * use the value fetched from the previous iteration and do
> some shifts.
> +	 * This is safe because we wouldn't access more words than
> necessary.
> +	 */
> +
> +	/* Calculate shifts */
> +	slli	t3, a3, 3
> +	sub	t4, x0, t3 /* negate is okay as shift will only
> look at LSBs */ +
> +	/* Load the initial value and align a1 */
> +	andi	a1, a1, ~(SZREG-1)
> +	REG_L	a5, 0(a1)
> +
> +	addi	t0, t0, -(SZREG-1)
> +	/* At least one iteration will be executed here, no check */
> +1:
> +	srl	a4, a5, t3
> +	REG_L	a5, SZREG(a1)
> +	addi	a1, a1, SZREG
> +	sll	a2, a5, t4
> +	or	a2, a2, a4
> +	REG_S	a2, 0(a0)
> +	addi	a0, a0, SZREG
> +	bltu	a0, t0, 1b
> +
> +	/* Update pointers to correct value */
> +	addi	t0, t0, SZREG-1
> +	add	a1, a1, a3
> +
> +	j	.Lbyte_copy_tail
>  END(__memcpy)
> diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
> index 07d1d2152ba5..fbe6701dbe4a 100644
> --- a/arch/riscv/lib/memmove.S
> +++ b/arch/riscv/lib/memmove.S
> @@ -5,60 +5,124 @@
>  
>  ENTRY(__memmove)
>  WEAK(memmove)
> -        move    t0, a0
> -        move    t1, a1
> -
> -        beq     a0, a1, exit_memcpy
> -        beqz    a2, exit_memcpy
> -        srli    t2, a2, 0x2
> -
> -        slt     t3, a0, a1
> -        beqz    t3, do_reverse
> -
> -        andi    a2, a2, 0x3
> -        li      t4, 1
> -        beqz    t2, byte_copy
> -
> -word_copy:
> -        lw      t3, 0(a1)
> -        addi    t2, t2, -1
> -        addi    a1, a1, 4
> -        sw      t3, 0(a0)
> -        addi    a0, a0, 4
> -        bnez    t2, word_copy
> -        beqz    a2, exit_memcpy
> -        j       byte_copy
> -
> -do_reverse:
> -        add     a0, a0, a2
> -        add     a1, a1, a2
> -        andi    a2, a2, 0x3
> -        li      t4, -1
> -        beqz    t2, reverse_byte_copy
> -
> -reverse_word_copy:
> -        addi    a1, a1, -4
> -        addi    t2, t2, -1
> -        lw      t3, 0(a1)
> -        addi    a0, a0, -4
> -        sw      t3, 0(a0)
> -        bnez    t2, reverse_word_copy
> -        beqz    a2, exit_memcpy
> -
> -reverse_byte_copy:
> -        addi    a0, a0, -1
> -        addi    a1, a1, -1
> -
> -byte_copy:
> -        lb      t3, 0(a1)
> -        addi    a2, a2, -1
> -        sb      t3, 0(a0)
> -        add     a1, a1, t4
> -        add     a0, a0, t4
> -        bnez    a2, byte_copy
> -
> -exit_memcpy:
> -        move a0, t0
> -        move a1, t1
> -        ret
> +	/*
> +	 * Here we determine if forward copy is possible. Forward
> copy is
> +	 * preferred to backward copy as it is more cache friendly.
> +	 *
> +	 * If a0 >= a1, t0 gives their distance, if t0 >= a2 then we
> can
> +	 *   copy forward.
> +	 * If a0 < a1, we can always copy forward. This will make t0
> negative,
> +	 *   so a *unsigned* comparison will always have t0 >= a2.
> +	 *
> +	 * For forward copy we just delegate the task to memcpy.
> +	 */
> +	sub	t0, a0, a1
> +	bltu	t0, a2, 1f
> +	tail	__memcpy
> +1:
> +
> +	/*
> +	 * Register allocation for code below:
> +	 * a0 - end of uncopied dst
> +	 * a1 - end of uncopied src
> +	 * t0 - start of uncopied dst
> +	 */
> +	mv	t0, a0
> +	add	a0, a0, a2
> +	add	a1, a1, a2
> +
> +	/*
> +	 * Use bytewise copy if too small.
> +	 *
> +	 * This threshold must be at least 2*SZREG to ensure at
> least one
> +	 * wordwise copy is performed. It is chosen to be 16 because
> it will
> +	 * save at least 7 iterations of bytewise copy, which pays
> off the
> +	 * fixed overhead.
> +	 */
> +	li	a3, 16
> +	bltu	a2, a3, .Lbyte_copy_tail
> +
> +	/*
> +	 * Bytewise copy first to align t0 to word boundary.
> +	 */
> +	andi	a2, a0, ~(SZREG-1)
> +	beq	a0, a2, 2f
> +1:
> +	addi	a1, a1, -1
> +	lb	a5, 0(a1)
> +	addi	a0, a0, -1
> +	sb	a5, 0(a0)
> +	bne	a0, a2, 1b
> +2:
> +
> +	/*
> +	 * Now a0 is word-aligned. If a1 is also word aligned, we
> could perform
> +	 * aligned word-wise copy. Otherwise we need to perform
> misaligned
> +	 * word-wise copy.
> +	 */
> +	andi	a3, a1, SZREG-1
> +	bnez	a3, .Lmisaligned_word_copy
> +
> +	/* Wordwise copy */
> +	addi	t0, t0, SZREG-1
> +	bleu	a0, t0, 2f
> +1:
> +	addi	a1, a1, -SZREG
> +	REG_L	a5, 0(a1)
> +	addi	a0, a0, -SZREG
> +	REG_S	a5, 0(a0)
> +	bgtu	a0, t0, 1b
> +2:
> +	addi	t0, t0, -(SZREG-1)
> +
> +.Lbyte_copy_tail:
> +	/*
> +	 * Bytewise copy anything left.
> +	 */
> +	beq	a0, t0, 2f
> +1:
> +	addi	a1, a1, -1
> +	lb	a5, 0(a1)
> +	addi	a0, a0, -1
> +	sb	a5, 0(a0)
> +	bne	a0, t0, 1b
> +2:
> +
> +	mv	a0, t0
> +	ret
> +
> +.Lmisaligned_word_copy:
> +	/*
> +	 * Misaligned word-wise copy.
> +	 * For misaligned copy we still perform word-wise copy, but
> we need to
> +	 * use the value fetched from the previous iteration and do
> some shifts.
> +	 * This is safe because we wouldn't access more words than
> necessary.
> +	 */
> +
> +	/* Calculate shifts */
> +	slli	t3, a3, 3
> +	sub	t4, x0, t3 /* negate is okay as shift will only
> look at LSBs */ +
> +	/* Load the initial value and align a1 */
> +	andi	a1, a1, ~(SZREG-1)
> +	REG_L	a5, 0(a1)
> +
> +	addi	t0, t0, SZREG-1
> +	/* At least one iteration will be executed here, no check */
> +1:
> +	sll	a4, a5, t4
> +	addi	a1, a1, -SZREG
> +	REG_L	a5, 0(a1)
> +	srl	a2, a5, t3
> +	or	a2, a2, a4
> +	addi	a0, a0, -SZREG
> +	REG_S	a2, 0(a0)
> +	bgtu	a0, t0, 1b
> +
> +	/* Update pointers to correct value */
> +	addi	t0, t0, -(SZREG-1)
> +	add	a1, a1, a3
> +
> +	j	.Lbyte_copy_tail
> +
>  END(__memmove)

ping. It's been 3 month since submission and I really would like to see
this applied or at least have some feedbacks.

Link to the original patch in archive:
https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/

- Gary
Palmer Dabbelt May 23, 2021, 1:47 a.m. UTC | #3
On Sat, 22 May 2021 15:22:56 PDT (-0700), gary@garyguo.net wrote:
> On Tue, 16 Feb 2021 22:55:51 +0000
> Gary Guo <gary@garyguo.net> wrote:
>
>> 04091d6 introduces an assembly version of memmove but
>> it does take misalignment into account (it checks if
>> length is a multiple of machine word size but pointers
>> need also be aligned). As a result it will generate
>> misaligned load/store for the majority of cases and causes
>> significant performance regression on hardware that traps
>> misaligned load/store and emulate them using firmware.
>>
>> The current behaviour of memcpy is that it checks if both
>> src and dest pointers are co-aligned (aka congruent
>> modular SZ_REG). If aligned, it will copy data word-by-word
>> after first aligning pointers to word boundary. If src
>> and dst are not co-aligned, however, byte-wise copy will
>> be performed.
>>
>> This patch fixes the memmove and optimises memcpy for
>> misaligned cases. It will first align destination pointer
>> to word-boundary regardless whether src and dest are
>> co-aligned or not. If they indeed are, then wordwise copy
>> is performed. If they are not co-aligned, then it will
>> load two adjacent words from src and use shifts to assemble
>> a full machine word. Some additional assembly level
>> micro-optimisation is also performed to ensure more
>> instructions can be compressed (e.g. prefer a0 to t6).
>>
>> In my testing this speeds up memcpy 4~5x when src and dest
>> are not co-aligned (which is quite common in networking),
>> and speeds up memmove 1000+x by avoiding trapping to firmware.
>>
>> Signed-off-by: Gary Guo <gary@garyguo.net>
>> ---
>>  arch/riscv/lib/memcpy.S  | 223
>> ++++++++++++++++++++++++--------------- arch/riscv/lib/memmove.S |
>> 176 ++++++++++++++++++++---------- 2 files changed, 257
>> insertions(+), 142 deletions(-)
>>
>> diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
>> index 51ab716253fa..00672c19ad9b 100644
>> --- a/arch/riscv/lib/memcpy.S
>> +++ b/arch/riscv/lib/memcpy.S
>> @@ -9,100 +9,151 @@
>>  /* void *memcpy(void *, const void *, size_t) */
>>  ENTRY(__memcpy)
>>  WEAK(memcpy)
>> -	move t6, a0  /* Preserve return value */
>> +	/* Save for return value */
>> +	mv	t6, a0
>>
>> -	/* Defer to byte-oriented copy for small sizes */
>> -	sltiu a3, a2, 128
>> -	bnez a3, 4f
>> -	/* Use word-oriented copy only if low-order bits match */
>> -	andi a3, t6, SZREG-1
>> -	andi a4, a1, SZREG-1
>> -	bne a3, a4, 4f
>> +	/*
>> +	 * Register allocation for code below:
>> +	 * a0 - start of uncopied dst
>> +	 * a1 - start of uncopied src
>> +	 * t0 - end of uncopied dst
>> +	 */
>> +	add	t0, a0, a2
>>
>> -	beqz a3, 2f  /* Skip if already aligned */
>>  	/*
>> -	 * Round to nearest double word-aligned address
>> -	 * greater than or equal to start address
>> +	 * Use bytewise copy if too small.
>> +	 *
>> +	 * This threshold must be at least 2*SZREG to ensure at
>> least one
>> +	 * wordwise copy is performed. It is chosen to be 16 because
>> it will
>> +	 * save at least 7 iterations of bytewise copy, which pays
>> off the
>> +	 * fixed overhead.
>>  	 */
>> -	andi a3, a1, ~(SZREG-1)
>> -	addi a3, a3, SZREG
>> -	/* Handle initial misalignment */
>> -	sub a4, a3, a1
>> +	li	a3, 16
>> +	bltu	a2, a3, .Lbyte_copy_tail
>> +
>> +	/*
>> +	 * Bytewise copy first to align a0 to word boundary.
>> +	 */
>> +	addi	a2, a0, SZREG-1
>> +	andi	a2, a2, ~(SZREG-1)
>> +	beq	a0, a2, 2f
>>  1:
>> -	lb a5, 0(a1)
>> -	addi a1, a1, 1
>> -	sb a5, 0(t6)
>> -	addi t6, t6, 1
>> -	bltu a1, a3, 1b
>> -	sub a2, a2, a4  /* Update count */
>> +	lb	a5, 0(a1)
>> +	addi	a1, a1, 1
>> +	sb	a5, 0(a0)
>> +	addi	a0, a0, 1
>> +	bne	a0, a2, 1b
>> +2:
>> +
>> +	/*
>> +	 * Now a0 is word-aligned. If a1 is also word aligned, we
>> could perform
>> +	 * aligned word-wise copy. Otherwise we need to perform
>> misaligned
>> +	 * word-wise copy.
>> +	 */
>> +	andi	a3, a1, SZREG-1
>> +	bnez	a3, .Lmisaligned_word_copy
>>
>> +	/* Unrolled wordwise copy */
>> +	addi	t0, t0, -(16*SZREG-1)
>> +	bgeu	a0, t0, 2f
>> +1:
>> +	REG_L	a2,        0(a1)
>> +	REG_L	a3,    SZREG(a1)
>> +	REG_L	a4,  2*SZREG(a1)
>> +	REG_L	a5,  3*SZREG(a1)
>> +	REG_L	a6,  4*SZREG(a1)
>> +	REG_L	a7,  5*SZREG(a1)
>> +	REG_L	t1,  6*SZREG(a1)
>> +	REG_L	t2,  7*SZREG(a1)
>> +	REG_L	t3,  8*SZREG(a1)
>> +	REG_L	t4,  9*SZREG(a1)
>> +	REG_L	t5, 10*SZREG(a1)
>> +	REG_S	a2,        0(a0)
>> +	REG_S	a3,    SZREG(a0)
>> +	REG_S	a4,  2*SZREG(a0)
>> +	REG_S	a5,  3*SZREG(a0)
>> +	REG_S	a6,  4*SZREG(a0)
>> +	REG_S	a7,  5*SZREG(a0)
>> +	REG_S	t1,  6*SZREG(a0)
>> +	REG_S	t2,  7*SZREG(a0)
>> +	REG_S	t3,  8*SZREG(a0)
>> +	REG_S	t4,  9*SZREG(a0)
>> +	REG_S	t5, 10*SZREG(a0)
>> +	REG_L	a2, 11*SZREG(a1)
>> +	REG_L	a3, 12*SZREG(a1)
>> +	REG_L	a4, 13*SZREG(a1)
>> +	REG_L	a5, 14*SZREG(a1)
>> +	REG_L	a6, 15*SZREG(a1)
>> +	addi	a1, a1, 16*SZREG
>> +	REG_S	a2, 11*SZREG(a0)
>> +	REG_S	a3, 12*SZREG(a0)
>> +	REG_S	a4, 13*SZREG(a0)
>> +	REG_S	a5, 14*SZREG(a0)
>> +	REG_S	a6, 15*SZREG(a0)
>> +	addi	a0, a0, 16*SZREG
>> +	bltu	a0, t0, 1b
>>  2:
>> -	andi a4, a2, ~((16*SZREG)-1)
>> -	beqz a4, 4f
>> -	add a3, a1, a4
>> -3:
>> -	REG_L a4,       0(a1)
>> -	REG_L a5,   SZREG(a1)
>> -	REG_L a6, 2*SZREG(a1)
>> -	REG_L a7, 3*SZREG(a1)
>> -	REG_L t0, 4*SZREG(a1)
>> -	REG_L t1, 5*SZREG(a1)
>> -	REG_L t2, 6*SZREG(a1)
>> -	REG_L t3, 7*SZREG(a1)
>> -	REG_L t4, 8*SZREG(a1)
>> -	REG_L t5, 9*SZREG(a1)
>> -	REG_S a4,       0(t6)
>> -	REG_S a5,   SZREG(t6)
>> -	REG_S a6, 2*SZREG(t6)
>> -	REG_S a7, 3*SZREG(t6)
>> -	REG_S t0, 4*SZREG(t6)
>> -	REG_S t1, 5*SZREG(t6)
>> -	REG_S t2, 6*SZREG(t6)
>> -	REG_S t3, 7*SZREG(t6)
>> -	REG_S t4, 8*SZREG(t6)
>> -	REG_S t5, 9*SZREG(t6)
>> -	REG_L a4, 10*SZREG(a1)
>> -	REG_L a5, 11*SZREG(a1)
>> -	REG_L a6, 12*SZREG(a1)
>> -	REG_L a7, 13*SZREG(a1)
>> -	REG_L t0, 14*SZREG(a1)
>> -	REG_L t1, 15*SZREG(a1)
>> -	addi a1, a1, 16*SZREG
>> -	REG_S a4, 10*SZREG(t6)
>> -	REG_S a5, 11*SZREG(t6)
>> -	REG_S a6, 12*SZREG(t6)
>> -	REG_S a7, 13*SZREG(t6)
>> -	REG_S t0, 14*SZREG(t6)
>> -	REG_S t1, 15*SZREG(t6)
>> -	addi t6, t6, 16*SZREG
>> -	bltu a1, a3, 3b
>> -	andi a2, a2, (16*SZREG)-1  /* Update count */
>> -
>> -4:
>> -	/* Handle trailing misalignment */
>> -	beqz a2, 6f
>> -	add a3, a1, a2
>> -
>> -	/* Use word-oriented copy if co-aligned to word boundary */
>> -	or a5, a1, t6
>> -	or a5, a5, a3
>> -	andi a5, a5, 3
>> -	bnez a5, 5f
>> -7:
>> -	lw a4, 0(a1)
>> -	addi a1, a1, 4
>> -	sw a4, 0(t6)
>> -	addi t6, t6, 4
>> -	bltu a1, a3, 7b
>> +	/* Post-loop increment by 16*SZREG-1 and pre-loop decrement
>> by SZREG-1 */
>> +	addi	t0, t0, 15*SZREG
>>
>> -	ret
>> +	/* Wordwise copy */
>> +	bgeu	a0, t0, 2f
>> +1:
>> +	REG_L	a5, 0(a1)
>> +	addi	a1, a1, SZREG
>> +	REG_S	a5, 0(a0)
>> +	addi	a0, a0, SZREG
>> +	bltu	a0, t0, 1b
>> +2:
>> +	addi	t0, t0, SZREG-1
>>
>> -5:
>> -	lb a4, 0(a1)
>> -	addi a1, a1, 1
>> -	sb a4, 0(t6)
>> -	addi t6, t6, 1
>> -	bltu a1, a3, 5b
>> -6:
>> +.Lbyte_copy_tail:
>> +	/*
>> +	 * Bytewise copy anything left.
>> +	 */
>> +	beq	a0, t0, 2f
>> +1:
>> +	lb	a5, 0(a1)
>> +	addi	a1, a1, 1
>> +	sb	a5, 0(a0)
>> +	addi	a0, a0, 1
>> +	bne	a0, t0, 1b
>> +2:
>> +
>> +	mv	a0, t6
>>  	ret
>> +
>> +.Lmisaligned_word_copy:
>> +	/*
>> +	 * Misaligned word-wise copy.
>> +	 * For misaligned copy we still perform word-wise copy, but
>> we need to
>> +	 * use the value fetched from the previous iteration and do
>> some shifts.
>> +	 * This is safe because we wouldn't access more words than
>> necessary.
>> +	 */
>> +
>> +	/* Calculate shifts */
>> +	slli	t3, a3, 3
>> +	sub	t4, x0, t3 /* negate is okay as shift will only
>> look at LSBs */ +
>> +	/* Load the initial value and align a1 */
>> +	andi	a1, a1, ~(SZREG-1)
>> +	REG_L	a5, 0(a1)
>> +
>> +	addi	t0, t0, -(SZREG-1)
>> +	/* At least one iteration will be executed here, no check */
>> +1:
>> +	srl	a4, a5, t3
>> +	REG_L	a5, SZREG(a1)
>> +	addi	a1, a1, SZREG
>> +	sll	a2, a5, t4
>> +	or	a2, a2, a4
>> +	REG_S	a2, 0(a0)
>> +	addi	a0, a0, SZREG
>> +	bltu	a0, t0, 1b
>> +
>> +	/* Update pointers to correct value */
>> +	addi	t0, t0, SZREG-1
>> +	add	a1, a1, a3
>> +
>> +	j	.Lbyte_copy_tail
>>  END(__memcpy)
>> diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
>> index 07d1d2152ba5..fbe6701dbe4a 100644
>> --- a/arch/riscv/lib/memmove.S
>> +++ b/arch/riscv/lib/memmove.S
>> @@ -5,60 +5,124 @@
>>
>>  ENTRY(__memmove)
>>  WEAK(memmove)
>> -        move    t0, a0
>> -        move    t1, a1
>> -
>> -        beq     a0, a1, exit_memcpy
>> -        beqz    a2, exit_memcpy
>> -        srli    t2, a2, 0x2
>> -
>> -        slt     t3, a0, a1
>> -        beqz    t3, do_reverse
>> -
>> -        andi    a2, a2, 0x3
>> -        li      t4, 1
>> -        beqz    t2, byte_copy
>> -
>> -word_copy:
>> -        lw      t3, 0(a1)
>> -        addi    t2, t2, -1
>> -        addi    a1, a1, 4
>> -        sw      t3, 0(a0)
>> -        addi    a0, a0, 4
>> -        bnez    t2, word_copy
>> -        beqz    a2, exit_memcpy
>> -        j       byte_copy
>> -
>> -do_reverse:
>> -        add     a0, a0, a2
>> -        add     a1, a1, a2
>> -        andi    a2, a2, 0x3
>> -        li      t4, -1
>> -        beqz    t2, reverse_byte_copy
>> -
>> -reverse_word_copy:
>> -        addi    a1, a1, -4
>> -        addi    t2, t2, -1
>> -        lw      t3, 0(a1)
>> -        addi    a0, a0, -4
>> -        sw      t3, 0(a0)
>> -        bnez    t2, reverse_word_copy
>> -        beqz    a2, exit_memcpy
>> -
>> -reverse_byte_copy:
>> -        addi    a0, a0, -1
>> -        addi    a1, a1, -1
>> -
>> -byte_copy:
>> -        lb      t3, 0(a1)
>> -        addi    a2, a2, -1
>> -        sb      t3, 0(a0)
>> -        add     a1, a1, t4
>> -        add     a0, a0, t4
>> -        bnez    a2, byte_copy
>> -
>> -exit_memcpy:
>> -        move a0, t0
>> -        move a1, t1
>> -        ret
>> +	/*
>> +	 * Here we determine if forward copy is possible. Forward
>> copy is
>> +	 * preferred to backward copy as it is more cache friendly.
>> +	 *
>> +	 * If a0 >= a1, t0 gives their distance, if t0 >= a2 then we
>> can
>> +	 *   copy forward.
>> +	 * If a0 < a1, we can always copy forward. This will make t0
>> negative,
>> +	 *   so a *unsigned* comparison will always have t0 >= a2.
>> +	 *
>> +	 * For forward copy we just delegate the task to memcpy.
>> +	 */
>> +	sub	t0, a0, a1
>> +	bltu	t0, a2, 1f
>> +	tail	__memcpy
>> +1:
>> +
>> +	/*
>> +	 * Register allocation for code below:
>> +	 * a0 - end of uncopied dst
>> +	 * a1 - end of uncopied src
>> +	 * t0 - start of uncopied dst
>> +	 */
>> +	mv	t0, a0
>> +	add	a0, a0, a2
>> +	add	a1, a1, a2
>> +
>> +	/*
>> +	 * Use bytewise copy if too small.
>> +	 *
>> +	 * This threshold must be at least 2*SZREG to ensure at
>> least one
>> +	 * wordwise copy is performed. It is chosen to be 16 because
>> it will
>> +	 * save at least 7 iterations of bytewise copy, which pays
>> off the
>> +	 * fixed overhead.
>> +	 */
>> +	li	a3, 16
>> +	bltu	a2, a3, .Lbyte_copy_tail
>> +
>> +	/*
>> +	 * Bytewise copy first to align t0 to word boundary.
>> +	 */
>> +	andi	a2, a0, ~(SZREG-1)
>> +	beq	a0, a2, 2f
>> +1:
>> +	addi	a1, a1, -1
>> +	lb	a5, 0(a1)
>> +	addi	a0, a0, -1
>> +	sb	a5, 0(a0)
>> +	bne	a0, a2, 1b
>> +2:
>> +
>> +	/*
>> +	 * Now a0 is word-aligned. If a1 is also word aligned, we
>> could perform
>> +	 * aligned word-wise copy. Otherwise we need to perform
>> misaligned
>> +	 * word-wise copy.
>> +	 */
>> +	andi	a3, a1, SZREG-1
>> +	bnez	a3, .Lmisaligned_word_copy
>> +
>> +	/* Wordwise copy */
>> +	addi	t0, t0, SZREG-1
>> +	bleu	a0, t0, 2f
>> +1:
>> +	addi	a1, a1, -SZREG
>> +	REG_L	a5, 0(a1)
>> +	addi	a0, a0, -SZREG
>> +	REG_S	a5, 0(a0)
>> +	bgtu	a0, t0, 1b
>> +2:
>> +	addi	t0, t0, -(SZREG-1)
>> +
>> +.Lbyte_copy_tail:
>> +	/*
>> +	 * Bytewise copy anything left.
>> +	 */
>> +	beq	a0, t0, 2f
>> +1:
>> +	addi	a1, a1, -1
>> +	lb	a5, 0(a1)
>> +	addi	a0, a0, -1
>> +	sb	a5, 0(a0)
>> +	bne	a0, t0, 1b
>> +2:
>> +
>> +	mv	a0, t0
>> +	ret
>> +
>> +.Lmisaligned_word_copy:
>> +	/*
>> +	 * Misaligned word-wise copy.
>> +	 * For misaligned copy we still perform word-wise copy, but
>> we need to
>> +	 * use the value fetched from the previous iteration and do
>> some shifts.
>> +	 * This is safe because we wouldn't access more words than
>> necessary.
>> +	 */
>> +
>> +	/* Calculate shifts */
>> +	slli	t3, a3, 3
>> +	sub	t4, x0, t3 /* negate is okay as shift will only
>> look at LSBs */ +
>> +	/* Load the initial value and align a1 */
>> +	andi	a1, a1, ~(SZREG-1)
>> +	REG_L	a5, 0(a1)
>> +
>> +	addi	t0, t0, SZREG-1
>> +	/* At least one iteration will be executed here, no check */
>> +1:
>> +	sll	a4, a5, t4
>> +	addi	a1, a1, -SZREG
>> +	REG_L	a5, 0(a1)
>> +	srl	a2, a5, t3
>> +	or	a2, a2, a4
>> +	addi	a0, a0, -SZREG
>> +	REG_S	a2, 0(a0)
>> +	bgtu	a0, t0, 1b
>> +
>> +	/* Update pointers to correct value */
>> +	addi	t0, t0, -(SZREG-1)
>> +	add	a1, a1, a3
>> +
>> +	j	.Lbyte_copy_tail
>> +
>>  END(__memmove)
>
> ping. It's been 3 month since submission and I really would like to see
> this applied or at least have some feedbacks.
>
> Link to the original patch in archive:
> https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/

Sorry, I thought I replied to this one but it must have gotten lost 
somewhere.

IMO the right way to go here is to just move to C-based string routines, 
at least until we get to the point where we're seriously optimizing for 
specific processors.  We went with the C-based string rountines in glibc 
as part of the upstreaming process and found only some small performance 
differences when compared to the hand-written assembly, and they're way 
easier to maintain.

IIRC Linux only has trivial C string routines in lib, I think the best 
way to go about that would be to higher performance versions in there.  
That will allow other ports to use them.
David Laight May 23, 2021, 5:12 p.m. UTC | #4
From: Palmer Dabbelt
> Sent: 23 May 2021 02:47
...
> IMO the right way to go here is to just move to C-based string routines,
> at least until we get to the point where we're seriously optimizing for
> specific processors.  We went with the C-based string rountines in glibc
> as part of the upstreaming process and found only some small performance
> differences when compared to the hand-written assembly, and they're way
> easier to maintain.
> 
> IIRC Linux only has trivial C string routines in lib, I think the best
> way to go about that would be to higher performance versions in there.
> That will allow other ports to use them.

I certainly wonder how much benefit these massively unrolled
loops have on modern superscaler processors - especially those
with any form of 'out of order' execution.

It is often easy to write assembler where all the loop
control instructions happen in parallel with the memory
accesses - which cannot be avoided.
Loop unrolling is so 1970s.

Sometimes you need to unroll once.
And maybe interleave the loads and stores.
But after that you can just be trashing the i-cache.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Gary Guo May 25, 2021, 2:34 p.m. UTC | #5
On Sun, 23 May 2021 17:12:23 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Palmer Dabbelt
> > Sent: 23 May 2021 02:47  
> ...
> > IMO the right way to go here is to just move to C-based string
> > routines, at least until we get to the point where we're seriously
> > optimizing for specific processors.  We went with the C-based
> > string rountines in glibc as part of the upstreaming process and
> > found only some small performance differences when compared to the
> > hand-written assembly, and they're way easier to maintain.

I prefer C versions as well, and actually before commit 04091d6 we are
indeed using the generic C version. The issue is that 04091d6
introduces an assembly version that's very broken. It does not offer
and performance improvement to the C version, and breaks all processors
without hardware misalignment support (yes, firmware is expected to
trap and handle these, but they are painfully slow).

I noticed the issue because I ran Linux on my own firmware and found
that kernel couldn't boot. I didn't implement misalignment emulation at
that time (and just send the trap to the supervisor).

Because 04091d6 is accepted, my assumption is that we need an assembly
version. So I spent some time writing, testing and optimising the
assembly.

> > 
> > IIRC Linux only has trivial C string routines in lib, I think the
> > best way to go about that would be to higher performance versions
> > in there. That will allow other ports to use them.  
> 
> I certainly wonder how much benefit these massively unrolled
> loops have on modern superscaler processors - especially those
> with any form of 'out of order' execution.
> 
> It is often easy to write assembler where all the loop
> control instructions happen in parallel with the memory
> accesses - which cannot be avoided.
> Loop unrolling is so 1970s.
> 
> Sometimes you need to unroll once.
> And maybe interleave the loads and stores.
> But after that you can just be trashing the i-cache.

I didn't introduce the loop unrolling though. The loop unrolled
assembly is there before this patch, and I didn't even change the
unroll factor. I only added a path to handle misaligned case.

There are a lot of diffs because I did made some changes to the
register allocation so that the code is more optimal. I also made a few
cleanups and added a few comments. It might be easier to review if you
apply the patch locally and just look at the file.

- Gary
Bin Meng June 15, 2021, 1:40 p.m. UTC | #6
On Tue, May 25, 2021 at 10:37 PM Gary Guo <gary@garyguo.net> wrote:
>
> On Sun, 23 May 2021 17:12:23 +0000
> David Laight <David.Laight@ACULAB.COM> wrote:
>
> > From: Palmer Dabbelt
> > > Sent: 23 May 2021 02:47
> > ...
> > > IMO the right way to go here is to just move to C-based string
> > > routines, at least until we get to the point where we're seriously
> > > optimizing for specific processors.  We went with the C-based
> > > string rountines in glibc as part of the upstreaming process and
> > > found only some small performance differences when compared to the
> > > hand-written assembly, and they're way easier to maintain.
>
> I prefer C versions as well, and actually before commit 04091d6 we are
> indeed using the generic C version. The issue is that 04091d6
> introduces an assembly version that's very broken. It does not offer
> and performance improvement to the C version, and breaks all processors
> without hardware misalignment support (yes, firmware is expected to
> trap and handle these, but they are painfully slow).
>
> I noticed the issue because I ran Linux on my own firmware and found
> that kernel couldn't boot. I didn't implement misalignment emulation at
> that time (and just send the trap to the supervisor).

Similarly, I noticed exactly the same issue as what you saw on
mainline U-Boot SPL that it suddenly does not boot on SiFive
Unleashed. Bisect leads to the commit that brought the kernel buggy
assembly mem* version into U-Boot.

> Because 04091d6 is accepted, my assumption is that we need an assembly
> version.

That's my understanding as well. I thought kernel folks prefer
hand-written assembly over pure C :)

> So I spent some time writing, testing and optimising the assembly.

Thank you for the fix!

>
> > >
> > > IIRC Linux only has trivial C string routines in lib, I think the
> > > best way to go about that would be to higher performance versions
> > > in there. That will allow other ports to use them.
> >
> > I certainly wonder how much benefit these massively unrolled
> > loops have on modern superscaler processors - especially those
> > with any form of 'out of order' execution.
> >
> > It is often easy to write assembler where all the loop
> > control instructions happen in parallel with the memory
> > accesses - which cannot be avoided.
> > Loop unrolling is so 1970s.
> >
> > Sometimes you need to unroll once.
> > And maybe interleave the loads and stores.
> > But after that you can just be trashing the i-cache.
>
> I didn't introduce the loop unrolling though. The loop unrolled
> assembly is there before this patch, and I didn't even change the
> unroll factor. I only added a path to handle misaligned case.
>
> There are a lot of diffs because I did made some changes to the
> register allocation so that the code is more optimal. I also made a few
> cleanups and added a few comments. It might be easier to review if you
> apply the patch locally and just look at the file.

I would vote to apply this patch, unless we wanted to do "git revert
04091d6", but that would break KASAN I guess?

Regards,
Bin
David Laight June 15, 2021, 2:08 p.m. UTC | #7
From: Bin Meng
> Sent: 15 June 2021 14:40
...
> > I prefer C versions as well, and actually before commit 04091d6 we are
> > indeed using the generic C version. The issue is that 04091d6
> > introduces an assembly version that's very broken. It does not offer
> > and performance improvement to the C version, and breaks all processors
> > without hardware misalignment support

There may need to be a few C implementations for different cpu
instruction sets.
While the compiler might manage to DTRT (or the wrong thing given
the right source) using a loop that matches the instruction set
is a good idea.

For instance, x86 can do *(reg_1 + reg_2 * (1|2|4|8) + constant)
so you can increment reg_2 and use it for both buffers while
still unrolling enough to hit memory bandwidth.

With only *(reg_1 + constant) you need to increment both the
source and destination addresses.

OTOH you can save an instruction on x86 by adding to 'reg_2'
until it becomes zero (so you don't need add, cmp and jmp).

But a mips-like instruction set (includes riscv and nios2)
has 'compare and branch' so you only ever need one instruction
at the end of the loop.

Having to handle misaligned copies is another distinct issue.
For some 32bit cpu byte copies may be as fast as any shift
and mask code.

> > (yes, firmware is expected to
> > trap and handle these, but they are painfully slow).

Yes, to the point where the system should just panic and
force you to fix the code.

When I were a lad we forced everyone to fix there code
so it would run on sparc.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
diff mbox series

Patch

diff --git a/arch/riscv/lib/memcpy.S b/arch/riscv/lib/memcpy.S
index 51ab716253fa..00672c19ad9b 100644
--- a/arch/riscv/lib/memcpy.S
+++ b/arch/riscv/lib/memcpy.S
@@ -9,100 +9,151 @@ 
 /* void *memcpy(void *, const void *, size_t) */
 ENTRY(__memcpy)
 WEAK(memcpy)
-	move t6, a0  /* Preserve return value */
+	/* Save for return value */
+	mv	t6, a0
 
-	/* Defer to byte-oriented copy for small sizes */
-	sltiu a3, a2, 128
-	bnez a3, 4f
-	/* Use word-oriented copy only if low-order bits match */
-	andi a3, t6, SZREG-1
-	andi a4, a1, SZREG-1
-	bne a3, a4, 4f
+	/*
+	 * Register allocation for code below:
+	 * a0 - start of uncopied dst
+	 * a1 - start of uncopied src
+	 * t0 - end of uncopied dst
+	 */
+	add	t0, a0, a2
 
-	beqz a3, 2f  /* Skip if already aligned */
 	/*
-	 * Round to nearest double word-aligned address
-	 * greater than or equal to start address
+	 * Use bytewise copy if too small.
+	 *
+	 * This threshold must be at least 2*SZREG to ensure at least one
+	 * wordwise copy is performed. It is chosen to be 16 because it will
+	 * save at least 7 iterations of bytewise copy, which pays off the
+	 * fixed overhead.
 	 */
-	andi a3, a1, ~(SZREG-1)
-	addi a3, a3, SZREG
-	/* Handle initial misalignment */
-	sub a4, a3, a1
+	li	a3, 16
+	bltu	a2, a3, .Lbyte_copy_tail
+
+	/*
+	 * Bytewise copy first to align a0 to word boundary.
+	 */
+	addi	a2, a0, SZREG-1
+	andi	a2, a2, ~(SZREG-1)
+	beq	a0, a2, 2f
 1:
-	lb a5, 0(a1)
-	addi a1, a1, 1
-	sb a5, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 1b
-	sub a2, a2, a4  /* Update count */
+	lb	a5, 0(a1)
+	addi	a1, a1, 1
+	sb	a5, 0(a0)
+	addi	a0, a0, 1
+	bne	a0, a2, 1b
+2:
+
+	/*
+	 * Now a0 is word-aligned. If a1 is also word aligned, we could perform
+	 * aligned word-wise copy. Otherwise we need to perform misaligned
+	 * word-wise copy.
+	 */
+	andi	a3, a1, SZREG-1
+	bnez	a3, .Lmisaligned_word_copy
 
+	/* Unrolled wordwise copy */
+	addi	t0, t0, -(16*SZREG-1)
+	bgeu	a0, t0, 2f
+1:
+	REG_L	a2,        0(a1)
+	REG_L	a3,    SZREG(a1)
+	REG_L	a4,  2*SZREG(a1)
+	REG_L	a5,  3*SZREG(a1)
+	REG_L	a6,  4*SZREG(a1)
+	REG_L	a7,  5*SZREG(a1)
+	REG_L	t1,  6*SZREG(a1)
+	REG_L	t2,  7*SZREG(a1)
+	REG_L	t3,  8*SZREG(a1)
+	REG_L	t4,  9*SZREG(a1)
+	REG_L	t5, 10*SZREG(a1)
+	REG_S	a2,        0(a0)
+	REG_S	a3,    SZREG(a0)
+	REG_S	a4,  2*SZREG(a0)
+	REG_S	a5,  3*SZREG(a0)
+	REG_S	a6,  4*SZREG(a0)
+	REG_S	a7,  5*SZREG(a0)
+	REG_S	t1,  6*SZREG(a0)
+	REG_S	t2,  7*SZREG(a0)
+	REG_S	t3,  8*SZREG(a0)
+	REG_S	t4,  9*SZREG(a0)
+	REG_S	t5, 10*SZREG(a0)
+	REG_L	a2, 11*SZREG(a1)
+	REG_L	a3, 12*SZREG(a1)
+	REG_L	a4, 13*SZREG(a1)
+	REG_L	a5, 14*SZREG(a1)
+	REG_L	a6, 15*SZREG(a1)
+	addi	a1, a1, 16*SZREG
+	REG_S	a2, 11*SZREG(a0)
+	REG_S	a3, 12*SZREG(a0)
+	REG_S	a4, 13*SZREG(a0)
+	REG_S	a5, 14*SZREG(a0)
+	REG_S	a6, 15*SZREG(a0)
+	addi	a0, a0, 16*SZREG
+	bltu	a0, t0, 1b
 2:
-	andi a4, a2, ~((16*SZREG)-1)
-	beqz a4, 4f
-	add a3, a1, a4
-3:
-	REG_L a4,       0(a1)
-	REG_L a5,   SZREG(a1)
-	REG_L a6, 2*SZREG(a1)
-	REG_L a7, 3*SZREG(a1)
-	REG_L t0, 4*SZREG(a1)
-	REG_L t1, 5*SZREG(a1)
-	REG_L t2, 6*SZREG(a1)
-	REG_L t3, 7*SZREG(a1)
-	REG_L t4, 8*SZREG(a1)
-	REG_L t5, 9*SZREG(a1)
-	REG_S a4,       0(t6)
-	REG_S a5,   SZREG(t6)
-	REG_S a6, 2*SZREG(t6)
-	REG_S a7, 3*SZREG(t6)
-	REG_S t0, 4*SZREG(t6)
-	REG_S t1, 5*SZREG(t6)
-	REG_S t2, 6*SZREG(t6)
-	REG_S t3, 7*SZREG(t6)
-	REG_S t4, 8*SZREG(t6)
-	REG_S t5, 9*SZREG(t6)
-	REG_L a4, 10*SZREG(a1)
-	REG_L a5, 11*SZREG(a1)
-	REG_L a6, 12*SZREG(a1)
-	REG_L a7, 13*SZREG(a1)
-	REG_L t0, 14*SZREG(a1)
-	REG_L t1, 15*SZREG(a1)
-	addi a1, a1, 16*SZREG
-	REG_S a4, 10*SZREG(t6)
-	REG_S a5, 11*SZREG(t6)
-	REG_S a6, 12*SZREG(t6)
-	REG_S a7, 13*SZREG(t6)
-	REG_S t0, 14*SZREG(t6)
-	REG_S t1, 15*SZREG(t6)
-	addi t6, t6, 16*SZREG
-	bltu a1, a3, 3b
-	andi a2, a2, (16*SZREG)-1  /* Update count */
-
-4:
-	/* Handle trailing misalignment */
-	beqz a2, 6f
-	add a3, a1, a2
-
-	/* Use word-oriented copy if co-aligned to word boundary */
-	or a5, a1, t6
-	or a5, a5, a3
-	andi a5, a5, 3
-	bnez a5, 5f
-7:
-	lw a4, 0(a1)
-	addi a1, a1, 4
-	sw a4, 0(t6)
-	addi t6, t6, 4
-	bltu a1, a3, 7b
+	/* Post-loop increment by 16*SZREG-1 and pre-loop decrement by SZREG-1 */
+	addi	t0, t0, 15*SZREG
 
-	ret
+	/* Wordwise copy */
+	bgeu	a0, t0, 2f
+1:
+	REG_L	a5, 0(a1)
+	addi	a1, a1, SZREG
+	REG_S	a5, 0(a0)
+	addi	a0, a0, SZREG
+	bltu	a0, t0, 1b
+2:
+	addi	t0, t0, SZREG-1
 
-5:
-	lb a4, 0(a1)
-	addi a1, a1, 1
-	sb a4, 0(t6)
-	addi t6, t6, 1
-	bltu a1, a3, 5b
-6:
+.Lbyte_copy_tail:
+	/*
+	 * Bytewise copy anything left.
+	 */
+	beq	a0, t0, 2f
+1:
+	lb	a5, 0(a1)
+	addi	a1, a1, 1
+	sb	a5, 0(a0)
+	addi	a0, a0, 1
+	bne	a0, t0, 1b
+2:
+
+	mv	a0, t6
 	ret
+
+.Lmisaligned_word_copy:
+	/*
+	 * Misaligned word-wise copy.
+	 * For misaligned copy we still perform word-wise copy, but we need to
+	 * use the value fetched from the previous iteration and do some shifts.
+	 * This is safe because we wouldn't access more words than necessary.
+	 */
+
+	/* Calculate shifts */
+	slli	t3, a3, 3
+	sub	t4, x0, t3 /* negate is okay as shift will only look at LSBs */
+
+	/* Load the initial value and align a1 */
+	andi	a1, a1, ~(SZREG-1)
+	REG_L	a5, 0(a1)
+
+	addi	t0, t0, -(SZREG-1)
+	/* At least one iteration will be executed here, no check */
+1:
+	srl	a4, a5, t3
+	REG_L	a5, SZREG(a1)
+	addi	a1, a1, SZREG
+	sll	a2, a5, t4
+	or	a2, a2, a4
+	REG_S	a2, 0(a0)
+	addi	a0, a0, SZREG
+	bltu	a0, t0, 1b
+
+	/* Update pointers to correct value */
+	addi	t0, t0, SZREG-1
+	add	a1, a1, a3
+
+	j	.Lbyte_copy_tail
 END(__memcpy)
diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
index 07d1d2152ba5..fbe6701dbe4a 100644
--- a/arch/riscv/lib/memmove.S
+++ b/arch/riscv/lib/memmove.S
@@ -5,60 +5,124 @@ 
 
 ENTRY(__memmove)
 WEAK(memmove)
-        move    t0, a0
-        move    t1, a1
-
-        beq     a0, a1, exit_memcpy
-        beqz    a2, exit_memcpy
-        srli    t2, a2, 0x2
-
-        slt     t3, a0, a1
-        beqz    t3, do_reverse
-
-        andi    a2, a2, 0x3
-        li      t4, 1
-        beqz    t2, byte_copy
-
-word_copy:
-        lw      t3, 0(a1)
-        addi    t2, t2, -1
-        addi    a1, a1, 4
-        sw      t3, 0(a0)
-        addi    a0, a0, 4
-        bnez    t2, word_copy
-        beqz    a2, exit_memcpy
-        j       byte_copy
-
-do_reverse:
-        add     a0, a0, a2
-        add     a1, a1, a2
-        andi    a2, a2, 0x3
-        li      t4, -1
-        beqz    t2, reverse_byte_copy
-
-reverse_word_copy:
-        addi    a1, a1, -4
-        addi    t2, t2, -1
-        lw      t3, 0(a1)
-        addi    a0, a0, -4
-        sw      t3, 0(a0)
-        bnez    t2, reverse_word_copy
-        beqz    a2, exit_memcpy
-
-reverse_byte_copy:
-        addi    a0, a0, -1
-        addi    a1, a1, -1
-
-byte_copy:
-        lb      t3, 0(a1)
-        addi    a2, a2, -1
-        sb      t3, 0(a0)
-        add     a1, a1, t4
-        add     a0, a0, t4
-        bnez    a2, byte_copy
-
-exit_memcpy:
-        move a0, t0
-        move a1, t1
-        ret
+	/*
+	 * Here we determine if forward copy is possible. Forward copy is
+	 * preferred to backward copy as it is more cache friendly.
+	 *
+	 * If a0 >= a1, t0 gives their distance, if t0 >= a2 then we can
+	 *   copy forward.
+	 * If a0 < a1, we can always copy forward. This will make t0 negative,
+	 *   so a *unsigned* comparison will always have t0 >= a2.
+	 *
+	 * For forward copy we just delegate the task to memcpy.
+	 */
+	sub	t0, a0, a1
+	bltu	t0, a2, 1f
+	tail	__memcpy
+1:
+
+	/*
+	 * Register allocation for code below:
+	 * a0 - end of uncopied dst
+	 * a1 - end of uncopied src
+	 * t0 - start of uncopied dst
+	 */
+	mv	t0, a0
+	add	a0, a0, a2
+	add	a1, a1, a2
+
+	/*
+	 * Use bytewise copy if too small.
+	 *
+	 * This threshold must be at least 2*SZREG to ensure at least one
+	 * wordwise copy is performed. It is chosen to be 16 because it will
+	 * save at least 7 iterations of bytewise copy, which pays off the
+	 * fixed overhead.
+	 */
+	li	a3, 16
+	bltu	a2, a3, .Lbyte_copy_tail
+
+	/*
+	 * Bytewise copy first to align t0 to word boundary.
+	 */
+	andi	a2, a0, ~(SZREG-1)
+	beq	a0, a2, 2f
+1:
+	addi	a1, a1, -1
+	lb	a5, 0(a1)
+	addi	a0, a0, -1
+	sb	a5, 0(a0)
+	bne	a0, a2, 1b
+2:
+
+	/*
+	 * Now a0 is word-aligned. If a1 is also word aligned, we could perform
+	 * aligned word-wise copy. Otherwise we need to perform misaligned
+	 * word-wise copy.
+	 */
+	andi	a3, a1, SZREG-1
+	bnez	a3, .Lmisaligned_word_copy
+
+	/* Wordwise copy */
+	addi	t0, t0, SZREG-1
+	bleu	a0, t0, 2f
+1:
+	addi	a1, a1, -SZREG
+	REG_L	a5, 0(a1)
+	addi	a0, a0, -SZREG
+	REG_S	a5, 0(a0)
+	bgtu	a0, t0, 1b
+2:
+	addi	t0, t0, -(SZREG-1)
+
+.Lbyte_copy_tail:
+	/*
+	 * Bytewise copy anything left.
+	 */
+	beq	a0, t0, 2f
+1:
+	addi	a1, a1, -1
+	lb	a5, 0(a1)
+	addi	a0, a0, -1
+	sb	a5, 0(a0)
+	bne	a0, t0, 1b
+2:
+
+	mv	a0, t0
+	ret
+
+.Lmisaligned_word_copy:
+	/*
+	 * Misaligned word-wise copy.
+	 * For misaligned copy we still perform word-wise copy, but we need to
+	 * use the value fetched from the previous iteration and do some shifts.
+	 * This is safe because we wouldn't access more words than necessary.
+	 */
+
+	/* Calculate shifts */
+	slli	t3, a3, 3
+	sub	t4, x0, t3 /* negate is okay as shift will only look at LSBs */
+
+	/* Load the initial value and align a1 */
+	andi	a1, a1, ~(SZREG-1)
+	REG_L	a5, 0(a1)
+
+	addi	t0, t0, SZREG-1
+	/* At least one iteration will be executed here, no check */
+1:
+	sll	a4, a5, t4
+	addi	a1, a1, -SZREG
+	REG_L	a5, 0(a1)
+	srl	a2, a5, t3
+	or	a2, a2, a4
+	addi	a0, a0, -SZREG
+	REG_S	a2, 0(a0)
+	bgtu	a0, t0, 1b
+
+	/* Update pointers to correct value */
+	addi	t0, t0, -(SZREG-1)
+	add	a1, a1, a3
+
+	j	.Lbyte_copy_tail
+
 END(__memmove)