diff mbox series

[v2] Fixed: Misaligned memory access. Fixed pointer comparison.

Message ID 20220123034518.3717116-1-michael@michaelkloos.com (mailing list archive)
State New, archived
Headers show
Series [v2] Fixed: Misaligned memory access. Fixed pointer comparison. | expand

Commit Message

Michael T. Kloos Jan. 23, 2022, 3:45 a.m. UTC
Rewrote the riscv memmove() assembly implementation.  The
previous implementation did not check memory alignment and it
compared 2 pointers with a signed comparison.  The misaligned
memory access would cause the kernel to crash on systems that
did not emulate it in firmware and did not support it in hardware.
Firmware emulation is slow and may not exist.  Additionally,
hardware support may not exist and would likely still run slower
than aligned accesses even if it did.  The RISC-V spec does not
guarantee that support for misaligned memory accesses will exist.
It should not be depended on.

This patch now checks for the maximum granularity of co-alignment
between the pointers and copies them with that, using single-byte
copy for any unaligned data at their terminations.  It also now uses
unsigned comparison for the pointers.

Added half-word and, if built for 64-bit, double-word copy.

Migrated to the	newer assembler annotations from the now deprecated
ones.

Commit Message Edited on Jan 22 2022: Fixed some typos.

[v2]

Per kernel test robot, I have fixed the build under clang.  This
was broken due to a difference between gcc and clang, clang requiring
explict zero offsets the jalr instruction. gcc allowed them to be
omitted if zero.

While I was at it, I clarifed the comment for that line of code.

I have fixed some typos in the v1 commit message, but left the 1st
line as it was to allow the email thread to be followed.

Signed-off-by: Michael T. Kloos <michael@michaelkloos.com>
---
 arch/riscv/lib/memmove.S | 264 +++++++++++++++++++++++++++++++--------
 1 file changed, 210 insertions(+), 54 deletions(-)

Comments

David Laight Jan. 23, 2022, 1:31 p.m. UTC | #1
From: michael@michaelkloos.com
> Sent: 23 January 2022 03:45
> 
> Rewrote the riscv memmove() assembly implementation.  The
> previous implementation did not check memory alignment and it
> compared 2 pointers with a signed comparison.  The misaligned
> memory access would cause the kernel to crash on systems that
> did not emulate it in firmware and did not support it in hardware.
> Firmware emulation is slow and may not exist.  Additionally,
> hardware support may not exist and would likely still run slower
> than aligned accesses even if it did.  The RISC-V spec does not
> guarantee that support for misaligned memory accesses will exist.
> It should not be depended on.
...

From the way my email client display the patch I think it is
using both tabs and spaces for indentation.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jessica Clarke Jan. 23, 2022, 3:44 p.m. UTC | #2
On 23 Jan 2022, at 03:45, Michael T. Kloos <michael@michaelkloos.com> wrote:
> 
> Rewrote the riscv memmove() assembly implementation.  The
> previous implementation did not check memory alignment and it
> compared 2 pointers with a signed comparison.  The misaligned
> memory access would cause the kernel to crash on systems that
> did not emulate it in firmware and did not support it in hardware.
> Firmware emulation is slow and may not exist.  Additionally,
> hardware support may not exist and would likely still run slower
> than aligned accesses even if it did.  The RISC-V spec does not
> guarantee that support for misaligned memory accesses will exist.
> It should not be depended on.
> 
> This patch now checks for the maximum granularity of co-alignment
> between the pointers and copies them with that, using single-byte
> copy for any unaligned data at their terminations.  It also now uses
> unsigned comparison for the pointers.
> 
> Added half-word and, if built for 64-bit, double-word copy.
> 
> Migrated to the	newer assembler annotations from the now deprecated
> ones.
> 
> Commit Message Edited on Jan 22 2022: Fixed some typos.
> 
> [v2]
> 
> Per kernel test robot, I have fixed the build under clang.  This
> was broken due to a difference between gcc and clang, clang requiring
> explict zero offsets the jalr instruction. gcc allowed them to be
> omitted if zero.

Unlike LLVM, GCC does not have an assembler, that’s binutils’s GNU as.

Jess
Michael T. Kloos Jan. 23, 2022, 4:53 p.m. UTC | #3
No.  It only uses tabs.  The previous version used spaces.
Make sure that you are not looking at a line with a '-'.
The only place that spaces, perhaps combined with tabs, appear
at the start of a line in my patch is to align the '*' character
for a multi-line comment.  In this case, tab(s) are followed by
a single space for alignment.  I believe this is correct per
the coding style.  If I am wrong, please let me know.

	----Michael

On 1/23/22 08:31, David Laight wrote:

> From: michael@michaelkloos.com
>> Sent: 23 January 2022 03:45
>>
>> Rewrote the riscv memmove() assembly implementation.  The
>> previous implementation did not check memory alignment and it
>> compared 2 pointers with a signed comparison.  The misaligned
>> memory access would cause the kernel to crash on systems that
>> did not emulate it in firmware and did not support it in hardware.
>> Firmware emulation is slow and may not exist.  Additionally,
>> hardware support may not exist and would likely still run slower
>> than aligned accesses even if it did.  The RISC-V spec does not
>> guarantee that support for misaligned memory accesses will exist.
>> It should not be depended on.
> ...
>
> From the way my email client display the patch I think it is
> using both tabs and spaces for indentation.
>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
Michael T. Kloos Jan. 23, 2022, 5:15 p.m. UTC | #4
Yes, you are correct, but while I am not that knowledgeable in the kernel
build system, I believe that in many build systems, binutils "as" is usually
called by the "gcc" wrapper, rather than being executed directly.  This
allows for the C preprocessor to be easily and automatically run over *.S
files first.  Binutils "as" doesn't care about suffix.  It just assembles.
"gcc" will check and call the other tools as necessary to build.

Perhaps I should have been more specific in my language.  However,
I was trying to refer to the 2 different build systems in their entirety,
not their individually sub-components because I had originally test
built with gcc, not clang.

	Michael

On 1/23/22 10:44, Jessica Clarke wrote:

> On 23 Jan 2022, at 03:45, Michael T. Kloos <michael@michaelkloos.com> wrote:
>> Rewrote the riscv memmove() assembly implementation.  The
>> previous implementation did not check memory alignment and it
>> compared 2 pointers with a signed comparison.  The misaligned
>> memory access would cause the kernel to crash on systems that
>> did not emulate it in firmware and did not support it in hardware.
>> Firmware emulation is slow and may not exist.  Additionally,
>> hardware support may not exist and would likely still run slower
>> than aligned accesses even if it did.  The RISC-V spec does not
>> guarantee that support for misaligned memory accesses will exist.
>> It should not be depended on.
>>
>> This patch now checks for the maximum granularity of co-alignment
>> between the pointers and copies them with that, using single-byte
>> copy for any unaligned data at their terminations.  It also now uses
>> unsigned comparison for the pointers.
>>
>> Added half-word and, if built for 64-bit, double-word copy.
>>
>> Migrated to the	newer assembler annotations from the now deprecated
>> ones.
>>
>> Commit Message Edited on Jan 22 2022: Fixed some typos.
>>
>> [v2]
>>
>> Per kernel test robot, I have fixed the build under clang.  This
>> was broken due to a difference between gcc and clang, clang requiring
>> explict zero offsets the jalr instruction. gcc allowed them to be
>> omitted if zero.
> Unlike LLVM, GCC does not have an assembler, that’s binutils’s GNU as.
>
> Jess
>
David Laight Jan. 23, 2022, 10:35 p.m. UTC | #5
From: michael@michaelkloos.com
> Sent: 23 January 2022 16:54
> 
> No.  It only uses tabs.  The previous version used spaces.

Ah, it was the old one that was wrong.
One of them was.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Michael T. Kloos Jan. 23, 2022, 11:03 p.m. UTC | #6
No problem.  If you have any other concerns about my patch, please let me know.

	Michael

On 1/23/22 17:35, David Laight wrote:

> From: michael@michaelkloos.com
>> Sent: 23 January 2022 16:54
>>
>> No.  It only uses tabs.  The previous version used spaces.
> Ah, it was the old one that was wrong.
> One of them was.
>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
David Laight Jan. 24, 2022, 9:21 a.m. UTC | #7
From: michael@michaelkloos.com
> Sent: 23 January 2022 03:45
> 
> Rewrote the riscv memmove() assembly implementation.  The
> previous implementation did not check memory alignment and it
> compared 2 pointers with a signed comparison.  The misaligned
> memory access would cause the kernel to crash on systems that
> did not emulate it in firmware and did not support it in hardware.
> Firmware emulation is slow and may not exist.  Additionally,
> hardware support may not exist and would likely still run slower
> than aligned accesses even if it did.

That may not be true.
On x86 the cost of misaligned accesses only just measurable.
It isn't even one clock per cache line for reads (eg ipcsum).

> The RISC-V spec does not
> guarantee that support for misaligned memory accesses will exist.
> It should not be depended on.
> 
> This patch now checks for the maximum granularity of co-alignment
> between the pointers and copies them with that, using single-byte
> copy for any unaligned data at their terminations.  It also now uses
> unsigned comparison for the pointers.

If the performance of misaligned copies ever matters it is probably
better to use:
	*dst++ = src[0] >> n | src[1] << (64 - n);
for the body of the misaligned loop.
You can always read the aligned src[] even if outside the buffer.
So the only difficult part is writing the odd bytes
and getting 'n' and the direction of the shifts right!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Michael T. Kloos Jan. 24, 2022, 7:19 p.m. UTC | #8
> That may not be true.
> On x86 the cost of misaligned accesses only just measurable.
> It isn't even one clock per cache line for reads (eg ipcsum).
I know that the Intel manuals still recommend alignment on x86.  I
haven't tried to measure performance differences yet.

I think the issue here is that RISC-V is designed as a modular
architecture.  Unlike x86, we don't know that misaligned accesses
will or will not be supported.  I will grant you that if they are
supported by hardware, it will probably be faster to let the hardware
natively take care of it.  However, if the hardware doesn't support
it, the kernel won't be compatible with that hardware.

Now this could be worked around by having the firmware emulate it.
This is not specified in the RISC-V SBI spec.  So depending on this
would mean that the Linux kernel needs additional support from the
firmware than the bare SBI specifies.  The counter argument for this
is that it is a hardware implementation detail and that the firmware
should therefore handle the translation.  This is already the case
for the time CSRs where the CSRs are implemented in memory rather
than as CSRs.  However I don't think that this can be considered 
the same thing as memory accesses, which in my opinion, are a much
more common and primitive operation.  The RISC-V Privileged spec
also lists the time CSRs with the associated CSR addresses.
The CSRs provide needed special functionality that can not be
achieved without specific driver support for alternative
implementations.  I would consider memory accesses and the stuff
controlled by CSRs to be in very different categories.  Memory 
accesses are considered a basic part of a program, listed in the
base rv32i implementation specs.

Firmware emulation would also be dramatically slower.  We are
talking about trapping the exception, saving context, determining
cause, checking permissions, emulating behavior, restoring context,
and finally returning from the trap.  I would be strongly opposed to
requiring this for misaligned memory accesses where the hardware
doesn't support it.  It is likely that we are already dealing with
slower, less capable hardware in that case anyway.  So I would not
recommend adding the additional overhead.

If people are strongly in favor of allowing native handing of
misaligned memory accesses, a possible compromise is the addition
of a CONFIG option such as "Allow Kernel Misaligned Memory Access",
probably under the "Platform Type" section in menuconfig.  This
could be enabled for hardware which supported it, turning on
alternative code-paths that didn't care about alignment, allowing
the hardware or firmware to handle it as needed.  I would
recommend, at least until RISC-V hardware matures, this CONFIG
option default to disabled.

Anyone who has feedback on this, please let me know.

> If the performance of misaligned copies ever matters it is probably
> better to use:
> 	*dst++ = src[0] >> n | src[1] << (64 - n);
> for the body of the misaligned loop.
> You can always read the aligned src[] even if outside the buffer.
> So the only difficult part is writing the odd bytes
> and getting 'n' and the direction of the shifts right!
I had considered this.  I wondered if it would really be faster due
to the additional instructions required on each iteration and avoided
shifts due to the possible use of a barrel shifter vs a clocked shift
register in hardware.  But you are probably correct.  It would
probably still be faster.  Using that instead would also greatly
simplify my code since I wouldn't need a bunch of different loops
for each level of granularity.  I would just do it at the native
XLEN size.  I'll start working on this change.

	Michael

On 1/24/22 04:21, David Laight wrote:

> From: michael@michaelkloos.com
>> Sent: 23 January 2022 03:45
>>
>> Rewrote the riscv memmove() assembly implementation.  The
>> previous implementation did not check memory alignment and it
>> compared 2 pointers with a signed comparison.  The misaligned
>> memory access would cause the kernel to crash on systems that
>> did not emulate it in firmware and did not support it in hardware.
>> Firmware emulation is slow and may not exist.  Additionally,
>> hardware support may not exist and would likely still run slower
>> than aligned accesses even if it did.
> That may not be true.
> On x86 the cost of misaligned accesses only just measurable.
> It isn't even one clock per cache line for reads (eg ipcsum).
>
>> The RISC-V spec does not
>> guarantee that support for misaligned memory accesses will exist.
>> It should not be depended on.
>>
>> This patch now checks for the maximum granularity of co-alignment
>> between the pointers and copies them with that, using single-byte
>> copy for any unaligned data at their terminations.  It also now uses
>> unsigned comparison for the pointers.
> If the performance of misaligned copies ever matters it is probably
> better to use:
> 	*dst++ = src[0] >> n | src[1] << (64 - n);
> for the body of the misaligned loop.
> You can always read the aligned src[] even if outside the buffer.
> So the only difficult part is writing the odd bytes
> and getting 'n' and the direction of the shifts right!
>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
David Laight Jan. 24, 2022, 10:28 p.m. UTC | #9
From: michael@michaelkloos.com
> Sent: 24 January 2022 19:19

Re-instating the bit I commented on ..
> > ... Additionally, hardware support may not exist and would likely
> > still run slower than aligned accesses even if it did.
> 
> > That may not be true.
> > On x86 the cost of misaligned accesses only just measurable.
> > It isn't even one clock per cache line for reads (eg ipcsum).

> I know that the Intel manuals still recommend alignment on x86.  I
> haven't tried to measure performance differences yet.

IIRC they recommend aligned writes in particular.
(And don't do misaligned locked accesses that cross page boundaries.)

I've done some measurements for reads and the cost really was minimal.
You'd need to be doing a high proportion of multi-kb misaligned transfers
to cover the cost of any conditional test on aligned tranfsers.

> I think the issue here is that RISC-V is designed as a modular
> architecture.  Unlike x86, we don't know that misaligned accesses
> will or will not be supported.  I will grant you that if they are
> supported by hardware, it will probably be faster to let the hardware
> natively take care of it.  However, if the hardware doesn't support
> it, the kernel won't be compatible with that hardware.

Indeed you really don't want to be fixing up alignment faults - ever.
I've no idea why that ever became acceptable.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
diff mbox series

Patch

diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
index 07d1d2152ba5..db4cfeb59afd 100644
--- a/arch/riscv/lib/memmove.S
+++ b/arch/riscv/lib/memmove.S
@@ -1,64 +1,220 @@ 
-/* SPDX-License-Identifier: GPL-2.0 */
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2022 Michael T. Kloos <michael@michaelkloos.com>
+ */
 
 #include <linux/linkage.h>
 #include <asm/asm.h>
 
-ENTRY(__memmove)
-WEAK(memmove)
-        move    t0, a0
-        move    t1, a1
+SYM_FUNC_START(__memmove)
+SYM_FUNC_START_ALIAS(memmove)
+	/*
+	 * Returns
+	 *   a0 - dest
+	 *
+	 * Parameters
+	 *   a0 - Inclusive first byte of dest
+	 *   a1 - Inclusive first byte of src
+	 *   a2 - Length of copy
+	 *
+	 * Because the return matches the parameter register a0,
+	 * we will not clobber or modify that register.
+	 */
 
-        beq     a0, a1, exit_memcpy
-        beqz    a2, exit_memcpy
-        srli    t2, a2, 0x2
+	/* Return if nothing to do */
+	beq a0, a1, exit_memmove
+	beqz a2, exit_memmove
 
-        slt     t3, a0, a1
-        beqz    t3, do_reverse
+	/*
+	 * Register Uses
+	 *   a3 - Inclusive first multibyte of src
+	 *   a4 - Non-inclusive last multibyte of src
+	 *   a5 - Non-inclusive last byte of src
+	 *
+	 * During the copy
+	 *      Forward Copy: a1 - Index counter of src
+	 *      Reverse Copy: a5 - Index counter of src
+	 *   Both Copy Modes: t2 - Index counter of dest
+	 *   Both Copy Modes: t1 - Temporary for load-store
+	 *   Both Copy Modes: t0 - Link
+	 */
 
-        andi    a2, a2, 0x3
-        li      t4, 1
-        beqz    t2, byte_copy
+	/*
+	 * Solve for last byte now.  We will solve the rest when
+	 * they are needed for the copy because either byte copy
+	 * does not require any of the others (Wasted effort if
+	 * byte copy gets used) or we do not yet have enough
+	 * information to solve them.
+	 */
+	add  a5, a1, a2
 
-word_copy:
-        lw      t3, 0(a1)
-        addi    t2, t2, -1
-        addi    a1, a1, 4
-        sw      t3, 0(a0)
-        addi    a0, a0, 4
-        bnez    t2, word_copy
-        beqz    a2, exit_memcpy
-        j       byte_copy
-
-do_reverse:
-        add     a0, a0, a2
-        add     a1, a1, a2
-        andi    a2, a2, 0x3
-        li      t4, -1
-        beqz    t2, reverse_byte_copy
-
-reverse_word_copy:
-        addi    a1, a1, -4
-        addi    t2, t2, -1
-        lw      t3, 0(a1)
-        addi    a0, a0, -4
-        sw      t3, 0(a0)
-        bnez    t2, reverse_word_copy
-        beqz    a2, exit_memcpy
-
-reverse_byte_copy:
-        addi    a0, a0, -1
-        addi    a1, a1, -1
+	/*
+	 * Byte copy if copying less than SZREG bytes.
+	 * This can cause problems with the bulk copy
+	 * implementation below and is small enough not
+	 * to bother.
+	 */
+	andi t0, a2, -SZREG
+	beqz t0, byte_copy
+
+	/* Determine the maximum granularity of co-alignment. */
+	xor  t0, a0, a1
+#if   SZREG >= 8
+	andi t1, t0, 0x7
+	beqz t1, doubleword_copy
+#endif
+	andi t1, t0, 0x3
+	beqz t1, word_copy
+	andi t1, t0, 0x1
+	beqz t1, halfword_copy
+	/* Fall through to byte copy if nothing larger is found. */
 
 byte_copy:
-        lb      t3, 0(a1)
-        addi    a2, a2, -1
-        sb      t3, 0(a0)
-        add     a1, a1, t4
-        add     a0, a0, t4
-        bnez    a2, byte_copy
-
-exit_memcpy:
-        move a0, t0
-        move a1, t1
-        ret
-END(__memmove)
+	bltu a1, a0, byte_copy_reverse
+
+byte_copy_forward:
+	add  t2, a0, zero
+byte_copy_fw_callin:
+	beq  a1, a5, exit_memmove
+	lb   t1, (a1)
+	sb   t1, (t2)
+	addi a1, a1, 1
+	addi t2, t2, 1
+	j byte_copy_fw_callin
+
+byte_copy_reverse:
+	add  t2, a0, a2
+byte_copy_rv_callin:
+	beq  a1, a5, exit_memmove
+	addi a5, a5, -1
+	addi t2, t2, -1
+	lb   t1, (a5)
+	sb   t1, (t2)
+	j byte_copy_rv_callin
+
+exit_memmove:
+	ret
+
+copy_bytes_until_aligned_fw:
+	beq  a1, a3, 1f /* Reuse the return from the other copy loop */
+	lb   t1, (a1)
+	sb   t1, (t2)
+	addi a1, a1, 1
+	addi t2, t2, 1
+	j copy_bytes_until_aligned_fw
+
+copy_bytes_until_aligned_rv:
+	beq  a4, a5, 1f
+	addi a5, a5, -1
+	addi t2, t2, -1
+	lb   t1, (a5)
+	sb   t1, (t2)
+	j copy_bytes_until_aligned_rv
+	1: jalr zero, 0x0(t0) /* Return to multibyte copy loop */
+
+#if   SZREG >= 8
+doubleword_copy:
+	andi a3, a1, -8
+	andi a4, a5, -8
+	beq  a3, a1, 1f
+	addi a3, a3, 8
+	1:
+	bltu a1, a0, doubleword_copy_reverse
+
+doubleword_copy_forward:
+	add  t2, a0, zero
+
+	jal t0, copy_bytes_until_aligned_fw
+
+	1:
+	beq  a1, a4, byte_copy_fw_callin
+	ld   t1, (a1)
+	sd   t1, (t2)
+	addi a1, a1, 8
+	addi t2, t2, 8
+	j 1b
+
+doubleword_copy_reverse:
+	add  t2, a0, a2
+
+	jal t0, copy_bytes_until_aligned_rv
+
+	1:
+	beq  a3, a5, byte_copy_rv_callin
+	addi a5, a5, -8
+	addi t2, t2, -8
+	ld   t1, (a5)
+	sd   t1, (t2)
+	j 1b
+#endif
+
+word_copy:
+	andi a3, a1, -4
+	andi a4, a5, -4
+	beq  a3, a1, 1f
+	addi a3, a3, 4
+	1:
+	bltu a1, a0, word_copy_reverse
+
+word_copy_forward:
+	add  t2, a0, zero
+
+	jal t0, copy_bytes_until_aligned_fw
+
+	1:
+	beq  a1, a4, byte_copy_fw_callin
+	lw   t1, (a1)
+	sw   t1, (t2)
+	addi a1, a1, 4
+	addi t2, t2, 4
+	j 1b
+
+word_copy_reverse:
+	add  t2, a0, a2
+
+	jal t0, copy_bytes_until_aligned_rv
+
+	1:
+	beq  a3, a5, byte_copy_rv_callin
+	addi a5, a5, -4
+	addi t2, t2, -4
+	lw   t1, (a5)
+	sw   t1, (t2)
+	j 1b
+
+halfword_copy:
+	andi a3, a1, -2
+	andi a4, a5, -2
+	beq  a3, a1, 1f
+	addi a3, a3, 2
+	1:
+	bltu a1, a0, halfword_reverse
+
+halfword_forward:
+	add  t2, a0, zero
+
+	jal t0, copy_bytes_until_aligned_fw
+
+	1:
+	beq  a1, a4, byte_copy_fw_callin
+	lh   t1, (a1)
+	sh   t1, (t2)
+	addi a1, a1, 2
+	addi t2, t2, 2
+	j 1b
+
+halfword_reverse:
+	add  t2, a0, a2
+
+	jal t0, copy_bytes_until_aligned_rv
+
+	1:
+	beq  a3, a5, byte_copy_rv_callin
+	addi a5, a5, -2
+	addi t2, t2, -2
+	lh   t1, (a5)
+	sh   t1, (t2)
+	j 1b
+
+SYM_FUNC_END_ALIAS(memmove)
+SYM_FUNC_END(__memmove)