[7/8] arm64: Better optimised memchr()

Message ID	68d3229c208780fdde88ba48e14fcde8c50f6e76.1620738177.git.robin.murphy@arm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=CwTY=KG=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 973B66187E From: Robin Murphy <robin.murphy@arm.com> To: will@kernel.org, catalin.marinas@arm.com Cc: linux-arm-kernel@lists.infradead.org, yangyingliang@huawei.com, shenkai8@huawei.com Subject: [PATCH 7/8] arm64: Better optimised memchr() Date: Tue, 11 May 2021 17:12:37 +0100 Message-Id: <68d3229c208780fdde88ba48e14fcde8c50f6e76.1620738177.git.robin.murphy@arm.com> In-Reply-To: <cover.1620738177.git.robin.murphy@arm.com> References: <cover.1620738177.git.robin.murphy@arm.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	arm64: String function updates \| expand [0/8] arm64: String function updates [1/8] arm64: Import latest version of Cortex Strings' memcmp [2/8] arm64: Import latest version of Cortex Strings' strcmp [3/8] arm64: Import updated version of Cortex Strings' strlen [4/8] arm64: Import latest version of Cortex Strings' strncmp [5/8] arm64: Add assembly annotations for weak-PI-alias madness [6/8] arm64: Import latest memcpy()/memmove() implementation [7/8] arm64: Better optimised memchr() [8/8] arm64: Rewrite __arch_clear_user()

Message ID

68d3229c208780fdde88ba48e14fcde8c50f6e76.1620738177.git.robin.murphy@arm.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 973B66187E
From: Robin Murphy <robin.murphy@arm.com>
To: will@kernel.org,
	catalin.marinas@arm.com
Cc: linux-arm-kernel@lists.infradead.org, yangyingliang@huawei.com,
 shenkai8@huawei.com
Subject: [PATCH 7/8] arm64: Better optimised memchr()
Date: Tue, 11 May 2021 17:12:37 +0100
Message-Id: 
 <68d3229c208780fdde88ba48e14fcde8c50f6e76.1620738177.git.robin.murphy@arm.com>
In-Reply-To: <cover.1620738177.git.robin.murphy@arm.com>
References: <cover.1620738177.git.robin.murphy@arm.com>
MIME-Version: 1.0
Precedence: list
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Series

arm64: String function updates | expand

Commit Message

Robin Murphy May 11, 2021, 4:12 p.m. UTC

Although we implement our own assembly version of memchr(), it turns
out to be barely any better than what GCC can generate for the generic
C version (and would go wrong if the size_t argument were ever large
enough to be interpreted as negative). Unfortunately we can't import the
tuned implementation from the Arm optimized-routines library, since that
has some Advanced SIMD parts which are not really viable for general
kernel library code. What we can do, however, is pep things up with some
relatively straightforward word-at-a-time logic for larger calls.

Adding some timing to optimized-routines' memchr() test for a simple
benchmark, overall this version comes in around half as fast as the SIMD
code, but still nearly 4x faster than our existing implementation.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---
 arch/arm64/lib/memchr.S | 65 +++++++++++++++++++++++++++++++++--------
 1 file changed, 53 insertions(+), 12 deletions(-)

Comments

Catalin Marinas May 14, 2021, 2:55 p.m. UTC | #1

On Tue, May 11, 2021 at 05:12:37PM +0100, Robin Murphy wrote:
> Although we implement our own assembly version of memchr(), it turns
> out to be barely any better than what GCC can generate for the generic
> C version (and would go wrong if the size_t argument were ever large
> enough to be interpreted as negative). Unfortunately we can't import the
> tuned implementation from the Arm optimized-routines library, since that
> has some Advanced SIMD parts which are not really viable for general
> kernel library code. What we can do, however, is pep things up with some
> relatively straightforward word-at-a-time logic for larger calls.
> 
> Adding some timing to optimized-routines' memchr() test for a simple
> benchmark, overall this version comes in around half as fast as the SIMD
> code, but still nearly 4x faster than our existing implementation.
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>

I haven't reviewed the code yet but wondering - could we write this in C
using load_unaligned_zeropad()?

Robin Murphy May 14, 2021, 6:38 p.m. UTC | #2

On 2021-05-14 15:55, Catalin Marinas wrote:
> On Tue, May 11, 2021 at 05:12:37PM +0100, Robin Murphy wrote:
>> Although we implement our own assembly version of memchr(), it turns
>> out to be barely any better than what GCC can generate for the generic
>> C version (and would go wrong if the size_t argument were ever large
>> enough to be interpreted as negative). Unfortunately we can't import the
>> tuned implementation from the Arm optimized-routines library, since that
>> has some Advanced SIMD parts which are not really viable for general
>> kernel library code. What we can do, however, is pep things up with some
>> relatively straightforward word-at-a-time logic for larger calls.
>>
>> Adding some timing to optimized-routines' memchr() test for a simple
>> benchmark, overall this version comes in around half as fast as the SIMD
>> code, but still nearly 4x faster than our existing implementation.
>>
>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> 
> I haven't reviewed the code yet but wondering - could we write this in C
> using load_unaligned_zeropad()?

I've had a hack around with a couple of C implementations this 
afternoon, and they seem to come out roughly 85% as fast as this asm 
version. I'm not sure how much extra overhead load_unaligned_zeropad() 
would add with wiggling PSTATE.TCO all the time, though.

Robin.

diff --git a/arch/arm64/lib/memchr.S b/arch/arm64/lib/memchr.S
index edf6b970a277..7c2276fdab54 100644
--- a/arch/arm64/lib/memchr.S
+++ b/arch/arm64/lib/memchr.S
@@ -1,9 +1,6 @@ 
 /* SPDX-License-Identifier: GPL-2.0-only */
 /*
- * Based on arch/arm/lib/memchr.S
- *
- * Copyright (C) 1995-2000 Russell King
- * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2021 Arm Ltd.
  */
 
 #include <linux/linkage.h>
@@ -19,16 +16,60 @@ 
  * Returns:
  *	x0 - address of first occurrence of 'c' or 0
  */
+
+#define L(label) .L ## label
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+
+#define srcin		x0
+#define chrin		w1
+#define cntin		x2
+
+#define result		x0
+
+#define wordcnt		x3
+#define rep01		x4
+#define repchr		x5
+#define cur_word	x6
+#define cur_byte	w6
+#define tmp		x7
+#define tmp2		x8
+
+	.p2align 4
+	nop
 SYM_FUNC_START_WEAK_PI(memchr)
-	and	w1, w1, #0xff
-1:	subs	x2, x2, #1
-	b.mi	2f
-	ldrb	w3, [x0], #1
-	cmp	w3, w1
-	b.ne	1b
-	sub	x0, x0, #1
+	and	chrin, chrin, #0xff
+	lsr	wordcnt, cntin, #3
+	cbz	wordcnt, L(byte_loop)
+	mov	rep01, #REP8_01
+	mul	repchr, x1, rep01
+	and	cntin, cntin, #7
+L(word_loop):
+	ldr	cur_word, [srcin], #8
+	sub	wordcnt, wordcnt, #1
+	eor	cur_word, cur_word, repchr
+	sub	tmp, cur_word, rep01
+	orr	tmp2, cur_word, #REP8_7f
+	bics	tmp, tmp, tmp2
+	b.ne	L(found_word)
+	cbnz	wordcnt, L(word_loop)
+L(byte_loop):
+	cbz	cntin, L(not_found)
+	ldrb	cur_byte, [srcin], #1
+	sub	cntin, cntin, #1
+	cmp	cur_byte, chrin
+	b.ne	L(byte_loop)
+	sub	srcin, srcin, #1
 	ret
-2:	mov	x0, #0
+L(found_word):
+CPU_LE(	rev	tmp, tmp)
+	clz	tmp, tmp
+	sub	tmp, tmp, #64
+	add	result, srcin, tmp, asr #3
+	ret
+L(not_found):
+	mov	result, #0
 	ret
 SYM_FUNC_END_PI(memchr)
 EXPORT_SYMBOL_NOKASAN(memchr)

[7/8] arm64: Better optimised memchr()

Commit Message

Comments

Patch