diff mbox series

[4/5] arm64: atomics: lse: improve constraints for simple ops

Message ID 20211210151410.2782645-5-mark.rutland@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64: atomics: cleanups and codegen improvements | expand

Commit Message

Mark Rutland Dec. 10, 2021, 3:14 p.m. UTC
We have overly conservative assembly constraints for the basic FEAT_LSE
atomic instructions, and using more accurate and permissive constraints
will allow for better code generation.

The FEAT_LSE basic atomic instructions have come in two forms:

	LD{op}{order}{size} <Rs>, <Rt>, [<Rn>]
	ST{op}{order}{size} <Rs>, [<Rn>]

The ST* forms are aliases of the LD* forms where:

	ST{op}{order}{size} <Rs>, [<Rn>]
Is:
	LD{op}{order}{size} <Rs>, XZR, [<Rn>]

For either form, both <Rs> and <Rn> are read but not written back to,
and <Rt> is written with the original value of the memory location.
Where (<Rt> == <Rs>) or (<Rt> == <Rn>), <Rt> is written *after* the
other register value(s) are consumed. There are no UNPREDICTABLE or
CONSTRAINED UNPREDICTABLE behaviours when any pair of <Rs>, <Rt>, or
<Rn> are the same register.

Our current inline assembly always uses <Rs> == <Rt>, treating this
register as both an input and an output (using a '+r' constraint). This
forces the compiler to do some unnecessary register shuffling and/or
redundant value generation.

For example, the compiler cannot reuse the <Rs> value, and currently GCC
11.1.0 will compile:

	__lse_atomic_add(1, a);
	__lse_atomic_add(1, b);
	__lse_atomic_add(1, c);

As:

	mov     w3, #0x1
	mov     w4, w3
	stadd   w4, [x0]
	mov     w0, w3
	stadd   w0, [x1]
	stadd   w3, [x2]

We can improve this with more accurate constraints, separating <Rs> and
<Rt>, where <Rs> is an input-only register ('r'), and <Rt> is an
output-only value ('=r'). As <Rt> is written back after <Rs> is
consumed, it does not need to be earlyclobber ('=&r'), leaving the
compiler free to use the same register for both <Rs> and <Rt> where this
is desirable.

At the same time, the redundant 'r' constraint for `v` is removed, as
the `+Q` constraint is sufficient.

With this change, the above example becomes:

	mov     w3, #0x1
	stadd   w3, [x0]
	stadd   w3, [x1]
	stadd   w3, [x2]

I've made this change for the non-value-returning and FETCH ops. The
RETURN ops have a multi-instruction sequence for which we cannot use the
same constraints, and a subsequent patch will rewrite hte RETURN ops in
terms of the FETCH ops, relying on the ability for the compiler to reuse
the <Rs> value.

This is intended as an optimization.
There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/atomic_lse.h | 30 +++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

Comments

Will Deacon Dec. 13, 2021, 7:40 p.m. UTC | #1
On Fri, Dec 10, 2021 at 03:14:09PM +0000, Mark Rutland wrote:
> We have overly conservative assembly constraints for the basic FEAT_LSE
> atomic instructions, and using more accurate and permissive constraints
> will allow for better code generation.
> 
> The FEAT_LSE basic atomic instructions have come in two forms:
> 
> 	LD{op}{order}{size} <Rs>, <Rt>, [<Rn>]
> 	ST{op}{order}{size} <Rs>, [<Rn>]
> 
> The ST* forms are aliases of the LD* forms where:
> 
> 	ST{op}{order}{size} <Rs>, [<Rn>]
> Is:
> 	LD{op}{order}{size} <Rs>, XZR, [<Rn>]
> 
> For either form, both <Rs> and <Rn> are read but not written back to,
> and <Rt> is written with the original value of the memory location.
> Where (<Rt> == <Rs>) or (<Rt> == <Rn>), <Rt> is written *after* the
> other register value(s) are consumed. There are no UNPREDICTABLE or
> CONSTRAINED UNPREDICTABLE behaviours when any pair of <Rs>, <Rt>, or
> <Rn> are the same register.
> 
> Our current inline assembly always uses <Rs> == <Rt>, treating this
> register as both an input and an output (using a '+r' constraint). This
> forces the compiler to do some unnecessary register shuffling and/or
> redundant value generation.
> 
> For example, the compiler cannot reuse the <Rs> value, and currently GCC
> 11.1.0 will compile:
> 
> 	__lse_atomic_add(1, a);
> 	__lse_atomic_add(1, b);
> 	__lse_atomic_add(1, c);
> 
> As:
> 
> 	mov     w3, #0x1
> 	mov     w4, w3
> 	stadd   w4, [x0]
> 	mov     w0, w3
> 	stadd   w0, [x1]
> 	stadd   w3, [x2]
> 
> We can improve this with more accurate constraints, separating <Rs> and
> <Rt>, where <Rs> is an input-only register ('r'), and <Rt> is an
> output-only value ('=r'). As <Rt> is written back after <Rs> is
> consumed, it does not need to be earlyclobber ('=&r'), leaving the
> compiler free to use the same register for both <Rs> and <Rt> where this
> is desirable.
> 
> At the same time, the redundant 'r' constraint for `v` is removed, as
> the `+Q` constraint is sufficient.
> 
> With this change, the above example becomes:
> 
> 	mov     w3, #0x1
> 	stadd   w3, [x0]
> 	stadd   w3, [x1]
> 	stadd   w3, [x2]
> 
> I've made this change for the non-value-returning and FETCH ops. The
> RETURN ops have a multi-instruction sequence for which we cannot use the
> same constraints, and a subsequent patch will rewrite hte RETURN ops in
> terms of the FETCH ops, relying on the ability for the compiler to reuse
> the <Rs> value.
> 
> This is intended as an optimization.
> There should be no functional change as a result of this patch.
> 
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/include/asm/atomic_lse.h | 30 +++++++++++++++++------------
>  1 file changed, 18 insertions(+), 12 deletions(-)

Makes sense to me. I'm not sure _why_ the current constraints are so weird;
maybe a hangover from when we patched them inline? Anywho:

Acked-by: Will Deacon <will@kernel.org>

Will
Mark Rutland Dec. 14, 2021, 1:04 p.m. UTC | #2
On Mon, Dec 13, 2021 at 07:40:23PM +0000, Will Deacon wrote:
> On Fri, Dec 10, 2021 at 03:14:09PM +0000, Mark Rutland wrote:
> > We have overly conservative assembly constraints for the basic FEAT_LSE
> > atomic instructions, and using more accurate and permissive constraints
> > will allow for better code generation.
> > 
> > The FEAT_LSE basic atomic instructions have come in two forms:
> > 
> > 	LD{op}{order}{size} <Rs>, <Rt>, [<Rn>]
> > 	ST{op}{order}{size} <Rs>, [<Rn>]

> > For either form, both <Rs> and <Rn> are read but not written back to,
> > and <Rt> is written with the original value of the memory location.
> > Where (<Rt> == <Rs>) or (<Rt> == <Rn>), <Rt> is written *after* the
> > other register value(s) are consumed. There are no UNPREDICTABLE or
> > CONSTRAINED UNPREDICTABLE behaviours when any pair of <Rs>, <Rt>, or
> > <Rn> are the same register.

> > Our current inline assembly always uses <Rs> == <Rt>, treating this
> > register as both an input and an output (using a '+r' constraint). This
> > forces the compiler to do some unnecessary register shuffling and/or
> > redundant value generation.

> > We can improve this with more accurate constraints, separating <Rs> and
> > <Rt>, where <Rs> is an input-only register ('r'), and <Rt> is an
> > output-only value ('=r'). As <Rt> is written back after <Rs> is
> > consumed, it does not need to be earlyclobber ('=&r'), leaving the
> > compiler free to use the same register for both <Rs> and <Rt> where this
> > is desirable.
> > 
> > At the same time, the redundant 'r' constraint for `v` is removed, as
> > the `+Q` constraint is sufficient.

> Makes sense to me. I'm not sure _why_ the current constraints are so weird;
> maybe a hangover from when we patched them inline?

Yup; the constraints used to make sense when we were patching the instructions,
and we just never cleaned them up. Details below -- I can add a link tag to
this mail if you'd like?

For example, back when we introduced the ops in commit:

  c0385b24af15020a ("arm64: introduce CONFIG_ARM64_LSE_ATOMICS as fallback to ll/sc atomics")

We had:

| #define ATOMIC_OP(op, asm_op)                                          \
| static inline void atomic_##op(int i, atomic_t *v)                     \
| {                                                                      \
|        register int w0 asm ("w0") = i;                                 \
|        register atomic_t *x1 asm ("x1") = v;                           \
|                                                                        \
|        asm volatile(                                                   \
|        __LL_SC_CALL(op)                                                \
|        : "+r" (w0), "+Q" (v->counter)                                  \
|        : "r" (x1)                                                      \
|        : "x30");                                                       \
| }                                                                      \

IIUC, for those constraints:

* "+r" (w0) -- necessary because the LL-SC forms might clobber w0 (e.g. if
  used for their `tmp` variable). For the value-returning ops we had to use
  w0/x0 for the return value as the out-of-line LL-SC ops followed the AAPCS
  parameter passing rules.

* "+Q" (v->counter) -- necessary to clobber the memory location, "Q"
  specifically to ensure the compiler didn't expect some write-back addressing
  mode to have adjusted a register.

* "r" (x1) -- necessary to ensure the address was available in X1 for the LL-SC
  code, since I beleive the "+Q" constraint *could* have used a copy in another
  register. I could be mistaken on this one, since there's no rationale in the
  original commit message or code.

When that approach was thrown away in commits

  addfc38672c73efd ("arm64: atomics: avoid out-of-line ll/sc atomics")
  3337cb5aea594e40 ("arm64: avoid using hard-coded registers for LSE atomics")

... we stopped hard-coding the registers, and added a `tmp` register for ops
with a return value, but otherwise left the constraints as-is.

> Anywho:
> 
> Acked-by: Will Deacon <will@kernel.org>

Thanks!

Mark.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index d707eafb7677..e4c5c4c34ce6 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -16,8 +16,8 @@  static inline void __lse_atomic_##op(int i, atomic_t *v)		\
 	asm volatile(							\
 	__LSE_PREAMBLE							\
 	"	" #asm_op "	%w[i], %[v]\n"				\
-	: [i] "+r" (i), [v] "+Q" (v->counter)				\
-	: "r" (v));							\
+	: [v] "+Q" (v->counter)						\
+	: [i] "r" (i));							\
 }
 
 ATOMIC_OP(andnot, stclr)
@@ -35,14 +35,17 @@  static inline void __lse_atomic_sub(int i, atomic_t *v)
 #define ATOMIC_FETCH_OP(name, mb, op, asm_op, cl...)			\
 static inline int __lse_atomic_fetch_##op##name(int i, atomic_t *v)	\
 {									\
+	int old;							\
+									\
 	asm volatile(							\
 	__LSE_PREAMBLE							\
-	"	" #asm_op #mb "	%w[i], %w[i], %[v]"			\
-	: [i] "+r" (i), [v] "+Q" (v->counter)				\
-	: "r" (v)							\
+	"	" #asm_op #mb "	%w[i], %w[old], %[v]"			\
+	: [v] "+Q" (v->counter),					\
+	  [old] "=r" (old)						\
+	: [i] "r" (i)							\
 	: cl);								\
 									\
-	return i;							\
+	return old;							\
 }
 
 #define ATOMIC_FETCH_OPS(op, asm_op)					\
@@ -124,8 +127,8 @@  static inline void __lse_atomic64_##op(s64 i, atomic64_t *v)		\
 	asm volatile(							\
 	__LSE_PREAMBLE							\
 	"	" #asm_op "	%[i], %[v]\n"				\
-	: [i] "+r" (i), [v] "+Q" (v->counter)				\
-	: "r" (v));							\
+	: [v] "+Q" (v->counter)						\
+	: [i] "r" (i));							\
 }
 
 ATOMIC64_OP(andnot, stclr)
@@ -143,14 +146,17 @@  static inline void __lse_atomic64_sub(s64 i, atomic64_t *v)
 #define ATOMIC64_FETCH_OP(name, mb, op, asm_op, cl...)			\
 static inline long __lse_atomic64_fetch_##op##name(s64 i, atomic64_t *v)\
 {									\
+	s64 old;							\
+									\
 	asm volatile(							\
 	__LSE_PREAMBLE							\
-	"	" #asm_op #mb "	%[i], %[i], %[v]"			\
-	: [i] "+r" (i), [v] "+Q" (v->counter)				\
-	: "r" (v)							\
+	"	" #asm_op #mb "	%[i], %[old], %[v]"			\
+	: [v] "+Q" (v->counter),					\
+	  [old] "=r" (old)						\
+	: [i] "r" (i) 							\
 	: cl);								\
 									\
-	return i;							\
+	return old;							\
 }
 
 #define ATOMIC64_FETCH_OPS(op, asm_op)					\