Message ID | 20211210151410.2782645-5-mark.rutland@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: atomics: cleanups and codegen improvements | expand |
On Fri, Dec 10, 2021 at 03:14:09PM +0000, Mark Rutland wrote: > We have overly conservative assembly constraints for the basic FEAT_LSE > atomic instructions, and using more accurate and permissive constraints > will allow for better code generation. > > The FEAT_LSE basic atomic instructions have come in two forms: > > LD{op}{order}{size} <Rs>, <Rt>, [<Rn>] > ST{op}{order}{size} <Rs>, [<Rn>] > > The ST* forms are aliases of the LD* forms where: > > ST{op}{order}{size} <Rs>, [<Rn>] > Is: > LD{op}{order}{size} <Rs>, XZR, [<Rn>] > > For either form, both <Rs> and <Rn> are read but not written back to, > and <Rt> is written with the original value of the memory location. > Where (<Rt> == <Rs>) or (<Rt> == <Rn>), <Rt> is written *after* the > other register value(s) are consumed. There are no UNPREDICTABLE or > CONSTRAINED UNPREDICTABLE behaviours when any pair of <Rs>, <Rt>, or > <Rn> are the same register. > > Our current inline assembly always uses <Rs> == <Rt>, treating this > register as both an input and an output (using a '+r' constraint). This > forces the compiler to do some unnecessary register shuffling and/or > redundant value generation. > > For example, the compiler cannot reuse the <Rs> value, and currently GCC > 11.1.0 will compile: > > __lse_atomic_add(1, a); > __lse_atomic_add(1, b); > __lse_atomic_add(1, c); > > As: > > mov w3, #0x1 > mov w4, w3 > stadd w4, [x0] > mov w0, w3 > stadd w0, [x1] > stadd w3, [x2] > > We can improve this with more accurate constraints, separating <Rs> and > <Rt>, where <Rs> is an input-only register ('r'), and <Rt> is an > output-only value ('=r'). As <Rt> is written back after <Rs> is > consumed, it does not need to be earlyclobber ('=&r'), leaving the > compiler free to use the same register for both <Rs> and <Rt> where this > is desirable. > > At the same time, the redundant 'r' constraint for `v` is removed, as > the `+Q` constraint is sufficient. > > With this change, the above example becomes: > > mov w3, #0x1 > stadd w3, [x0] > stadd w3, [x1] > stadd w3, [x2] > > I've made this change for the non-value-returning and FETCH ops. The > RETURN ops have a multi-instruction sequence for which we cannot use the > same constraints, and a subsequent patch will rewrite hte RETURN ops in > terms of the FETCH ops, relying on the ability for the compiler to reuse > the <Rs> value. > > This is intended as an optimization. > There should be no functional change as a result of this patch. > > Signed-off-by: Mark Rutland <mark.rutland@arm.com> > Cc: Boqun Feng <boqun.feng@gmail.com> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Will Deacon <will@kernel.org> > --- > arch/arm64/include/asm/atomic_lse.h | 30 +++++++++++++++++------------ > 1 file changed, 18 insertions(+), 12 deletions(-) Makes sense to me. I'm not sure _why_ the current constraints are so weird; maybe a hangover from when we patched them inline? Anywho: Acked-by: Will Deacon <will@kernel.org> Will
On Mon, Dec 13, 2021 at 07:40:23PM +0000, Will Deacon wrote: > On Fri, Dec 10, 2021 at 03:14:09PM +0000, Mark Rutland wrote: > > We have overly conservative assembly constraints for the basic FEAT_LSE > > atomic instructions, and using more accurate and permissive constraints > > will allow for better code generation. > > > > The FEAT_LSE basic atomic instructions have come in two forms: > > > > LD{op}{order}{size} <Rs>, <Rt>, [<Rn>] > > ST{op}{order}{size} <Rs>, [<Rn>] > > For either form, both <Rs> and <Rn> are read but not written back to, > > and <Rt> is written with the original value of the memory location. > > Where (<Rt> == <Rs>) or (<Rt> == <Rn>), <Rt> is written *after* the > > other register value(s) are consumed. There are no UNPREDICTABLE or > > CONSTRAINED UNPREDICTABLE behaviours when any pair of <Rs>, <Rt>, or > > <Rn> are the same register. > > Our current inline assembly always uses <Rs> == <Rt>, treating this > > register as both an input and an output (using a '+r' constraint). This > > forces the compiler to do some unnecessary register shuffling and/or > > redundant value generation. > > We can improve this with more accurate constraints, separating <Rs> and > > <Rt>, where <Rs> is an input-only register ('r'), and <Rt> is an > > output-only value ('=r'). As <Rt> is written back after <Rs> is > > consumed, it does not need to be earlyclobber ('=&r'), leaving the > > compiler free to use the same register for both <Rs> and <Rt> where this > > is desirable. > > > > At the same time, the redundant 'r' constraint for `v` is removed, as > > the `+Q` constraint is sufficient. > Makes sense to me. I'm not sure _why_ the current constraints are so weird; > maybe a hangover from when we patched them inline? Yup; the constraints used to make sense when we were patching the instructions, and we just never cleaned them up. Details below -- I can add a link tag to this mail if you'd like? For example, back when we introduced the ops in commit: c0385b24af15020a ("arm64: introduce CONFIG_ARM64_LSE_ATOMICS as fallback to ll/sc atomics") We had: | #define ATOMIC_OP(op, asm_op) \ | static inline void atomic_##op(int i, atomic_t *v) \ | { \ | register int w0 asm ("w0") = i; \ | register atomic_t *x1 asm ("x1") = v; \ | \ | asm volatile( \ | __LL_SC_CALL(op) \ | : "+r" (w0), "+Q" (v->counter) \ | : "r" (x1) \ | : "x30"); \ | } \ IIUC, for those constraints: * "+r" (w0) -- necessary because the LL-SC forms might clobber w0 (e.g. if used for their `tmp` variable). For the value-returning ops we had to use w0/x0 for the return value as the out-of-line LL-SC ops followed the AAPCS parameter passing rules. * "+Q" (v->counter) -- necessary to clobber the memory location, "Q" specifically to ensure the compiler didn't expect some write-back addressing mode to have adjusted a register. * "r" (x1) -- necessary to ensure the address was available in X1 for the LL-SC code, since I beleive the "+Q" constraint *could* have used a copy in another register. I could be mistaken on this one, since there's no rationale in the original commit message or code. When that approach was thrown away in commits addfc38672c73efd ("arm64: atomics: avoid out-of-line ll/sc atomics") 3337cb5aea594e40 ("arm64: avoid using hard-coded registers for LSE atomics") ... we stopped hard-coding the registers, and added a `tmp` register for ops with a return value, but otherwise left the constraints as-is. > Anywho: > > Acked-by: Will Deacon <will@kernel.org> Thanks! Mark.
diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h index d707eafb7677..e4c5c4c34ce6 100644 --- a/arch/arm64/include/asm/atomic_lse.h +++ b/arch/arm64/include/asm/atomic_lse.h @@ -16,8 +16,8 @@ static inline void __lse_atomic_##op(int i, atomic_t *v) \ asm volatile( \ __LSE_PREAMBLE \ " " #asm_op " %w[i], %[v]\n" \ - : [i] "+r" (i), [v] "+Q" (v->counter) \ - : "r" (v)); \ + : [v] "+Q" (v->counter) \ + : [i] "r" (i)); \ } ATOMIC_OP(andnot, stclr) @@ -35,14 +35,17 @@ static inline void __lse_atomic_sub(int i, atomic_t *v) #define ATOMIC_FETCH_OP(name, mb, op, asm_op, cl...) \ static inline int __lse_atomic_fetch_##op##name(int i, atomic_t *v) \ { \ + int old; \ + \ asm volatile( \ __LSE_PREAMBLE \ - " " #asm_op #mb " %w[i], %w[i], %[v]" \ - : [i] "+r" (i), [v] "+Q" (v->counter) \ - : "r" (v) \ + " " #asm_op #mb " %w[i], %w[old], %[v]" \ + : [v] "+Q" (v->counter), \ + [old] "=r" (old) \ + : [i] "r" (i) \ : cl); \ \ - return i; \ + return old; \ } #define ATOMIC_FETCH_OPS(op, asm_op) \ @@ -124,8 +127,8 @@ static inline void __lse_atomic64_##op(s64 i, atomic64_t *v) \ asm volatile( \ __LSE_PREAMBLE \ " " #asm_op " %[i], %[v]\n" \ - : [i] "+r" (i), [v] "+Q" (v->counter) \ - : "r" (v)); \ + : [v] "+Q" (v->counter) \ + : [i] "r" (i)); \ } ATOMIC64_OP(andnot, stclr) @@ -143,14 +146,17 @@ static inline void __lse_atomic64_sub(s64 i, atomic64_t *v) #define ATOMIC64_FETCH_OP(name, mb, op, asm_op, cl...) \ static inline long __lse_atomic64_fetch_##op##name(s64 i, atomic64_t *v)\ { \ + s64 old; \ + \ asm volatile( \ __LSE_PREAMBLE \ - " " #asm_op #mb " %[i], %[i], %[v]" \ - : [i] "+r" (i), [v] "+Q" (v->counter) \ - : "r" (v) \ + " " #asm_op #mb " %[i], %[old], %[v]" \ + : [v] "+Q" (v->counter), \ + [old] "=r" (old) \ + : [i] "r" (i) \ : cl); \ \ - return i; \ + return old; \ } #define ATOMIC64_FETCH_OPS(op, asm_op) \
We have overly conservative assembly constraints for the basic FEAT_LSE atomic instructions, and using more accurate and permissive constraints will allow for better code generation. The FEAT_LSE basic atomic instructions have come in two forms: LD{op}{order}{size} <Rs>, <Rt>, [<Rn>] ST{op}{order}{size} <Rs>, [<Rn>] The ST* forms are aliases of the LD* forms where: ST{op}{order}{size} <Rs>, [<Rn>] Is: LD{op}{order}{size} <Rs>, XZR, [<Rn>] For either form, both <Rs> and <Rn> are read but not written back to, and <Rt> is written with the original value of the memory location. Where (<Rt> == <Rs>) or (<Rt> == <Rn>), <Rt> is written *after* the other register value(s) are consumed. There are no UNPREDICTABLE or CONSTRAINED UNPREDICTABLE behaviours when any pair of <Rs>, <Rt>, or <Rn> are the same register. Our current inline assembly always uses <Rs> == <Rt>, treating this register as both an input and an output (using a '+r' constraint). This forces the compiler to do some unnecessary register shuffling and/or redundant value generation. For example, the compiler cannot reuse the <Rs> value, and currently GCC 11.1.0 will compile: __lse_atomic_add(1, a); __lse_atomic_add(1, b); __lse_atomic_add(1, c); As: mov w3, #0x1 mov w4, w3 stadd w4, [x0] mov w0, w3 stadd w0, [x1] stadd w3, [x2] We can improve this with more accurate constraints, separating <Rs> and <Rt>, where <Rs> is an input-only register ('r'), and <Rt> is an output-only value ('=r'). As <Rt> is written back after <Rs> is consumed, it does not need to be earlyclobber ('=&r'), leaving the compiler free to use the same register for both <Rs> and <Rt> where this is desirable. At the same time, the redundant 'r' constraint for `v` is removed, as the `+Q` constraint is sufficient. With this change, the above example becomes: mov w3, #0x1 stadd w3, [x0] stadd w3, [x1] stadd w3, [x2] I've made this change for the non-value-returning and FETCH ops. The RETURN ops have a multi-instruction sequence for which we cannot use the same constraints, and a subsequent patch will rewrite hte RETURN ops in terms of the FETCH ops, relying on the ability for the compiler to reuse the <Rs> value. This is intended as an optimization. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> --- arch/arm64/include/asm/atomic_lse.h | 30 +++++++++++++++++------------ 1 file changed, 18 insertions(+), 12 deletions(-)