diff mbox series

[RESEND,2/2] x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations

Message ID 20250213191457.12377-2-ubizjak@gmail.com (mailing list archive)
State New
Headers show
Series [RESEND,1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op() | expand

Commit Message

Uros Bizjak Feb. 13, 2025, 7:14 p.m. UTC
According to [1], the usage of asm pseudo directives in the asm template
can confuse the compiler to wrongly estimate the size of the generated
code. ALTERNATIVE macro expands to several asm pseudo directives, so
its usage in {,try_}cmpxchg{64,128} causes instruction length estimate
to fail by an order of magnitude (the compiler estimates the length of
an asm to be more than 20 instructions). This wrong estimate further
causes unoptimal inlining decisions for functions that use these
locking primitives.

Use asm_inline instead of just asm. For inlining purposes, the size of
the asm is then taken as the minimum size, ignoring how many instructions
compiler thinks it is.

[1] https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
---
 arch/x86/include/asm/cmpxchg_32.h | 32 +++++++------
 arch/x86/include/asm/percpu.h     | 77 +++++++++++++++----------------
 2 files changed, 55 insertions(+), 54 deletions(-)

Comments

Dave Hansen Feb. 13, 2025, 8:48 p.m. UTC | #1
On 2/13/25 11:14, Uros Bizjak wrote:
> According to [1], the usage of asm pseudo directives in the asm template
> can confuse the compiler to wrongly estimate the size of the generated
> code. ALTERNATIVE macro expands to several asm pseudo directives, so
> its usage in {,try_}cmpxchg{64,128} causes instruction length estimate
> to fail by an order of magnitude (the compiler estimates the length of
> an asm to be more than 20 instructions). 

Just curious, but how did you come up with the "20 instructions" number?

> This wrong estimate further causes unoptimal inlining decisions for
> functions that use these locking primitives.
> 
> Use asm_inline instead of just asm. For inlining purposes, the size of
> the asm is then taken as the minimum size, ignoring how many instructions
> compiler thinks it is.

So, the compiler is trying to decide whether to inline a function or
not. The bigger it is, the less likely, it is to be inlined. Since it is
over-estimating the size of {,try_}cmpxchg{64,128}, it will avoid
inlining it when it _should_ be inlining it.

Is that it?

Is any of this measurable? Is there any objective data to support that
this change is a good one?

It's quite possible that someone did the "asm" on purpose because
over-estimating the size was a good thing.
Uros Bizjak Feb. 13, 2025, 10:13 p.m. UTC | #2
On Thu, Feb 13, 2025 at 9:48 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/13/25 11:14, Uros Bizjak wrote:
> > According to [1], the usage of asm pseudo directives in the asm template
> > can confuse the compiler to wrongly estimate the size of the generated
> > code. ALTERNATIVE macro expands to several asm pseudo directives, so
> > its usage in {,try_}cmpxchg{64,128} causes instruction length estimate
> > to fail by an order of magnitude (the compiler estimates the length of
> > an asm to be more than 20 instructions).
>
> Just curious, but how did you come up with the "20 instructions" number?

Currently, a patched GCC compiler is needed (please see
asm_insn_count() and asm_str_count() functions in gcc/final.cc on how
the asm length is calculated) to report the length. For historic
reasons, the length of asm is not printed in asm dumps, but recently a
GCC PR was filled with a request to change this).

> > This wrong estimate further causes unoptimal inlining decisions for
> > functions that use these locking primitives.
> >
> > Use asm_inline instead of just asm. For inlining purposes, the size of
> > the asm is then taken as the minimum size, ignoring how many instructions
> > compiler thinks it is.
>
> So, the compiler is trying to decide whether to inline a function or
> not. The bigger it is, the less likely, it is to be inlined. Since it is
> over-estimating the size of {,try_}cmpxchg{64,128}, it will avoid
> inlining it when it _should_ be inlining it.
>
> Is that it?

Yes, because the calculated length of what is effectively one
instruction gets unreasonably high. The compiler counts 20
*instructions*, each estimated to be 16 bytes long.

> Is any of this measurable? Is there any objective data to support that
> this change is a good one?

Actually, "asm inline" was added to the GCC compiler just for this
purpose by request from the linux community [1]. My patch follows the
example of other similar macros (e.g. arch/x86/include/alternative.h)
and adds the same cure to asms that will undoubtedly result in a
single instruction [*].  The benefit is much more precise length
estimation, so compiler heuristic is able to correctly estimate the
benefit of inlining, not being skewed by excessive use of
__always_inline directive. OTOH, it is hard to back up compiler
decisions by objective data, as inlining decisions depend on several
factors besides function size (e.g. how hot/cold is function), so a
simple comparison of kernel sizes does not present the full picture.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2018-December/512349.html

[*] Please note that if asm template is using CC_SET, the compiler may
emit an additional SETx asm insn. However, all GCC versions that
support "asm inline" also support flag outputs, so they are guaranteed
to emit only one asm insn.

> It's quite possible that someone did the "asm" on purpose because
> over-estimating the size was a good thing.

I doubt this would be the case, and I would consider the code that
depends on this detail defective. The code that results in one asm
instruction should be accounted as such, no matter what internal
details are exposed in the instruction asm template.

Uros.
Dave Hansen Feb. 13, 2025, 10:52 p.m. UTC | #3
On 2/13/25 14:13, Uros Bizjak wrote:
> On Thu, Feb 13, 2025 at 9:48 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> On 2/13/25 11:14, Uros Bizjak wrote:
>>> According to [1], the usage of asm pseudo directives in the asm template
>>> can confuse the compiler to wrongly estimate the size of the generated
>>> code. ALTERNATIVE macro expands to several asm pseudo directives, so
>>> its usage in {,try_}cmpxchg{64,128} causes instruction length estimate
>>> to fail by an order of magnitude (the compiler estimates the length of
>>> an asm to be more than 20 instructions).
>>
>> Just curious, but how did you come up with the "20 instructions" number?
> 
> Currently, a patched GCC compiler is needed (please see
> asm_insn_count() and asm_str_count() functions in gcc/final.cc on how
> the asm length is calculated) to report the length. For historic
> reasons, the length of asm is not printed in asm dumps, but recently a
> GCC PR was filled with a request to change this).

So, that's also good info to add. You can  even do it in the changelog
with little more space than the existing changelog:

	... fail by an order of magnitude (a hacked-up gcc shows that it
	estimates the length of an asm to be more than 20 instructions).

...
>> Is any of this measurable? Is there any objective data to support that
>> this change is a good one?
> 
> Actually, "asm inline" was added to the GCC compiler just for this
> purpose by request from the linux community [1].

Wow, that's really important important information. Shouldn't the fact
that this is leveraging a new feature that we asked for specifically get
called out somewhere?

Who asked for it? Are they on cc? Do they agree that this feature fills
the gap they wanted filled?

> My patch follows the
> example of other similar macros (e.g. arch/x86/include/alternative.h)
> and adds the same cure to asms that will undoubtedly result in a
> single instruction [*].  The benefit is much more precise length
> estimation, so compiler heuristic is able to correctly estimate the
> benefit of inlining, not being skewed by excessive use of
> __always_inline directive. OTOH, it is hard to back up compiler
> decisions by objective data, as inlining decisions depend on several
> factors besides function size (e.g. how hot/cold is function), so a
> simple comparison of kernel sizes does not present the full picture.

Yes, the world is complicated. But, honestly, one data point is a
billion times better than zero. Right now, we're at zero.

>> It's quite possible that someone did the "asm" on purpose because
>> over-estimating the size was a good thing.
> 
> I doubt this would be the case, and I would consider the code that
> depends on this detail defective. The code that results in one asm
> instruction should be accounted as such, no matter what internal
> details are exposed in the instruction asm template.

Yeah, but defective or not, if this causes a regression, it's either not
getting applied to gets reverted.

All that I'm asking here is that someone look at the kernel after the
patch gets applied and sanity check it. Absolutely basic scientific
method stuff. Make a hypothesis about what it will do:

	1. Inline these locking functions
	2. Make the kernel go faster for _something_

and if it doesn't match the hypothesis, then try and figure out why. You
don't have to do every config or every compiler. Just do one config and
one modern compiler.

Right now, this patch is saying:

	1. gcc appears to have done something that might be suboptimal
	2. gcc has a new feature that might make it less suboptimal
	3. here's a patch that should optimize things
	...

but then it leaves us hanging.  There's a lot of "mights" and "shoulds"
in there, but nothing that shows that this actually does anything
positive in practice.

Maybe I'm just a dummy and this is just an obvious improvement that I
can't grasp. If so, sorry for being so dense, but I'm going to need a
little more education before this gets applied.
Uros Bizjak Feb. 14, 2025, 7:25 a.m. UTC | #4
On Thu, Feb 13, 2025 at 11:52 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/13/25 14:13, Uros Bizjak wrote:
> > On Thu, Feb 13, 2025 at 9:48 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >> On 2/13/25 11:14, Uros Bizjak wrote:
> >>> According to [1], the usage of asm pseudo directives in the asm template
> >>> can confuse the compiler to wrongly estimate the size of the generated
> >>> code. ALTERNATIVE macro expands to several asm pseudo directives, so
> >>> its usage in {,try_}cmpxchg{64,128} causes instruction length estimate
> >>> to fail by an order of magnitude (the compiler estimates the length of
> >>> an asm to be more than 20 instructions).
> >>
> >> Just curious, but how did you come up with the "20 instructions" number?
> >
> > Currently, a patched GCC compiler is needed (please see
> > asm_insn_count() and asm_str_count() functions in gcc/final.cc on how
> > the asm length is calculated) to report the length. For historic
> > reasons, the length of asm is not printed in asm dumps, but recently a
> > GCC PR was filled with a request to change this).
>
> So, that's also good info to add. You can  even do it in the changelog
> with little more space than the existing changelog:
>
>         ... fail by an order of magnitude (a hacked-up gcc shows that it
>         estimates the length of an asm to be more than 20 instructions).
>
> ...
> >> Is any of this measurable? Is there any objective data to support that
> >> this change is a good one?
> >
> > Actually, "asm inline" was added to the GCC compiler just for this
> > purpose by request from the linux community [1].
>
> Wow, that's really important important information. Shouldn't the fact
> that this is leveraging a new feature that we asked for specifically get
> called out somewhere?
>
> Who asked for it? Are they on cc? Do they agree that this feature fills
> the gap they wanted filled?

asm_inline is already used in some 40-50 places throughout the tree,
but there still remain some places that could benefit from it.

> > My patch follows the
> > example of other similar macros (e.g. arch/x86/include/alternative.h)
> > and adds the same cure to asms that will undoubtedly result in a
> > single instruction [*].  The benefit is much more precise length
> > estimation, so compiler heuristic is able to correctly estimate the
> > benefit of inlining, not being skewed by excessive use of
> > __always_inline directive. OTOH, it is hard to back up compiler
> > decisions by objective data, as inlining decisions depend on several
> > factors besides function size (e.g. how hot/cold is function), so a
> > simple comparison of kernel sizes does not present the full picture.
>
> Yes, the world is complicated. But, honestly, one data point is a
> billion times better than zero. Right now, we're at zero.
>
> >> It's quite possible that someone did the "asm" on purpose because
> >> over-estimating the size was a good thing.
> >
> > I doubt this would be the case, and I would consider the code that
> > depends on this detail defective. The code that results in one asm
> > instruction should be accounted as such, no matter what internal
> > details are exposed in the instruction asm template.
>
> Yeah, but defective or not, if this causes a regression, it's either not
> getting applied to gets reverted.
>
> All that I'm asking here is that someone look at the kernel after the
> patch gets applied and sanity check it. Absolutely basic scientific
> method stuff. Make a hypothesis about what it will do:
>
>         1. Inline these locking functions
>         2. Make the kernel go faster for _something_
>
> and if it doesn't match the hypothesis, then try and figure out why. You
> don't have to do every config or every compiler. Just do one config and
> one modern compiler.
>
> Right now, this patch is saying:
>
>         1. gcc appears to have done something that might be suboptimal
>         2. gcc has a new feature that might make it less suboptimal
>         3. here's a patch that should optimize things
>         ...
>
> but then it leaves us hanging.  There's a lot of "mights" and "shoulds"
> in there, but nothing that shows that this actually does anything
> positive in practice.

Let me harvest some data and report the findings in a V2 ChangeLog.
However, these particular macros are rarely used, so I don't expect
some big changes in the generated asm code.

> Maybe I'm just a dummy and this is just an obvious improvement that I
> can't grasp. If so, sorry for being so dense, but I'm going to need a
> little more education before this gets applied.

Thanks,
Uros.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h
index fd1282a783dd..95b5f990ca88 100644
--- a/arch/x86/include/asm/cmpxchg_32.h
+++ b/arch/x86/include/asm/cmpxchg_32.h
@@ -91,12 +91,14 @@  static __always_inline bool __try_cmpxchg64_local(volatile u64 *ptr, u64 *oldp,
 	union __u64_halves o = { .full = (_old), },			\
 			   n = { .full = (_new), };			\
 									\
-	asm volatile(ALTERNATIVE(_lock_loc				\
-				 "call cmpxchg8b_emu",			\
-				 _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \
-		     : ALT_OUTPUT_SP("+a" (o.low), "+d" (o.high))	\
-		     : "b" (n.low), "c" (n.high), [ptr] "S" (_ptr)	\
-		     : "memory");					\
+	asm_inline volatile(						\
+		ALTERNATIVE(_lock_loc					\
+			    "call cmpxchg8b_emu",			\
+			    _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8)	\
+		: ALT_OUTPUT_SP("+a" (o.low), "+d" (o.high))		\
+		: "b" (n.low), "c" (n.high),				\
+		  [ptr] "S" (_ptr)					\
+		: "memory");						\
 									\
 	o.full;								\
 })
@@ -119,14 +121,16 @@  static __always_inline u64 arch_cmpxchg64_local(volatile u64 *ptr, u64 old, u64
 			   n = { .full = (_new), };			\
 	bool ret;							\
 									\
-	asm volatile(ALTERNATIVE(_lock_loc				\
-				 "call cmpxchg8b_emu",			\
-				 _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \
-		     CC_SET(e)						\
-		     : ALT_OUTPUT_SP(CC_OUT(e) (ret),			\
-				     "+a" (o.low), "+d" (o.high))	\
-		     : "b" (n.low), "c" (n.high), [ptr] "S" (_ptr)	\
-		     : "memory");					\
+	asm_inline volatile(						\
+		ALTERNATIVE(_lock_loc					\
+			    "call cmpxchg8b_emu",			\
+			    _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \
+		CC_SET(e)						\
+		: ALT_OUTPUT_SP(CC_OUT(e) (ret),			\
+				"+a" (o.low), "+d" (o.high))		\
+		: "b" (n.low), "c" (n.high),				\
+		  [ptr] "S" (_ptr)					\
+		: "memory");						\
 									\
 	if (unlikely(!ret))						\
 		*(_oldp) = o.full;					\
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 0ab991fba7de..08f5f61690b7 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -348,15 +348,14 @@  do {									\
 	old__.var = _oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
-			      "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
-		  : ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)),	\
-				  "+a" (old__.low),			\
-				  "+d" (old__.high))			\
-		  : "b" (new__.low),					\
-		    "c" (new__.high),					\
-		    "S" (&(_var))					\
-		  : "memory");						\
+	asm_inline qual (						\
+		ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
+			    "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
+		: ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)),	\
+				"+a" (old__.low), "+d" (old__.high))	\
+		: "b" (new__.low), "c" (new__.high),			\
+		  "S" (&(_var))						\
+		: "memory");						\
 									\
 	old__.var;							\
 })
@@ -378,17 +377,16 @@  do {									\
 	old__.var = *_oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
-			      "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
-		  CC_SET(z)						\
-		  : ALT_OUTPUT_SP(CC_OUT(z) (success),			\
-				  [var] "+m" (__my_cpu_var(_var)),	\
-				  "+a" (old__.low),			\
-				  "+d" (old__.high))			\
-		  : "b" (new__.low),					\
-		    "c" (new__.high),					\
-		    "S" (&(_var))					\
-		  : "memory");						\
+	asm_inline qual (						\
+		ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
+			    "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
+		CC_SET(z)						\
+		: ALT_OUTPUT_SP(CC_OUT(z) (success),			\
+				[var] "+m" (__my_cpu_var(_var)),	\
+				"+a" (old__.low), "+d" (old__.high))	\
+		: "b" (new__.low), "c" (new__.high),			\
+		  "S" (&(_var))						\
+		: "memory");						\
 	if (unlikely(!success))						\
 		*_oval = old__.var;					\
 									\
@@ -419,15 +417,14 @@  do {									\
 	old__.var = _oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
-			      "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
-		  : ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)),	\
-				  "+a" (old__.low),			\
-				  "+d" (old__.high))			\
-		  : "b" (new__.low),					\
-		    "c" (new__.high),					\
-		    "S" (&(_var))					\
-		  : "memory");						\
+	asm_inline qual (						\
+		ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
+			    "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
+		: ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)),	\
+				"+a" (old__.low), "+d" (old__.high))	\
+		: "b" (new__.low), "c" (new__.high),			\
+		  "S" (&(_var))						\
+		: "memory");						\
 									\
 	old__.var;							\
 })
@@ -449,19 +446,19 @@  do {									\
 	old__.var = *_oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
-			      "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
-		  CC_SET(z)						\
-		  : ALT_OUTPUT_SP(CC_OUT(z) (success),			\
-				  [var] "+m" (__my_cpu_var(_var)),	\
-				  "+a" (old__.low),			\
-				  "+d" (old__.high))			\
-		  : "b" (new__.low),					\
-		    "c" (new__.high),					\
-		    "S" (&(_var))					\
-		  : "memory");						\
+	asm_inline qual (						\
+		ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
+			    "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
+		CC_SET(z)						\
+		: ALT_OUTPUT_SP(CC_OUT(z) (success),			\
+				[var] "+m" (__my_cpu_var(_var)),	\
+				"+a" (old__.low), "+d" (old__.high))	\
+		: "b" (new__.low), "c" (new__.high),			\
+		  "S" (&(_var))						\
+		: "memory");						\
 	if (unlikely(!success))						\
 		*_oval = old__.var;					\
+									\
 	likely(success);						\
 })