Message ID | 20250213191457.12377-2-ubizjak@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RESEND,1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op() | expand |
On 2/13/25 11:14, Uros Bizjak wrote: > According to [1], the usage of asm pseudo directives in the asm template > can confuse the compiler to wrongly estimate the size of the generated > code. ALTERNATIVE macro expands to several asm pseudo directives, so > its usage in {,try_}cmpxchg{64,128} causes instruction length estimate > to fail by an order of magnitude (the compiler estimates the length of > an asm to be more than 20 instructions). Just curious, but how did you come up with the "20 instructions" number? > This wrong estimate further causes unoptimal inlining decisions for > functions that use these locking primitives. > > Use asm_inline instead of just asm. For inlining purposes, the size of > the asm is then taken as the minimum size, ignoring how many instructions > compiler thinks it is. So, the compiler is trying to decide whether to inline a function or not. The bigger it is, the less likely, it is to be inlined. Since it is over-estimating the size of {,try_}cmpxchg{64,128}, it will avoid inlining it when it _should_ be inlining it. Is that it? Is any of this measurable? Is there any objective data to support that this change is a good one? It's quite possible that someone did the "asm" on purpose because over-estimating the size was a good thing.
On Thu, Feb 13, 2025 at 9:48 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 2/13/25 11:14, Uros Bizjak wrote: > > According to [1], the usage of asm pseudo directives in the asm template > > can confuse the compiler to wrongly estimate the size of the generated > > code. ALTERNATIVE macro expands to several asm pseudo directives, so > > its usage in {,try_}cmpxchg{64,128} causes instruction length estimate > > to fail by an order of magnitude (the compiler estimates the length of > > an asm to be more than 20 instructions). > > Just curious, but how did you come up with the "20 instructions" number? Currently, a patched GCC compiler is needed (please see asm_insn_count() and asm_str_count() functions in gcc/final.cc on how the asm length is calculated) to report the length. For historic reasons, the length of asm is not printed in asm dumps, but recently a GCC PR was filled with a request to change this). > > This wrong estimate further causes unoptimal inlining decisions for > > functions that use these locking primitives. > > > > Use asm_inline instead of just asm. For inlining purposes, the size of > > the asm is then taken as the minimum size, ignoring how many instructions > > compiler thinks it is. > > So, the compiler is trying to decide whether to inline a function or > not. The bigger it is, the less likely, it is to be inlined. Since it is > over-estimating the size of {,try_}cmpxchg{64,128}, it will avoid > inlining it when it _should_ be inlining it. > > Is that it? Yes, because the calculated length of what is effectively one instruction gets unreasonably high. The compiler counts 20 *instructions*, each estimated to be 16 bytes long. > Is any of this measurable? Is there any objective data to support that > this change is a good one? Actually, "asm inline" was added to the GCC compiler just for this purpose by request from the linux community [1]. My patch follows the example of other similar macros (e.g. arch/x86/include/alternative.h) and adds the same cure to asms that will undoubtedly result in a single instruction [*]. The benefit is much more precise length estimation, so compiler heuristic is able to correctly estimate the benefit of inlining, not being skewed by excessive use of __always_inline directive. OTOH, it is hard to back up compiler decisions by objective data, as inlining decisions depend on several factors besides function size (e.g. how hot/cold is function), so a simple comparison of kernel sizes does not present the full picture. [1] https://gcc.gnu.org/pipermail/gcc-patches/2018-December/512349.html [*] Please note that if asm template is using CC_SET, the compiler may emit an additional SETx asm insn. However, all GCC versions that support "asm inline" also support flag outputs, so they are guaranteed to emit only one asm insn. > It's quite possible that someone did the "asm" on purpose because > over-estimating the size was a good thing. I doubt this would be the case, and I would consider the code that depends on this detail defective. The code that results in one asm instruction should be accounted as such, no matter what internal details are exposed in the instruction asm template. Uros.
On 2/13/25 14:13, Uros Bizjak wrote: > On Thu, Feb 13, 2025 at 9:48 PM Dave Hansen <dave.hansen@intel.com> wrote: >> On 2/13/25 11:14, Uros Bizjak wrote: >>> According to [1], the usage of asm pseudo directives in the asm template >>> can confuse the compiler to wrongly estimate the size of the generated >>> code. ALTERNATIVE macro expands to several asm pseudo directives, so >>> its usage in {,try_}cmpxchg{64,128} causes instruction length estimate >>> to fail by an order of magnitude (the compiler estimates the length of >>> an asm to be more than 20 instructions). >> >> Just curious, but how did you come up with the "20 instructions" number? > > Currently, a patched GCC compiler is needed (please see > asm_insn_count() and asm_str_count() functions in gcc/final.cc on how > the asm length is calculated) to report the length. For historic > reasons, the length of asm is not printed in asm dumps, but recently a > GCC PR was filled with a request to change this). So, that's also good info to add. You can even do it in the changelog with little more space than the existing changelog: ... fail by an order of magnitude (a hacked-up gcc shows that it estimates the length of an asm to be more than 20 instructions). ... >> Is any of this measurable? Is there any objective data to support that >> this change is a good one? > > Actually, "asm inline" was added to the GCC compiler just for this > purpose by request from the linux community [1]. Wow, that's really important important information. Shouldn't the fact that this is leveraging a new feature that we asked for specifically get called out somewhere? Who asked for it? Are they on cc? Do they agree that this feature fills the gap they wanted filled? > My patch follows the > example of other similar macros (e.g. arch/x86/include/alternative.h) > and adds the same cure to asms that will undoubtedly result in a > single instruction [*]. The benefit is much more precise length > estimation, so compiler heuristic is able to correctly estimate the > benefit of inlining, not being skewed by excessive use of > __always_inline directive. OTOH, it is hard to back up compiler > decisions by objective data, as inlining decisions depend on several > factors besides function size (e.g. how hot/cold is function), so a > simple comparison of kernel sizes does not present the full picture. Yes, the world is complicated. But, honestly, one data point is a billion times better than zero. Right now, we're at zero. >> It's quite possible that someone did the "asm" on purpose because >> over-estimating the size was a good thing. > > I doubt this would be the case, and I would consider the code that > depends on this detail defective. The code that results in one asm > instruction should be accounted as such, no matter what internal > details are exposed in the instruction asm template. Yeah, but defective or not, if this causes a regression, it's either not getting applied to gets reverted. All that I'm asking here is that someone look at the kernel after the patch gets applied and sanity check it. Absolutely basic scientific method stuff. Make a hypothesis about what it will do: 1. Inline these locking functions 2. Make the kernel go faster for _something_ and if it doesn't match the hypothesis, then try and figure out why. You don't have to do every config or every compiler. Just do one config and one modern compiler. Right now, this patch is saying: 1. gcc appears to have done something that might be suboptimal 2. gcc has a new feature that might make it less suboptimal 3. here's a patch that should optimize things ... but then it leaves us hanging. There's a lot of "mights" and "shoulds" in there, but nothing that shows that this actually does anything positive in practice. Maybe I'm just a dummy and this is just an obvious improvement that I can't grasp. If so, sorry for being so dense, but I'm going to need a little more education before this gets applied.
On Thu, Feb 13, 2025 at 11:52 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 2/13/25 14:13, Uros Bizjak wrote: > > On Thu, Feb 13, 2025 at 9:48 PM Dave Hansen <dave.hansen@intel.com> wrote: > >> On 2/13/25 11:14, Uros Bizjak wrote: > >>> According to [1], the usage of asm pseudo directives in the asm template > >>> can confuse the compiler to wrongly estimate the size of the generated > >>> code. ALTERNATIVE macro expands to several asm pseudo directives, so > >>> its usage in {,try_}cmpxchg{64,128} causes instruction length estimate > >>> to fail by an order of magnitude (the compiler estimates the length of > >>> an asm to be more than 20 instructions). > >> > >> Just curious, but how did you come up with the "20 instructions" number? > > > > Currently, a patched GCC compiler is needed (please see > > asm_insn_count() and asm_str_count() functions in gcc/final.cc on how > > the asm length is calculated) to report the length. For historic > > reasons, the length of asm is not printed in asm dumps, but recently a > > GCC PR was filled with a request to change this). > > So, that's also good info to add. You can even do it in the changelog > with little more space than the existing changelog: > > ... fail by an order of magnitude (a hacked-up gcc shows that it > estimates the length of an asm to be more than 20 instructions). > > ... > >> Is any of this measurable? Is there any objective data to support that > >> this change is a good one? > > > > Actually, "asm inline" was added to the GCC compiler just for this > > purpose by request from the linux community [1]. > > Wow, that's really important important information. Shouldn't the fact > that this is leveraging a new feature that we asked for specifically get > called out somewhere? > > Who asked for it? Are they on cc? Do they agree that this feature fills > the gap they wanted filled? asm_inline is already used in some 40-50 places throughout the tree, but there still remain some places that could benefit from it. > > My patch follows the > > example of other similar macros (e.g. arch/x86/include/alternative.h) > > and adds the same cure to asms that will undoubtedly result in a > > single instruction [*]. The benefit is much more precise length > > estimation, so compiler heuristic is able to correctly estimate the > > benefit of inlining, not being skewed by excessive use of > > __always_inline directive. OTOH, it is hard to back up compiler > > decisions by objective data, as inlining decisions depend on several > > factors besides function size (e.g. how hot/cold is function), so a > > simple comparison of kernel sizes does not present the full picture. > > Yes, the world is complicated. But, honestly, one data point is a > billion times better than zero. Right now, we're at zero. > > >> It's quite possible that someone did the "asm" on purpose because > >> over-estimating the size was a good thing. > > > > I doubt this would be the case, and I would consider the code that > > depends on this detail defective. The code that results in one asm > > instruction should be accounted as such, no matter what internal > > details are exposed in the instruction asm template. > > Yeah, but defective or not, if this causes a regression, it's either not > getting applied to gets reverted. > > All that I'm asking here is that someone look at the kernel after the > patch gets applied and sanity check it. Absolutely basic scientific > method stuff. Make a hypothesis about what it will do: > > 1. Inline these locking functions > 2. Make the kernel go faster for _something_ > > and if it doesn't match the hypothesis, then try and figure out why. You > don't have to do every config or every compiler. Just do one config and > one modern compiler. > > Right now, this patch is saying: > > 1. gcc appears to have done something that might be suboptimal > 2. gcc has a new feature that might make it less suboptimal > 3. here's a patch that should optimize things > ... > > but then it leaves us hanging. There's a lot of "mights" and "shoulds" > in there, but nothing that shows that this actually does anything > positive in practice. Let me harvest some data and report the findings in a V2 ChangeLog. However, these particular macros are rarely used, so I don't expect some big changes in the generated asm code. > Maybe I'm just a dummy and this is just an obvious improvement that I > can't grasp. If so, sorry for being so dense, but I'm going to need a > little more education before this gets applied. Thanks, Uros.
diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h index fd1282a783dd..95b5f990ca88 100644 --- a/arch/x86/include/asm/cmpxchg_32.h +++ b/arch/x86/include/asm/cmpxchg_32.h @@ -91,12 +91,14 @@ static __always_inline bool __try_cmpxchg64_local(volatile u64 *ptr, u64 *oldp, union __u64_halves o = { .full = (_old), }, \ n = { .full = (_new), }; \ \ - asm volatile(ALTERNATIVE(_lock_loc \ - "call cmpxchg8b_emu", \ - _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \ - : ALT_OUTPUT_SP("+a" (o.low), "+d" (o.high)) \ - : "b" (n.low), "c" (n.high), [ptr] "S" (_ptr) \ - : "memory"); \ + asm_inline volatile( \ + ALTERNATIVE(_lock_loc \ + "call cmpxchg8b_emu", \ + _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \ + : ALT_OUTPUT_SP("+a" (o.low), "+d" (o.high)) \ + : "b" (n.low), "c" (n.high), \ + [ptr] "S" (_ptr) \ + : "memory"); \ \ o.full; \ }) @@ -119,14 +121,16 @@ static __always_inline u64 arch_cmpxchg64_local(volatile u64 *ptr, u64 old, u64 n = { .full = (_new), }; \ bool ret; \ \ - asm volatile(ALTERNATIVE(_lock_loc \ - "call cmpxchg8b_emu", \ - _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \ - CC_SET(e) \ - : ALT_OUTPUT_SP(CC_OUT(e) (ret), \ - "+a" (o.low), "+d" (o.high)) \ - : "b" (n.low), "c" (n.high), [ptr] "S" (_ptr) \ - : "memory"); \ + asm_inline volatile( \ + ALTERNATIVE(_lock_loc \ + "call cmpxchg8b_emu", \ + _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \ + CC_SET(e) \ + : ALT_OUTPUT_SP(CC_OUT(e) (ret), \ + "+a" (o.low), "+d" (o.high)) \ + : "b" (n.low), "c" (n.high), \ + [ptr] "S" (_ptr) \ + : "memory"); \ \ if (unlikely(!ret)) \ *(_oldp) = o.full; \ diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 0ab991fba7de..08f5f61690b7 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -348,15 +348,14 @@ do { \ old__.var = _oval; \ new__.var = _nval; \ \ - asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu", \ - "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \ - : ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)), \ - "+a" (old__.low), \ - "+d" (old__.high)) \ - : "b" (new__.low), \ - "c" (new__.high), \ - "S" (&(_var)) \ - : "memory"); \ + asm_inline qual ( \ + ALTERNATIVE("call this_cpu_cmpxchg8b_emu", \ + "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \ + : ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)), \ + "+a" (old__.low), "+d" (old__.high)) \ + : "b" (new__.low), "c" (new__.high), \ + "S" (&(_var)) \ + : "memory"); \ \ old__.var; \ }) @@ -378,17 +377,16 @@ do { \ old__.var = *_oval; \ new__.var = _nval; \ \ - asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu", \ - "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \ - CC_SET(z) \ - : ALT_OUTPUT_SP(CC_OUT(z) (success), \ - [var] "+m" (__my_cpu_var(_var)), \ - "+a" (old__.low), \ - "+d" (old__.high)) \ - : "b" (new__.low), \ - "c" (new__.high), \ - "S" (&(_var)) \ - : "memory"); \ + asm_inline qual ( \ + ALTERNATIVE("call this_cpu_cmpxchg8b_emu", \ + "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \ + CC_SET(z) \ + : ALT_OUTPUT_SP(CC_OUT(z) (success), \ + [var] "+m" (__my_cpu_var(_var)), \ + "+a" (old__.low), "+d" (old__.high)) \ + : "b" (new__.low), "c" (new__.high), \ + "S" (&(_var)) \ + : "memory"); \ if (unlikely(!success)) \ *_oval = old__.var; \ \ @@ -419,15 +417,14 @@ do { \ old__.var = _oval; \ new__.var = _nval; \ \ - asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu", \ - "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \ - : ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)), \ - "+a" (old__.low), \ - "+d" (old__.high)) \ - : "b" (new__.low), \ - "c" (new__.high), \ - "S" (&(_var)) \ - : "memory"); \ + asm_inline qual ( \ + ALTERNATIVE("call this_cpu_cmpxchg16b_emu", \ + "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \ + : ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)), \ + "+a" (old__.low), "+d" (old__.high)) \ + : "b" (new__.low), "c" (new__.high), \ + "S" (&(_var)) \ + : "memory"); \ \ old__.var; \ }) @@ -449,19 +446,19 @@ do { \ old__.var = *_oval; \ new__.var = _nval; \ \ - asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu", \ - "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \ - CC_SET(z) \ - : ALT_OUTPUT_SP(CC_OUT(z) (success), \ - [var] "+m" (__my_cpu_var(_var)), \ - "+a" (old__.low), \ - "+d" (old__.high)) \ - : "b" (new__.low), \ - "c" (new__.high), \ - "S" (&(_var)) \ - : "memory"); \ + asm_inline qual ( \ + ALTERNATIVE("call this_cpu_cmpxchg16b_emu", \ + "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \ + CC_SET(z) \ + : ALT_OUTPUT_SP(CC_OUT(z) (success), \ + [var] "+m" (__my_cpu_var(_var)), \ + "+a" (old__.low), "+d" (old__.high)) \ + : "b" (new__.low), "c" (new__.high), \ + "S" (&(_var)) \ + : "memory"); \ if (unlikely(!success)) \ *_oval = old__.var; \ + \ likely(success); \ })
According to [1], the usage of asm pseudo directives in the asm template can confuse the compiler to wrongly estimate the size of the generated code. ALTERNATIVE macro expands to several asm pseudo directives, so its usage in {,try_}cmpxchg{64,128} causes instruction length estimate to fail by an order of magnitude (the compiler estimates the length of an asm to be more than 20 instructions). This wrong estimate further causes unoptimal inlining decisions for functions that use these locking primitives. Use asm_inline instead of just asm. For inlining purposes, the size of the asm is then taken as the minimum size, ignoring how many instructions compiler thinks it is. [1] https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> --- arch/x86/include/asm/cmpxchg_32.h | 32 +++++++------ arch/x86/include/asm/percpu.h | 77 +++++++++++++++---------------- 2 files changed, 55 insertions(+), 54 deletions(-)