Message ID | 20230405141710.3551-1-ubizjak@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | locking: Introduce local{,64}_try_cmpxchg | expand |
On 4/5/23 07:17, Uros Bizjak wrote: > Add generic and target specific support for local{,64}_try_cmpxchg > and wire up support for all targets that use local_t infrastructure. I feel like I'm missing some context. What are the actual end user visible effects of this series? Is there a measurable decrease in perf overhead? Why go to all this trouble for perf? Who else will use local_try_cmpxchg()? I'm all for improving things, and perf is an important user. But, if the goal here is improving performance, it would be nice to see at least a stab at quantifying the performance delta.
On Wed, Apr 5, 2023 at 6:37 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 4/5/23 07:17, Uros Bizjak wrote: > > Add generic and target specific support for local{,64}_try_cmpxchg > > and wire up support for all targets that use local_t infrastructure. > > I feel like I'm missing some context. > > What are the actual end user visible effects of this series? Is there a > measurable decrease in perf overhead? Why go to all this trouble for > perf? Who else will use local_try_cmpxchg()? This functionality was requested by perf people [1], so perhaps Steven can give us some concrete examples. In general, apart from the removal of unneeded compare instruction on x86, usage of try_cmpxchg also results in slightly better code on non-x86 targets [2], since the code now correctly identifies fast-path through the cmpxchg loop. Also important is that try_cmpxchg code reuses the result of cmpxchg instruction in the loop, so a read from the memory in the loop is eliminated. When reviewing the cmpxchg usage sites, I found numerous places where unnecessary read from memory was present in the loop, two examples can be seen in the last patch of this series. Also, using try_cmpxchg prevents inconsistencies of the cmpxchg loop, where the result of the cmpxchg is compared with the wrong "old" value - one such bug is still lurking in x86 APIC code, please see [3]. Please note that apart from perf subsystem, event subsystem can also be improved by using local_try_cmpxchg. This is the reason that the last patch includes a change in events/core.c. > I'm all for improving things, and perf is an important user. But, if > the goal here is improving performance, it would be nice to see at least > a stab at quantifying the performance delta. [1] https://lore.kernel.org/lkml/20230301131831.6c8d4ff5@gandalf.local.home/ [2] https://lore.kernel.org/lkml/Yo91omfDZtTgXhyn@FVFF77S0Q05N.cambridge.arm.com/ [3] https://lore.kernel.org/lkml/20230227160917.107820-1-ubizjak@gmail.com/ Uros.
From: Dave Hansen > Sent: 05 April 2023 17:37 > > On 4/5/23 07:17, Uros Bizjak wrote: > > Add generic and target specific support for local{,64}_try_cmpxchg > > and wire up support for all targets that use local_t infrastructure. > > I feel like I'm missing some context. > > What are the actual end user visible effects of this series? Is there a > measurable decrease in perf overhead? Why go to all this trouble for > perf? Who else will use local_try_cmpxchg()? I'm assuming the local_xxx operations only have to be save wrt interrupts? On x86 it is possible that an alternate instruction sequence that doesn't use a locked instruction may actually be faster! Although, maybe, any kind of locked cmpxchg just needs to ensure the cache line isn't 'stolen', so apart from possible slight delays on another cpu that gets a cache miss for the line in all makes little difference. The cache line miss costs a lot anyway, line bouncing more and is best avoided. So is there actually much of a benefit at all? Clearly the try_cmpxchg help - but that is a different issue. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Thu, Apr 6, 2023 at 10:26 AM David Laight <David.Laight@aculab.com> wrote: > > From: Dave Hansen > > Sent: 05 April 2023 17:37 > > > > On 4/5/23 07:17, Uros Bizjak wrote: > > > Add generic and target specific support for local{,64}_try_cmpxchg > > > and wire up support for all targets that use local_t infrastructure. > > > > I feel like I'm missing some context. > > > > What are the actual end user visible effects of this series? Is there a > > measurable decrease in perf overhead? Why go to all this trouble for > > perf? Who else will use local_try_cmpxchg()? > > I'm assuming the local_xxx operations only have to be save wrt interrupts? > On x86 it is possible that an alternate instruction sequence > that doesn't use a locked instruction may actually be faster! Please note that "local" functions do not use lock prefix. Only atomic properties of cmpxchg instruction are exploited since it only needs to be safe wrt interrupts. Uros. > Although, maybe, any kind of locked cmpxchg just needs to ensure > the cache line isn't 'stolen', so apart from possible slight > delays on another cpu that gets a cache miss for the line in > all makes little difference. > The cache line miss costs a lot anyway, line bouncing more > and is best avoided. > So is there actually much of a benefit at all? > > Clearly the try_cmpxchg help - but that is a different issue. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
From: Uros Bizjak > Sent: 06 April 2023 09:39 > > On Thu, Apr 6, 2023 at 10:26 AM David Laight <David.Laight@aculab.com> wrote: > > > > From: Dave Hansen > > > Sent: 05 April 2023 17:37 > > > > > > On 4/5/23 07:17, Uros Bizjak wrote: > > > > Add generic and target specific support for local{,64}_try_cmpxchg > > > > and wire up support for all targets that use local_t infrastructure. > > > > > > I feel like I'm missing some context. > > > > > > What are the actual end user visible effects of this series? Is there a > > > measurable decrease in perf overhead? Why go to all this trouble for > > > perf? Who else will use local_try_cmpxchg()? > > > > I'm assuming the local_xxx operations only have to be save wrt interrupts? > > On x86 it is possible that an alternate instruction sequence > > that doesn't use a locked instruction may actually be faster! > > Please note that "local" functions do not use lock prefix. Only atomic > properties of cmpxchg instruction are exploited since it only needs to > be safe wrt interrupts. Gah, I was assuming that LOCK was implied - like it is for xchg and all the bit instructions. In any case I suspect it makes little difference unless the locked variant affects the instruction pipeline. In fact, you may want to stop the cacheline being invalidated between the read and write in order to avoid an extra cache line bounce. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, Apr 05, 2023 at 09:37:04AM -0700, Dave Hansen wrote: > On 4/5/23 07:17, Uros Bizjak wrote: > > Add generic and target specific support for local{,64}_try_cmpxchg > > and wire up support for all targets that use local_t infrastructure. > > I feel like I'm missing some context. > > What are the actual end user visible effects of this series? Is there a > measurable decrease in perf overhead? Why go to all this trouble for > perf? Who else will use local_try_cmpxchg()? Overall, the theory is that it can generate slightly better code (e.g. by reusing the flags on x86). In practice, that might be in the noise, but as demonstrated in prior postings the code generation is no worse than before. From my perspective, the more important part is that this aligns local_t with the other atomic*_t APIs, which all have ${atomictype}_try_cmpxchg(), and for consistency/legibility/maintainability it's nice to be able to use the same code patterns, e.g. ${inttype} new, old = ${atomictype}_read(ptr); do { ... new = do_something_with(old); } while (${atomictype}_try_cmpxvhg(ptr, &oldval, newval); > I'm all for improving things, and perf is an important user. But, if > the goal here is improving performance, it would be nice to see at least > a stab at quantifying the performance delta. IIUC, Steve's original request for local_try_cmpxchg() was a combination of a theoretical performance benefit and a more general preference to use try_cmpxchg() for consistency / better structure of the source code: https://lore.kernel.org/lkml/20230301131831.6c8d4ff5@gandalf.local.home/ I agree it'd be nice to have performance figures, but I think those would only need to demonstrate a lack of a regression rather than a performance improvement, and I think it's fairly clear from eyeballing the generated instructions that a regression isn't likely. Thanks, Mark.
On 4/11/23 04:35, Mark Rutland wrote: > I agree it'd be nice to have performance figures, but I think those would only > need to demonstrate a lack of a regression rather than a performance > improvement, and I think it's fairly clear from eyeballing the generated > instructions that a regression isn't likely. Thanks for the additional context. I totally agree that there's zero burden here to show a performance increase. If anyone can think of a quick way to do _some_ kind of benchmark on the code being changed and just show that it's free of brown paper bags, it would be appreciated. Nothing crazy, just think of one workload (synthetic or not) that will stress the paths being changed and run it with and without these changes. Make sure there are not surprises. I also agree that it's unlikely to be brown paper bag material.
From: Dave Hansen > Sent: 11 April 2023 14:44 > > On 4/11/23 04:35, Mark Rutland wrote: > > I agree it'd be nice to have performance figures, but I think those would only > > need to demonstrate a lack of a regression rather than a performance > > improvement, and I think it's fairly clear from eyeballing the generated > > instructions that a regression isn't likely. > > Thanks for the additional context. > > I totally agree that there's zero burden here to show a performance > increase. If anyone can think of a quick way to do _some_ kind of > benchmark on the code being changed and just show that it's free of > brown paper bags, it would be appreciated. Nothing crazy, just think of > one workload (synthetic or not) that will stress the paths being changed > and run it with and without these changes. Make sure there are not > surprises. > > I also agree that it's unlikely to be brown paper bag material. The only thing I can think of is that, on x86, the locked variant may actually be faster! Both require exclusive access to the cache line (the unlocked variant always does the write! [1]). So if the cache line is contended between cpu the unlocked variant might ping-pong the cache line twice! Of course, if the line is shared like that then performance is horrid. [1] I checked on an uncached PCIe address on which I can monitor the TLP. The write always happens so you can use cmpxchg18b with a 'known bad value' to do a 16 byte read as a single TLP (without using an SSE register). David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)