Message ID | 20210908025436.dvsgeCXAh%akpm@linux-foundation.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [001/147] mm, slub: don't call flush_all() from slab_debug_trace_open() | expand |
On 08/09/2021 04.54, Andrew Morton wrote: > From: Vlastimil Babka <vbabka@suse.cz> > Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg > > Jann Horn reported [1] the following theoretically possible race: > > task A: put_cpu_partial() calls preempt_disable() > task A: oldpage = this_cpu_read(s->cpu_slab->partial) > interrupt: kfree() reaches unfreeze_partials() and discards the page > task B (on another CPU): reallocates page as page cache > task A: reads page->pages and page->pobjects, which are actually > halves of the pointer page->lru.prev > task B (on another CPU): frees page > interrupt: allocates page as SLUB page and places it on the percpu partial list > task A: this_cpu_cmpxchg() succeeds > > which would cause page->pages and page->pobjects to end up containing > halves of pointers that would then influence when put_cpu_partial() > happens and show up in root-only sysfs files. Maybe that's acceptable, > I don't know. But there should probably at least be a comment for now > to point out that we're reading union fields of a page that might be > in a completely different state. > > Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only > safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the > latter disables irqs, otherwise a __slab_free() in an irq handler could > call put_cpu_partial() in the middle of ___slab_alloc() manipulating > ->partial and corrupt it. This becomes an issue on RT after a local_lock > is introduced in later patch. The fix means taking the local_lock also in > put_cpu_partial() on RT. > > After debugging this issue, Mike Galbraith suggested [2] that to avoid > different locking schemes on RT and !RT, we can just protect > put_cpu_partial() with disabled irqs (to be converted to > local_lock_irqsave() later) everywhere. This should be acceptable as it's > not a fast path, and moving the actual partial unfreezing outside of the > irq disabled section makes it short, and with the retry loop gone the code > can be also simplified. In addition, the race reported by Jann should no > longer be possible. Based on my microbench[0] measurement changing preempt_disable to local_irq_save will cost us 11 cycles (TSC). I'm not against the change, I just want people to keep this in mind. On my E5-1650 v4 @ 3.60GHz: - preempt_disable(+enable) cost: 11 cycles(tsc) 3.161 ns - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns Notice the non-save/restore variant is superfast: - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns [0] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c > [1] https://lore.kernel.org/lkml/CAG48ez1mvUuXwg0YPH5ANzhQLpbphqk-ZS+jbRz+H66fvm4FcA@mail.gmail.com/ > [2] https://lore.kernel.org/linux-rt-users/e3470ab357b48bccfbd1f5133b982178a7d2befb.camel@gmx.de/ > > Link: https://lkml.kernel.org/r/20210904105003.11688-32-vbabka@suse.cz > Reported-by: Jann Horn <jannh@google.com> > Suggested-by: Mike Galbraith <efault@gmx.de> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > Cc: Christoph Lameter <cl@linux.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Jesper Dangaard Brouer <brouer@redhat.com> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: Pekka Enberg <penberg@kernel.org> > Cc: Qian Cai <quic_qiancai@quicinc.com> > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > Cc: Thomas Gleixner <tglx@linutronix.de> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > mm/slub.c | 83 ++++++++++++++++++++++++++++------------------------ > 1 file changed, 45 insertions(+), 38 deletions(-) > > --- a/mm/slub.c~mm-slub-protect-put_cpu_partial-with-disabled-irqs-instead-of-cmpxchg > +++ a/mm/slub.c > @@ -2025,7 +2025,12 @@ static inline void *acquire_slab(struct > return freelist; > } > > +#ifdef CONFIG_SLUB_CPU_PARTIAL > static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain); > +#else > +static inline void put_cpu_partial(struct kmem_cache *s, struct page *page, > + int drain) { } > +#endif > static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags); > > /* > @@ -2459,14 +2464,6 @@ static void unfreeze_partials_cpu(struct > __unfreeze_partials(s, partial_page); > } > > -#else /* CONFIG_SLUB_CPU_PARTIAL */ > - > -static inline void unfreeze_partials(struct kmem_cache *s) { } > -static inline void unfreeze_partials_cpu(struct kmem_cache *s, > - struct kmem_cache_cpu *c) { } > - > -#endif /* CONFIG_SLUB_CPU_PARTIAL */ > - > /* > * Put a page that was just frozen (in __slab_free|get_partial_node) into a > * partial page slot if available. > @@ -2476,46 +2473,56 @@ static inline void unfreeze_partials_cpu > */ > static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) > { > -#ifdef CONFIG_SLUB_CPU_PARTIAL > struct page *oldpage; > - int pages; > - int pobjects; > + struct page *page_to_unfreeze = NULL; > + unsigned long flags; > + int pages = 0; > + int pobjects = 0; > > - preempt_disable(); > - do { > - pages = 0; > - pobjects = 0; > - oldpage = this_cpu_read(s->cpu_slab->partial); > + local_irq_save(flags); > + > + oldpage = this_cpu_read(s->cpu_slab->partial); > > - if (oldpage) { > + if (oldpage) { > + if (drain && oldpage->pobjects > slub_cpu_partial(s)) { > + /* > + * Partial array is full. Move the existing set to the > + * per node partial list. Postpone the actual unfreezing > + * outside of the critical section. > + */ > + page_to_unfreeze = oldpage; > + oldpage = NULL; > + } else { > pobjects = oldpage->pobjects; > pages = oldpage->pages; > - if (drain && pobjects > slub_cpu_partial(s)) { > - /* > - * partial array is full. Move the existing > - * set to the per node partial list. > - */ > - unfreeze_partials(s); > - oldpage = NULL; > - pobjects = 0; > - pages = 0; > - stat(s, CPU_PARTIAL_DRAIN); > - } > } > + } > > - pages++; > - pobjects += page->objects - page->inuse; > + pages++; > + pobjects += page->objects - page->inuse; > > - page->pages = pages; > - page->pobjects = pobjects; > - page->next = oldpage; > - > - } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) > - != oldpage); > - preempt_enable(); > -#endif /* CONFIG_SLUB_CPU_PARTIAL */ > + page->pages = pages; > + page->pobjects = pobjects; > + page->next = oldpage; > + > + this_cpu_write(s->cpu_slab->partial, page); > + > + local_irq_restore(flags); > + > + if (page_to_unfreeze) { > + __unfreeze_partials(s, page_to_unfreeze); > + stat(s, CPU_PARTIAL_DRAIN); > + } > } > > +#else /* CONFIG_SLUB_CPU_PARTIAL */ > + > +static inline void unfreeze_partials(struct kmem_cache *s) { } > +static inline void unfreeze_partials_cpu(struct kmem_cache *s, > + struct kmem_cache_cpu *c) { } > + > +#endif /* CONFIG_SLUB_CPU_PARTIAL */ > + > static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) > { > unsigned long flags; > _ > $ uname -a Linux broadwell 5.14.0-net-next+ #612 SMP PREEMPT Wed Sep 8 10:10:04 CEST 2021 x86_64 x86_64 x86_64 GNU/Linux My config: $ zcat /proc/config.gz | grep PREE # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_COUNT=y CONFIG_PREEMPTION=y CONFIG_PREEMPT_DYNAMIC=y CONFIG_PREEMPT_RCU=y CONFIG_HAVE_PREEMPT_DYNAMIC=y CONFIG_PREEMPT_NOTIFIERS=y # CONFIG_DEBUG_PREEMPT is not set # CONFIG_PREEMPT_TRACER is not set # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
On 9/8/21 15:05, Jesper Dangaard Brouer wrote: > > > On 08/09/2021 04.54, Andrew Morton wrote: >> From: Vlastimil Babka <vbabka@suse.cz> >> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg >> >> Jann Horn reported [1] the following theoretically possible race: >> >> task A: put_cpu_partial() calls preempt_disable() >> task A: oldpage = this_cpu_read(s->cpu_slab->partial) >> interrupt: kfree() reaches unfreeze_partials() and discards the page >> task B (on another CPU): reallocates page as page cache >> task A: reads page->pages and page->pobjects, which are actually >> halves of the pointer page->lru.prev >> task B (on another CPU): frees page >> interrupt: allocates page as SLUB page and places it on the percpu partial list >> task A: this_cpu_cmpxchg() succeeds >> >> which would cause page->pages and page->pobjects to end up containing >> halves of pointers that would then influence when put_cpu_partial() >> happens and show up in root-only sysfs files. Maybe that's acceptable, >> I don't know. But there should probably at least be a comment for now >> to point out that we're reading union fields of a page that might be >> in a completely different state. >> >> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only >> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the >> latter disables irqs, otherwise a __slab_free() in an irq handler could >> call put_cpu_partial() in the middle of ___slab_alloc() manipulating >> ->partial and corrupt it. This becomes an issue on RT after a local_lock >> is introduced in later patch. The fix means taking the local_lock also in >> put_cpu_partial() on RT. >> >> After debugging this issue, Mike Galbraith suggested [2] that to avoid >> different locking schemes on RT and !RT, we can just protect >> put_cpu_partial() with disabled irqs (to be converted to >> local_lock_irqsave() later) everywhere. This should be acceptable as it's >> not a fast path, and moving the actual partial unfreezing outside of the >> irq disabled section makes it short, and with the retry loop gone the code >> can be also simplified. In addition, the race reported by Jann should no >> longer be possible. > > Based on my microbench[0] measurement changing preempt_disable to > local_irq_save will cost us 11 cycles (TSC). I'm not against the > change, I just want people to keep this in mind. OK, but this is not a fast path for every allocation/free, so it gets amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect cmpxchg to be expensive too? > On my E5-1650 v4 @ 3.60GHz: > - preempt_disable(+enable) cost: 11 cycles(tsc) 3.161 ns > - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns > > Notice the non-save/restore variant is superfast: > - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns It actually surprises me that it's that cheap, and would have expected changing the irq state would be the costly part, not the saving/restoring. Incidentally, would you know what's the cost of save+restore when the irqs are already disabled, so it's effectively a no-op? Thanks, Vlastimil
On 08.09.21 15:58, Vlastimil Babka wrote: > On 9/8/21 15:05, Jesper Dangaard Brouer wrote: >> >> >> On 08/09/2021 04.54, Andrew Morton wrote: >>> From: Vlastimil Babka <vbabka@suse.cz> >>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg >>> >>> Jann Horn reported [1] the following theoretically possible race: >>> >>> task A: put_cpu_partial() calls preempt_disable() >>> task A: oldpage = this_cpu_read(s->cpu_slab->partial) >>> interrupt: kfree() reaches unfreeze_partials() and discards the page >>> task B (on another CPU): reallocates page as page cache >>> task A: reads page->pages and page->pobjects, which are actually >>> halves of the pointer page->lru.prev >>> task B (on another CPU): frees page >>> interrupt: allocates page as SLUB page and places it on the percpu partial list >>> task A: this_cpu_cmpxchg() succeeds >>> >>> which would cause page->pages and page->pobjects to end up containing >>> halves of pointers that would then influence when put_cpu_partial() >>> happens and show up in root-only sysfs files. Maybe that's acceptable, >>> I don't know. But there should probably at least be a comment for now >>> to point out that we're reading union fields of a page that might be >>> in a completely different state. >>> >>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only >>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the >>> latter disables irqs, otherwise a __slab_free() in an irq handler could >>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating >>> ->partial and corrupt it. This becomes an issue on RT after a local_lock >>> is introduced in later patch. The fix means taking the local_lock also in >>> put_cpu_partial() on RT. >>> >>> After debugging this issue, Mike Galbraith suggested [2] that to avoid >>> different locking schemes on RT and !RT, we can just protect >>> put_cpu_partial() with disabled irqs (to be converted to >>> local_lock_irqsave() later) everywhere. This should be acceptable as it's >>> not a fast path, and moving the actual partial unfreezing outside of the >>> irq disabled section makes it short, and with the retry loop gone the code >>> can be also simplified. In addition, the race reported by Jann should no >>> longer be possible. >> >> Based on my microbench[0] measurement changing preempt_disable to >> local_irq_save will cost us 11 cycles (TSC). I'm not against the >> change, I just want people to keep this in mind. > > OK, but this is not a fast path for every allocation/free, so it gets > amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect > cmpxchg to be expensive too? > >> On my E5-1650 v4 @ 3.60GHz: >> - preempt_disable(+enable) cost: 11 cycles(tsc) 3.161 ns >> - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns >> >> Notice the non-save/restore variant is superfast: >> - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns > > It actually surprises me that it's that cheap, and would have expected > changing the irq state would be the costly part, not the saving/restoring. > Incidentally, would you know what's the cost of save+restore when the > irqs are already disabled, so it's effectively a no-op? It surprises me as well. That would imply that protecting short RCU sections using local_irq_disable local_irq_enable instead of via preempt_disable preempt_enable would actually be very beneficial. Are the numbers trustworthy? :)
On 08.09.21 16:55, David Hildenbrand wrote: > On 08.09.21 15:58, Vlastimil Babka wrote: >> On 9/8/21 15:05, Jesper Dangaard Brouer wrote: >>> >>> >>> On 08/09/2021 04.54, Andrew Morton wrote: >>>> From: Vlastimil Babka <vbabka@suse.cz> >>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg >>>> >>>> Jann Horn reported [1] the following theoretically possible race: >>>> >>>> task A: put_cpu_partial() calls preempt_disable() >>>> task A: oldpage = this_cpu_read(s->cpu_slab->partial) >>>> interrupt: kfree() reaches unfreeze_partials() and discards the page >>>> task B (on another CPU): reallocates page as page cache >>>> task A: reads page->pages and page->pobjects, which are actually >>>> halves of the pointer page->lru.prev >>>> task B (on another CPU): frees page >>>> interrupt: allocates page as SLUB page and places it on the percpu partial list >>>> task A: this_cpu_cmpxchg() succeeds >>>> >>>> which would cause page->pages and page->pobjects to end up containing >>>> halves of pointers that would then influence when put_cpu_partial() >>>> happens and show up in root-only sysfs files. Maybe that's acceptable, >>>> I don't know. But there should probably at least be a comment for now >>>> to point out that we're reading union fields of a page that might be >>>> in a completely different state. >>>> >>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only >>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the >>>> latter disables irqs, otherwise a __slab_free() in an irq handler could >>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating >>>> ->partial and corrupt it. This becomes an issue on RT after a local_lock >>>> is introduced in later patch. The fix means taking the local_lock also in >>>> put_cpu_partial() on RT. >>>> >>>> After debugging this issue, Mike Galbraith suggested [2] that to avoid >>>> different locking schemes on RT and !RT, we can just protect >>>> put_cpu_partial() with disabled irqs (to be converted to >>>> local_lock_irqsave() later) everywhere. This should be acceptable as it's >>>> not a fast path, and moving the actual partial unfreezing outside of the >>>> irq disabled section makes it short, and with the retry loop gone the code >>>> can be also simplified. In addition, the race reported by Jann should no >>>> longer be possible. >>> >>> Based on my microbench[0] measurement changing preempt_disable to >>> local_irq_save will cost us 11 cycles (TSC). I'm not against the >>> change, I just want people to keep this in mind. >> >> OK, but this is not a fast path for every allocation/free, so it gets >> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect >> cmpxchg to be expensive too? >> >>> On my E5-1650 v4 @ 3.60GHz: >>> - preempt_disable(+enable) cost: 11 cycles(tsc) 3.161 ns >>> - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns >>> >>> Notice the non-save/restore variant is superfast: >>> - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns >> >> It actually surprises me that it's that cheap, and would have expected >> changing the irq state would be the costly part, not the saving/restoring. >> Incidentally, would you know what's the cost of save+restore when the >> irqs are already disabled, so it's effectively a no-op? > > It surprises me as well. That would imply that protecting short RCU > sections using > > local_irq_disable > local_irq_enable > > instead of via > > preempt_disable > preempt_enable > > would actually be very beneficial. > > Are the numbers trustworthy? :) > .. and especially did the benchmark consider side effects of enabling/disabling interrupts (pipeline flushes etc ..)?
On 08/09/2021 15.58, Vlastimil Babka wrote: > On 9/8/21 15:05, Jesper Dangaard Brouer wrote: >> >> >> On 08/09/2021 04.54, Andrew Morton wrote: >>> From: Vlastimil Babka <vbabka@suse.cz> >>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg >>> >>> Jann Horn reported [1] the following theoretically possible race: >>> >>> task A: put_cpu_partial() calls preempt_disable() >>> task A: oldpage = this_cpu_read(s->cpu_slab->partial) >>> interrupt: kfree() reaches unfreeze_partials() and discards the page >>> task B (on another CPU): reallocates page as page cache >>> task A: reads page->pages and page->pobjects, which are actually >>> halves of the pointer page->lru.prev >>> task B (on another CPU): frees page >>> interrupt: allocates page as SLUB page and places it on the percpu partial list >>> task A: this_cpu_cmpxchg() succeeds >>> >>> which would cause page->pages and page->pobjects to end up containing >>> halves of pointers that would then influence when put_cpu_partial() >>> happens and show up in root-only sysfs files. Maybe that's acceptable, >>> I don't know. But there should probably at least be a comment for now >>> to point out that we're reading union fields of a page that might be >>> in a completely different state. >>> >>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only >>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the >>> latter disables irqs, otherwise a __slab_free() in an irq handler could >>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating >>> ->partial and corrupt it. This becomes an issue on RT after a local_lock >>> is introduced in later patch. The fix means taking the local_lock also in >>> put_cpu_partial() on RT. >>> >>> After debugging this issue, Mike Galbraith suggested [2] that to avoid >>> different locking schemes on RT and !RT, we can just protect >>> put_cpu_partial() with disabled irqs (to be converted to >>> local_lock_irqsave() later) everywhere. This should be acceptable as it's >>> not a fast path, and moving the actual partial unfreezing outside of the >>> irq disabled section makes it short, and with the retry loop gone the code >>> can be also simplified. In addition, the race reported by Jann should no >>> longer be possible. >> >> Based on my microbench[0] measurement changing preempt_disable to >> local_irq_save will cost us 11 cycles (TSC). I'm not against the >> change, I just want people to keep this in mind. > > OK, but this is not a fast path for every allocation/free, so it gets > amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect > cmpxchg to be expensive too? Added tests for this: - this_cpu_cmpxchg cost: 5 cycles(tsc) 1.581 ns - cmpxchg cost: 18 cycles(tsc) 5.006 ns >> On my E5-1650 v4 @ 3.60GHz: >> - preempt_disable(+enable) cost: 11 cycles(tsc) 3.161 ns >> - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns >> >> Notice the non-save/restore variant is superfast: >> - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns > > It actually surprises me that it's that cheap, and would have expected > changing the irq state would be the costly part, not the saving/restoring. > Incidentally, would you know what's the cost of save+restore when the > irqs are already disabled, so it's effectively a no-op? The non-save variant simply translated onto CLI and STI, which seems to be very fast. The cost of save+restore when the irqs are already disabled is the same (did a quick test). Cannot remember who told me, but (apparently) the expensive part is reading the CPU FLAGS. I did a quick test with: /** Loop to measure **/ for (i = 0; i < rec->loops; i++) { local_irq_save(flags); loops_cnt++; barrier(); //local_irq_restore(flags); local_irq_enable(); } Doing a save + enable: This cost 21 cycles(tsc) 6.015 ns. (Cost before was 22 cycles) This confirms reading the CPU FLAGS seems to be the expensive part. --Jesper
On Wed, Sep 8, 2021 at 9:11 AM Jesper Dangaard Brouer <jbrouer@redhat.com> wrote: > > The non-save variant simply translated onto CLI and STI, which seems to > be very fast. It will depend on the microarchitecture. Happily: > The cost of save+restore when the irqs are already disabled is the same > (did a quick test). The really expensive part used to be P4. 'popf' was hundreds of cycles if any of the non-arithmetic bits changed, iirc. P4 used to be a big headache just because of things like that - straightforward code ran very well, but anything a bit more special took forever because it flushed the pipeline. So some of our optimizations may be historic because of things like that. We don't really need to worry about the P4 glass jaws any more, but it *used* to be much quicker to do 'preempt_disable()' that just does an add to a memory location than it was to disable interrupts. > Cannot remember who told me, but (apparently) the expensive part is > reading the CPU FLAGS. Again, it ends up being very dependent on the uarch. Reading and writing the flags register is somewhat expensive because it's not really "one" register in hardware any more (even if that was obviously the historical implementation). These days, the arithmetic flags are generally multiple renamed registers, and then the other flags are a separate system register (possibly multiple bits spread out). The cost of doing those flag reads and writes are hard to really specify, because in an OoO architecture a lot of it ends up being "how much of that can be done in parallel, and what's the pipeline serialization cost". Doing a loop with rdtsc is not necessarily AT ALL indicative of the cost when there is other real code around it. The cost _could_ be much smaller, in case there is little serialization with normal other code. Or, it could be much bigger than what a rdtsc shows, because if it's a hard pipeline flush, then a tight loop with those things won't have any real work to flush, while in "real code" there may be hundreds of instructions in flight and doing the flush is very expensive. The good news is that afaik, all the modern x86 CPU microarchitectures do reasonably well. And while a "pushf/cli/popf" sequence is probably more cycles than an add/subtract one in a benchmark, if the preempt counter is not otherwise needed, and is cold in the cache, then the pushf/cli/popf may be *much* cheaper than a cache miss. So the only way to really tell would be to run real benchmarks of real loads on multiple different microarchitectures. I'm pretty sure the actual result is: "you can't measure the 10-cycle difference on any modern core because it can actually go either way". But "I'm pretty sure" and "reality" are not the same thing. These days, pipeline flushes and cache misses (and then as a very particularly bad case - cache line pingpong issues) are almost the only thing that matters. And the most common reason by far for the pipeline flushes are branch mispredicts, but see above: the system bits in the flags register _have_ been cause of them in the past, so it's not entirely impossible. Linus
On 08/09/2021 16.59, David Hildenbrand wrote: > On 08.09.21 16:55, David Hildenbrand wrote: >> On 08.09.21 15:58, Vlastimil Babka wrote: >>> On 9/8/21 15:05, Jesper Dangaard Brouer wrote: >>>> >>>> >>>> On 08/09/2021 04.54, Andrew Morton wrote: >>>>> From: Vlastimil Babka <vbabka@suse.cz> >>>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs >>>>> instead of cmpxchg >>>>> >>>>> Jann Horn reported [1] the following theoretically possible race: >>>>> >>>>> task A: put_cpu_partial() calls preempt_disable() >>>>> task A: oldpage = this_cpu_read(s->cpu_slab->partial) >>>>> interrupt: kfree() reaches unfreeze_partials() and discards >>>>> the page >>>>> task B (on another CPU): reallocates page as page cache >>>>> task A: reads page->pages and page->pobjects, which are actually >>>>> halves of the pointer page->lru.prev >>>>> task B (on another CPU): frees page >>>>> interrupt: allocates page as SLUB page and places it on the >>>>> percpu partial list >>>>> task A: this_cpu_cmpxchg() succeeds >>>>> >>>>> which would cause page->pages and page->pobjects to end up >>>>> containing >>>>> halves of pointers that would then influence when >>>>> put_cpu_partial() >>>>> happens and show up in root-only sysfs files. Maybe that's >>>>> acceptable, >>>>> I don't know. But there should probably at least be a comment >>>>> for now >>>>> to point out that we're reading union fields of a page that >>>>> might be >>>>> in a completely different state. >>>>> >>>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() >>>>> is only >>>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() >>>>> if the >>>>> latter disables irqs, otherwise a __slab_free() in an irq handler >>>>> could >>>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating >>>>> ->partial and corrupt it. This becomes an issue on RT after a >>>>> local_lock >>>>> is introduced in later patch. The fix means taking the local_lock >>>>> also in >>>>> put_cpu_partial() on RT. >>>>> >>>>> After debugging this issue, Mike Galbraith suggested [2] that to avoid >>>>> different locking schemes on RT and !RT, we can just protect >>>>> put_cpu_partial() with disabled irqs (to be converted to >>>>> local_lock_irqsave() later) everywhere. This should be acceptable >>>>> as it's >>>>> not a fast path, and moving the actual partial unfreezing outside >>>>> of the >>>>> irq disabled section makes it short, and with the retry loop gone >>>>> the code >>>>> can be also simplified. In addition, the race reported by Jann >>>>> should no >>>>> longer be possible. >>>> >>>> Based on my microbench[0] measurement changing preempt_disable to >>>> local_irq_save will cost us 11 cycles (TSC). I'm not against the >>>> change, I just want people to keep this in mind. >>> >>> OK, but this is not a fast path for every allocation/free, so it gets >>> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect >>> cmpxchg to be expensive too? >>> >>>> On my E5-1650 v4 @ 3.60GHz: >>>> - preempt_disable(+enable) cost: 11 cycles(tsc) 3.161 ns >>>> - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns >>>> >>>> Notice the non-save/restore variant is superfast: >>>> - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns >>> >>> It actually surprises me that it's that cheap, and would have expected >>> changing the irq state would be the costly part, not the >>> saving/restoring. >>> Incidentally, would you know what's the cost of save+restore when the >>> irqs are already disabled, so it's effectively a no-op? >> >> It surprises me as well. That would imply that protecting short RCU >> sections using >> >> local_irq_disable >> local_irq_enable >> >> instead of via >> >> preempt_disable >> preempt_enable >> >> would actually be very beneficial. Please don't draw this as a general conclusion. As Linus describe in details, the IRQ disable/enable will be very micro-arch specific. The preempt_disable/enable will likely be more stable/consistent across micro-archs. Keep an eye out for kernel config options when juding preempt_disable/enable performance [1] [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c#L363-L367 >> >> Are the numbers trustworthy? :) >> > > .. and especially did the benchmark consider side effects of > enabling/disabling interrupts (pipeline flushes etc ..)? > Of-cause not, this is a microbenchmark... they are per definition not trustworthy :-P -Jesper
On 08.09.21 19:14, Jesper Dangaard Brouer wrote: > > > On 08/09/2021 16.59, David Hildenbrand wrote: >> On 08.09.21 16:55, David Hildenbrand wrote: >>> On 08.09.21 15:58, Vlastimil Babka wrote: >>>> On 9/8/21 15:05, Jesper Dangaard Brouer wrote: >>>>> >>>>> >>>>> On 08/09/2021 04.54, Andrew Morton wrote: >>>>>> From: Vlastimil Babka <vbabka@suse.cz> >>>>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs >>>>>> instead of cmpxchg >>>>>> >>>>>> Jann Horn reported [1] the following theoretically possible race: >>>>>> >>>>>> task A: put_cpu_partial() calls preempt_disable() >>>>>> task A: oldpage = this_cpu_read(s->cpu_slab->partial) >>>>>> interrupt: kfree() reaches unfreeze_partials() and discards >>>>>> the page >>>>>> task B (on another CPU): reallocates page as page cache >>>>>> task A: reads page->pages and page->pobjects, which are actually >>>>>> halves of the pointer page->lru.prev >>>>>> task B (on another CPU): frees page >>>>>> interrupt: allocates page as SLUB page and places it on the >>>>>> percpu partial list >>>>>> task A: this_cpu_cmpxchg() succeeds >>>>>> >>>>>> which would cause page->pages and page->pobjects to end up >>>>>> containing >>>>>> halves of pointers that would then influence when >>>>>> put_cpu_partial() >>>>>> happens and show up in root-only sysfs files. Maybe that's >>>>>> acceptable, >>>>>> I don't know. But there should probably at least be a comment >>>>>> for now >>>>>> to point out that we're reading union fields of a page that >>>>>> might be >>>>>> in a completely different state. >>>>>> >>>>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() >>>>>> is only >>>>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() >>>>>> if the >>>>>> latter disables irqs, otherwise a __slab_free() in an irq handler >>>>>> could >>>>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating >>>>>> ->partial and corrupt it. This becomes an issue on RT after a >>>>>> local_lock >>>>>> is introduced in later patch. The fix means taking the local_lock >>>>>> also in >>>>>> put_cpu_partial() on RT. >>>>>> >>>>>> After debugging this issue, Mike Galbraith suggested [2] that to avoid >>>>>> different locking schemes on RT and !RT, we can just protect >>>>>> put_cpu_partial() with disabled irqs (to be converted to >>>>>> local_lock_irqsave() later) everywhere. This should be acceptable >>>>>> as it's >>>>>> not a fast path, and moving the actual partial unfreezing outside >>>>>> of the >>>>>> irq disabled section makes it short, and with the retry loop gone >>>>>> the code >>>>>> can be also simplified. In addition, the race reported by Jann >>>>>> should no >>>>>> longer be possible. >>>>> >>>>> Based on my microbench[0] measurement changing preempt_disable to >>>>> local_irq_save will cost us 11 cycles (TSC). I'm not against the >>>>> change, I just want people to keep this in mind. >>>> >>>> OK, but this is not a fast path for every allocation/free, so it gets >>>> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect >>>> cmpxchg to be expensive too? >>>> >>>>> On my E5-1650 v4 @ 3.60GHz: >>>>> - preempt_disable(+enable) cost: 11 cycles(tsc) 3.161 ns >>>>> - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns >>>>> >>>>> Notice the non-save/restore variant is superfast: >>>>> - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns >>>> >>>> It actually surprises me that it's that cheap, and would have expected >>>> changing the irq state would be the costly part, not the >>>> saving/restoring. >>>> Incidentally, would you know what's the cost of save+restore when the >>>> irqs are already disabled, so it's effectively a no-op? >>> >>> It surprises me as well. That would imply that protecting short RCU >>> sections using >>> >>> local_irq_disable >>> local_irq_enable >>> >>> instead of via >>> >>> preempt_disable >>> preempt_enable >>> >>> would actually be very beneficial. > > Please don't draw this as a general conclusion. > As Linus describe in details, the IRQ disable/enable will be very > micro-arch specific. Sure: but especially for modern micro-archs, this might be very relevant. I actually stumbled over this exact question 1 month ago, that's why your comment caught my attention. I looked for CLI/STI cycle numbers and didn't really find a trusted source. I merely only found [1], which made it look like incrementing/decrementing some counter would actually be much faster most of the time. [1] https://www.agner.org/optimize/instruction_tables.pdf > > The preempt_disable/enable will likely be more stable/consistent across > micro-archs. > Keep an eye out for kernel config options when juding > preempt_disable/enable performance [1] > > [1] > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c#L363-L367 > > >>> >>> Are the numbers trustworthy? :) >>> >> >> .. and especially did the benchmark consider side effects of >> enabling/disabling interrupts (pipeline flushes etc ..)? >> > > Of-cause not, this is a microbenchmark... they are per definition not > trustworthy :-P :)
--- a/mm/slub.c~mm-slub-protect-put_cpu_partial-with-disabled-irqs-instead-of-cmpxchg +++ a/mm/slub.c @@ -2025,7 +2025,12 @@ static inline void *acquire_slab(struct return freelist; } +#ifdef CONFIG_SLUB_CPU_PARTIAL static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain); +#else +static inline void put_cpu_partial(struct kmem_cache *s, struct page *page, + int drain) { } +#endif static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags); /* @@ -2459,14 +2464,6 @@ static void unfreeze_partials_cpu(struct __unfreeze_partials(s, partial_page); } -#else /* CONFIG_SLUB_CPU_PARTIAL */ - -static inline void unfreeze_partials(struct kmem_cache *s) { } -static inline void unfreeze_partials_cpu(struct kmem_cache *s, - struct kmem_cache_cpu *c) { } - -#endif /* CONFIG_SLUB_CPU_PARTIAL */ - /* * Put a page that was just frozen (in __slab_free|get_partial_node) into a * partial page slot if available. @@ -2476,46 +2473,56 @@ static inline void unfreeze_partials_cpu */ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) { -#ifdef CONFIG_SLUB_CPU_PARTIAL struct page *oldpage; - int pages; - int pobjects; + struct page *page_to_unfreeze = NULL; + unsigned long flags; + int pages = 0; + int pobjects = 0; - preempt_disable(); - do { - pages = 0; - pobjects = 0; - oldpage = this_cpu_read(s->cpu_slab->partial); + local_irq_save(flags); + + oldpage = this_cpu_read(s->cpu_slab->partial); - if (oldpage) { + if (oldpage) { + if (drain && oldpage->pobjects > slub_cpu_partial(s)) { + /* + * Partial array is full. Move the existing set to the + * per node partial list. Postpone the actual unfreezing + * outside of the critical section. + */ + page_to_unfreeze = oldpage; + oldpage = NULL; + } else { pobjects = oldpage->pobjects; pages = oldpage->pages; - if (drain && pobjects > slub_cpu_partial(s)) { - /* - * partial array is full. Move the existing - * set to the per node partial list. - */ - unfreeze_partials(s); - oldpage = NULL; - pobjects = 0; - pages = 0; - stat(s, CPU_PARTIAL_DRAIN); - } } + } - pages++; - pobjects += page->objects - page->inuse; + pages++; + pobjects += page->objects - page->inuse; - page->pages = pages; - page->pobjects = pobjects; - page->next = oldpage; - - } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) - != oldpage); - preempt_enable(); -#endif /* CONFIG_SLUB_CPU_PARTIAL */ + page->pages = pages; + page->pobjects = pobjects; + page->next = oldpage; + + this_cpu_write(s->cpu_slab->partial, page); + + local_irq_restore(flags); + + if (page_to_unfreeze) { + __unfreeze_partials(s, page_to_unfreeze); + stat(s, CPU_PARTIAL_DRAIN); + } } +#else /* CONFIG_SLUB_CPU_PARTIAL */ + +static inline void unfreeze_partials(struct kmem_cache *s) { } +static inline void unfreeze_partials_cpu(struct kmem_cache *s, + struct kmem_cache_cpu *c) { } + +#endif /* CONFIG_SLUB_CPU_PARTIAL */ + static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) { unsigned long flags;