diff mbox series

[031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg

Message ID 20210908025436.dvsgeCXAh%akpm@linux-foundation.org (mailing list archive)
State New
Headers show
Series [001/147] mm, slub: don't call flush_all() from slab_debug_trace_open() | expand

Commit Message

Andrew Morton Sept. 8, 2021, 2:54 a.m. UTC
From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg

Jann Horn reported [1] the following theoretically possible race:

  task A: put_cpu_partial() calls preempt_disable()
  task A: oldpage = this_cpu_read(s->cpu_slab->partial)
  interrupt: kfree() reaches unfreeze_partials() and discards the page
  task B (on another CPU): reallocates page as page cache
  task A: reads page->pages and page->pobjects, which are actually
  halves of the pointer page->lru.prev
  task B (on another CPU): frees page
  interrupt: allocates page as SLUB page and places it on the percpu partial list
  task A: this_cpu_cmpxchg() succeeds

  which would cause page->pages and page->pobjects to end up containing
  halves of pointers that would then influence when put_cpu_partial()
  happens and show up in root-only sysfs files. Maybe that's acceptable,
  I don't know. But there should probably at least be a comment for now
  to point out that we're reading union fields of a page that might be
  in a completely different state.

Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
latter disables irqs, otherwise a __slab_free() in an irq handler could
call put_cpu_partial() in the middle of ___slab_alloc() manipulating
->partial and corrupt it.  This becomes an issue on RT after a local_lock
is introduced in later patch.  The fix means taking the local_lock also in
put_cpu_partial() on RT.

After debugging this issue, Mike Galbraith suggested [2] that to avoid
different locking schemes on RT and !RT, we can just protect
put_cpu_partial() with disabled irqs (to be converted to
local_lock_irqsave() later) everywhere.  This should be acceptable as it's
not a fast path, and moving the actual partial unfreezing outside of the
irq disabled section makes it short, and with the retry loop gone the code
can be also simplified.  In addition, the race reported by Jann should no
longer be possible.

[1] https://lore.kernel.org/lkml/CAG48ez1mvUuXwg0YPH5ANzhQLpbphqk-ZS+jbRz+H66fvm4FcA@mail.gmail.com/
[2] https://lore.kernel.org/linux-rt-users/e3470ab357b48bccfbd1f5133b982178a7d2befb.camel@gmx.de/

Link: https://lkml.kernel.org/r/20210904105003.11688-32-vbabka@suse.cz
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   83 ++++++++++++++++++++++++++++------------------------
 1 file changed, 45 insertions(+), 38 deletions(-)

Comments

Jesper Dangaard Brouer Sept. 8, 2021, 1:05 p.m. UTC | #1
On 08/09/2021 04.54, Andrew Morton wrote:
> From: Vlastimil Babka <vbabka@suse.cz>
> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
> 
> Jann Horn reported [1] the following theoretically possible race:
> 
>    task A: put_cpu_partial() calls preempt_disable()
>    task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>    interrupt: kfree() reaches unfreeze_partials() and discards the page
>    task B (on another CPU): reallocates page as page cache
>    task A: reads page->pages and page->pobjects, which are actually
>    halves of the pointer page->lru.prev
>    task B (on another CPU): frees page
>    interrupt: allocates page as SLUB page and places it on the percpu partial list
>    task A: this_cpu_cmpxchg() succeeds
> 
>    which would cause page->pages and page->pobjects to end up containing
>    halves of pointers that would then influence when put_cpu_partial()
>    happens and show up in root-only sysfs files. Maybe that's acceptable,
>    I don't know. But there should probably at least be a comment for now
>    to point out that we're reading union fields of a page that might be
>    in a completely different state.
> 
> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
> latter disables irqs, otherwise a __slab_free() in an irq handler could
> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
> ->partial and corrupt it.  This becomes an issue on RT after a local_lock
> is introduced in later patch.  The fix means taking the local_lock also in
> put_cpu_partial() on RT.
> 
> After debugging this issue, Mike Galbraith suggested [2] that to avoid
> different locking schemes on RT and !RT, we can just protect
> put_cpu_partial() with disabled irqs (to be converted to
> local_lock_irqsave() later) everywhere.  This should be acceptable as it's
> not a fast path, and moving the actual partial unfreezing outside of the
> irq disabled section makes it short, and with the retry loop gone the code
> can be also simplified.  In addition, the race reported by Jann should no
> longer be possible.

Based on my microbench[0] measurement changing preempt_disable to 
local_irq_save will cost us 11 cycles (TSC).  I'm not against the 
change, I just want people to keep this in mind.

On my E5-1650 v4 @ 3.60GHz:
  - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
  - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns

Notice the non-save/restore variant is superfast:
  - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns


[0] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

> [1] https://lore.kernel.org/lkml/CAG48ez1mvUuXwg0YPH5ANzhQLpbphqk-ZS+jbRz+H66fvm4FcA@mail.gmail.com/
> [2] https://lore.kernel.org/linux-rt-users/e3470ab357b48bccfbd1f5133b982178a7d2befb.camel@gmx.de/
> 
> Link: https://lkml.kernel.org/r/20210904105003.11688-32-vbabka@suse.cz
> Reported-by: Jann Horn <jannh@google.com>
> Suggested-by: Mike Galbraith <efault@gmx.de>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: Qian Cai <quic_qiancai@quicinc.com>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>   mm/slub.c |   83 ++++++++++++++++++++++++++++------------------------
>   1 file changed, 45 insertions(+), 38 deletions(-)
> 
> --- a/mm/slub.c~mm-slub-protect-put_cpu_partial-with-disabled-irqs-instead-of-cmpxchg
> +++ a/mm/slub.c
> @@ -2025,7 +2025,12 @@ static inline void *acquire_slab(struct
>   	return freelist;
>   }
>   
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>   static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain);
> +#else
> +static inline void put_cpu_partial(struct kmem_cache *s, struct page *page,
> +				   int drain) { }
> +#endif
>   static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
>   
>   /*
> @@ -2459,14 +2464,6 @@ static void unfreeze_partials_cpu(struct
>   		__unfreeze_partials(s, partial_page);
>   }
>   
> -#else	/* CONFIG_SLUB_CPU_PARTIAL */
> -
> -static inline void unfreeze_partials(struct kmem_cache *s) { }
> -static inline void unfreeze_partials_cpu(struct kmem_cache *s,
> -				  struct kmem_cache_cpu *c) { }
> -
> -#endif	/* CONFIG_SLUB_CPU_PARTIAL */
> -
>   /*
>    * Put a page that was just frozen (in __slab_free|get_partial_node) into a
>    * partial page slot if available.
> @@ -2476,46 +2473,56 @@ static inline void unfreeze_partials_cpu
>    */
>   static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>   {
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
>   	struct page *oldpage;
> -	int pages;
> -	int pobjects;
> +	struct page *page_to_unfreeze = NULL;
> +	unsigned long flags;
> +	int pages = 0;
> +	int pobjects = 0;
>   
> -	preempt_disable();
> -	do {
> -		pages = 0;
> -		pobjects = 0;
> -		oldpage = this_cpu_read(s->cpu_slab->partial);
> +	local_irq_save(flags);
> +
> +	oldpage = this_cpu_read(s->cpu_slab->partial);
>   
> -		if (oldpage) {
> +	if (oldpage) {
> +		if (drain && oldpage->pobjects > slub_cpu_partial(s)) {
> +			/*
> +			 * Partial array is full. Move the existing set to the
> +			 * per node partial list. Postpone the actual unfreezing
> +			 * outside of the critical section.
> +			 */
> +			page_to_unfreeze = oldpage;
> +			oldpage = NULL;
> +		} else {
>   			pobjects = oldpage->pobjects;
>   			pages = oldpage->pages;
> -			if (drain && pobjects > slub_cpu_partial(s)) {
> -				/*
> -				 * partial array is full. Move the existing
> -				 * set to the per node partial list.
> -				 */
> -				unfreeze_partials(s);
> -				oldpage = NULL;
> -				pobjects = 0;
> -				pages = 0;
> -				stat(s, CPU_PARTIAL_DRAIN);
> -			}
>   		}
> +	}
>   
> -		pages++;
> -		pobjects += page->objects - page->inuse;
> +	pages++;
> +	pobjects += page->objects - page->inuse;
>   
> -		page->pages = pages;
> -		page->pobjects = pobjects;
> -		page->next = oldpage;
> -
> -	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
> -								!= oldpage);
> -	preempt_enable();
> -#endif	/* CONFIG_SLUB_CPU_PARTIAL */
> +	page->pages = pages;
> +	page->pobjects = pobjects;
> +	page->next = oldpage;
> +
> +	this_cpu_write(s->cpu_slab->partial, page);
> +
> +	local_irq_restore(flags);
> +
> +	if (page_to_unfreeze) {
> +		__unfreeze_partials(s, page_to_unfreeze);
> +		stat(s, CPU_PARTIAL_DRAIN);
> +	}
>   }
>   
> +#else	/* CONFIG_SLUB_CPU_PARTIAL */
> +
> +static inline void unfreeze_partials(struct kmem_cache *s) { }
> +static inline void unfreeze_partials_cpu(struct kmem_cache *s,
> +				  struct kmem_cache_cpu *c) { }
> +
> +#endif	/* CONFIG_SLUB_CPU_PARTIAL */
> +
>   static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
>   {
>   	unsigned long flags;
> _
> 

$ uname -a
Linux broadwell 5.14.0-net-next+ #612 SMP PREEMPT Wed Sep 8 10:10:04 
CEST 2021 x86_64 x86_64 x86_64 GNU/Linux


My config:

$ zcat /proc/config.gz | grep PREE
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_RCU=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_PREEMPT_TRACER is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
Vlastimil Babka Sept. 8, 2021, 1:58 p.m. UTC | #2
On 9/8/21 15:05, Jesper Dangaard Brouer wrote:
> 
> 
> On 08/09/2021 04.54, Andrew Morton wrote:
>> From: Vlastimil Babka <vbabka@suse.cz>
>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
>>
>> Jann Horn reported [1] the following theoretically possible race:
>>
>>    task A: put_cpu_partial() calls preempt_disable()
>>    task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>>    interrupt: kfree() reaches unfreeze_partials() and discards the page
>>    task B (on another CPU): reallocates page as page cache
>>    task A: reads page->pages and page->pobjects, which are actually
>>    halves of the pointer page->lru.prev
>>    task B (on another CPU): frees page
>>    interrupt: allocates page as SLUB page and places it on the percpu partial list
>>    task A: this_cpu_cmpxchg() succeeds
>>
>>    which would cause page->pages and page->pobjects to end up containing
>>    halves of pointers that would then influence when put_cpu_partial()
>>    happens and show up in root-only sysfs files. Maybe that's acceptable,
>>    I don't know. But there should probably at least be a comment for now
>>    to point out that we're reading union fields of a page that might be
>>    in a completely different state.
>>
>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
>> latter disables irqs, otherwise a __slab_free() in an irq handler could
>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
>> ->partial and corrupt it.  This becomes an issue on RT after a local_lock
>> is introduced in later patch.  The fix means taking the local_lock also in
>> put_cpu_partial() on RT.
>>
>> After debugging this issue, Mike Galbraith suggested [2] that to avoid
>> different locking schemes on RT and !RT, we can just protect
>> put_cpu_partial() with disabled irqs (to be converted to
>> local_lock_irqsave() later) everywhere.  This should be acceptable as it's
>> not a fast path, and moving the actual partial unfreezing outside of the
>> irq disabled section makes it short, and with the retry loop gone the code
>> can be also simplified.  In addition, the race reported by Jann should no
>> longer be possible.
> 
> Based on my microbench[0] measurement changing preempt_disable to 
> local_irq_save will cost us 11 cycles (TSC).  I'm not against the 
> change, I just want people to keep this in mind.

OK, but this is not a fast path for every allocation/free, so it gets
amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
cmpxchg to be expensive too?

> On my E5-1650 v4 @ 3.60GHz:
>   - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
>   - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns
> 
> Notice the non-save/restore variant is superfast:
>   - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns

It actually surprises me that it's that cheap, and would have expected
changing the irq state would be the costly part, not the saving/restoring.
Incidentally, would you know what's the cost of save+restore when the
irqs are already disabled, so it's effectively a no-op?

Thanks,
Vlastimil
David Hildenbrand Sept. 8, 2021, 2:55 p.m. UTC | #3
On 08.09.21 15:58, Vlastimil Babka wrote:
> On 9/8/21 15:05, Jesper Dangaard Brouer wrote:
>>
>>
>> On 08/09/2021 04.54, Andrew Morton wrote:
>>> From: Vlastimil Babka <vbabka@suse.cz>
>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
>>>
>>> Jann Horn reported [1] the following theoretically possible race:
>>>
>>>     task A: put_cpu_partial() calls preempt_disable()
>>>     task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>>>     interrupt: kfree() reaches unfreeze_partials() and discards the page
>>>     task B (on another CPU): reallocates page as page cache
>>>     task A: reads page->pages and page->pobjects, which are actually
>>>     halves of the pointer page->lru.prev
>>>     task B (on another CPU): frees page
>>>     interrupt: allocates page as SLUB page and places it on the percpu partial list
>>>     task A: this_cpu_cmpxchg() succeeds
>>>
>>>     which would cause page->pages and page->pobjects to end up containing
>>>     halves of pointers that would then influence when put_cpu_partial()
>>>     happens and show up in root-only sysfs files. Maybe that's acceptable,
>>>     I don't know. But there should probably at least be a comment for now
>>>     to point out that we're reading union fields of a page that might be
>>>     in a completely different state.
>>>
>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
>>> latter disables irqs, otherwise a __slab_free() in an irq handler could
>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
>>> ->partial and corrupt it.  This becomes an issue on RT after a local_lock
>>> is introduced in later patch.  The fix means taking the local_lock also in
>>> put_cpu_partial() on RT.
>>>
>>> After debugging this issue, Mike Galbraith suggested [2] that to avoid
>>> different locking schemes on RT and !RT, we can just protect
>>> put_cpu_partial() with disabled irqs (to be converted to
>>> local_lock_irqsave() later) everywhere.  This should be acceptable as it's
>>> not a fast path, and moving the actual partial unfreezing outside of the
>>> irq disabled section makes it short, and with the retry loop gone the code
>>> can be also simplified.  In addition, the race reported by Jann should no
>>> longer be possible.
>>
>> Based on my microbench[0] measurement changing preempt_disable to
>> local_irq_save will cost us 11 cycles (TSC).  I'm not against the
>> change, I just want people to keep this in mind.
> 
> OK, but this is not a fast path for every allocation/free, so it gets
> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
> cmpxchg to be expensive too?
> 
>> On my E5-1650 v4 @ 3.60GHz:
>>    - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
>>    - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns
>>
>> Notice the non-save/restore variant is superfast:
>>    - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns
> 
> It actually surprises me that it's that cheap, and would have expected
> changing the irq state would be the costly part, not the saving/restoring.
> Incidentally, would you know what's the cost of save+restore when the
> irqs are already disabled, so it's effectively a no-op?

It surprises me as well. That would imply that protecting short RCU 
sections using

local_irq_disable
local_irq_enable

instead of via

preempt_disable
preempt_enable

would actually be very beneficial.

Are the numbers trustworthy? :)
David Hildenbrand Sept. 8, 2021, 2:59 p.m. UTC | #4
On 08.09.21 16:55, David Hildenbrand wrote:
> On 08.09.21 15:58, Vlastimil Babka wrote:
>> On 9/8/21 15:05, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 08/09/2021 04.54, Andrew Morton wrote:
>>>> From: Vlastimil Babka <vbabka@suse.cz>
>>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
>>>>
>>>> Jann Horn reported [1] the following theoretically possible race:
>>>>
>>>>      task A: put_cpu_partial() calls preempt_disable()
>>>>      task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>>>>      interrupt: kfree() reaches unfreeze_partials() and discards the page
>>>>      task B (on another CPU): reallocates page as page cache
>>>>      task A: reads page->pages and page->pobjects, which are actually
>>>>      halves of the pointer page->lru.prev
>>>>      task B (on another CPU): frees page
>>>>      interrupt: allocates page as SLUB page and places it on the percpu partial list
>>>>      task A: this_cpu_cmpxchg() succeeds
>>>>
>>>>      which would cause page->pages and page->pobjects to end up containing
>>>>      halves of pointers that would then influence when put_cpu_partial()
>>>>      happens and show up in root-only sysfs files. Maybe that's acceptable,
>>>>      I don't know. But there should probably at least be a comment for now
>>>>      to point out that we're reading union fields of a page that might be
>>>>      in a completely different state.
>>>>
>>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
>>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
>>>> latter disables irqs, otherwise a __slab_free() in an irq handler could
>>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
>>>> ->partial and corrupt it.  This becomes an issue on RT after a local_lock
>>>> is introduced in later patch.  The fix means taking the local_lock also in
>>>> put_cpu_partial() on RT.
>>>>
>>>> After debugging this issue, Mike Galbraith suggested [2] that to avoid
>>>> different locking schemes on RT and !RT, we can just protect
>>>> put_cpu_partial() with disabled irqs (to be converted to
>>>> local_lock_irqsave() later) everywhere.  This should be acceptable as it's
>>>> not a fast path, and moving the actual partial unfreezing outside of the
>>>> irq disabled section makes it short, and with the retry loop gone the code
>>>> can be also simplified.  In addition, the race reported by Jann should no
>>>> longer be possible.
>>>
>>> Based on my microbench[0] measurement changing preempt_disable to
>>> local_irq_save will cost us 11 cycles (TSC).  I'm not against the
>>> change, I just want people to keep this in mind.
>>
>> OK, but this is not a fast path for every allocation/free, so it gets
>> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
>> cmpxchg to be expensive too?
>>
>>> On my E5-1650 v4 @ 3.60GHz:
>>>     - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
>>>     - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns
>>>
>>> Notice the non-save/restore variant is superfast:
>>>     - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns
>>
>> It actually surprises me that it's that cheap, and would have expected
>> changing the irq state would be the costly part, not the saving/restoring.
>> Incidentally, would you know what's the cost of save+restore when the
>> irqs are already disabled, so it's effectively a no-op?
> 
> It surprises me as well. That would imply that protecting short RCU
> sections using
> 
> local_irq_disable
> local_irq_enable
> 
> instead of via
> 
> preempt_disable
> preempt_enable
> 
> would actually be very beneficial.
> 
> Are the numbers trustworthy? :)
> 

.. and especially did the benchmark consider side effects of 
enabling/disabling interrupts (pipeline flushes etc ..)?
Jesper Dangaard Brouer Sept. 8, 2021, 4:11 p.m. UTC | #5
On 08/09/2021 15.58, Vlastimil Babka wrote:
> On 9/8/21 15:05, Jesper Dangaard Brouer wrote:
>>
>>
>> On 08/09/2021 04.54, Andrew Morton wrote:
>>> From: Vlastimil Babka <vbabka@suse.cz>
>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
>>>
>>> Jann Horn reported [1] the following theoretically possible race:
>>>
>>>     task A: put_cpu_partial() calls preempt_disable()
>>>     task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>>>     interrupt: kfree() reaches unfreeze_partials() and discards the page
>>>     task B (on another CPU): reallocates page as page cache
>>>     task A: reads page->pages and page->pobjects, which are actually
>>>     halves of the pointer page->lru.prev
>>>     task B (on another CPU): frees page
>>>     interrupt: allocates page as SLUB page and places it on the percpu partial list
>>>     task A: this_cpu_cmpxchg() succeeds
>>>
>>>     which would cause page->pages and page->pobjects to end up containing
>>>     halves of pointers that would then influence when put_cpu_partial()
>>>     happens and show up in root-only sysfs files. Maybe that's acceptable,
>>>     I don't know. But there should probably at least be a comment for now
>>>     to point out that we're reading union fields of a page that might be
>>>     in a completely different state.
>>>
>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
>>> latter disables irqs, otherwise a __slab_free() in an irq handler could
>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
>>> ->partial and corrupt it.  This becomes an issue on RT after a local_lock
>>> is introduced in later patch.  The fix means taking the local_lock also in
>>> put_cpu_partial() on RT.
>>>
>>> After debugging this issue, Mike Galbraith suggested [2] that to avoid
>>> different locking schemes on RT and !RT, we can just protect
>>> put_cpu_partial() with disabled irqs (to be converted to
>>> local_lock_irqsave() later) everywhere.  This should be acceptable as it's
>>> not a fast path, and moving the actual partial unfreezing outside of the
>>> irq disabled section makes it short, and with the retry loop gone the code
>>> can be also simplified.  In addition, the race reported by Jann should no
>>> longer be possible.
>>
>> Based on my microbench[0] measurement changing preempt_disable to
>> local_irq_save will cost us 11 cycles (TSC).  I'm not against the
>> change, I just want people to keep this in mind.
> 
> OK, but this is not a fast path for every allocation/free, so it gets
> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
> cmpxchg to be expensive too?

Added tests for this:
  - this_cpu_cmpxchg cost: 5 cycles(tsc) 1.581 ns
  - cmpxchg          cost: 18 cycles(tsc) 5.006 ns

>> On my E5-1650 v4 @ 3.60GHz:
>>    - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
>>    - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns
>>
>> Notice the non-save/restore variant is superfast:
>>    - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns
> 
> It actually surprises me that it's that cheap, and would have expected
> changing the irq state would be the costly part, not the saving/restoring.
> Incidentally, would you know what's the cost of save+restore when the
> irqs are already disabled, so it's effectively a no-op?

The non-save variant simply translated onto CLI and STI, which seems to 
be very fast.

The cost of save+restore when the irqs are already disabled is the same 
(did a quick test).
Cannot remember who told me, but (apparently) the expensive part is 
reading the CPU FLAGS.

I did a quick test with:

	/** Loop to measure **/
	for (i = 0; i < rec->loops; i++) {
		local_irq_save(flags);
		loops_cnt++;
		barrier();
		//local_irq_restore(flags);
		local_irq_enable();
	}

Doing a save + enable: This cost 21 cycles(tsc) 6.015 ns.
(Cost before was 22 cycles)

This confirms reading the CPU FLAGS seems to be the expensive part.

--Jesper
Linus Torvalds Sept. 8, 2021, 4:31 p.m. UTC | #6
On Wed, Sep 8, 2021 at 9:11 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
> The non-save variant simply translated onto CLI and STI, which seems to
> be very fast.

It will depend on the microarchitecture.

Happily:

> The cost of save+restore when the irqs are already disabled is the same
> (did a quick test).

The really expensive part used to be P4. 'popf' was hundreds of cycles
if any of the non-arithmetic bits changed, iirc.

P4 used to be a big headache just because of things like that -
straightforward code ran very well, but anything a bit more special
took forever because it flushed the pipeline.

So some of our optimizations may be historic because of things like
that. We don't really need to worry about the P4 glass jaws any more,
but it *used* to be much quicker to do 'preempt_disable()' that just
does an add to a memory location than it was to disable interrupts.

> Cannot remember who told me, but (apparently) the expensive part is
> reading the CPU FLAGS.

Again, it ends up being very dependent on the uarch.

Reading and writing the flags register is somewhat expensive because
it's not really "one" register in hardware any more (even if that was
obviously the historical implementation).

These days, the arithmetic flags are generally multiple renamed
registers, and then the other flags are a separate system register
(possibly multiple bits spread out).

The cost of doing those flag reads and writes are hard to really
specify, because in an OoO architecture a lot of it ends up being "how
much of that can be done in parallel, and what's the pipeline
serialization cost". Doing a loop with rdtsc is not necessarily AT ALL
indicative of the cost when there is other real code around it.

The cost _could_ be much smaller, in case there is little
serialization with normal other code. Or, it could be much bigger than
what a rdtsc shows, because if it's a hard pipeline flush, then a
tight loop with those things won't have any real work to flush, while
in "real code" there may be hundreds of instructions in flight and
doing the flush is very expensive.

The good news is that afaik, all the modern x86 CPU microarchitectures
do reasonably well. And while a "pushf/cli/popf" sequence is probably
more cycles than an add/subtract one in a benchmark, if the preempt
counter is not otherwise needed, and is cold in the cache, then the
pushf/cli/popf may be *much* cheaper than a cache miss.

So the only way to really tell would be to run real benchmarks of real
loads on multiple different microarchitectures.

I'm pretty sure the actual result is: "you can't measure the 10-cycle
difference on any modern core because it can actually go either way".

But "I'm pretty sure" and "reality" are not the same thing.

These days, pipeline flushes and cache misses (and then as a very
particularly bad case - cache line pingpong issues) are almost the
only thing that matters.

And the most common reason by far for the pipeline flushes are branch
mispredicts, but see above: the system bits in the flags register
_have_ been cause of them in the past, so it's not entirely
impossible.

               Linus
Jesper Dangaard Brouer Sept. 8, 2021, 5:14 p.m. UTC | #7
On 08/09/2021 16.59, David Hildenbrand wrote:
> On 08.09.21 16:55, David Hildenbrand wrote:
>> On 08.09.21 15:58, Vlastimil Babka wrote:
>>> On 9/8/21 15:05, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 08/09/2021 04.54, Andrew Morton wrote:
>>>>> From: Vlastimil Babka <vbabka@suse.cz>
>>>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs 
>>>>> instead of cmpxchg
>>>>>
>>>>> Jann Horn reported [1] the following theoretically possible race:
>>>>>
>>>>>      task A: put_cpu_partial() calls preempt_disable()
>>>>>      task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>>>>>      interrupt: kfree() reaches unfreeze_partials() and discards 
>>>>> the page
>>>>>      task B (on another CPU): reallocates page as page cache
>>>>>      task A: reads page->pages and page->pobjects, which are actually
>>>>>      halves of the pointer page->lru.prev
>>>>>      task B (on another CPU): frees page
>>>>>      interrupt: allocates page as SLUB page and places it on the 
>>>>> percpu partial list
>>>>>      task A: this_cpu_cmpxchg() succeeds
>>>>>
>>>>>      which would cause page->pages and page->pobjects to end up 
>>>>> containing
>>>>>      halves of pointers that would then influence when 
>>>>> put_cpu_partial()
>>>>>      happens and show up in root-only sysfs files. Maybe that's 
>>>>> acceptable,
>>>>>      I don't know. But there should probably at least be a comment 
>>>>> for now
>>>>>      to point out that we're reading union fields of a page that 
>>>>> might be
>>>>>      in a completely different state.
>>>>>
>>>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() 
>>>>> is only
>>>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() 
>>>>> if the
>>>>> latter disables irqs, otherwise a __slab_free() in an irq handler 
>>>>> could
>>>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
>>>>> ->partial and corrupt it.  This becomes an issue on RT after a 
>>>>> local_lock
>>>>> is introduced in later patch.  The fix means taking the local_lock 
>>>>> also in
>>>>> put_cpu_partial() on RT.
>>>>>
>>>>> After debugging this issue, Mike Galbraith suggested [2] that to avoid
>>>>> different locking schemes on RT and !RT, we can just protect
>>>>> put_cpu_partial() with disabled irqs (to be converted to
>>>>> local_lock_irqsave() later) everywhere.  This should be acceptable 
>>>>> as it's
>>>>> not a fast path, and moving the actual partial unfreezing outside 
>>>>> of the
>>>>> irq disabled section makes it short, and with the retry loop gone 
>>>>> the code
>>>>> can be also simplified.  In addition, the race reported by Jann 
>>>>> should no
>>>>> longer be possible.
>>>>
>>>> Based on my microbench[0] measurement changing preempt_disable to
>>>> local_irq_save will cost us 11 cycles (TSC).  I'm not against the
>>>> change, I just want people to keep this in mind.
>>>
>>> OK, but this is not a fast path for every allocation/free, so it gets
>>> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
>>> cmpxchg to be expensive too?
>>>
>>>> On my E5-1650 v4 @ 3.60GHz:
>>>>     - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
>>>>     - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns
>>>>
>>>> Notice the non-save/restore variant is superfast:
>>>>     - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns
>>>
>>> It actually surprises me that it's that cheap, and would have expected
>>> changing the irq state would be the costly part, not the 
>>> saving/restoring.
>>> Incidentally, would you know what's the cost of save+restore when the
>>> irqs are already disabled, so it's effectively a no-op?
>>
>> It surprises me as well. That would imply that protecting short RCU
>> sections using
>>
>> local_irq_disable
>> local_irq_enable
>>
>> instead of via
>>
>> preempt_disable
>> preempt_enable
>>
>> would actually be very beneficial.

Please don't draw this as a general conclusion.
As Linus describe in details, the IRQ disable/enable will be very 
micro-arch specific.

The preempt_disable/enable will likely be more stable/consistent across 
micro-archs.
Keep an eye out for kernel config options when juding 
preempt_disable/enable performance [1]

[1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c#L363-L367


>>
>> Are the numbers trustworthy? :)
>>
> 
> .. and especially did the benchmark consider side effects of 
> enabling/disabling interrupts (pipeline flushes etc ..)?
> 

Of-cause not, this is a microbenchmark... they are per definition not 
trustworthy :-P

-Jesper
David Hildenbrand Sept. 8, 2021, 5:24 p.m. UTC | #8
On 08.09.21 19:14, Jesper Dangaard Brouer wrote:
> 
> 
> On 08/09/2021 16.59, David Hildenbrand wrote:
>> On 08.09.21 16:55, David Hildenbrand wrote:
>>> On 08.09.21 15:58, Vlastimil Babka wrote:
>>>> On 9/8/21 15:05, Jesper Dangaard Brouer wrote:
>>>>>
>>>>>
>>>>> On 08/09/2021 04.54, Andrew Morton wrote:
>>>>>> From: Vlastimil Babka <vbabka@suse.cz>
>>>>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs
>>>>>> instead of cmpxchg
>>>>>>
>>>>>> Jann Horn reported [1] the following theoretically possible race:
>>>>>>
>>>>>>       task A: put_cpu_partial() calls preempt_disable()
>>>>>>       task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>>>>>>       interrupt: kfree() reaches unfreeze_partials() and discards
>>>>>> the page
>>>>>>       task B (on another CPU): reallocates page as page cache
>>>>>>       task A: reads page->pages and page->pobjects, which are actually
>>>>>>       halves of the pointer page->lru.prev
>>>>>>       task B (on another CPU): frees page
>>>>>>       interrupt: allocates page as SLUB page and places it on the
>>>>>> percpu partial list
>>>>>>       task A: this_cpu_cmpxchg() succeeds
>>>>>>
>>>>>>       which would cause page->pages and page->pobjects to end up
>>>>>> containing
>>>>>>       halves of pointers that would then influence when
>>>>>> put_cpu_partial()
>>>>>>       happens and show up in root-only sysfs files. Maybe that's
>>>>>> acceptable,
>>>>>>       I don't know. But there should probably at least be a comment
>>>>>> for now
>>>>>>       to point out that we're reading union fields of a page that
>>>>>> might be
>>>>>>       in a completely different state.
>>>>>>
>>>>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial()
>>>>>> is only
>>>>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc()
>>>>>> if the
>>>>>> latter disables irqs, otherwise a __slab_free() in an irq handler
>>>>>> could
>>>>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
>>>>>> ->partial and corrupt it.  This becomes an issue on RT after a
>>>>>> local_lock
>>>>>> is introduced in later patch.  The fix means taking the local_lock
>>>>>> also in
>>>>>> put_cpu_partial() on RT.
>>>>>>
>>>>>> After debugging this issue, Mike Galbraith suggested [2] that to avoid
>>>>>> different locking schemes on RT and !RT, we can just protect
>>>>>> put_cpu_partial() with disabled irqs (to be converted to
>>>>>> local_lock_irqsave() later) everywhere.  This should be acceptable
>>>>>> as it's
>>>>>> not a fast path, and moving the actual partial unfreezing outside
>>>>>> of the
>>>>>> irq disabled section makes it short, and with the retry loop gone
>>>>>> the code
>>>>>> can be also simplified.  In addition, the race reported by Jann
>>>>>> should no
>>>>>> longer be possible.
>>>>>
>>>>> Based on my microbench[0] measurement changing preempt_disable to
>>>>> local_irq_save will cost us 11 cycles (TSC).  I'm not against the
>>>>> change, I just want people to keep this in mind.
>>>>
>>>> OK, but this is not a fast path for every allocation/free, so it gets
>>>> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
>>>> cmpxchg to be expensive too?
>>>>
>>>>> On my E5-1650 v4 @ 3.60GHz:
>>>>>      - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
>>>>>      - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns
>>>>>
>>>>> Notice the non-save/restore variant is superfast:
>>>>>      - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns
>>>>
>>>> It actually surprises me that it's that cheap, and would have expected
>>>> changing the irq state would be the costly part, not the
>>>> saving/restoring.
>>>> Incidentally, would you know what's the cost of save+restore when the
>>>> irqs are already disabled, so it's effectively a no-op?
>>>
>>> It surprises me as well. That would imply that protecting short RCU
>>> sections using
>>>
>>> local_irq_disable
>>> local_irq_enable
>>>
>>> instead of via
>>>
>>> preempt_disable
>>> preempt_enable
>>>
>>> would actually be very beneficial.
> 
> Please don't draw this as a general conclusion.
> As Linus describe in details, the IRQ disable/enable will be very
> micro-arch specific.

Sure: but especially for modern micro-archs, this might be very relevant.

I actually stumbled over this exact question 1 month ago, that's why 
your comment caught my attention. I looked for CLI/STI cycle numbers and 
didn't really find a trusted source. I merely only found [1], which made 
it look like incrementing/decrementing some counter would actually be 
much faster most of the time.

[1] https://www.agner.org/optimize/instruction_tables.pdf

> 
> The preempt_disable/enable will likely be more stable/consistent across
> micro-archs.
> Keep an eye out for kernel config options when juding
> preempt_disable/enable performance [1]
> 
> [1]
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c#L363-L367
> 
> 
>>>
>>> Are the numbers trustworthy? :)
>>>
>>
>> .. and especially did the benchmark consider side effects of
>> enabling/disabling interrupts (pipeline flushes etc ..)?
>>
> 
> Of-cause not, this is a microbenchmark... they are per definition not
> trustworthy :-P

:)
diff mbox series

Patch

--- a/mm/slub.c~mm-slub-protect-put_cpu_partial-with-disabled-irqs-instead-of-cmpxchg
+++ a/mm/slub.c
@@ -2025,7 +2025,12 @@  static inline void *acquire_slab(struct
 	return freelist;
 }
 
+#ifdef CONFIG_SLUB_CPU_PARTIAL
 static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain);
+#else
+static inline void put_cpu_partial(struct kmem_cache *s, struct page *page,
+				   int drain) { }
+#endif
 static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
 
 /*
@@ -2459,14 +2464,6 @@  static void unfreeze_partials_cpu(struct
 		__unfreeze_partials(s, partial_page);
 }
 
-#else	/* CONFIG_SLUB_CPU_PARTIAL */
-
-static inline void unfreeze_partials(struct kmem_cache *s) { }
-static inline void unfreeze_partials_cpu(struct kmem_cache *s,
-				  struct kmem_cache_cpu *c) { }
-
-#endif	/* CONFIG_SLUB_CPU_PARTIAL */
-
 /*
  * Put a page that was just frozen (in __slab_free|get_partial_node) into a
  * partial page slot if available.
@@ -2476,46 +2473,56 @@  static inline void unfreeze_partials_cpu
  */
 static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 {
-#ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct page *oldpage;
-	int pages;
-	int pobjects;
+	struct page *page_to_unfreeze = NULL;
+	unsigned long flags;
+	int pages = 0;
+	int pobjects = 0;
 
-	preempt_disable();
-	do {
-		pages = 0;
-		pobjects = 0;
-		oldpage = this_cpu_read(s->cpu_slab->partial);
+	local_irq_save(flags);
+
+	oldpage = this_cpu_read(s->cpu_slab->partial);
 
-		if (oldpage) {
+	if (oldpage) {
+		if (drain && oldpage->pobjects > slub_cpu_partial(s)) {
+			/*
+			 * Partial array is full. Move the existing set to the
+			 * per node partial list. Postpone the actual unfreezing
+			 * outside of the critical section.
+			 */
+			page_to_unfreeze = oldpage;
+			oldpage = NULL;
+		} else {
 			pobjects = oldpage->pobjects;
 			pages = oldpage->pages;
-			if (drain && pobjects > slub_cpu_partial(s)) {
-				/*
-				 * partial array is full. Move the existing
-				 * set to the per node partial list.
-				 */
-				unfreeze_partials(s);
-				oldpage = NULL;
-				pobjects = 0;
-				pages = 0;
-				stat(s, CPU_PARTIAL_DRAIN);
-			}
 		}
+	}
 
-		pages++;
-		pobjects += page->objects - page->inuse;
+	pages++;
+	pobjects += page->objects - page->inuse;
 
-		page->pages = pages;
-		page->pobjects = pobjects;
-		page->next = oldpage;
-
-	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
-								!= oldpage);
-	preempt_enable();
-#endif	/* CONFIG_SLUB_CPU_PARTIAL */
+	page->pages = pages;
+	page->pobjects = pobjects;
+	page->next = oldpage;
+
+	this_cpu_write(s->cpu_slab->partial, page);
+
+	local_irq_restore(flags);
+
+	if (page_to_unfreeze) {
+		__unfreeze_partials(s, page_to_unfreeze);
+		stat(s, CPU_PARTIAL_DRAIN);
+	}
 }
 
+#else	/* CONFIG_SLUB_CPU_PARTIAL */
+
+static inline void unfreeze_partials(struct kmem_cache *s) { }
+static inline void unfreeze_partials_cpu(struct kmem_cache *s,
+				  struct kmem_cache_cpu *c) { }
+
+#endif	/* CONFIG_SLUB_CPU_PARTIAL */
+
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	unsigned long flags;