diff mbox series

[v2,resend] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove

Message ID 20210412120842.GY3697@techsingularity.net (mailing list archive)
State New, archived
Headers show
Series [v2,resend] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove | expand

Commit Message

Mel Gorman April 12, 2021, 12:08 p.m. UTC
zone_pcp_reset allegedly protects against a race with drain_pages
using local_irq_save but this is bogus. local_irq_save only operates
on the local CPU. If memory hotplug is running on CPU A and drain_pages
is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
offers no protection.

This patch deletes IRQ disable/enable on the grounds that IRQs protect
nothing and assumes the existing hotplug paths guarantees the PCP cannot be
used after zone_pcp_enable(). That should be the case already because all
the pages have been freed and there is no page to put on the PCP lists.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
Resending for email address correction and adding lists

Changelog since v1
o Minimal fix

 mm/page_alloc.c | 4 ----
 1 file changed, 4 deletions(-)

Comments

David Hildenbrand April 12, 2021, 12:12 p.m. UTC | #1
On 12.04.21 14:08, Mel Gorman wrote:
> zone_pcp_reset allegedly protects against a race with drain_pages
> using local_irq_save but this is bogus. local_irq_save only operates
> on the local CPU. If memory hotplug is running on CPU A and drain_pages
> is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
> offers no protection.
> 
> This patch deletes IRQ disable/enable on the grounds that IRQs protect
> nothing and assumes the existing hotplug paths guarantees the PCP cannot be
> used after zone_pcp_enable(). That should be the case already because all
> the pages have been freed and there is no page to put on the PCP lists.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
> Resending for email address correction and adding lists
> 
> Changelog since v1
> o Minimal fix
> 
>   mm/page_alloc.c | 4 ----
>   1 file changed, 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5e8aedb64b57..9bf0db982f14 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8952,12 +8952,9 @@ void zone_pcp_enable(struct zone *zone)
>   
>   void zone_pcp_reset(struct zone *zone)
>   {
> -	unsigned long flags;
>   	int cpu;
>   	struct per_cpu_pageset *pset;
>   
> -	/* avoid races with drain_pages()  */
> -	local_irq_save(flags);
>   	if (zone->pageset != &boot_pageset) {
>   		for_each_online_cpu(cpu) {
>   			pset = per_cpu_ptr(zone->pageset, cpu);
> @@ -8966,7 +8963,6 @@ void zone_pcp_reset(struct zone *zone)
>   		free_percpu(zone->pageset);
>   		zone->pageset = &boot_pageset;
>   	}
> -	local_irq_restore(flags);
>   }
>   
>   #ifdef CONFIG_MEMORY_HOTREMOVE
> 

This was originally introduced by 340175b7d14d ("mm/hotplug: free 
zone->pageset when a zone becomes empty").

I wonder why it was ever added. Could it be that drain_pages() could 
have been called from an interrupt handler (e.g., on concurrent CPU 
hotunplug of the current CPU?) back then while we are just about to 
remove it ourselves?

Anyhow, if we need some protection after setting the zone unpopulated 
(setting present_pages = 0), it doesn't seem like disabling local IRQs 
is the right thing to do.

Reviewed-by: David Hildenbrand <david@redhat.com>
Vlastimil Babka April 12, 2021, 12:40 p.m. UTC | #2
On 4/12/21 2:08 PM, Mel Gorman wrote:
> zone_pcp_reset allegedly protects against a race with drain_pages
> using local_irq_save but this is bogus. local_irq_save only operates
> on the local CPU. If memory hotplug is running on CPU A and drain_pages
> is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
> offers no protection.
> 
> This patch deletes IRQ disable/enable on the grounds that IRQs protect
> nothing and assumes the existing hotplug paths guarantees the PCP cannot be
> used after zone_pcp_enable(). That should be the case already because all
> the pages have been freed and there is no page to put on the PCP lists.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Yeah the irq disabling here is clearly bogus, so:

Acked-by: Vlastimil Babka <vbabka@suse.cz>

But I think Michal has a point that we might best leave the pagesets around, by
a future change. I'm have some doubts that even with your reordering of the
reset/destroy after zonelist rebuild in v1 they cant't be reachable. We have no
protection between zonelist rebuild and zonelist traversal, and that's why we
just leave pgdats around.

So I can imagine a task racing with memory hotremove might see watermarks as ok
in get_page_from_freelist() for the zone and proceeds to try_this_zone:, then
gets stalled/scheduled out while hotremove rebuilds the zonelist and destroys
the pcplists, then the first task is resumed and proceeds with rmqueue_pcplist().

So that's very rare thus not urgent, and this patch doesn't make it less rare so
not a reason to block it.

> ---
> Resending for email address correction and adding lists
> 
> Changelog since v1
> o Minimal fix
> 
>  mm/page_alloc.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5e8aedb64b57..9bf0db982f14 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8952,12 +8952,9 @@ void zone_pcp_enable(struct zone *zone)
>  
>  void zone_pcp_reset(struct zone *zone)
>  {
> -	unsigned long flags;
>  	int cpu;
>  	struct per_cpu_pageset *pset;
>  
> -	/* avoid races with drain_pages()  */
> -	local_irq_save(flags);
>  	if (zone->pageset != &boot_pageset) {
>  		for_each_online_cpu(cpu) {
>  			pset = per_cpu_ptr(zone->pageset, cpu);
> @@ -8966,7 +8963,6 @@ void zone_pcp_reset(struct zone *zone)
>  		free_percpu(zone->pageset);
>  		zone->pageset = &boot_pageset;
>  	}
> -	local_irq_restore(flags);
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>
Mel Gorman April 12, 2021, 2:08 p.m. UTC | #3
On Mon, Apr 12, 2021 at 02:40:18PM +0200, Vlastimil Babka wrote:
> On 4/12/21 2:08 PM, Mel Gorman wrote:
> > zone_pcp_reset allegedly protects against a race with drain_pages
> > using local_irq_save but this is bogus. local_irq_save only operates
> > on the local CPU. If memory hotplug is running on CPU A and drain_pages
> > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
> > offers no protection.
> > 
> > This patch deletes IRQ disable/enable on the grounds that IRQs protect
> > nothing and assumes the existing hotplug paths guarantees the PCP cannot be
> > used after zone_pcp_enable(). That should be the case already because all
> > the pages have been freed and there is no page to put on the PCP lists.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Yeah the irq disabling here is clearly bogus, so:
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 

Thanks!

> But I think Michal has a point that we might best leave the pagesets around, by
> a future change. I'm have some doubts that even with your reordering of the
> reset/destroy after zonelist rebuild in v1 they cant't be reachable. We have no
> protection between zonelist rebuild and zonelist traversal, and that's why we
> just leave pgdats around.
> 
> So I can imagine a task racing with memory hotremove might see watermarks as ok
> in get_page_from_freelist() for the zone and proceeds to try_this_zone:, then
> gets stalled/scheduled out while hotremove rebuilds the zonelist and destroys
> the pcplists, then the first task is resumed and proceeds with rmqueue_pcplist().
> 
> So that's very rare thus not urgent, and this patch doesn't make it less rare so
> not a reason to block it.
> 

After v1 of the patch, the race was reduced to the point between the
zone watermark check and the rmqueue_pcplist but yes, it still existed.
Closing it completely was either complex or expensive. Setting
zone->pageset = &boot_pageset before the free would shrink the race
further but that still leaves a potential memory ordering issue.

While fixable, it's either complex, expensive or both so yes, just leaving
the pageset structures in place would be much more straight-forward
assuming the structures were not allocated in the zone that is being
hot-removed. As things stand, I had trouble even testing zone hot-remove
as there was always a few pages left behind and I did not chase down
why. The focus was getting rid of the bogus local_irq_save() because
it was clearly wrong and offering a false sense of safety and the last
problematic local_irq_save() user in page_alloc.c when local_lock is used
to protect the PCP structures.
David Hildenbrand April 12, 2021, 2:12 p.m. UTC | #4
On 12.04.21 16:08, Mel Gorman wrote:
> On Mon, Apr 12, 2021 at 02:40:18PM +0200, Vlastimil Babka wrote:
>> On 4/12/21 2:08 PM, Mel Gorman wrote:
>>> zone_pcp_reset allegedly protects against a race with drain_pages
>>> using local_irq_save but this is bogus. local_irq_save only operates
>>> on the local CPU. If memory hotplug is running on CPU A and drain_pages
>>> is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
>>> offers no protection.
>>>
>>> This patch deletes IRQ disable/enable on the grounds that IRQs protect
>>> nothing and assumes the existing hotplug paths guarantees the PCP cannot be
>>> used after zone_pcp_enable(). That should be the case already because all
>>> the pages have been freed and there is no page to put on the PCP lists.
>>>
>>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>>
>> Yeah the irq disabling here is clearly bogus, so:
>>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>
> 
> Thanks!
> 
>> But I think Michal has a point that we might best leave the pagesets around, by
>> a future change. I'm have some doubts that even with your reordering of the
>> reset/destroy after zonelist rebuild in v1 they cant't be reachable. We have no
>> protection between zonelist rebuild and zonelist traversal, and that's why we
>> just leave pgdats around.
>>
>> So I can imagine a task racing with memory hotremove might see watermarks as ok
>> in get_page_from_freelist() for the zone and proceeds to try_this_zone:, then
>> gets stalled/scheduled out while hotremove rebuilds the zonelist and destroys
>> the pcplists, then the first task is resumed and proceeds with rmqueue_pcplist().
>>
>> So that's very rare thus not urgent, and this patch doesn't make it less rare so
>> not a reason to block it.
>>
> 
> After v1 of the patch, the race was reduced to the point between the
> zone watermark check and the rmqueue_pcplist but yes, it still existed.
> Closing it completely was either complex or expensive. Setting
> zone->pageset = &boot_pageset before the free would shrink the race
> further but that still leaves a potential memory ordering issue.
> 
> While fixable, it's either complex, expensive or both so yes, just leaving
> the pageset structures in place would be much more straight-forward
> assuming the structures were not allocated in the zone that is being
> hot-removed. As things stand, I had trouble even testing zone hot-remove
> as there was always a few pages left behind and I did not chase down
> why.
Can you elaborate? I can reliably trigger zone present pages going to 0 
by just hotplugging a DIMM, onlining the memory block devices to the 
MOVABLE zone, followed by offlining the memory block again.
Mel Gorman April 12, 2021, 3:27 p.m. UTC | #5
On Mon, Apr 12, 2021 at 04:12:11PM +0200, David Hildenbrand wrote:
> > After v1 of the patch, the race was reduced to the point between the
> > zone watermark check and the rmqueue_pcplist but yes, it still existed.
> > Closing it completely was either complex or expensive. Setting
> > zone->pageset = &boot_pageset before the free would shrink the race
> > further but that still leaves a potential memory ordering issue.
> > 
> > While fixable, it's either complex, expensive or both so yes, just leaving
> > the pageset structures in place would be much more straight-forward
> > assuming the structures were not allocated in the zone that is being
> > hot-removed. As things stand, I had trouble even testing zone hot-remove
> > as there was always a few pages left behind and I did not chase down
> > why.
>
> Can you elaborate? I can reliably trigger zone present pages going to 0 by
> just hotplugging a DIMM, onlining the memory block devices to the MOVABLE
> zone, followed by offlining the memory block again.
> 

For the machine I was testing on, I tried offlining all memory within
a zone on a NUMA machine. Even if I used movable_zone to create a zone
or numa=fake to create multiple fake nodes and zones, there was always
either reserved or pinned pages preventing the full zone being removed.
David Hildenbrand April 12, 2021, 4:02 p.m. UTC | #6
On 12.04.21 17:27, Mel Gorman wrote:
> On Mon, Apr 12, 2021 at 04:12:11PM +0200, David Hildenbrand wrote:
>>> After v1 of the patch, the race was reduced to the point between the
>>> zone watermark check and the rmqueue_pcplist but yes, it still existed.
>>> Closing it completely was either complex or expensive. Setting
>>> zone->pageset = &boot_pageset before the free would shrink the race
>>> further but that still leaves a potential memory ordering issue.
>>>
>>> While fixable, it's either complex, expensive or both so yes, just leaving
>>> the pageset structures in place would be much more straight-forward
>>> assuming the structures were not allocated in the zone that is being
>>> hot-removed. As things stand, I had trouble even testing zone hot-remove
>>> as there was always a few pages left behind and I did not chase down
>>> why.
>>
>> Can you elaborate? I can reliably trigger zone present pages going to 0 by
>> just hotplugging a DIMM, onlining the memory block devices to the MOVABLE
>> zone, followed by offlining the memory block again.
>>
> 
> For the machine I was testing on, I tried offlining all memory within
> a zone on a NUMA machine. Even if I used movable_zone to create a zone
> or numa=fake to create multiple fake nodes and zones, there was always
> either reserved or pinned pages preventing the full zone being removed.

What can happen is that memblock allocations are still placed into the 
MOVABLE zone -- even with "movablenode" IIRC.

Memory hot(un)plug is usually best tested in QEMU via pc-dimm devices.
Michal Hocko April 13, 2021, 6:40 a.m. UTC | #7
On Mon 12-04-21 13:08:42, Mel Gorman wrote:
> zone_pcp_reset allegedly protects against a race with drain_pages
> using local_irq_save but this is bogus. local_irq_save only operates
> on the local CPU. If memory hotplug is running on CPU A and drain_pages
> is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
> offers no protection.
> 
> This patch deletes IRQ disable/enable on the grounds that IRQs protect
> nothing and assumes the existing hotplug paths guarantees the PCP cannot be
> used after zone_pcp_enable(). That should be the case already because all
> the pages have been freed and there is no page to put on the PCP lists.

Yes, that is the case since ec6e8c7e0314 ("mm, page_alloc: disable
pcplists during memory offline"). Prior to this commit the behavior was
undefined but full zone/node hotremove is rare enough that an existing
race was likely never observed.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!
 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
> Resending for email address correction and adding lists
> 
> Changelog since v1
> o Minimal fix
> 
>  mm/page_alloc.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5e8aedb64b57..9bf0db982f14 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8952,12 +8952,9 @@ void zone_pcp_enable(struct zone *zone)
>  
>  void zone_pcp_reset(struct zone *zone)
>  {
> -	unsigned long flags;
>  	int cpu;
>  	struct per_cpu_pageset *pset;
>  
> -	/* avoid races with drain_pages()  */
> -	local_irq_save(flags);
>  	if (zone->pageset != &boot_pageset) {
>  		for_each_online_cpu(cpu) {
>  			pset = per_cpu_ptr(zone->pageset, cpu);
> @@ -8966,7 +8963,6 @@ void zone_pcp_reset(struct zone *zone)
>  		free_percpu(zone->pageset);
>  		zone->pageset = &boot_pageset;
>  	}
> -	local_irq_restore(flags);
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
Michal Hocko April 13, 2021, 6:44 a.m. UTC | #8
On Mon 12-04-21 14:40:18, Vlastimil Babka wrote:
> On 4/12/21 2:08 PM, Mel Gorman wrote:
> > zone_pcp_reset allegedly protects against a race with drain_pages
> > using local_irq_save but this is bogus. local_irq_save only operates
> > on the local CPU. If memory hotplug is running on CPU A and drain_pages
> > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
> > offers no protection.
> > 
> > This patch deletes IRQ disable/enable on the grounds that IRQs protect
> > nothing and assumes the existing hotplug paths guarantees the PCP cannot be
> > used after zone_pcp_enable(). That should be the case already because all
> > the pages have been freed and there is no page to put on the PCP lists.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Yeah the irq disabling here is clearly bogus, so:
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> But I think Michal has a point that we might best leave the pagesets around, by
> a future change. I'm have some doubts that even with your reordering of the
> reset/destroy after zonelist rebuild in v1 they cant't be reachable. We have no
> protection between zonelist rebuild and zonelist traversal, and that's why we
> just leave pgdats around.
> 
> So I can imagine a task racing with memory hotremove might see watermarks as ok
> in get_page_from_freelist() for the zone and proceeds to try_this_zone:, then
> gets stalled/scheduled out while hotremove rebuilds the zonelist and destroys
> the pcplists, then the first task is resumed and proceeds with rmqueue_pcplist().
> 
> So that's very rare thus not urgent, and this patch doesn't make it less rare so
> not a reason to block it.

Completely agreed here. Not an urgent thing to work on but something to
look into long term.
Vlastimil Babka April 13, 2021, 9:36 a.m. UTC | #9
On 4/12/21 4:08 PM, Mel Gorman wrote:
> On Mon, Apr 12, 2021 at 02:40:18PM +0200, Vlastimil Babka wrote:
>> On 4/12/21 2:08 PM, Mel Gorman wrote:
>
> the pageset structures in place would be much more straight-forward
> assuming the structures were not allocated in the zone that is being
> hot-removed.

I would expect this is not possible, at least for ZONE_MOVABLE, as the percpu
allocations should be GFP_KERNEL. And it's not realistic to expect offlining to
succeed at all without using ZONE_MOVABLE.

AFAIK even Oscar's work on using the node to self-contain its own structures is
only applicable to struct pages, not percpu allocations?
Mel Gorman April 13, 2021, 9:49 a.m. UTC | #10
On Tue, Apr 13, 2021 at 11:36:08AM +0200, Vlastimil Babka wrote:
> On 4/12/21 4:08 PM, Mel Gorman wrote:
> > On Mon, Apr 12, 2021 at 02:40:18PM +0200, Vlastimil Babka wrote:
> >> On 4/12/21 2:08 PM, Mel Gorman wrote:
> >
> > the pageset structures in place would be much more straight-forward
> > assuming the structures were not allocated in the zone that is being
> > hot-removed.
> 
> I would expect this is not possible, at least for ZONE_MOVABLE, as the percpu
> allocations should be GFP_KERNEL.

True.

> And it's not realistic to expect offlining to
> succeed at all without using ZONE_MOVABLE.
> 
> AFAIK even Oscar's work on using the node to self-contain its own structures is
> only applicable to struct pages, not percpu allocations?

That I don't know as I didn't check although in general, it would be
somewhat unfortunate if per-cpu structures were remote. It wouldn't be
critical given that they'll be in cache assuming the per-cpu structures
are not straddling cache lines.
Michal Hocko April 13, 2021, 10:16 a.m. UTC | #11
On Tue 13-04-21 11:36:08, Vlastimil Babka wrote:
[...]
> AFAIK even Oscar's work on using the node to self-contain its own structures is
> only applicable to struct pages, not percpu allocations?

Correct. Teaching pcp storage on movable zone sounds like a large
undertaking to me. Not sure this is worth it TBH. Even an idea of any
pcp access synchronization with memory hotplug makes for a decent headache.
Oscar Salvador April 14, 2021, 7:18 a.m. UTC | #12
On Mon, Apr 12, 2021 at 01:08:42PM +0100, Mel Gorman wrote:
> zone_pcp_reset allegedly protects against a race with drain_pages
> using local_irq_save but this is bogus. local_irq_save only operates
> on the local CPU. If memory hotplug is running on CPU A and drain_pages
> is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
> offers no protection.
> 
> This patch deletes IRQ disable/enable on the grounds that IRQs protect
> nothing and assumes the existing hotplug paths guarantees the PCP cannot be
> used after zone_pcp_enable(). That should be the case already because all
> the pages have been freed and there is no page to put on the PCP lists.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Oscar Salvador <osalvador@suse.de>
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5e8aedb64b57..9bf0db982f14 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8952,12 +8952,9 @@  void zone_pcp_enable(struct zone *zone)
 
 void zone_pcp_reset(struct zone *zone)
 {
-	unsigned long flags;
 	int cpu;
 	struct per_cpu_pageset *pset;
 
-	/* avoid races with drain_pages()  */
-	local_irq_save(flags);
 	if (zone->pageset != &boot_pageset) {
 		for_each_online_cpu(cpu) {
 			pset = per_cpu_ptr(zone->pageset, cpu);
@@ -8966,7 +8963,6 @@  void zone_pcp_reset(struct zone *zone)
 		free_percpu(zone->pageset);
 		zone->pageset = &boot_pageset;
 	}
-	local_irq_restore(flags);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE