mm, page_alloc: fix build_zonerefs_node()

Message ID	20220407093221.1090-1-jgross@suse.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Juergen Gross <jgross@suse.com> To: xen-devel@lists.xenproject.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Juergen Gross <jgross@suse.com>, Andrew Morton <akpm@linux-foundation.org>, stable@vger.kernel.org, =?utf-8?q?Marek_Marczykowski-G=C3=B3recki?= <marmarek@invisiblethingslab.com> Subject: [PATCH] mm, page_alloc: fix build_zonerefs_node() Date: Thu, 7 Apr 2022 11:32:21 +0200 Message-Id: <20220407093221.1090-1-jgross@suse.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm, page_alloc: fix build_zonerefs_node() \| expand mm, page_alloc: fix build_zonerefs_node()

Jürgen Groß April 7, 2022, 9:32 a.m. UTC

Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
initialization") only zones with free memory are included in a built
zonelist. This is problematic when e.g. all memory of a zone has been
ballooned out.

Use populated_zone() when building a zonelist as it has been done
before that commit.

Cc: stable@vger.kernel.org
Fixes: 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

David Hildenbrand April 7, 2022, 9:46 a.m. UTC | #1

On 07.04.22 11:32, Juergen Gross wrote:
> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
> initialization") only zones with free memory are included in a built
> zonelist. This is problematic when e.g. all memory of a zone has been
> ballooned out.
> 
> Use populated_zone() when building a zonelist as it has been done
> before that commit.
> 
> Cc: stable@vger.kernel.org
> Fixes: 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization")
> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bdc8f60ae462..3d0662af3289 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6128,7 +6128,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
>  	do {
>  		zone_type--;
>  		zone = pgdat->node_zones + zone_type;
> -		if (managed_zone(zone)) {
> +		if (populated_zone(zone)) {
>  			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
>  			check_highest_zone(zone_type);
>  		}

Let's see if we have to find another way to properly handle fadump.

Acked-by: David Hildenbrand <david@redhat.com>

Jürgen Groß April 7, 2022, 10:06 a.m. UTC | #2

On 07.04.22 11:46, David Hildenbrand wrote:
> On 07.04.22 11:32, Juergen Gross wrote:
>> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
>> initialization") only zones with free memory are included in a built
>> zonelist. This is problematic when e.g. all memory of a zone has been
>> ballooned out.
>>
>> Use populated_zone() when building a zonelist as it has been done
>> before that commit.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization")
>> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>>   mm/page_alloc.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index bdc8f60ae462..3d0662af3289 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6128,7 +6128,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
>>   	do {
>>   		zone_type--;
>>   		zone = pgdat->node_zones + zone_type;
>> -		if (managed_zone(zone)) {
>> +		if (populated_zone(zone)) {
>>   			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
>>   			check_highest_zone(zone_type);
>>   		}
> 
> Let's see if we have to find another way to properly handle fadump.

TBH, I don't think this should matter here. A zone can always happen to
have no free memory left, so only handling this case when building the
zonelist can't be the full solution of that problem. It might trigger
a problem more easily, though.

> Acked-by: David Hildenbrand <david@redhat.com>

Thanks,


Juergen

Michal Hocko April 7, 2022, 10:34 a.m. UTC | #3

Ccing Mel

On Thu 07-04-22 11:32:21, Juergen Gross wrote:
> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
> initialization") only zones with free memory are included in a built
> zonelist. This is problematic when e.g. all memory of a zone has been
> ballooned out.

What is the actual problem there?

> Use populated_zone() when building a zonelist as it has been done
> before that commit.
> 
> Cc: stable@vger.kernel.org
> Fixes: 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization")

Did you mean to refer to 
6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with
pages managed by the buddy allocator")?

> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bdc8f60ae462..3d0662af3289 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6128,7 +6128,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
>  	do {
>  		zone_type--;
>  		zone = pgdat->node_zones + zone_type;
> -		if (managed_zone(zone)) {
> +		if (populated_zone(zone)) {
>  			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
>  			check_highest_zone(zone_type);
>  		}
> -- 
> 2.34.1

Jürgen Groß April 7, 2022, 10:45 a.m. UTC | #4

On 07.04.22 12:34, Michal Hocko wrote:
> Ccing Mel
> 
> On Thu 07-04-22 11:32:21, Juergen Gross wrote:
>> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
>> initialization") only zones with free memory are included in a built
>> zonelist. This is problematic when e.g. all memory of a zone has been
>> ballooned out.
> 
> What is the actual problem there?

When running as Xen guest new hotplugged memory will not be onlined
automatically, but only on special request. This is done in order to
support adding e.g. the possibility to use another GB of memory, while
adding only a part of that memory initially.

In case adding that memory is populating a new zone, the page allocator
won't be able to use this memory when it is onlined, as the zone wasn't
added to the zonelist, due to managed_zone() returning 0.

Note that a similar problem could occur in case the zonelists are
rebuilt and a zone happens to have all memory in use. That zone will
then be dropped from the rebuilt zonelists.

> 
>> Use populated_zone() when building a zonelist as it has been done
>> before that commit.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization")
> 
> Did you mean to refer to
> 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with
> pages managed by the buddy allocator")?

Yes. I really can't explain where this other reference is coming from.
I must have selected the wrong line from a git annotate output

Juergen

Michal Hocko April 7, 2022, 11:07 a.m. UTC | #5

On Thu 07-04-22 12:45:41, Juergen Gross wrote:
> On 07.04.22 12:34, Michal Hocko wrote:
> > Ccing Mel
> > 
> > On Thu 07-04-22 11:32:21, Juergen Gross wrote:
> > > Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
> > > initialization") only zones with free memory are included in a built
> > > zonelist. This is problematic when e.g. all memory of a zone has been
> > > ballooned out.
> > 
> > What is the actual problem there?
> 
> When running as Xen guest new hotplugged memory will not be onlined
> automatically, but only on special request. This is done in order to
> support adding e.g. the possibility to use another GB of memory, while
> adding only a part of that memory initially.
> 
> In case adding that memory is populating a new zone, the page allocator
> won't be able to use this memory when it is onlined, as the zone wasn't
> added to the zonelist, due to managed_zone() returning 0.

How is that memory onlined? Because "regular" onlining (online_pages())
does rebuild zonelists if their zone hasn't been populated before.

Jürgen Groß April 7, 2022, 11:17 a.m. UTC | #6

On 07.04.22 13:07, Michal Hocko wrote:
> On Thu 07-04-22 12:45:41, Juergen Gross wrote:
>> On 07.04.22 12:34, Michal Hocko wrote:
>>> Ccing Mel
>>>
>>> On Thu 07-04-22 11:32:21, Juergen Gross wrote:
>>>> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
>>>> initialization") only zones with free memory are included in a built
>>>> zonelist. This is problematic when e.g. all memory of a zone has been
>>>> ballooned out.
>>>
>>> What is the actual problem there?
>>
>> When running as Xen guest new hotplugged memory will not be onlined
>> automatically, but only on special request. This is done in order to
>> support adding e.g. the possibility to use another GB of memory, while
>> adding only a part of that memory initially.
>>
>> In case adding that memory is populating a new zone, the page allocator
>> won't be able to use this memory when it is onlined, as the zone wasn't
>> added to the zonelist, due to managed_zone() returning 0.
> 
> How is that memory onlined? Because "regular" onlining (online_pages())
> does rebuild zonelists if their zone hasn't been populated before.

The Xen balloon driver has an own callback for onlining pages. The pages
are just added to the ballooned-out page list without handing them to the
allocator. This is done only when the guest is ballooned up.

So the problem is that a new zone is being populated, but it won't have
free pages when the zonelists are rebuilt.


Juergen

Michal Hocko April 7, 2022, 11:40 a.m. UTC | #7

On Thu 07-04-22 13:17:19, Juergen Gross wrote:
> On 07.04.22 13:07, Michal Hocko wrote:
> > On Thu 07-04-22 12:45:41, Juergen Gross wrote:
> > > On 07.04.22 12:34, Michal Hocko wrote:
> > > > Ccing Mel
> > > > 
> > > > On Thu 07-04-22 11:32:21, Juergen Gross wrote:
> > > > > Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
> > > > > initialization") only zones with free memory are included in a built
> > > > > zonelist. This is problematic when e.g. all memory of a zone has been
> > > > > ballooned out.
> > > > 
> > > > What is the actual problem there?
> > > 
> > > When running as Xen guest new hotplugged memory will not be onlined
> > > automatically, but only on special request. This is done in order to
> > > support adding e.g. the possibility to use another GB of memory, while
> > > adding only a part of that memory initially.
> > > 
> > > In case adding that memory is populating a new zone, the page allocator
> > > won't be able to use this memory when it is onlined, as the zone wasn't
> > > added to the zonelist, due to managed_zone() returning 0.
> > 
> > How is that memory onlined? Because "regular" onlining (online_pages())
> > does rebuild zonelists if their zone hasn't been populated before.
> 
> The Xen balloon driver has an own callback for onlining pages. The pages
> are just added to the ballooned-out page list without handing them to the
> allocator. This is done only when the guest is ballooned up.

OK, I see. Let me just rephrase to see whether we are on the same page.
Xen is overriding the online_page_callback to xen_online_page which
doesn't free pages to the page allocator which means that a zone might
remain unpopulated after onlining. This means that the default zone
lists rebuild is not done and later on when those pages are finally
released to the allocator there is no build_all_zonelists happening so
those freed pages are not really visible to the allocator via zonelists
fallback allocation.

Now to your patch. I suspect this is not sufficient for the full hotplug
situation. Consider a new NUMA node to be hotadded. hotadd_new_pgdat
will call build_all_zonelists but the zone is not populated yet at that
moment unless I am missing something. We do rely on online_pages to
rebuild once pages are onlined - which usually means they are freed to
the page allocator.

The zonelists building is kinda messy TBH. I have to say that I am not
really clear on Mel's 6aa303defb74 ("mm, vmscan: only allocate and
reclaim from zones with pages managed by the buddy allocator") because
as you have said unpoppulated zone is not (or shouldn't be) really all
that different from a depleted zone.

I think a better and more complete fix would be the following. In other
words the zonelists will be built for all present zones. Not sure
whether that is going to break 6aa303defb74 though.

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2a9627dc784c..880c455e2557 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1062,7 +1062,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 		       struct zone *zone, struct memory_group *group)
 {
 	unsigned long flags;
-	int need_zonelists_rebuild = 0;
 	const int nid = zone_to_nid(zone);
 	int ret;
 	struct memory_notify arg;
@@ -1106,17 +1105,13 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
 	 * This means the page allocator ignores this zone.
 	 * So, zonelist must be updated after online.
 	 */
-	if (!populated_zone(zone)) {
-		need_zonelists_rebuild = 1;
+	if (!populated_zone(zone))
 		setup_zone_pageset(zone);
-	}
 
 	online_pages_range(pfn, nr_pages);
 	adjust_present_page_count(pfn_to_page(pfn), group, nr_pages);
 
 	node_states_set_node(nid, &arg);
-	if (need_zonelists_rebuild)
-		build_all_zonelists(NULL);
 
 	/* Basic onlining is complete, allow allocation of onlined pages. */
 	undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
@@ -1985,10 +1980,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	/* reinitialise watermarks and update pcp limits */
 	init_per_zone_wmark_min();
 
-	if (!populated_zone(zone)) {
+	if (!populated_zone(zone))
 		zone_pcp_reset(zone);
-		build_all_zonelists(NULL);
-	}
 
 	node_states_clear_node(node, &arg);
 	if (arg.status_change_nid >= 0) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..130a2feceddc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
-		if (managed_zone(zone)) {
-			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
-			check_highest_zone(zone_type);
-		}
+		zoneref_set_zone(zone, &zonerefs[nr_zones++]);
+		check_highest_zone(zone_type);
 	} while (zone_type);
 
 	return nr_zones;

Michal Hocko April 7, 2022, 11:48 a.m. UTC | #8

On Thu 07-04-22 13:40:25, Michal Hocko wrote:
[...]
> Now to your patch. I suspect this is not sufficient for the full hotplug
> situation. Consider a new NUMA node to be hotadded. hotadd_new_pgdat
> will call build_all_zonelists but the zone is not populated yet at that
> moment unless I am missing something. We do rely on online_pages to
> rebuild once pages are onlined - which usually means they are freed to
> the page allocator.

OK, I've managed to get lost in the code and misread the onlining part.
After re-reading the code I have concluded that the patch is good as is.
online_pages relies on zone_populated so it will pass and zonelists will
be regenerated even without any pages freed to the allocator. Sorry for
the confusion. But I guess this still proves my other point that the
code is really subtle and messy so I guess the less rebuilding we do the
better. There are two ways, go with your patch and do the clean up on
top or merge the two.

That being said
Acked-by: Michal Hocko <mhocko@suse.com>
to your patch with an improved changelog to be more specific about the
underlying problem.

Thanks and sorry for the confusion.

David Hildenbrand April 7, 2022, 11:58 a.m. UTC | #9

On 07.04.22 13:40, Michal Hocko wrote:
> On Thu 07-04-22 13:17:19, Juergen Gross wrote:
>> On 07.04.22 13:07, Michal Hocko wrote:
>>> On Thu 07-04-22 12:45:41, Juergen Gross wrote:
>>>> On 07.04.22 12:34, Michal Hocko wrote:
>>>>> Ccing Mel
>>>>>
>>>>> On Thu 07-04-22 11:32:21, Juergen Gross wrote:
>>>>>> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
>>>>>> initialization") only zones with free memory are included in a built
>>>>>> zonelist. This is problematic when e.g. all memory of a zone has been
>>>>>> ballooned out.
>>>>>
>>>>> What is the actual problem there?
>>>>
>>>> When running as Xen guest new hotplugged memory will not be onlined
>>>> automatically, but only on special request. This is done in order to
>>>> support adding e.g. the possibility to use another GB of memory, while
>>>> adding only a part of that memory initially.
>>>>
>>>> In case adding that memory is populating a new zone, the page allocator
>>>> won't be able to use this memory when it is onlined, as the zone wasn't
>>>> added to the zonelist, due to managed_zone() returning 0.
>>>
>>> How is that memory onlined? Because "regular" onlining (online_pages())
>>> does rebuild zonelists if their zone hasn't been populated before.
>>
>> The Xen balloon driver has an own callback for onlining pages. The pages
>> are just added to the ballooned-out page list without handing them to the
>> allocator. This is done only when the guest is ballooned up.
> 
> OK, I see. Let me just rephrase to see whether we are on the same page.
> Xen is overriding the online_page_callback to xen_online_page which
> doesn't free pages to the page allocator which means that a zone might
> remain unpopulated after onlining. This means that the default zone
> lists rebuild is not done and later on when those pages are finally
> released to the allocator there is no build_all_zonelists happening so
> those freed pages are not really visible to the allocator via zonelists
> fallback allocation.
> 
> Now to your patch. I suspect this is not sufficient for the full hotplug
> situation. Consider a new NUMA node to be hotadded. hotadd_new_pgdat
> will call build_all_zonelists but the zone is not populated yet at that
> moment unless I am missing something. We do rely on online_pages to
> rebuild once pages are onlined - which usually means they are freed to
> the page allocator.
> 
> The zonelists building is kinda messy TBH. I have to say that I am not
> really clear on Mel's 6aa303defb74 ("mm, vmscan: only allocate and
> reclaim from zones with pages managed by the buddy allocator") because
> as you have said unpoppulated zone is not (or shouldn't be) really all
> that different from a depleted zone.
> 
> I think a better and more complete fix would be the following. In other
> words the zonelists will be built for all present zones. Not sure
> whether that is going to break 6aa303defb74 though.
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2a9627dc784c..880c455e2557 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1062,7 +1062,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>  		       struct zone *zone, struct memory_group *group)
>  {
>  	unsigned long flags;
> -	int need_zonelists_rebuild = 0;
>  	const int nid = zone_to_nid(zone);
>  	int ret;
>  	struct memory_notify arg;
> @@ -1106,17 +1105,13 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>  	 * This means the page allocator ignores this zone.
>  	 * So, zonelist must be updated after online.
>  	 */
> -	if (!populated_zone(zone)) {
> -		need_zonelists_rebuild = 1;
> +	if (!populated_zone(zone))
>  		setup_zone_pageset(zone);
> -	}
>  
>  	online_pages_range(pfn, nr_pages);
>  	adjust_present_page_count(pfn_to_page(pfn), group, nr_pages);
>  
>  	node_states_set_node(nid, &arg);
> -	if (need_zonelists_rebuild)
> -		build_all_zonelists(NULL);
>  
>  	/* Basic onlining is complete, allow allocation of onlined pages. */
>  	undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
> @@ -1985,10 +1980,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>  	/* reinitialise watermarks and update pcp limits */
>  	init_per_zone_wmark_min();
>  
> -	if (!populated_zone(zone)) {
> +	if (!populated_zone(zone))
>  		zone_pcp_reset(zone);
> -		build_all_zonelists(NULL);
> -	}
>  
>  	node_states_clear_node(node, &arg);
>  	if (arg.status_change_nid >= 0) {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589febc6d31..130a2feceddc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
>  	do {
>  		zone_type--;
>  		zone = pgdat->node_zones + zone_type;
> -		if (managed_zone(zone)) {
> -			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
> -			check_highest_zone(zone_type);
> -		}
> +		zoneref_set_zone(zone, &zonerefs[nr_zones++]);
> +		check_highest_zone(zone_type);
>  	} while (zone_type);
>  
>  	return nr_zones;

I don't think having !populated zones in the zonelist is a particularly
good idea. Populated vs !populated changes only during page
onlininge/offlining.

If I'm not wrong, with your patch we'd even include ZONE_DEVICE here ...

I'd vote for going with the simple fix first, which should be good
enough AFAIKT.

Michal Hocko April 7, 2022, 12:04 p.m. UTC | #10

On Thu 07-04-22 13:58:44, David Hildenbrand wrote:
[...]
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3589febc6d31..130a2feceddc 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
> >  	do {
> >  		zone_type--;
> >  		zone = pgdat->node_zones + zone_type;
> > -		if (managed_zone(zone)) {
> > -			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
> > -			check_highest_zone(zone_type);
> > -		}
> > +		zoneref_set_zone(zone, &zonerefs[nr_zones++]);
> > +		check_highest_zone(zone_type);
> >  	} while (zone_type);
> >  
> >  	return nr_zones;
> 
> I don't think having !populated zones in the zonelist is a particularly
> good idea. Populated vs !populated changes only during page
> onlininge/offlining.
> 
> If I'm not wrong, with your patch we'd even include ZONE_DEVICE here ...

What kind of problem that would cause? The allocator wouldn't see any
pages at all so it would fallback to the next one. Maybe kswapd would
need some tweak to have a bail out condition but as mentioned in the
thread already. !populated or !managed for that matter are not all that
much different from completely depleted zones. The fact that we are
making that distinction has led to some bugs and I suspect it makes the
code more complex without a very good reason.

> I'd vote for going with the simple fix first, which should be good
> enough AFAIKT.

yes, see the other reply

David Hildenbrand April 7, 2022, 12:12 p.m. UTC | #11

On 07.04.22 14:04, Michal Hocko wrote:
> On Thu 07-04-22 13:58:44, David Hildenbrand wrote:
> [...]
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 3589febc6d31..130a2feceddc 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
>>>  	do {
>>>  		zone_type--;
>>>  		zone = pgdat->node_zones + zone_type;
>>> -		if (managed_zone(zone)) {
>>> -			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
>>> -			check_highest_zone(zone_type);
>>> -		}
>>> +		zoneref_set_zone(zone, &zonerefs[nr_zones++]);
>>> +		check_highest_zone(zone_type);
>>>  	} while (zone_type);
>>>  
>>>  	return nr_zones;
>>
>> I don't think having !populated zones in the zonelist is a particularly
>> good idea. Populated vs !populated changes only during page
>> onlininge/offlining.
>>
>> If I'm not wrong, with your patch we'd even include ZONE_DEVICE here ...
> 
> What kind of problem that would cause? The allocator wouldn't see any
> pages at all so it would fallback to the next one. Maybe kswapd would
> need some tweak to have a bail out condition but as mentioned in the
> thread already. !populated or !managed for that matter are not all that
> much different from completely depleted zones. The fact that we are
> making that distinction has led to some bugs and I suspect it makes the
> code more complex without a very good reason.

I assume performance problems. Assume you have an ordinary system with
multiple NUMA nodes and no MOVABLE memory. Most nodes will only have
ZONE_NORMAL. Yet, you'd include ZONE_DMA* and ZONE_MOVABLE that will
always remain empty to be traversed on each and every allocation
fallback. Of course, we could measure, but IMHO at least *that* part of
memory onlining/offlining is not the complicated part :D

Populated vs. !populated is under pretty good control via page
onlining/offlining. We have to be careful with "managed pages", because
that's a moving target, especially with memory ballooning. And I assume
that's the bigger source of bugs.

> 
>> I'd vote for going with the simple fix first, which should be good
>> enough AFAIKT.
> 
> yes, see the other reply
> 

I think we were composing almost simultaneously :)

Mel Gorman April 7, 2022, 12:32 p.m. UTC | #12

On Thu, Apr 07, 2022 at 01:17:19PM +0200, Juergen Gross wrote:
> On 07.04.22 13:07, Michal Hocko wrote:
> > On Thu 07-04-22 12:45:41, Juergen Gross wrote:
> > > On 07.04.22 12:34, Michal Hocko wrote:
> > > > Ccing Mel
> > > > 
> > > > On Thu 07-04-22 11:32:21, Juergen Gross wrote:
> > > > > Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
> > > > > initialization") only zones with free memory are included in a built
> > > > > zonelist. This is problematic when e.g. all memory of a zone has been
> > > > > ballooned out.
> > > > 
> > > > What is the actual problem there?
> > > 
> > > When running as Xen guest new hotplugged memory will not be onlined
> > > automatically, but only on special request. This is done in order to
> > > support adding e.g. the possibility to use another GB of memory, while
> > > adding only a part of that memory initially.
> > > 
> > > In case adding that memory is populating a new zone, the page allocator
> > > won't be able to use this memory when it is onlined, as the zone wasn't
> > > added to the zonelist, due to managed_zone() returning 0.
> > 
> > How is that memory onlined? Because "regular" onlining (online_pages())
> > does rebuild zonelists if their zone hasn't been populated before.
> 
> The Xen balloon driver has an own callback for onlining pages. The pages
> are just added to the ballooned-out page list without handing them to the
> allocator. This is done only when the guest is ballooned up.
> 

Is this new behaviour? I ask because keeping !managed_zones out of the
zonelist and reclaim paths and the behaviour makes sense. Elsewhere you
state "zone can always happen to have no free memory left" and this is true
but it's usually a transient event. The difference between a populated
vs managed zone is usually permanent event where no memory will ever be
placed on the buddy lists because the memory was reserved early in boot
or a similar reason. The patch is probably harmless but it has the
potential to waste CPUs allocating or reclaiming from zones that will
never succeed.

Jürgen Groß April 7, 2022, 12:49 p.m. UTC | #13

On 07.04.22 14:32, Mel Gorman wrote:
> On Thu, Apr 07, 2022 at 01:17:19PM +0200, Juergen Gross wrote:
>> On 07.04.22 13:07, Michal Hocko wrote:
>>> On Thu 07-04-22 12:45:41, Juergen Gross wrote:
>>>> On 07.04.22 12:34, Michal Hocko wrote:
>>>>> Ccing Mel
>>>>>
>>>>> On Thu 07-04-22 11:32:21, Juergen Gross wrote:
>>>>>> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
>>>>>> initialization") only zones with free memory are included in a built
>>>>>> zonelist. This is problematic when e.g. all memory of a zone has been
>>>>>> ballooned out.
>>>>>
>>>>> What is the actual problem there?
>>>>
>>>> When running as Xen guest new hotplugged memory will not be onlined
>>>> automatically, but only on special request. This is done in order to
>>>> support adding e.g. the possibility to use another GB of memory, while
>>>> adding only a part of that memory initially.
>>>>
>>>> In case adding that memory is populating a new zone, the page allocator
>>>> won't be able to use this memory when it is onlined, as the zone wasn't
>>>> added to the zonelist, due to managed_zone() returning 0.
>>>
>>> How is that memory onlined? Because "regular" onlining (online_pages())
>>> does rebuild zonelists if their zone hasn't been populated before.
>>
>> The Xen balloon driver has an own callback for onlining pages. The pages
>> are just added to the ballooned-out page list without handing them to the
>> allocator. This is done only when the guest is ballooned up.
>>
> 
> Is this new behaviour? I ask because keeping !managed_zones out of the

For some time (since kernel 5.9) Xen is using the zone device functionality
with memremap_pages() and pgmap->type = MEMORY_DEVICE_GENERIC.

> zonelist and reclaim paths and the behaviour makes sense. Elsewhere you
> state "zone can always happen to have no free memory left" and this is true
> but it's usually a transient event. The difference between a populated

And if this "transient event" is just happening when the zonelists are
being rebuilt the zone will be off the lists maybe forever.

> vs managed zone is usually permanent event where no memory will ever be
> placed on the buddy lists because the memory was reserved early in boot
> or a similar reason. The patch is probably harmless but it has the
> potential to waste CPUs allocating or reclaiming from zones that will
> never succeed.

I'd recommend to have an explicit flag per-zone for this case if you
really care about that. This would be much cleaner than to imply from
no free page being present at a specific point in time, that the zone
will never be subject to memory allocation.


Juergen

Michal Hocko April 7, 2022, 1:23 p.m. UTC | #14

On Thu 07-04-22 14:12:38, David Hildenbrand wrote:
> On 07.04.22 14:04, Michal Hocko wrote:
> > On Thu 07-04-22 13:58:44, David Hildenbrand wrote:
> > [...]
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 3589febc6d31..130a2feceddc 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
> >>>  	do {
> >>>  		zone_type--;
> >>>  		zone = pgdat->node_zones + zone_type;
> >>> -		if (managed_zone(zone)) {
> >>> -			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
> >>> -			check_highest_zone(zone_type);
> >>> -		}
> >>> +		zoneref_set_zone(zone, &zonerefs[nr_zones++]);
> >>> +		check_highest_zone(zone_type);
> >>>  	} while (zone_type);
> >>>  
> >>>  	return nr_zones;
> >>
> >> I don't think having !populated zones in the zonelist is a particularly
> >> good idea. Populated vs !populated changes only during page
> >> onlininge/offlining.
> >>
> >> If I'm not wrong, with your patch we'd even include ZONE_DEVICE here ...
> > 
> > What kind of problem that would cause? The allocator wouldn't see any
> > pages at all so it would fallback to the next one. Maybe kswapd would
> > need some tweak to have a bail out condition but as mentioned in the
> > thread already. !populated or !managed for that matter are not all that
> > much different from completely depleted zones. The fact that we are
> > making that distinction has led to some bugs and I suspect it makes the
> > code more complex without a very good reason.
> 
> I assume performance problems. Assume you have an ordinary system with
> multiple NUMA nodes and no MOVABLE memory. Most nodes will only have
> ZONE_NORMAL. Yet, you'd include ZONE_DMA* and ZONE_MOVABLE that will
> always remain empty to be traversed on each and every allocation
> fallback. Of course, we could measure, but IMHO at least *that* part of
> memory onlining/offlining is not the complicated part :D

You've got a good point here. I guess there will be usecases that really
benefit from every single CPU cycle spared in the allocator hot path
which really depends on the zonelists traversing.

I have very briefly had a look at our remaining usage of managed_zone()
and there are not that many left. We have 2 in page_alloc.c via
has_managed_dma(). I guess we could drop that one and use __GFP_NOWARN
in dma_atomic_pool_init but this is nothing really earth shattering.
The remaining occurances are in vmscan.c:
	- should_continue_reclaim, pgdat_balanced - required because
	  they iterate all zones withing the zoneindex and need to
	  decide whether they are balanced or not. We can make empty
	  zones special case and make them always balanced
	- kswapd_shrink_node - required because we would be increasing
	  reclaim target for empty zones
	- update_reclaim_active - required because we do not want to
	  alter zone state if it is not a subject of the reclaim which
	  empty zones are not by definition.
	- balance_pgdat - first check is likely a microoptimization,
	  reclaim_idx is needed to have a populated zone there
	- wakeup_kswapd - I dunno
	- shrink_node, allow_direct_reclaim, lruvec_lru_size - microptimizations
	- pgdat_watermark_boosted - microptimizations I suspect as empty
	  zone shouldn't ever get watermark_boost
	- pgdat_balanced - functionally dd

So we can get rid of quite some but we will still need some of them.

Wei Yang April 8, 2022, 11:21 p.m. UTC | #15

On Thu, Apr 07, 2022 at 11:46:13AM +0200, David Hildenbrand wrote:
>On 07.04.22 11:32, Juergen Gross wrote:
>> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
>> initialization") only zones with free memory are included in a built
>> zonelist. This is problematic when e.g. all memory of a zone has been
>> ballooned out.
>> 
>> Use populated_zone() when building a zonelist as it has been done
>> before that commit.
>> 
>> Cc: stable@vger.kernel.org
>> Fixes: 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization")
>> Reported-by: Marek Marczykowski-G??recki <marmarek@invisiblethingslab.com>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>>  mm/page_alloc.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index bdc8f60ae462..3d0662af3289 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6128,7 +6128,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
>>  	do {
>>  		zone_type--;
>>  		zone = pgdat->node_zones + zone_type;
>> -		if (managed_zone(zone)) {
>> +		if (populated_zone(zone)) {
>>  			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
>>  			check_highest_zone(zone_type);
>>  		}
>
>Let's see if we have to find another way to properly handle fadump.
>
>Acked-by: David Hildenbrand <david@redhat.com>

Ok, I see the point.

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>

>
>-- 
>Thanks,
>
>David / dhildenb

mm, page_alloc: fix build_zonerefs_node()

Commit Message

Comments

Patch