Message ID | 20220407093221.1090-1-jgross@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm, page_alloc: fix build_zonerefs_node() | expand |
On 07.04.22 11:32, Juergen Gross wrote: > Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist > initialization") only zones with free memory are included in a built > zonelist. This is problematic when e.g. all memory of a zone has been > ballooned out. > > Use populated_zone() when building a zonelist as it has been done > before that commit. > > Cc: stable@vger.kernel.org > Fixes: 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization") > Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> > Signed-off-by: Juergen Gross <jgross@suse.com> > --- > mm/page_alloc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index bdc8f60ae462..3d0662af3289 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6128,7 +6128,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) > do { > zone_type--; > zone = pgdat->node_zones + zone_type; > - if (managed_zone(zone)) { > + if (populated_zone(zone)) { > zoneref_set_zone(zone, &zonerefs[nr_zones++]); > check_highest_zone(zone_type); > } Let's see if we have to find another way to properly handle fadump. Acked-by: David Hildenbrand <david@redhat.com>
On 07.04.22 11:46, David Hildenbrand wrote: > On 07.04.22 11:32, Juergen Gross wrote: >> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist >> initialization") only zones with free memory are included in a built >> zonelist. This is problematic when e.g. all memory of a zone has been >> ballooned out. >> >> Use populated_zone() when building a zonelist as it has been done >> before that commit. >> >> Cc: stable@vger.kernel.org >> Fixes: 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization") >> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> >> Signed-off-by: Juergen Gross <jgross@suse.com> >> --- >> mm/page_alloc.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index bdc8f60ae462..3d0662af3289 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -6128,7 +6128,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) >> do { >> zone_type--; >> zone = pgdat->node_zones + zone_type; >> - if (managed_zone(zone)) { >> + if (populated_zone(zone)) { >> zoneref_set_zone(zone, &zonerefs[nr_zones++]); >> check_highest_zone(zone_type); >> } > > Let's see if we have to find another way to properly handle fadump. TBH, I don't think this should matter here. A zone can always happen to have no free memory left, so only handling this case when building the zonelist can't be the full solution of that problem. It might trigger a problem more easily, though. > Acked-by: David Hildenbrand <david@redhat.com> Thanks, Juergen
Ccing Mel On Thu 07-04-22 11:32:21, Juergen Gross wrote: > Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist > initialization") only zones with free memory are included in a built > zonelist. This is problematic when e.g. all memory of a zone has been > ballooned out. What is the actual problem there? > Use populated_zone() when building a zonelist as it has been done > before that commit. > > Cc: stable@vger.kernel.org > Fixes: 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization") Did you mean to refer to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")? > Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> > Signed-off-by: Juergen Gross <jgross@suse.com> > --- > mm/page_alloc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index bdc8f60ae462..3d0662af3289 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6128,7 +6128,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) > do { > zone_type--; > zone = pgdat->node_zones + zone_type; > - if (managed_zone(zone)) { > + if (populated_zone(zone)) { > zoneref_set_zone(zone, &zonerefs[nr_zones++]); > check_highest_zone(zone_type); > } > -- > 2.34.1
On 07.04.22 12:34, Michal Hocko wrote: > Ccing Mel > > On Thu 07-04-22 11:32:21, Juergen Gross wrote: >> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist >> initialization") only zones with free memory are included in a built >> zonelist. This is problematic when e.g. all memory of a zone has been >> ballooned out. > > What is the actual problem there? When running as Xen guest new hotplugged memory will not be onlined automatically, but only on special request. This is done in order to support adding e.g. the possibility to use another GB of memory, while adding only a part of that memory initially. In case adding that memory is populating a new zone, the page allocator won't be able to use this memory when it is onlined, as the zone wasn't added to the zonelist, due to managed_zone() returning 0. Note that a similar problem could occur in case the zonelists are rebuilt and a zone happens to have all memory in use. That zone will then be dropped from the rebuilt zonelists. > >> Use populated_zone() when building a zonelist as it has been done >> before that commit. >> >> Cc: stable@vger.kernel.org >> Fixes: 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization") > > Did you mean to refer to > 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with > pages managed by the buddy allocator")? Yes. I really can't explain where this other reference is coming from. I must have selected the wrong line from a git annotate output Juergen
On Thu 07-04-22 12:45:41, Juergen Gross wrote: > On 07.04.22 12:34, Michal Hocko wrote: > > Ccing Mel > > > > On Thu 07-04-22 11:32:21, Juergen Gross wrote: > > > Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist > > > initialization") only zones with free memory are included in a built > > > zonelist. This is problematic when e.g. all memory of a zone has been > > > ballooned out. > > > > What is the actual problem there? > > When running as Xen guest new hotplugged memory will not be onlined > automatically, but only on special request. This is done in order to > support adding e.g. the possibility to use another GB of memory, while > adding only a part of that memory initially. > > In case adding that memory is populating a new zone, the page allocator > won't be able to use this memory when it is onlined, as the zone wasn't > added to the zonelist, due to managed_zone() returning 0. How is that memory onlined? Because "regular" onlining (online_pages()) does rebuild zonelists if their zone hasn't been populated before.
On 07.04.22 13:07, Michal Hocko wrote: > On Thu 07-04-22 12:45:41, Juergen Gross wrote: >> On 07.04.22 12:34, Michal Hocko wrote: >>> Ccing Mel >>> >>> On Thu 07-04-22 11:32:21, Juergen Gross wrote: >>>> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist >>>> initialization") only zones with free memory are included in a built >>>> zonelist. This is problematic when e.g. all memory of a zone has been >>>> ballooned out. >>> >>> What is the actual problem there? >> >> When running as Xen guest new hotplugged memory will not be onlined >> automatically, but only on special request. This is done in order to >> support adding e.g. the possibility to use another GB of memory, while >> adding only a part of that memory initially. >> >> In case adding that memory is populating a new zone, the page allocator >> won't be able to use this memory when it is onlined, as the zone wasn't >> added to the zonelist, due to managed_zone() returning 0. > > How is that memory onlined? Because "regular" onlining (online_pages()) > does rebuild zonelists if their zone hasn't been populated before. The Xen balloon driver has an own callback for onlining pages. The pages are just added to the ballooned-out page list without handing them to the allocator. This is done only when the guest is ballooned up. So the problem is that a new zone is being populated, but it won't have free pages when the zonelists are rebuilt. Juergen
On Thu 07-04-22 13:17:19, Juergen Gross wrote: > On 07.04.22 13:07, Michal Hocko wrote: > > On Thu 07-04-22 12:45:41, Juergen Gross wrote: > > > On 07.04.22 12:34, Michal Hocko wrote: > > > > Ccing Mel > > > > > > > > On Thu 07-04-22 11:32:21, Juergen Gross wrote: > > > > > Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist > > > > > initialization") only zones with free memory are included in a built > > > > > zonelist. This is problematic when e.g. all memory of a zone has been > > > > > ballooned out. > > > > > > > > What is the actual problem there? > > > > > > When running as Xen guest new hotplugged memory will not be onlined > > > automatically, but only on special request. This is done in order to > > > support adding e.g. the possibility to use another GB of memory, while > > > adding only a part of that memory initially. > > > > > > In case adding that memory is populating a new zone, the page allocator > > > won't be able to use this memory when it is onlined, as the zone wasn't > > > added to the zonelist, due to managed_zone() returning 0. > > > > How is that memory onlined? Because "regular" onlining (online_pages()) > > does rebuild zonelists if their zone hasn't been populated before. > > The Xen balloon driver has an own callback for onlining pages. The pages > are just added to the ballooned-out page list without handing them to the > allocator. This is done only when the guest is ballooned up. OK, I see. Let me just rephrase to see whether we are on the same page. Xen is overriding the online_page_callback to xen_online_page which doesn't free pages to the page allocator which means that a zone might remain unpopulated after onlining. This means that the default zone lists rebuild is not done and later on when those pages are finally released to the allocator there is no build_all_zonelists happening so those freed pages are not really visible to the allocator via zonelists fallback allocation. Now to your patch. I suspect this is not sufficient for the full hotplug situation. Consider a new NUMA node to be hotadded. hotadd_new_pgdat will call build_all_zonelists but the zone is not populated yet at that moment unless I am missing something. We do rely on online_pages to rebuild once pages are onlined - which usually means they are freed to the page allocator. The zonelists building is kinda messy TBH. I have to say that I am not really clear on Mel's 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") because as you have said unpoppulated zone is not (or shouldn't be) really all that different from a depleted zone. I think a better and more complete fix would be the following. In other words the zonelists will be built for all present zones. Not sure whether that is going to break 6aa303defb74 though. diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2a9627dc784c..880c455e2557 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1062,7 +1062,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone, struct memory_group *group) { unsigned long flags; - int need_zonelists_rebuild = 0; const int nid = zone_to_nid(zone); int ret; struct memory_notify arg; @@ -1106,17 +1105,13 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, * This means the page allocator ignores this zone. * So, zonelist must be updated after online. */ - if (!populated_zone(zone)) { - need_zonelists_rebuild = 1; + if (!populated_zone(zone)) setup_zone_pageset(zone); - } online_pages_range(pfn, nr_pages); adjust_present_page_count(pfn_to_page(pfn), group, nr_pages); node_states_set_node(nid, &arg); - if (need_zonelists_rebuild) - build_all_zonelists(NULL); /* Basic onlining is complete, allow allocation of onlined pages. */ undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE); @@ -1985,10 +1980,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, /* reinitialise watermarks and update pcp limits */ init_per_zone_wmark_min(); - if (!populated_zone(zone)) { + if (!populated_zone(zone)) zone_pcp_reset(zone); - build_all_zonelists(NULL); - } node_states_clear_node(node, &arg); if (arg.status_change_nid >= 0) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3589febc6d31..130a2feceddc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) do { zone_type--; zone = pgdat->node_zones + zone_type; - if (managed_zone(zone)) { - zoneref_set_zone(zone, &zonerefs[nr_zones++]); - check_highest_zone(zone_type); - } + zoneref_set_zone(zone, &zonerefs[nr_zones++]); + check_highest_zone(zone_type); } while (zone_type); return nr_zones;
On Thu 07-04-22 13:40:25, Michal Hocko wrote: [...] > Now to your patch. I suspect this is not sufficient for the full hotplug > situation. Consider a new NUMA node to be hotadded. hotadd_new_pgdat > will call build_all_zonelists but the zone is not populated yet at that > moment unless I am missing something. We do rely on online_pages to > rebuild once pages are onlined - which usually means they are freed to > the page allocator. OK, I've managed to get lost in the code and misread the onlining part. After re-reading the code I have concluded that the patch is good as is. online_pages relies on zone_populated so it will pass and zonelists will be regenerated even without any pages freed to the allocator. Sorry for the confusion. But I guess this still proves my other point that the code is really subtle and messy so I guess the less rebuilding we do the better. There are two ways, go with your patch and do the clean up on top or merge the two. That being said Acked-by: Michal Hocko <mhocko@suse.com> to your patch with an improved changelog to be more specific about the underlying problem. Thanks and sorry for the confusion.
On 07.04.22 13:40, Michal Hocko wrote: > On Thu 07-04-22 13:17:19, Juergen Gross wrote: >> On 07.04.22 13:07, Michal Hocko wrote: >>> On Thu 07-04-22 12:45:41, Juergen Gross wrote: >>>> On 07.04.22 12:34, Michal Hocko wrote: >>>>> Ccing Mel >>>>> >>>>> On Thu 07-04-22 11:32:21, Juergen Gross wrote: >>>>>> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist >>>>>> initialization") only zones with free memory are included in a built >>>>>> zonelist. This is problematic when e.g. all memory of a zone has been >>>>>> ballooned out. >>>>> >>>>> What is the actual problem there? >>>> >>>> When running as Xen guest new hotplugged memory will not be onlined >>>> automatically, but only on special request. This is done in order to >>>> support adding e.g. the possibility to use another GB of memory, while >>>> adding only a part of that memory initially. >>>> >>>> In case adding that memory is populating a new zone, the page allocator >>>> won't be able to use this memory when it is onlined, as the zone wasn't >>>> added to the zonelist, due to managed_zone() returning 0. >>> >>> How is that memory onlined? Because "regular" onlining (online_pages()) >>> does rebuild zonelists if their zone hasn't been populated before. >> >> The Xen balloon driver has an own callback for onlining pages. The pages >> are just added to the ballooned-out page list without handing them to the >> allocator. This is done only when the guest is ballooned up. > > OK, I see. Let me just rephrase to see whether we are on the same page. > Xen is overriding the online_page_callback to xen_online_page which > doesn't free pages to the page allocator which means that a zone might > remain unpopulated after onlining. This means that the default zone > lists rebuild is not done and later on when those pages are finally > released to the allocator there is no build_all_zonelists happening so > those freed pages are not really visible to the allocator via zonelists > fallback allocation. > > Now to your patch. I suspect this is not sufficient for the full hotplug > situation. Consider a new NUMA node to be hotadded. hotadd_new_pgdat > will call build_all_zonelists but the zone is not populated yet at that > moment unless I am missing something. We do rely on online_pages to > rebuild once pages are onlined - which usually means they are freed to > the page allocator. > > The zonelists building is kinda messy TBH. I have to say that I am not > really clear on Mel's 6aa303defb74 ("mm, vmscan: only allocate and > reclaim from zones with pages managed by the buddy allocator") because > as you have said unpoppulated zone is not (or shouldn't be) really all > that different from a depleted zone. > > I think a better and more complete fix would be the following. In other > words the zonelists will be built for all present zones. Not sure > whether that is going to break 6aa303defb74 though. > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 2a9627dc784c..880c455e2557 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1062,7 +1062,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, > struct zone *zone, struct memory_group *group) > { > unsigned long flags; > - int need_zonelists_rebuild = 0; > const int nid = zone_to_nid(zone); > int ret; > struct memory_notify arg; > @@ -1106,17 +1105,13 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, > * This means the page allocator ignores this zone. > * So, zonelist must be updated after online. > */ > - if (!populated_zone(zone)) { > - need_zonelists_rebuild = 1; > + if (!populated_zone(zone)) > setup_zone_pageset(zone); > - } > > online_pages_range(pfn, nr_pages); > adjust_present_page_count(pfn_to_page(pfn), group, nr_pages); > > node_states_set_node(nid, &arg); > - if (need_zonelists_rebuild) > - build_all_zonelists(NULL); > > /* Basic onlining is complete, allow allocation of onlined pages. */ > undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE); > @@ -1985,10 +1980,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, > /* reinitialise watermarks and update pcp limits */ > init_per_zone_wmark_min(); > > - if (!populated_zone(zone)) { > + if (!populated_zone(zone)) > zone_pcp_reset(zone); > - build_all_zonelists(NULL); > - } > > node_states_clear_node(node, &arg); > if (arg.status_change_nid >= 0) { > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3589febc6d31..130a2feceddc 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) > do { > zone_type--; > zone = pgdat->node_zones + zone_type; > - if (managed_zone(zone)) { > - zoneref_set_zone(zone, &zonerefs[nr_zones++]); > - check_highest_zone(zone_type); > - } > + zoneref_set_zone(zone, &zonerefs[nr_zones++]); > + check_highest_zone(zone_type); > } while (zone_type); > > return nr_zones; I don't think having !populated zones in the zonelist is a particularly good idea. Populated vs !populated changes only during page onlininge/offlining. If I'm not wrong, with your patch we'd even include ZONE_DEVICE here ... I'd vote for going with the simple fix first, which should be good enough AFAIKT.
On Thu 07-04-22 13:58:44, David Hildenbrand wrote: [...] > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 3589febc6d31..130a2feceddc 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) > > do { > > zone_type--; > > zone = pgdat->node_zones + zone_type; > > - if (managed_zone(zone)) { > > - zoneref_set_zone(zone, &zonerefs[nr_zones++]); > > - check_highest_zone(zone_type); > > - } > > + zoneref_set_zone(zone, &zonerefs[nr_zones++]); > > + check_highest_zone(zone_type); > > } while (zone_type); > > > > return nr_zones; > > I don't think having !populated zones in the zonelist is a particularly > good idea. Populated vs !populated changes only during page > onlininge/offlining. > > If I'm not wrong, with your patch we'd even include ZONE_DEVICE here ... What kind of problem that would cause? The allocator wouldn't see any pages at all so it would fallback to the next one. Maybe kswapd would need some tweak to have a bail out condition but as mentioned in the thread already. !populated or !managed for that matter are not all that much different from completely depleted zones. The fact that we are making that distinction has led to some bugs and I suspect it makes the code more complex without a very good reason. > I'd vote for going with the simple fix first, which should be good > enough AFAIKT. yes, see the other reply
On 07.04.22 14:04, Michal Hocko wrote: > On Thu 07-04-22 13:58:44, David Hildenbrand wrote: > [...] >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index 3589febc6d31..130a2feceddc 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) >>> do { >>> zone_type--; >>> zone = pgdat->node_zones + zone_type; >>> - if (managed_zone(zone)) { >>> - zoneref_set_zone(zone, &zonerefs[nr_zones++]); >>> - check_highest_zone(zone_type); >>> - } >>> + zoneref_set_zone(zone, &zonerefs[nr_zones++]); >>> + check_highest_zone(zone_type); >>> } while (zone_type); >>> >>> return nr_zones; >> >> I don't think having !populated zones in the zonelist is a particularly >> good idea. Populated vs !populated changes only during page >> onlininge/offlining. >> >> If I'm not wrong, with your patch we'd even include ZONE_DEVICE here ... > > What kind of problem that would cause? The allocator wouldn't see any > pages at all so it would fallback to the next one. Maybe kswapd would > need some tweak to have a bail out condition but as mentioned in the > thread already. !populated or !managed for that matter are not all that > much different from completely depleted zones. The fact that we are > making that distinction has led to some bugs and I suspect it makes the > code more complex without a very good reason. I assume performance problems. Assume you have an ordinary system with multiple NUMA nodes and no MOVABLE memory. Most nodes will only have ZONE_NORMAL. Yet, you'd include ZONE_DMA* and ZONE_MOVABLE that will always remain empty to be traversed on each and every allocation fallback. Of course, we could measure, but IMHO at least *that* part of memory onlining/offlining is not the complicated part :D Populated vs. !populated is under pretty good control via page onlining/offlining. We have to be careful with "managed pages", because that's a moving target, especially with memory ballooning. And I assume that's the bigger source of bugs. > >> I'd vote for going with the simple fix first, which should be good >> enough AFAIKT. > > yes, see the other reply > I think we were composing almost simultaneously :)
On Thu, Apr 07, 2022 at 01:17:19PM +0200, Juergen Gross wrote: > On 07.04.22 13:07, Michal Hocko wrote: > > On Thu 07-04-22 12:45:41, Juergen Gross wrote: > > > On 07.04.22 12:34, Michal Hocko wrote: > > > > Ccing Mel > > > > > > > > On Thu 07-04-22 11:32:21, Juergen Gross wrote: > > > > > Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist > > > > > initialization") only zones with free memory are included in a built > > > > > zonelist. This is problematic when e.g. all memory of a zone has been > > > > > ballooned out. > > > > > > > > What is the actual problem there? > > > > > > When running as Xen guest new hotplugged memory will not be onlined > > > automatically, but only on special request. This is done in order to > > > support adding e.g. the possibility to use another GB of memory, while > > > adding only a part of that memory initially. > > > > > > In case adding that memory is populating a new zone, the page allocator > > > won't be able to use this memory when it is onlined, as the zone wasn't > > > added to the zonelist, due to managed_zone() returning 0. > > > > How is that memory onlined? Because "regular" onlining (online_pages()) > > does rebuild zonelists if their zone hasn't been populated before. > > The Xen balloon driver has an own callback for onlining pages. The pages > are just added to the ballooned-out page list without handing them to the > allocator. This is done only when the guest is ballooned up. > Is this new behaviour? I ask because keeping !managed_zones out of the zonelist and reclaim paths and the behaviour makes sense. Elsewhere you state "zone can always happen to have no free memory left" and this is true but it's usually a transient event. The difference between a populated vs managed zone is usually permanent event where no memory will ever be placed on the buddy lists because the memory was reserved early in boot or a similar reason. The patch is probably harmless but it has the potential to waste CPUs allocating or reclaiming from zones that will never succeed.
On 07.04.22 14:32, Mel Gorman wrote: > On Thu, Apr 07, 2022 at 01:17:19PM +0200, Juergen Gross wrote: >> On 07.04.22 13:07, Michal Hocko wrote: >>> On Thu 07-04-22 12:45:41, Juergen Gross wrote: >>>> On 07.04.22 12:34, Michal Hocko wrote: >>>>> Ccing Mel >>>>> >>>>> On Thu 07-04-22 11:32:21, Juergen Gross wrote: >>>>>> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist >>>>>> initialization") only zones with free memory are included in a built >>>>>> zonelist. This is problematic when e.g. all memory of a zone has been >>>>>> ballooned out. >>>>> >>>>> What is the actual problem there? >>>> >>>> When running as Xen guest new hotplugged memory will not be onlined >>>> automatically, but only on special request. This is done in order to >>>> support adding e.g. the possibility to use another GB of memory, while >>>> adding only a part of that memory initially. >>>> >>>> In case adding that memory is populating a new zone, the page allocator >>>> won't be able to use this memory when it is onlined, as the zone wasn't >>>> added to the zonelist, due to managed_zone() returning 0. >>> >>> How is that memory onlined? Because "regular" onlining (online_pages()) >>> does rebuild zonelists if their zone hasn't been populated before. >> >> The Xen balloon driver has an own callback for onlining pages. The pages >> are just added to the ballooned-out page list without handing them to the >> allocator. This is done only when the guest is ballooned up. >> > > Is this new behaviour? I ask because keeping !managed_zones out of the For some time (since kernel 5.9) Xen is using the zone device functionality with memremap_pages() and pgmap->type = MEMORY_DEVICE_GENERIC. > zonelist and reclaim paths and the behaviour makes sense. Elsewhere you > state "zone can always happen to have no free memory left" and this is true > but it's usually a transient event. The difference between a populated And if this "transient event" is just happening when the zonelists are being rebuilt the zone will be off the lists maybe forever. > vs managed zone is usually permanent event where no memory will ever be > placed on the buddy lists because the memory was reserved early in boot > or a similar reason. The patch is probably harmless but it has the > potential to waste CPUs allocating or reclaiming from zones that will > never succeed. I'd recommend to have an explicit flag per-zone for this case if you really care about that. This would be much cleaner than to imply from no free page being present at a specific point in time, that the zone will never be subject to memory allocation. Juergen
On Thu 07-04-22 14:12:38, David Hildenbrand wrote: > On 07.04.22 14:04, Michal Hocko wrote: > > On Thu 07-04-22 13:58:44, David Hildenbrand wrote: > > [...] > >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>> index 3589febc6d31..130a2feceddc 100644 > >>> --- a/mm/page_alloc.c > >>> +++ b/mm/page_alloc.c > >>> @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) > >>> do { > >>> zone_type--; > >>> zone = pgdat->node_zones + zone_type; > >>> - if (managed_zone(zone)) { > >>> - zoneref_set_zone(zone, &zonerefs[nr_zones++]); > >>> - check_highest_zone(zone_type); > >>> - } > >>> + zoneref_set_zone(zone, &zonerefs[nr_zones++]); > >>> + check_highest_zone(zone_type); > >>> } while (zone_type); > >>> > >>> return nr_zones; > >> > >> I don't think having !populated zones in the zonelist is a particularly > >> good idea. Populated vs !populated changes only during page > >> onlininge/offlining. > >> > >> If I'm not wrong, with your patch we'd even include ZONE_DEVICE here ... > > > > What kind of problem that would cause? The allocator wouldn't see any > > pages at all so it would fallback to the next one. Maybe kswapd would > > need some tweak to have a bail out condition but as mentioned in the > > thread already. !populated or !managed for that matter are not all that > > much different from completely depleted zones. The fact that we are > > making that distinction has led to some bugs and I suspect it makes the > > code more complex without a very good reason. > > I assume performance problems. Assume you have an ordinary system with > multiple NUMA nodes and no MOVABLE memory. Most nodes will only have > ZONE_NORMAL. Yet, you'd include ZONE_DMA* and ZONE_MOVABLE that will > always remain empty to be traversed on each and every allocation > fallback. Of course, we could measure, but IMHO at least *that* part of > memory onlining/offlining is not the complicated part :D You've got a good point here. I guess there will be usecases that really benefit from every single CPU cycle spared in the allocator hot path which really depends on the zonelists traversing. I have very briefly had a look at our remaining usage of managed_zone() and there are not that many left. We have 2 in page_alloc.c via has_managed_dma(). I guess we could drop that one and use __GFP_NOWARN in dma_atomic_pool_init but this is nothing really earth shattering. The remaining occurances are in vmscan.c: - should_continue_reclaim, pgdat_balanced - required because they iterate all zones withing the zoneindex and need to decide whether they are balanced or not. We can make empty zones special case and make them always balanced - kswapd_shrink_node - required because we would be increasing reclaim target for empty zones - update_reclaim_active - required because we do not want to alter zone state if it is not a subject of the reclaim which empty zones are not by definition. - balance_pgdat - first check is likely a microoptimization, reclaim_idx is needed to have a populated zone there - wakeup_kswapd - I dunno - shrink_node, allow_direct_reclaim, lruvec_lru_size - microptimizations - pgdat_watermark_boosted - microptimizations I suspect as empty zone shouldn't ever get watermark_boost - pgdat_balanced - functionally dd So we can get rid of quite some but we will still need some of them.
On Thu, Apr 07, 2022 at 11:46:13AM +0200, David Hildenbrand wrote: >On 07.04.22 11:32, Juergen Gross wrote: >> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist >> initialization") only zones with free memory are included in a built >> zonelist. This is problematic when e.g. all memory of a zone has been >> ballooned out. >> >> Use populated_zone() when building a zonelist as it has been done >> before that commit. >> >> Cc: stable@vger.kernel.org >> Fixes: 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization") >> Reported-by: Marek Marczykowski-G??recki <marmarek@invisiblethingslab.com> >> Signed-off-by: Juergen Gross <jgross@suse.com> >> --- >> mm/page_alloc.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index bdc8f60ae462..3d0662af3289 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -6128,7 +6128,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) >> do { >> zone_type--; >> zone = pgdat->node_zones + zone_type; >> - if (managed_zone(zone)) { >> + if (populated_zone(zone)) { >> zoneref_set_zone(zone, &zonerefs[nr_zones++]); >> check_highest_zone(zone_type); >> } > >Let's see if we have to find another way to properly handle fadump. > >Acked-by: David Hildenbrand <david@redhat.com> Ok, I see the point. Reviewed-by: Wei Yang <richard.weiyang@gmail.com> > >-- >Thanks, > >David / dhildenb
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index bdc8f60ae462..3d0662af3289 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6128,7 +6128,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) do { zone_type--; zone = pgdat->node_zones + zone_type; - if (managed_zone(zone)) { + if (populated_zone(zone)) { zoneref_set_zone(zone, &zonerefs[nr_zones++]); check_highest_zone(zone_type); }
Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization") only zones with free memory are included in a built zonelist. This is problematic when e.g. all memory of a zone has been ballooned out. Use populated_zone() when building a zonelist as it has been done before that commit. Cc: stable@vger.kernel.org Fixes: 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)