Message ID | 20211206033338.743270-3-npache@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: Dont allocate pages on a offline node | expand |
On Sun 05-12-21 22:33:38, Nico Pache wrote: > We have run into a panic caused by a shrinker allocation being attempted > on an offlined node. > > Our crash analysis has determined that the issue originates from trying > to allocate pages on an offlined node in expand_one_shrinker_info. This > function makes the incorrect assumption that we can allocate on any node. > To correct this we make sure we only itterate over online nodes. > > This assumption can lead to an incorrect address being assigned to ac->zonelist > in the following callchain: > __alloc_pages > -> prepare_alloc_pages > -> node_zonelist > > static inline struct zonelist *node_zonelist(int nid, gfp_t flags) > { > return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); > } > if the node is not online the return of node_zonelist will evaluate to a > invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0) > > This address is then dereferenced further down the callchain in: > prepare_alloc_pages > -> first_zones_zonelist > -> next_zones_zonelist > -> zonelist_zone_idx > > static inline int zonelist_zone_idx(struct zoneref *zoneref) > { > return zoneref->zone_idx; > } > > Leading the system to panic. Thanks for the analysis! Please also add an oops report so that this is easier to search for. It would be also interesting to see specifics about the issue. Why was the specific node !online in the first place? What architecture was this on? > We also correct this behavior in alloc_shrinker_info, free_shrinker_info, > and reparent_shrinker_deferred. > > Fixes: 2bfd36374edd ("mm: vmscan: consolidate shrinker_maps handling code") > Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg") Normally I would split the fix as it is fixing two issues one introduced in 4.19 the other in 5.13. > Signed-off-by: Nico Pache <npache@redhat.com> > --- > mm/vmscan.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index fb9584641ac7..731564b61e3f 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -221,7 +221,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg, > int nid; > int size = map_size + defer_size; > > - for_each_node(nid) { > + for_each_online_node(nid) { > pn = memcg->nodeinfo[nid]; > old = shrinker_info_protected(memcg, nid); > /* Not yet online memcg */ > @@ -256,7 +256,7 @@ void free_shrinker_info(struct mem_cgroup *memcg) > struct shrinker_info *info; > int nid; > > - for_each_node(nid) { > + for_each_online_node(nid) { > pn = memcg->nodeinfo[nid]; > info = rcu_dereference_protected(pn->shrinker_info, true); > kvfree(info); > @@ -274,7 +274,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg) > map_size = shrinker_map_size(shrinker_nr_max); > defer_size = shrinker_defer_size(shrinker_nr_max); > size = map_size + defer_size; > - for_each_node(nid) { > + for_each_online_node(nid) { > info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid); > if (!info) { > free_shrinker_info(memcg); > @@ -417,7 +417,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg) > > /* Prevent from concurrent shrinker_info expand */ > down_read(&shrinker_rwsem); > - for_each_node(nid) { > + for_each_online_node(nid) { > child_info = shrinker_info_protected(memcg, nid); > parent_info = shrinker_info_protected(parent, nid); > for (i = 0; i < shrinker_nr_max; i++) { > -- > 2.33.1 This doesn't seen complete. Slab shrinkers are used in the reclaim context. Previously offline nodes could be onlined later and this would lead to NULL ptr because there is no hook to allocate new shrinker infos. This would be also really impractical because this would have to update all existing memcgs... To be completely honest I am not really sure this is a practical problem because some architectures allocate (aka make online) all possible nodes reported by the platform. There are major inconsistencies there. Maybe that should be unified, so that problems like this one do not really have to add a complexity to the code.
[Cc David. I have only now noticed he has replied to this thread already pointing out the offline->online case] On Mon 06-12-21 10:23:00, Michal Hocko wrote: > On Sun 05-12-21 22:33:38, Nico Pache wrote: > > We have run into a panic caused by a shrinker allocation being attempted > > on an offlined node. > > > > Our crash analysis has determined that the issue originates from trying > > to allocate pages on an offlined node in expand_one_shrinker_info. This > > function makes the incorrect assumption that we can allocate on any node. > > To correct this we make sure we only itterate over online nodes. > > > > This assumption can lead to an incorrect address being assigned to ac->zonelist > > in the following callchain: > > __alloc_pages > > -> prepare_alloc_pages > > -> node_zonelist > > > > static inline struct zonelist *node_zonelist(int nid, gfp_t flags) > > { > > return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); > > } > > if the node is not online the return of node_zonelist will evaluate to a > > invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0) > > > > This address is then dereferenced further down the callchain in: > > prepare_alloc_pages > > -> first_zones_zonelist > > -> next_zones_zonelist > > -> zonelist_zone_idx > > > > static inline int zonelist_zone_idx(struct zoneref *zoneref) > > { > > return zoneref->zone_idx; > > } > > > > Leading the system to panic. > > Thanks for the analysis! Please also add an oops report so that this is > easier to search for. It would be also interesting to see specifics > about the issue. Why was the specific node !online in the first place? > What architecture was this on? > > > We also correct this behavior in alloc_shrinker_info, free_shrinker_info, > > and reparent_shrinker_deferred. > > > > Fixes: 2bfd36374edd ("mm: vmscan: consolidate shrinker_maps handling code") > > Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg") > > Normally I would split the fix as it is fixing two issues one introduced > in 4.19 the other in 5.13. > > > Signed-off-by: Nico Pache <npache@redhat.com> > > --- > > mm/vmscan.c | 8 ++++---- > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index fb9584641ac7..731564b61e3f 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -221,7 +221,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg, > > int nid; > > int size = map_size + defer_size; > > > > - for_each_node(nid) { > > + for_each_online_node(nid) { > > pn = memcg->nodeinfo[nid]; > > old = shrinker_info_protected(memcg, nid); > > /* Not yet online memcg */ > > @@ -256,7 +256,7 @@ void free_shrinker_info(struct mem_cgroup *memcg) > > struct shrinker_info *info; > > int nid; > > > > - for_each_node(nid) { > > + for_each_online_node(nid) { > > pn = memcg->nodeinfo[nid]; > > info = rcu_dereference_protected(pn->shrinker_info, true); > > kvfree(info); > > @@ -274,7 +274,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg) > > map_size = shrinker_map_size(shrinker_nr_max); > > defer_size = shrinker_defer_size(shrinker_nr_max); > > size = map_size + defer_size; > > - for_each_node(nid) { > > + for_each_online_node(nid) { > > info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid); > > if (!info) { > > free_shrinker_info(memcg); > > @@ -417,7 +417,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg) > > > > /* Prevent from concurrent shrinker_info expand */ > > down_read(&shrinker_rwsem); > > - for_each_node(nid) { > > + for_each_online_node(nid) { > > child_info = shrinker_info_protected(memcg, nid); > > parent_info = shrinker_info_protected(parent, nid); > > for (i = 0; i < shrinker_nr_max; i++) { > > -- > > 2.33.1 > > This doesn't seen complete. Slab shrinkers are used in the reclaim > context. Previously offline nodes could be onlined later and this would > lead to NULL ptr because there is no hook to allocate new shrinker > infos. This would be also really impractical because this would have to > update all existing memcgs... > > To be completely honest I am not really sure this is a practical problem > because some architectures allocate (aka make online) all possible nodes > reported by the platform. There are major inconsistencies there. Maybe > that should be unified, so that problems like this one do not really > have to add a complexity to the code. > > -- > Michal Hocko > SUSE Labs
On Mon 06-12-21 11:45:54, David Hildenbrand wrote: > > This doesn't seen complete. Slab shrinkers are used in the reclaim > > context. Previously offline nodes could be onlined later and this would > > lead to NULL ptr because there is no hook to allocate new shrinker > > infos. This would be also really impractical because this would have to > > update all existing memcgs... > > Instead of going through the trouble of updating... > > ... maybe just keep for_each_node() and check if the target node is > offline. If it's offline, just allocate from the first online node. > After all, we're not using __GFP_THISNODE, so there are no guarantees > either way ... This looks like another way to paper over a deeper underlying problem IMHO. Fundamentally we have a problem that some pgdata are not allocated and that causes a lot of headache. Not to mention that node_online is just adding to a confusion because it doesn't really tell anything about the logical state of the node. I think we really should get rid of this approach rather than play a whack-a-mole. We should really drop all notion of node_online and instead allocate pgdat for each possible node. Arch specific code should make sure that zone lists are properly initialized.
On Mon 06-12-21 12:00:50, David Hildenbrand wrote: > On 06.12.21 11:54, Michal Hocko wrote: > > On Mon 06-12-21 11:45:54, David Hildenbrand wrote: > >>> This doesn't seen complete. Slab shrinkers are used in the reclaim > >>> context. Previously offline nodes could be onlined later and this would > >>> lead to NULL ptr because there is no hook to allocate new shrinker > >>> infos. This would be also really impractical because this would have to > >>> update all existing memcgs... > >> > >> Instead of going through the trouble of updating... > >> > >> ... maybe just keep for_each_node() and check if the target node is > >> offline. If it's offline, just allocate from the first online node. > >> After all, we're not using __GFP_THISNODE, so there are no guarantees > >> either way ... > > > > This looks like another way to paper over a deeper underlying problem > > IMHO. Fundamentally we have a problem that some pgdata are not allocated > > and that causes a lot of headache. Not to mention that node_online > > is just adding to a confusion because it doesn't really tell anything > > about the logical state of the node. > > > > I think we really should get rid of this approach rather than play a > > whack-a-mole. We should really drop all notion of node_online and > > instead allocate pgdat for each possible node. Arch specific code should > > make sure that zone lists are properly initialized. > > > > I'm not sure if it's rally whack-a-mole really applies. It's just the > for_each_node_* calls that need love. In other cases, we shouldn't > really stumble over an offline node. > > Someone deliberately decided to use "for_each_node()" instead of > for_each_online_node() without taking care of online vs. offline > semantics. That's just a BUG and needs fixing IMHO. Well, I would argue this is too much to ask for the API users. It is also a trap that is just too easy to fall into. Turning for_each_node into for_each_online_node will not solve the problem just by itself. As you have pointed out an offline node can become online and without a hotplug notifier there is no way for_each_online_node would be a proper fit for anything that allocates memory only for online nodes. See the trap? People think they are doing the right thing but they are not in fact. Now practically speaking !node_online should not apear node_online (note I am attentionally avoiding to say offline and online as that has a completely different semantic) shouldn't really happen for some architectures. x86 should allocate pgdat for each possible node. I do not know what was the architecture in this case but we already have another report for x86 that remains unexplained. My existing experience tells me that we have two major problems in this area. Arch inconsistent behavior and really confusing APIs. We should be addressing those rather than adding more hacks. > After all, we do need patches to backport, reworking pgdat init isn't > really something feasible for that I think. And I heard of PPC that can > hotplug thousands of nodes ... If this turns out to be a big problem them we can think of putting pgdat on diet.
On Mon 06-12-21 13:43:27, David Hildenbrand wrote: [...] > > Now practically speaking !node_online should not apear node_online (note > > I am attentionally avoiding to say offline and online as that has a > > completely different semantic) shouldn't really happen for some > > architectures. x86 should allocate pgdat for each possible node. I do > > not know what was the architecture in this case but we already have > > another report for x86 that remains unexplained. > > So we'd allocate the pgdat although all we want is just a zonelist. The > obvious alternative is to implement the fallback where reasonable -- for > example, in the page allocator. It knows the fallback order: > build_zonelists(). That's pretty much all we need the preferred_nid for. > > So just making prepare_alloc_pages()/node_zonelist() deal with a missing > pgdat could make sense as well. Something like: > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index b976c4177299..2d2649e78766 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -508,9 +508,14 @@ static inline int gfp_zonelist(gfp_t flags) > * > * For the case of non-NUMA systems the NODE_DATA() gets optimized to > * &contig_page_data at compile-time. > + * > + * If the node does not have a pgdat yet, returns the zonelist of the > + * first online node. > */ > static inline struct zonelist *node_zonelist(int nid, gfp_t flags) > { > + if (unlikely(!NODE_DATA(nid))) > + nid = first_online_node; > return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); > } This is certainly possible. But it a) adds a branch to the hotpath and b) it doesn't solve any other potential dereference of garbage node. > But of course, there might be value in a proper node-aware fallback list > as we have in build_zonelists() -- but it remains questionable if the > difference for these corner cases would be relevant in practice. Only the platform knows the proper node topology and that includes memory less nodes. So they should be setting up a node properly and we shouldn't be dealing with this at the allocator nor any other code. > Further, if we could have thousands of nodes, we'd have to update each > and every one when building zone lists ... Why would that be a practical problem? > Removing the hotadd_new_pgdat() stuff does sound appealing, though. Yes our hotplug code could be simplified as well. All we really need is an arch code to initialize all the possible nodes.
On Mon 06-12-21 16:19:12, Kirill Tkhai wrote: > On 06.12.2021 13:45, David Hildenbrand wrote: > >> This doesn't seen complete. Slab shrinkers are used in the reclaim > >> context. Previously offline nodes could be onlined later and this would > >> lead to NULL ptr because there is no hook to allocate new shrinker > >> infos. This would be also really impractical because this would have to > >> update all existing memcgs... > > > > Instead of going through the trouble of updating... > > > > ... maybe just keep for_each_node() and check if the target node is > > offline. If it's offline, just allocate from the first online node. > > After all, we're not using __GFP_THISNODE, so there are no guarantees > > either way ... > > Hm, can't we add shrinker maps allocation to __try_online_node() in addition > to this patch? Either that or through hotplug notifier (which would be a better solution). But allocating a new shrinker map for each memcg would have to be done as has been mentioned earlier.
On Mon 06-12-21 14:47:13, David Hildenbrand wrote: > On 06.12.21 14:06, Michal Hocko wrote: > > On Mon 06-12-21 13:43:27, David Hildenbrand wrote: > > [...] > >>> Now practically speaking !node_online should not apear node_online (note > >>> I am attentionally avoiding to say offline and online as that has a > >>> completely different semantic) shouldn't really happen for some > >>> architectures. x86 should allocate pgdat for each possible node. I do > >>> not know what was the architecture in this case but we already have > >>> another report for x86 that remains unexplained. > >> > >> So we'd allocate the pgdat although all we want is just a zonelist. The > >> obvious alternative is to implement the fallback where reasonable -- for > >> example, in the page allocator. It knows the fallback order: > >> build_zonelists(). That's pretty much all we need the preferred_nid for. > >> > >> So just making prepare_alloc_pages()/node_zonelist() deal with a missing > >> pgdat could make sense as well. Something like: > >> > >> > >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h > >> index b976c4177299..2d2649e78766 100644 > >> --- a/include/linux/gfp.h > >> +++ b/include/linux/gfp.h > >> @@ -508,9 +508,14 @@ static inline int gfp_zonelist(gfp_t flags) > >> * > >> * For the case of non-NUMA systems the NODE_DATA() gets optimized to > >> * &contig_page_data at compile-time. > >> + * > >> + * If the node does not have a pgdat yet, returns the zonelist of the > >> + * first online node. > >> */ > >> static inline struct zonelist *node_zonelist(int nid, gfp_t flags) > >> { > >> + if (unlikely(!NODE_DATA(nid))) > >> + nid = first_online_node; > >> return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); > >> } > > > > This is certainly possible. But it a) adds a branch to the hotpath and > > b) it doesn't solve any other potential dereference of garbage node. > > I don't think a) is a problem but it's easy to measure. Agreed to b), > however, the page allocator has been the most prominent source of error > reports for this. > > > > >> But of course, there might be value in a proper node-aware fallback list > >> as we have in build_zonelists() -- but it remains questionable if the > >> difference for these corner cases would be relevant in practice. > > > > Only the platform knows the proper node topology and that includes > > memory less nodes. So they should be setting up a node properly and we > > shouldn't be dealing with this at the allocator nor any other code. > > I *think* there are cases where the topology of a new node is only know > once it actually gets used. For example, I remember talking to CXL and > there are ideas to have a pool of possible nodes, which can get used > dynamically for CXL memory. Of course, some kind of reconfiguration > could be imaginable. I am not really familiar with CXL but if a dynamic reconfiguration is needed then this would only require to update the zonelist which should be a trivial thing to do. The code using the node doesn't really have to be aware of that as the pgdat will stay unchanged. > >> Further, if we could have thousands of nodes, we'd have to update each > >> and every one when building zone lists ... > > > > Why would that be a practical problem? > > We'll need at least > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c5952749ad40..e5d958abc7cc 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6382,7 +6382,7 @@ static void __build_all_zonelists(void *data) > if (self && !node_online(self->node_id)) { > build_zonelists(self); > } else { > - for_each_online_node(nid) { > + for_each_node(nid) { > pg_data_t *pgdat = NODE_DATA(nid); > > build_zonelists(pgdat); If we do have preallocated pgdat for all possible nodes that for_each_online_node doesn't really need to exist. > But there might be more missing. Onlining a new zone will get more > expensive in setups with a lot of possible nodes (x86-64 shouldn't > really be an issue in that regard). Honestly, I am not really concerned by platforms with too many nodes without any memory. If they want to shoot their feet then that's their choice. We can optimize for those if they ever prove to be standar. > If we want stable backports, we'll want something simple upfront. For stable backports I would be fine by doing your NODE_DATA check in the allocator. In upstream I think we should be aiming for a more robust solution that is also easier to maintain further down the line. Even if that is an investment at this momemnt because the initialization code is a mess.
On Mon 06-12-21 14:47:13, David Hildenbrand wrote: > On 06.12.21 14:06, Michal Hocko wrote: [...] > > This is certainly possible. But it a) adds a branch to the hotpath and > > b) it doesn't solve any other potential dereference of garbage node. > > I don't think a) is a problem but it's easy to measure. Let me just clarify on this one. This single particular branch will be indeed hard to match to any performance changes. On a single call of the allocator it is nothing. But it is a condition that will be executed by each caller while it won't ever trigger in most systems and workloads. In other words it will cause a lot of wasted cpu cycles long term. A more expensive one time initialization is worth in order to have allocator fast path lighter IMHO.
On Mon 06-12-21 15:08:10, David Hildenbrand wrote: > > >> But there might be more missing. Onlining a new zone will get more > >> expensive in setups with a lot of possible nodes (x86-64 shouldn't > >> really be an issue in that regard). > > > > Honestly, I am not really concerned by platforms with too many nodes > > without any memory. If they want to shoot their feet then that's their > > choice. We can optimize for those if they ever prove to be standar. > > > >> If we want stable backports, we'll want something simple upfront. > > > > For stable backports I would be fine by doing your NODE_DATA check in > > the allocator. In upstream I think we should be aiming for a more robust > > solution that is also easier to maintain further down the line. Even if > > that is an investment at this momemnt because the initialization code is > > a mess. > > > > Agreed. I would be curious *why* we decided to dynamically allocate the > pgdat. is this just a historical coincidence or was there real reason to > not allocate it for all possible nodes during boot? I don't know but if I was to guess the most likely explanation would be that the numa init code was in a similar order as now and it was easier to simply allocate a pgdat when a new one was onlined. 9af3c2dea3a3 ("[PATCH] pgdat allocation for new node add (call pgdat allocation)") doesn't really tell much.
On 12/6/21 15:21, Michal Hocko wrote: > On Mon 06-12-21 15:08:10, David Hildenbrand wrote: >> >> >> But there might be more missing. Onlining a new zone will get more >> >> expensive in setups with a lot of possible nodes (x86-64 shouldn't >> >> really be an issue in that regard). >> > >> > Honestly, I am not really concerned by platforms with too many nodes >> > without any memory. If they want to shoot their feet then that's their >> > choice. We can optimize for those if they ever prove to be standar. >> > >> >> If we want stable backports, we'll want something simple upfront. >> > >> > For stable backports I would be fine by doing your NODE_DATA check in >> > the allocator. In upstream I think we should be aiming for a more robust >> > solution that is also easier to maintain further down the line. Even if >> > that is an investment at this momemnt because the initialization code is >> > a mess. >> > >> >> Agreed. I would be curious *why* we decided to dynamically allocate the >> pgdat. is this just a historical coincidence or was there real reason to >> not allocate it for all possible nodes during boot? > > I don't know but if I was to guess the most likely explanation would be > that the numa init code was in a similar order as now and it was easier > to simply allocate a pgdat when a new one was onlined. > 9af3c2dea3a3 ("[PATCH] pgdat allocation for new node add (call pgdat allocation)") > doesn't really tell much. I don't know if that's true for pgdat specifically, but generally IMHO the advantages of allocating during/after online instead for each possible is - memory savings when some possible node is actually never online - at least in some cases, the allocations can be local to the node in question where the advantages is - faster access - less memory occupied on nodes that are earlier online, especially node 0 So while the approach of allocate on boot for all possible nodes instead of just online nodes has advantages of being generally safer and simpler (no memory hotplug callbacks etc), we should also be careful not to overdo this approach so we don't end up with Node 0 memory filled with structures used for nodes 1-X that are just onlined later. I imagine that could be a problem even for "sane" archs that don't have tons of possible, but offline nodes. Concretely, pgdat should probably be fine, but things like all shrinkers? Maybe less so.
On Mon 06-12-21 15:30:37, Vlastimil Babka wrote: > On 12/6/21 15:21, Michal Hocko wrote: > > On Mon 06-12-21 15:08:10, David Hildenbrand wrote: > >> > >> >> But there might be more missing. Onlining a new zone will get more > >> >> expensive in setups with a lot of possible nodes (x86-64 shouldn't > >> >> really be an issue in that regard). > >> > > >> > Honestly, I am not really concerned by platforms with too many nodes > >> > without any memory. If they want to shoot their feet then that's their > >> > choice. We can optimize for those if they ever prove to be standar. > >> > > >> >> If we want stable backports, we'll want something simple upfront. > >> > > >> > For stable backports I would be fine by doing your NODE_DATA check in > >> > the allocator. In upstream I think we should be aiming for a more robust > >> > solution that is also easier to maintain further down the line. Even if > >> > that is an investment at this momemnt because the initialization code is > >> > a mess. > >> > > >> > >> Agreed. I would be curious *why* we decided to dynamically allocate the > >> pgdat. is this just a historical coincidence or was there real reason to > >> not allocate it for all possible nodes during boot? > > > > I don't know but if I was to guess the most likely explanation would be > > that the numa init code was in a similar order as now and it was easier > > to simply allocate a pgdat when a new one was onlined. > > 9af3c2dea3a3 ("[PATCH] pgdat allocation for new node add (call pgdat allocation)") > > doesn't really tell much. > > I don't know if that's true for pgdat specifically, but generally IMHO the > advantages of allocating during/after online instead for each possible is > - memory savings when some possible node is actually never online > - at least in some cases, the allocations can be local to the node in > question where the advantages is > - faster access > - less memory occupied on nodes that are earlier online, especially node 0 > > So while the approach of allocate on boot for all possible nodes instead of > just online nodes has advantages of being generally safer and simpler (no > memory hotplug callbacks etc), we should also be careful not to overdo this > approach so we don't end up with Node 0 memory filled with structures used > for nodes 1-X that are just onlined later. I imagine that could be a problem > even for "sane" archs that don't have tons of possible, but offline nodes. Yes this can indeed turn out to be a problem as the memory allocations scales not only with numa nodes but memcgs as well. The later one being a more visible one. > Concretely, pgdat should probably be fine, but things like all shrinkers? > Maybe less so. Yeah, right. But for that purpose the concept of online_node is just misleading. You would need a check whether the node is populated with memory and implement hotplug notifiers.
On Mon, Dec 6, 2021 at 6:53 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 06-12-21 15:30:37, Vlastimil Babka wrote: > > On 12/6/21 15:21, Michal Hocko wrote: > > > On Mon 06-12-21 15:08:10, David Hildenbrand wrote: > > >> > > >> >> But there might be more missing. Onlining a new zone will get more > > >> >> expensive in setups with a lot of possible nodes (x86-64 shouldn't > > >> >> really be an issue in that regard). > > >> > > > >> > Honestly, I am not really concerned by platforms with too many nodes > > >> > without any memory. If they want to shoot their feet then that's their > > >> > choice. We can optimize for those if they ever prove to be standar. > > >> > > > >> >> If we want stable backports, we'll want something simple upfront. > > >> > > > >> > For stable backports I would be fine by doing your NODE_DATA check in > > >> > the allocator. In upstream I think we should be aiming for a more robust > > >> > solution that is also easier to maintain further down the line. Even if > > >> > that is an investment at this momemnt because the initialization code is > > >> > a mess. > > >> > > > >> > > >> Agreed. I would be curious *why* we decided to dynamically allocate the > > >> pgdat. is this just a historical coincidence or was there real reason to > > >> not allocate it for all possible nodes during boot? > > > > > > I don't know but if I was to guess the most likely explanation would be > > > that the numa init code was in a similar order as now and it was easier > > > to simply allocate a pgdat when a new one was onlined. > > > 9af3c2dea3a3 ("[PATCH] pgdat allocation for new node add (call pgdat allocation)") > > > doesn't really tell much. > > > > I don't know if that's true for pgdat specifically, but generally IMHO the > > advantages of allocating during/after online instead for each possible is > > - memory savings when some possible node is actually never online > > - at least in some cases, the allocations can be local to the node in > > question where the advantages is > > - faster access > > - less memory occupied on nodes that are earlier online, especially node 0 > > > > So while the approach of allocate on boot for all possible nodes instead of > > just online nodes has advantages of being generally safer and simpler (no > > memory hotplug callbacks etc), we should also be careful not to overdo this > > approach so we don't end up with Node 0 memory filled with structures used > > for nodes 1-X that are just onlined later. I imagine that could be a problem > > even for "sane" archs that don't have tons of possible, but offline nodes. > > Yes this can indeed turn out to be a problem as the memory allocations > scales not only with numa nodes but memcgs as well. The later one being > a more visible one. > > > Concretely, pgdat should probably be fine, but things like all shrinkers? > > Maybe less so. > > Yeah, right. But for that purpose the concept of online_node is just > misleading. You would need a check whether the node is populated with > memory and implement hotplug notifiers. Yes, the cons is memory waste. I think it is a known problem since memcg has per node data (a.k.a. mem_cgroup_per_node_info) which holds lruvec and shrinker infos. And the comment in the code of alloc_mem_cgroup_per_node_info() does say: "TODO: this routine can waste much memory for nodes which will never be onlined. It's better to use memory hotplug callback function." But IMHO actually the memory usage should be not that bad for memcg heavy usecases since there should be not too many "never onlined" nodes for such workloads? > > -- > Michal Hocko > SUSE Labs
On Mon, Dec 6, 2021 at 5:19 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > > On 06.12.2021 13:45, David Hildenbrand wrote: > >> This doesn't seen complete. Slab shrinkers are used in the reclaim > >> context. Previously offline nodes could be onlined later and this would > >> lead to NULL ptr because there is no hook to allocate new shrinker > >> infos. This would be also really impractical because this would have to > >> update all existing memcgs... > > > > Instead of going through the trouble of updating... > > > > ... maybe just keep for_each_node() and check if the target node is > > offline. If it's offline, just allocate from the first online node. > > After all, we're not using __GFP_THISNODE, so there are no guarantees > > either way ... > > Hm, can't we add shrinker maps allocation to __try_online_node() in addition > to this patch? I think the below fix (an example, doesn't cover all affected callsites) should be good enough for now? It doesn't touch the hot path of the page allocator. diff --git a/mm/vmscan.c b/mm/vmscan.c index fb9584641ac7..1252a33f7c28 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -222,13 +222,15 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg, int size = map_size + defer_size; for_each_node(nid) { + int tmp = nid; pn = memcg->nodeinfo[nid]; old = shrinker_info_protected(memcg, nid); /* Not yet online memcg */ if (!old) return 0; - - new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid); + if (!node_online(nid)) + tmp = -1; + new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp); if (!new) return -ENOMEM; It used to use kvmalloc instead of kvmalloc_node(). The commit 86daf94efb11d7319fbef5e480018c4807add6ef ("mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node") changed to use *_node() version. The justification was that "kswapd is always bound to specific node. So allocate shrinker_map from the related NUMA node to respect its NUMA locality." There is no kswapd for offlined node, so just allocate shrinker info on node 0. This is also what alloc_mem_cgroup_per_node_info() does. Making memcg per node data node allocation memory hotplug aware should be solved in a separate patchset IMHO.
On Sun, Dec 5, 2021 at 7:34 PM Nico Pache <npache@redhat.com> wrote: > > We have run into a panic caused by a shrinker allocation being attempted > on an offlined node. > > Our crash analysis has determined that the issue originates from trying > to allocate pages on an offlined node in expand_one_shrinker_info. This > function makes the incorrect assumption that we can allocate on any node. > To correct this we make sure we only itterate over online nodes. > > This assumption can lead to an incorrect address being assigned to ac->zonelist > in the following callchain: > __alloc_pages > -> prepare_alloc_pages > -> node_zonelist > > static inline struct zonelist *node_zonelist(int nid, gfp_t flags) > { > return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); > } > if the node is not online the return of node_zonelist will evaluate to a > invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0) > > This address is then dereferenced further down the callchain in: > prepare_alloc_pages > -> first_zones_zonelist > -> next_zones_zonelist > -> zonelist_zone_idx > > static inline int zonelist_zone_idx(struct zoneref *zoneref) > { > return zoneref->zone_idx; > } > > Leading the system to panic. > > We also correct this behavior in alloc_shrinker_info, free_shrinker_info, > and reparent_shrinker_deferred. > > Fixes: 2bfd36374edd ("mm: vmscan: consolidate shrinker_maps handling code") > Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg") I think the correct fix tag should be: 86daf94efb11 ("mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node") regardless of how we will fix this problem. > Signed-off-by: Nico Pache <npache@redhat.com> > --- > mm/vmscan.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index fb9584641ac7..731564b61e3f 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -221,7 +221,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg, > int nid; > int size = map_size + defer_size; > > - for_each_node(nid) { > + for_each_online_node(nid) { > pn = memcg->nodeinfo[nid]; > old = shrinker_info_protected(memcg, nid); > /* Not yet online memcg */ > @@ -256,7 +256,7 @@ void free_shrinker_info(struct mem_cgroup *memcg) > struct shrinker_info *info; > int nid; > > - for_each_node(nid) { > + for_each_online_node(nid) { > pn = memcg->nodeinfo[nid]; > info = rcu_dereference_protected(pn->shrinker_info, true); > kvfree(info); > @@ -274,7 +274,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg) > map_size = shrinker_map_size(shrinker_nr_max); > defer_size = shrinker_defer_size(shrinker_nr_max); > size = map_size + defer_size; > - for_each_node(nid) { > + for_each_online_node(nid) { > info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid); > if (!info) { > free_shrinker_info(memcg); > @@ -417,7 +417,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg) > > /* Prevent from concurrent shrinker_info expand */ > down_read(&shrinker_rwsem); > - for_each_node(nid) { > + for_each_online_node(nid) { > child_info = shrinker_info_protected(memcg, nid); > parent_info = shrinker_info_protected(parent, nid); > for (i = 0; i < shrinker_nr_max; i++) { > -- > 2.33.1 >
On Mon, Dec 6, 2021 at 11:01 AM David Hildenbrand <david@redhat.com> wrote: > > On 06.12.21 19:42, Yang Shi wrote: > > On Mon, Dec 6, 2021 at 5:19 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > >> > >> On 06.12.2021 13:45, David Hildenbrand wrote: > >>>> This doesn't seen complete. Slab shrinkers are used in the reclaim > >>>> context. Previously offline nodes could be onlined later and this would > >>>> lead to NULL ptr because there is no hook to allocate new shrinker > >>>> infos. This would be also really impractical because this would have to > >>>> update all existing memcgs... > >>> > >>> Instead of going through the trouble of updating... > >>> > >>> ... maybe just keep for_each_node() and check if the target node is > >>> offline. If it's offline, just allocate from the first online node. > >>> After all, we're not using __GFP_THISNODE, so there are no guarantees > >>> either way ... > >> > >> Hm, can't we add shrinker maps allocation to __try_online_node() in addition > >> to this patch? > > > > I think the below fix (an example, doesn't cover all affected > > callsites) should be good enough for now? It doesn't touch the hot > > path of the page allocator. > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index fb9584641ac7..1252a33f7c28 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -222,13 +222,15 @@ static int expand_one_shrinker_info(struct > > mem_cgroup *memcg, > > int size = map_size + defer_size; > > > > for_each_node(nid) { > > + int tmp = nid; > > pn = memcg->nodeinfo[nid]; > > old = shrinker_info_protected(memcg, nid); > > /* Not yet online memcg */ > > if (!old) > > return 0; > > - > > - new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid); > > + if (!node_online(nid)) > > + tmp = -1; > > + new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, tmp); > > if (!new) > > return -ENOMEM; > > > > It used to use kvmalloc instead of kvmalloc_node(). The commit > > 86daf94efb11d7319fbef5e480018c4807add6ef ("mm/memcontrol.c: allocate > > shrinker_map on appropriate NUMA node") changed to use *_node() > > version. The justification was that "kswapd is always bound to > > specific node. So allocate shrinker_map from the related NUMA node to > > respect its NUMA locality." There is no kswapd for offlined node, so > > just allocate shrinker info on node 0. This is also what > > alloc_mem_cgroup_per_node_info() does. > > Yes, that's what I refer to as fixing it in the caller -- similar to > [1]. Michals point is to not require such node_online() checks at all, > neither in the caller nor in the buddy. > > I see 2 options short-term > > 1) What we have in [1]. > 2) What I proposed in [2], fixing it for all such instances until we > have something better. > > Long term I tend to agree that what Michal proposes is better. > > Short term I tend to like [2], because it avoids having to mess with all > such instances to eventually get it right and the temporary overhead > until we have the code reworked should be really negligible ... Thanks, David. Basically either option looks fine to me. But I'm a little bit concerned about [2]. It silently changes the node requested by the callers. It actually papers over potential bugs? And what if the callers specify __GFP_THISNODE (I didn't search if such callers really exist in the current code)? How's about a helper function, for example, called kvmalloc_best_node()? It does: void * kvmalloc_best_node(unsigned long size, int flag, int nid) { bool onlined = node_online(nid); WARN_ON_ONCE((flag & __GFP_THISNODE) && !onlined); if (!onlined) nid = -1; return kvmalloc_node(size, GFP_xxx, nid); } > > > > [1] https://lkml.kernel.org/r/20211108202325.20304-1-amakhalov@vmware.com > [2] > https://lkml.kernel.org/r/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com > > -- > Thanks, > > David / dhildenb >
On Mon 06-12-21 10:26:32, Yang Shi wrote: [...] > But IMHO actually the memory usage should be not that bad for memcg > heavy usecases since there should be not too many "never onlined" > nodes for such workloads? Hardware with very sparse nodes topology are really scarce. More likely on ppc (LPARs) but even then we are talking about really low number of nodes. At least this is my experience. So while the memory wasting is possible it doesn't seem to be a really pressing problem. I would be more careful about a code which scales with MAX_NUMNODES because that can be really large.
>> >> Short term I tend to like [2], because it avoids having to mess with all >> such instances to eventually get it right and the temporary overhead >> until we have the code reworked should be really negligible ... > > Thanks, David. Basically either option looks fine to me. But I'm a > little bit concerned about [2]. It silently changes the node requested > by the callers. It actually papers over potential bugs? And what if Hi, It's just a preferred node, so we do have a node fallback already via the zonelist to other nodes for proper online nodes -- and would have the proper node fallback when preallcoating all pgdat. *Not* doing the fallback with a preferred node that is not online could be considered a BUG instead. Note that [2] was just a quick draft. We might have to do some minor adjustments to handle __GFP_THISNODE properly. But after all, we have: VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); in __alloc_pages_node() and __folio_alloc_node(). So it might not be worth adjusting at all. > the callers specify __GFP_THISNODE (I didn't search if such callers > really exist in the current code)? > > How's about a helper function, for example, called > kvmalloc_best_node()? It does: > > void * kvmalloc_best_node(unsigned long size, int flag, int nid) > { > bool onlined = node_online(nid); > > WARN_ON_ONCE((flag & __GFP_THISNODE) && !onlined); > > if (!onlined) > nid = -1; > > return kvmalloc_node(size, GFP_xxx, nid); > } We still have to "fix" each and every affected for_each_node() ... code until we have preallcoation of pgdat for all possible nodes.
On Mon 06-12-21 20:01:34, David Hildenbrand wrote: [...] > Yes, that's what I refer to as fixing it in the caller -- similar to > [1]. Michals point is to not require such node_online() checks at all, > neither in the caller nor in the buddy. > > I see 2 options short-term > > 1) What we have in [1]. > 2) What I proposed in [2], fixing it for all such instances until we > have something better. > > Long term I tend to agree that what Michal proposes is better. > > Short term I tend to like [2], because it avoids having to mess with all > such instances to eventually get it right and the temporary overhead > until we have the code reworked should be really negligible ... I do dislike both but if I were to chose which to chose between the two then 2 is surely more targeted. We really do not want to spread this into bulk/pcp or whatever other allocator there is. The problem is that somebody might still try to access NODE_DATA (e.g. via a helper that hides that fact). Anyway, I am not sure whether authors of the patch can reproduce the problem and whether they can run a testing code on their machine. If yes it would be great to try with http://lkml.kernel.org/r/Ya89aqij6nMwJrIZ@dhcp22.suse.cz that I have just sent. > [1] https://lkml.kernel.org/r/20211108202325.20304-1-amakhalov@vmware.com > [2] > https://lkml.kernel.org/r/51c65635-1dae-6ba4-daf9-db9df0ec35d8@redhat.com > > -- > Thanks, > > David / dhildenb
On 12/6/21 04:22, Michal Hocko wrote: > On Sun 05-12-21 22:33:38, Nico Pache wrote: >> We have run into a panic caused by a shrinker allocation being attempted >> on an offlined node. >> >> Our crash analysis has determined that the issue originates from trying >> to allocate pages on an offlined node in expand_one_shrinker_info. This >> function makes the incorrect assumption that we can allocate on any node. >> To correct this we make sure we only itterate over online nodes. >> >> This assumption can lead to an incorrect address being assigned to ac->zonelist >> in the following callchain: >> __alloc_pages >> -> prepare_alloc_pages >> -> node_zonelist >> >> static inline struct zonelist *node_zonelist(int nid, gfp_t flags) >> { >> return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); >> } >> if the node is not online the return of node_zonelist will evaluate to a >> invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0) >> >> This address is then dereferenced further down the callchain in: >> prepare_alloc_pages >> -> first_zones_zonelist >> -> next_zones_zonelist >> -> zonelist_zone_idx >> >> static inline int zonelist_zone_idx(struct zoneref *zoneref) >> { >> return zoneref->zone_idx; >> } >> >> Leading the system to panic. > > Thanks for the analysis! Please also add an oops report so that this is > easier to search for. It would be also interesting to see specifics > about the issue. Why was the specific node !online in the first place? > What architecture was this on? Here is the Oops report. I will also add it to my commit message on the second posting! This was x86 btw, but it has also been hit on PPC64. [ 362.179917] RIP: 0010:prepare_alloc_pages.constprop.0+0xc6/0x150 [ 362.186639] Code: 03 80 c9 80 83 e2 03 83 fa 01 0f 44 c1 41 89 04 24 c1 eb 0c 48 8b 55 08 83 e3 01 8b 75 1c 48 8b 7d 00 88 5d 20 48 85 d2 75 6b <3b> 77 08 72 66 48 89 7d 10 b8 01 00 00 00 48 83 c4 08 5b 5d 41 5c [ 362.207604] RSP: 0018:ffffb4ba31427bc0 EFLAGS: 00010246 [ 362.213443] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000081 [ 362.221412] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080 [ 362.229380] RBP: ffffb4ba31427bf8 R08: 0000000000000000 R09: ffffb4ba31427bf4 [ 362.237347] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb4ba31427bf0 [ 362.245316] R13: 0000000000000002 R14: ffff9f2fb3788000 R15: 0000000000000078 [ 362.253285] FS: 00007fbc57bfd740(0000) GS:ffff9f4c7d780000(0000) knlGS:0000000000000000 [ 362.262322] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 362.268739] CR2: 0000000000002088 CR3: 000002004cb58002 CR4: 00000000007706e0 [ 362.276707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 362.284675] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 362.292644] PKRU: 55555554 [ 362.295669] Call Trace: [ 362.298402] __alloc_pages+0x9d/0x210 [ 362.302501] kmalloc_large_node+0x40/0x90 [ 362.306988] __kmalloc_node+0x3ac/0x480 [ 362.311279] kvmalloc_node+0x46/0x80 [ 362.315276] expand_one_shrinker_info+0x84/0x190 [ 362.320437] prealloc_shrinker+0x166/0x1c0 [ 362.325015] alloc_super+0x2ba/0x330 [ 362.329011] ? __fput_sync+0x30/0x30 [ 362.333003] ? set_anon_super+0x40/0x40 [ 362.337288] sget_fc+0x6c/0x2f0 [ 362.340798] ? mqueue_create+0x20/0x20 [ 362.344992] get_tree_keyed+0x2f/0xc0 [ 362.349086] vfs_get_tree+0x25/0xb0 [ 362.352982] fc_mount+0xe/0x30 [ 362.356397] mq_init_ns+0x105/0x1a0 [ 362.360291] copy_ipcs+0x129/0x220 [ 362.364093] create_new_namespaces+0xa1/0x2e0 [ 362.368960] unshare_nsproxy_namespaces+0x55/0xa0 [ 362.374216] ksys_unshare+0x198/0x380 [ 362.378310] __x64_sys_unshare+0xe/0x20 [ 362.382595] do_syscall_64+0x3b/0x90 [ 362.386597] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 362.392247] RIP: 0033:0x7fbc57d14d7b [ 362.396242] Code: 73 01 c3 48 8b 0d ad 70 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7d 70 0e 00 f7 d8 64 89 01 48 [ 362.417204] RSP: 002b:00007fbc4dc73f88 EFLAGS: 00000206 ORIG_RAX: 0000000000000110 [ 362.425664] RAX: ffffffffffffffda RBX: 0000560144b09578 RCX: 00007fbc57d14d7b [ 362.433634] RDX: 0000000000000010 RSI: 00007fbc4dc73f90 RDI: 0000000008000000 [ 362.441602] RBP: 0000560144b095a0 R08: 0000000000000000 R09: 0000000000000000 [ 362.449573] R10: 0000000000000000 R11: 0000000000000206 R12: 00007fbc4dc73f90 [ 362.457542] R13: 000000000000006a R14: 0000560144e7e5a0 R15: 00007fff5dec8e10 > >> We also correct this behavior in alloc_shrinker_info, free_shrinker_info, >> and reparent_shrinker_deferred. >> >> Fixes: 2bfd36374edd ("mm: vmscan: consolidate shrinker_maps handling code") >> Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg") > > Normally I would split the fix as it is fixing two issues one introduced > in 4.19 the other in 5.13. These are both commits that introduced the function, one introduces it while the other moves it to a separate file. But as Yang Shi pointed out the better commit to blame is 86daf94efb11 ("mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node") which actually made the change that would have caused the allocator to go for an !online node. > >> Signed-off-by: Nico Pache <npache@redhat.com> >> --- >> mm/vmscan.c | 8 ++++---- >> 1 file changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index fb9584641ac7..731564b61e3f 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -221,7 +221,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg, >> int nid; >> int size = map_size + defer_size; >> >> - for_each_node(nid) { >> + for_each_online_node(nid) { >> pn = memcg->nodeinfo[nid]; >> old = shrinker_info_protected(memcg, nid); >> /* Not yet online memcg */ >> @@ -256,7 +256,7 @@ void free_shrinker_info(struct mem_cgroup *memcg) >> struct shrinker_info *info; >> int nid; >> >> - for_each_node(nid) { >> + for_each_online_node(nid) { >> pn = memcg->nodeinfo[nid]; >> info = rcu_dereference_protected(pn->shrinker_info, true); >> kvfree(info); >> @@ -274,7 +274,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg) >> map_size = shrinker_map_size(shrinker_nr_max); >> defer_size = shrinker_defer_size(shrinker_nr_max); >> size = map_size + defer_size; >> - for_each_node(nid) { >> + for_each_online_node(nid) { >> info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid); >> if (!info) { >> free_shrinker_info(memcg); >> @@ -417,7 +417,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg) >> >> /* Prevent from concurrent shrinker_info expand */ >> down_read(&shrinker_rwsem); >> - for_each_node(nid) { >> + for_each_online_node(nid) { >> child_info = shrinker_info_protected(memcg, nid); >> parent_info = shrinker_info_protected(parent, nid); >> for (i = 0; i < shrinker_nr_max; i++) { >> -- >> 2.33.1 > > This doesn't seen complete. Slab shrinkers are used in the reclaim > context. Previously offline nodes could be onlined later and this would > lead to NULL ptr because there is no hook to allocate new shrinker > infos. This would be also really impractical because this would have to > update all existing memcgs... > > To be completely honest I am not really sure this is a practical problem > because some architectures allocate (aka make online) all possible nodes > reported by the platform. There are major inconsistencies there. Maybe > that should be unified, so that problems like this one do not really > have to add a complexity to the code. Im currently working a solution that will use register_one_node to perform the memcg node allocation for all memcgs. I will post that once I've verified all potential corner cases. Cheers, -- Nico >
On 12/6/21 05:45, David Hildenbrand wrote: >> This doesn't seen complete. Slab shrinkers are used in the reclaim >> context. Previously offline nodes could be onlined later and this would >> lead to NULL ptr because there is no hook to allocate new shrinker >> infos. This would be also really impractical because this would have to >> update all existing memcgs... > > Instead of going through the trouble of updating... > > ... maybe just keep for_each_node() and check if the target node is > offline. If it's offline, just allocate from the first online node. > After all, we're not using __GFP_THISNODE, so there are no guarantees > either way ... Thanks for your reviews David :) Yes, I believe this would be a viable solution. At this point it looks like our short term solution would be something along this line. Long term we would want to either allocate all pgdats (which will waste memory) or update the memcgs with the appropriate node information once a node is onlined. Im currently working a solution that does the latter.
On 12/6/21 08:19, Kirill Tkhai wrote: > On 06.12.2021 13:45, David Hildenbrand wrote: >>> This doesn't seen complete. Slab shrinkers are used in the reclaim >>> context. Previously offline nodes could be onlined later and this would >>> lead to NULL ptr because there is no hook to allocate new shrinker >>> infos. This would be also really impractical because this would have to >>> update all existing memcgs... >> >> Instead of going through the trouble of updating... >> >> ... maybe just keep for_each_node() and check if the target node is >> offline. If it's offline, just allocate from the first online node. >> After all, we're not using __GFP_THISNODE, so there are no guarantees >> either way ... > > Hm, can't we add shrinker maps allocation to __try_online_node() in addition > to this patch? > Thanks for the feedback :) I am currently working a solution similar to this.
On 12/6/21 08:24, Michal Hocko wrote: > On Mon 06-12-21 16:19:12, Kirill Tkhai wrote: >> On 06.12.2021 13:45, David Hildenbrand wrote: >>>> This doesn't seen complete. Slab shrinkers are used in the reclaim >>>> context. Previously offline nodes could be onlined later and this would >>>> lead to NULL ptr because there is no hook to allocate new shrinker >>>> infos. This would be also really impractical because this would have to >>>> update all existing memcgs... >>> >>> Instead of going through the trouble of updating... >>> >>> ... maybe just keep for_each_node() and check if the target node is >>> offline. If it's offline, just allocate from the first online node. >>> After all, we're not using __GFP_THISNODE, so there are no guarantees >>> either way ... >> >> Hm, can't we add shrinker maps allocation to __try_online_node() in addition >> to this patch? > > Either that or through hotplug notifier (which would be a better > solution). But allocating a new shrinker map for each memcg would have > to be done as has been mentioned earlier. I took a stab at this approach. It may be incomplete but please let me know what you think. This would go on top of this series. diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0c5c403f4be6..6c842382fa73 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -520,6 +520,7 @@ static inline struct mem_cgroup *page_memcg_check(struct page *page) return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); } +int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node); #ifdef CONFIG_MEMCG_KMEM /* * folio_memcg_kmem - Check if the folio has the memcg_kmem flag set. diff --git a/include/linux/node.h b/include/linux/node.h index bb21fd631b16..5e8c737ea751 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -19,7 +19,7 @@ #include <linux/cpumask.h> #include <linux/list.h> #include <linux/workqueue.h> - +#include <linux/memcontrol.h> /** * struct node_hmem_attrs - heterogeneous memory performance attributes * @@ -118,6 +118,7 @@ extern int __register_one_node(int nid); /* Registers an online node */ static inline int register_one_node(int nid) { + struct mem_cgroup *memcg; int error = 0; if (node_online(nid)) { @@ -130,6 +131,14 @@ static inline int register_one_node(int nid) return error; /* link memory sections under this node */ link_mem_sections(nid, start_pfn, end_pfn, MEMINIT_EARLY); + /* Iterate over memcgs and update nodeinfo */ + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + if (alloc_mem_cgroup_per_node_info(memcg,nid)) { + mem_cgroup_iter_break(NULL, memcg); + return error; + } + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); } return error; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6863a834ed42..2d55fad3229b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5041,18 +5041,11 @@ struct mem_cgroup *mem_cgroup_from_id(unsigned short id) return idr_find(&mem_cgroup_idr, id); } -static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) +int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) { struct mem_cgroup_per_node *pn; int tmp = node; - /* - * This routine is called against possible nodes. - * But it's BUG to call kmalloc() against offline node. - * - * TODO: this routine can waste much memory for nodes which will - * never be onlined. It's better to use memory hotplug callback - * function. - */ + if (!node_state(node, N_NORMAL_MEMORY)) tmp = -1; pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp); @@ -5130,7 +5123,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) if (!memcg->vmstats_percpu) goto fail; - for_each_node(node) + for_each_online_node(node) if (alloc_mem_cgroup_per_node_info(memcg, node)) goto fail;
diff --git a/mm/vmscan.c b/mm/vmscan.c index fb9584641ac7..731564b61e3f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -221,7 +221,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg, int nid; int size = map_size + defer_size; - for_each_node(nid) { + for_each_online_node(nid) { pn = memcg->nodeinfo[nid]; old = shrinker_info_protected(memcg, nid); /* Not yet online memcg */ @@ -256,7 +256,7 @@ void free_shrinker_info(struct mem_cgroup *memcg) struct shrinker_info *info; int nid; - for_each_node(nid) { + for_each_online_node(nid) { pn = memcg->nodeinfo[nid]; info = rcu_dereference_protected(pn->shrinker_info, true); kvfree(info); @@ -274,7 +274,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg) map_size = shrinker_map_size(shrinker_nr_max); defer_size = shrinker_defer_size(shrinker_nr_max); size = map_size + defer_size; - for_each_node(nid) { + for_each_online_node(nid) { info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid); if (!info) { free_shrinker_info(memcg); @@ -417,7 +417,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg) /* Prevent from concurrent shrinker_info expand */ down_read(&shrinker_rwsem); - for_each_node(nid) { + for_each_online_node(nid) { child_info = shrinker_info_protected(memcg, nid); parent_info = shrinker_info_protected(parent, nid); for (i = 0; i < shrinker_nr_max; i++) {
We have run into a panic caused by a shrinker allocation being attempted on an offlined node. Our crash analysis has determined that the issue originates from trying to allocate pages on an offlined node in expand_one_shrinker_info. This function makes the incorrect assumption that we can allocate on any node. To correct this we make sure we only itterate over online nodes. This assumption can lead to an incorrect address being assigned to ac->zonelist in the following callchain: __alloc_pages -> prepare_alloc_pages -> node_zonelist static inline struct zonelist *node_zonelist(int nid, gfp_t flags) { return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); } if the node is not online the return of node_zonelist will evaluate to a invalid pointer of 0x00000 + offset_of(node_zonelists) + (1|0) This address is then dereferenced further down the callchain in: prepare_alloc_pages -> first_zones_zonelist -> next_zones_zonelist -> zonelist_zone_idx static inline int zonelist_zone_idx(struct zoneref *zoneref) { return zoneref->zone_idx; } Leading the system to panic. We also correct this behavior in alloc_shrinker_info, free_shrinker_info, and reparent_shrinker_deferred. Fixes: 2bfd36374edd ("mm: vmscan: consolidate shrinker_maps handling code") Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg") Signed-off-by: Nico Pache <npache@redhat.com> --- mm/vmscan.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)