Message ID | 20180504085311.1240-1-Jonathan.Cameron@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Jonathan,
Thank you for the fix:
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
On Fri 04-05-18 09:53:11, Jonathan Cameron wrote: > The case of a new numa node got missed in avoiding using > the node info from page_struct during hotplug. In this > path we have a call to register_mem_sect_under_node (which allows > us to specify it is hotplug so don't change the node), > via link_mem_sections which unfortunately does not. I have hard time to parse the problem description. Could you be more specific and describe the user visible effect along with steps to trigger the issue?
On Fri, 4 May 2018 18:08:45 +0200 Michal Hocko <mhocko@kernel.org> wrote: > On Fri 04-05-18 09:53:11, Jonathan Cameron wrote: > > The case of a new numa node got missed in avoiding using > > the node info from page_struct during hotplug. In this > > path we have a call to register_mem_sect_under_node (which allows > > us to specify it is hotplug so don't change the node), > > via link_mem_sections which unfortunately does not. > > I have hard time to parse the problem description. Could you be more > specific and describe the user visible effect along with steps to > trigger the issue? Hi Michal, Sure, the result is that (with a new memory only node) we never successfully call register_mem_sect_under_node so don't get the memory associated with the node in sysfs and meminfo for the node doesn't report it. It came up whilst testing some arm64 hotplug patches, but appears to be universal. Whilst I'm triggering it by removing then reinserting memory to a node with no other elements (thus making the node disappear then appear again), it appears it would happen on hotplugging memory where there was none before and it doesn't seem to be related the arm64 patches. These patches call __add_pages (where most of the issue was fixed by Pavel's patch). If there is a node at the time of the __add_pages call then all is well as it calls register_mem_sect_under_node from there with check_nid set to false. Without a node that function returns having not done the sysfs related stuff as there is no node to use. This is expected but it is the resulting path that fails... Exact path to the problem is as follows: mm/memory_hotplug.c : add_memory_resource The node is not online so we enter the if (new_node) twice, on the second such block there is a call to link_mem_sections which calls into drivers/node.c: link_mem_sections which calls drivers/node.c: register_mem_sect_under_node which calls get_nid_for_pfn and keeps trying until the output of that matches the expected node (passed all the way down from add_memory_resource) It is effectively the same fix as the one referred to in the fixes tag just in the code path for a new node where the comments point out we have to rerun the link creation because it will have failed in register_new_memory (as there was no node at the time). (actually that comment is wrong now as we don't have register_new_memory any more it got renamed to hotplug_memory_register in Pavel's patch). Jonathan
On Fri 04-05-18 17:50:51, Jonathan Cameron wrote: [...] > Exact path to the problem is as follows: > > mm/memory_hotplug.c : add_memory_resource > The node is not online so we enter the > if (new_node) twice, on the second such block there is a call to > link_mem_sections which calls into > drivers/node.c: link_mem_sections which calls > drivers/node.c: register_mem_sect_under_node which calls > get_nid_for_pfn and keeps trying until the output of that matches > the expected node (passed all the way down from add_memory_resource) I am sorry but I am still confused. Why don't we create sysfs files from __add_pages __add_section hotplug_memory_register register_mem_sect_under_node The whole sysfs mess just deserves to die and be reworked completely. Creating different pieces here and there is just a recipe for bugs and unreviewable code </rant>
On Thu 10-05-18 14:02:00, Michal Hocko wrote: > On Fri 04-05-18 17:50:51, Jonathan Cameron wrote: > [...] > > Exact path to the problem is as follows: > > > > mm/memory_hotplug.c : add_memory_resource > > The node is not online so we enter the > > if (new_node) twice, on the second such block there is a call to > > link_mem_sections which calls into > > drivers/node.c: link_mem_sections which calls > > drivers/node.c: register_mem_sect_under_node which calls > > get_nid_for_pfn and keeps trying until the output of that matches > > the expected node (passed all the way down from add_memory_resource) > > I am sorry but I am still confused. Why don't we create sysfs files from > __add_pages > __add_section > hotplug_memory_register > register_mem_sect_under_node > ping?
On Thu, May 10, 2018 at 02:02:00PM +0200, Michal Hocko wrote: > On Fri 04-05-18 17:50:51, Jonathan Cameron wrote: > [...] > > Exact path to the problem is as follows: > > > > mm/memory_hotplug.c : add_memory_resource > > The node is not online so we enter the > > if (new_node) twice, on the second such block there is a call to > > link_mem_sections which calls into > > drivers/node.c: link_mem_sections which calls > > drivers/node.c: register_mem_sect_under_node which calls > > get_nid_for_pfn and keeps trying until the output of that matches > > the expected node (passed all the way down from add_memory_resource) > > I am sorry but I am still confused. Why don't we create sysfs files from > __add_pages > __add_section > hotplug_memory_register > register_mem_sect_under_node IIUC the problem is that at the point we are calling register_mem_sect_under_node(), pages are not initialized yet. While walking the pfns in register_mem_sect_under_node(), we might check for the node-id of the pfn if check_nid is true. if (check_nid) { page_nid = get_nid_for_pfn(pfn); if (page_nid < 0) continue; if (page_nid != nid) continue; } I think the problem is in: get_nid_for_pfn()->pfn_to_nid()->page_to_nid() static inline int page_to_nid(const struct page *page) { struct page *p = (struct page *)page; return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK; } We access a field of the page, but these are not initialiazed, so it can contain anything. Because of that we can just get a wrong id, making the loop to not pass the below check. if (check_nid) { page_nid = get_nid_for_pfn(pfn); if (page_nid < 0) continue; if (page_nid != nid) continue; } create_sys_fs ... and we do not carry on creating the sysfs. Oscar Salvador
On Wed 23-05-18 15:54:03, Oscar Salvador wrote: > On Thu, May 10, 2018 at 02:02:00PM +0200, Michal Hocko wrote: > > On Fri 04-05-18 17:50:51, Jonathan Cameron wrote: > > [...] > > > Exact path to the problem is as follows: > > > > > > mm/memory_hotplug.c : add_memory_resource > > > The node is not online so we enter the > > > if (new_node) twice, on the second such block there is a call to > > > link_mem_sections which calls into > > > drivers/node.c: link_mem_sections which calls > > > drivers/node.c: register_mem_sect_under_node which calls > > > get_nid_for_pfn and keeps trying until the output of that matches > > > the expected node (passed all the way down from add_memory_resource) > > > > I am sorry but I am still confused. Why don't we create sysfs files from > > __add_pages > > __add_section > > hotplug_memory_register > > register_mem_sect_under_node > > IIUC the problem is that at the point we are calling register_mem_sect_under_node(), > pages are not initialized yet. Ahh, of course. I keep forgetting the latest hotplug optimizations that we do not initialize even nid for struct pages. Which is the whole point of this patch... Sigh. I think the whole sysfs initialization needs to be refactored to be more sane. The way how we depend on things silently is just not maintainable. Thanks!
On Fri 04-05-18 09:53:11, Jonathan Cameron wrote: > The case of a new numa node got missed in avoiding using > the node info from page_struct during hotplug. In this > path we have a call to register_mem_sect_under_node (which allows > us to specify it is hotplug so don't change the node), > via link_mem_sections which unfortunately does not. > > Fix is to pass check_nid through link_mem_sections as well and > disable it in the new numa node path. > > Note the bug only 'sometimes' manifests depending on what happens to > be in the struct page structures - there are lots of them and it only > needs to match one of them. > > Fixes: fc44f7f9231a ("mm/memory_hotplug: don't read nid from struct page during hotplug") > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> I believe the proper fix is to refactor the sysfs code because it is just ridiculous. But this should work good enough as the stop gap Acked-by: Michal Hocko <mhocko@suse.com> > --- > drivers/base/node.c | 5 +++-- > include/linux/node.h | 8 +++++--- > mm/memory_hotplug.c | 2 +- > 3 files changed, 9 insertions(+), 6 deletions(-) > > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 7a3a580821e0..a5e821d09656 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -490,7 +490,8 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, > return 0; > } > > -int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages) > +int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages, > + bool check_nid) > { > unsigned long end_pfn = start_pfn + nr_pages; > unsigned long pfn; > @@ -514,7 +515,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages) > > mem_blk = find_memory_block_hinted(mem_sect, mem_blk); > > - ret = register_mem_sect_under_node(mem_blk, nid, true); > + ret = register_mem_sect_under_node(mem_blk, nid, check_nid); > if (!err) > err = ret; > > diff --git a/include/linux/node.h b/include/linux/node.h > index 41f171861dcc..6d336e38d155 100644 > --- a/include/linux/node.h > +++ b/include/linux/node.h > @@ -32,9 +32,11 @@ extern struct node *node_devices[]; > typedef void (*node_registration_func_t)(struct node *); > > #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_NUMA) > -extern int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages); > +extern int link_mem_sections(int nid, unsigned long start_pfn, > + unsigned long nr_pages, bool check_nid); > #else > -static inline int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages) > +static inline int link_mem_sections(int nid, unsigned long start_pfn, > + unsigned long nr_pages, bool check_nid) > { > return 0; > } > @@ -57,7 +59,7 @@ static inline int register_one_node(int nid) > if (error) > return error; > /* link memory sections under this node */ > - error = link_mem_sections(nid, pgdat->node_start_pfn, pgdat->node_spanned_pages); > + error = link_mem_sections(nid, pgdat->node_start_pfn, pgdat->node_spanned_pages, true); > } > > return error; > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index f74826cdceea..25982467800b 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1158,7 +1158,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online) > * nodes have to go through register_node. > * TODO clean up this mess. > */ > - ret = link_mem_sections(nid, start_pfn, nr_pages); > + ret = link_mem_sections(nid, start_pfn, nr_pages, false); > register_fail: > /* > * If sysfs file of new node can't create, cpu on the node > -- > 2.16.2 >
On Wed, May 23, 2018 at 04:16:08PM +0200, Michal Hocko wrote: > On Wed 23-05-18 15:54:03, Oscar Salvador wrote: > > On Thu, May 10, 2018 at 02:02:00PM +0200, Michal Hocko wrote: > > > On Fri 04-05-18 17:50:51, Jonathan Cameron wrote: > > > [...] > > > > Exact path to the problem is as follows: > > > > > > > > mm/memory_hotplug.c : add_memory_resource > > > > The node is not online so we enter the > > > > if (new_node) twice, on the second such block there is a call to > > > > link_mem_sections which calls into > > > > drivers/node.c: link_mem_sections which calls > > > > drivers/node.c: register_mem_sect_under_node which calls > > > > get_nid_for_pfn and keeps trying until the output of that matches > > > > the expected node (passed all the way down from add_memory_resource) > > > > > > I am sorry but I am still confused. Why don't we create sysfs files from > > > __add_pages > > > __add_section > > > hotplug_memory_register > > > register_mem_sect_under_node > > > > IIUC the problem is that at the point we are calling register_mem_sect_under_node(), > > pages are not initialized yet. > > Ahh, of course. I keep forgetting the latest hotplug optimizations that > we do not initialize even nid for struct pages. Which is the whole point > of this patch... Sigh. > > I think the whole sysfs initialization needs to be refactored to be more > sane. The way how we depend on things silently is just not maintainable. I will try to work that out. I also want to see if we can get rid of link_mem_sections() since it shares almost all the code with walk_memory_range(). Maybe we can pass register_mem_sect_under_node() as a callback of walk_memory_range(). Oscar Salvador
diff --git a/drivers/base/node.c b/drivers/base/node.c index 7a3a580821e0..a5e821d09656 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -490,7 +490,8 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, return 0; } -int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages) +int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages, + bool check_nid) { unsigned long end_pfn = start_pfn + nr_pages; unsigned long pfn; @@ -514,7 +515,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages) mem_blk = find_memory_block_hinted(mem_sect, mem_blk); - ret = register_mem_sect_under_node(mem_blk, nid, true); + ret = register_mem_sect_under_node(mem_blk, nid, check_nid); if (!err) err = ret; diff --git a/include/linux/node.h b/include/linux/node.h index 41f171861dcc..6d336e38d155 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -32,9 +32,11 @@ extern struct node *node_devices[]; typedef void (*node_registration_func_t)(struct node *); #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_NUMA) -extern int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages); +extern int link_mem_sections(int nid, unsigned long start_pfn, + unsigned long nr_pages, bool check_nid); #else -static inline int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages) +static inline int link_mem_sections(int nid, unsigned long start_pfn, + unsigned long nr_pages, bool check_nid) { return 0; } @@ -57,7 +59,7 @@ static inline int register_one_node(int nid) if (error) return error; /* link memory sections under this node */ - error = link_mem_sections(nid, pgdat->node_start_pfn, pgdat->node_spanned_pages); + error = link_mem_sections(nid, pgdat->node_start_pfn, pgdat->node_spanned_pages, true); } return error; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index f74826cdceea..25982467800b 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1158,7 +1158,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online) * nodes have to go through register_node. * TODO clean up this mess. */ - ret = link_mem_sections(nid, start_pfn, nr_pages); + ret = link_mem_sections(nid, start_pfn, nr_pages, false); register_fail: /* * If sysfs file of new node can't create, cpu on the node
The case of a new numa node got missed in avoiding using the node info from page_struct during hotplug. In this path we have a call to register_mem_sect_under_node (which allows us to specify it is hotplug so don't change the node), via link_mem_sections which unfortunately does not. Fix is to pass check_nid through link_mem_sections as well and disable it in the new numa node path. Note the bug only 'sometimes' manifests depending on what happens to be in the struct page structures - there are lots of them and it only needs to match one of them. Fixes: fc44f7f9231a ("mm/memory_hotplug: don't read nid from struct page during hotplug") Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> --- drivers/base/node.c | 5 +++-- include/linux/node.h | 8 +++++--- mm/memory_hotplug.c | 2 +- 3 files changed, 9 insertions(+), 6 deletions(-)