diff mbox

mm/memory_hotplug: Fix leftover use of struct page during hotplug

Message ID 20180504085311.1240-1-Jonathan.Cameron@huawei.com
State New
Headers show

Commit Message

Jonathan Cameron May 4, 2018, 8:53 a.m. UTC
The case of a new numa node got missed in avoiding using
the node info from page_struct during hotplug.  In this
path we have a call to register_mem_sect_under_node (which allows
us to specify it is hotplug so don't change the node),
via link_mem_sections which unfortunately does not.

Fix is to pass check_nid through link_mem_sections as well and
disable it in the new numa node path.

Note the bug only 'sometimes' manifests depending on what happens to
be in the struct page structures - there are lots of them and it only
needs to match one of them.

Fixes: fc44f7f9231a ("mm/memory_hotplug: don't read nid from struct page during hotplug")
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/base/node.c  | 5 +++--
 include/linux/node.h | 8 +++++---
 mm/memory_hotplug.c  | 2 +-
 3 files changed, 9 insertions(+), 6 deletions(-)

Comments

Pavel Tatashin May 4, 2018, 1 p.m. UTC | #1
Hi Jonathan,

Thank you for the fix:
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Michal Hocko May 4, 2018, 4:08 p.m. UTC | #2
On Fri 04-05-18 09:53:11, Jonathan Cameron wrote:
> The case of a new numa node got missed in avoiding using
> the node info from page_struct during hotplug.  In this
> path we have a call to register_mem_sect_under_node (which allows
> us to specify it is hotplug so don't change the node),
> via link_mem_sections which unfortunately does not.

I have hard time to parse the problem description. Could you be more
specific and describe the user visible effect along with steps to
trigger the issue?
Jonathan Cameron May 4, 2018, 4:50 p.m. UTC | #3
On Fri, 4 May 2018 18:08:45 +0200
Michal Hocko <mhocko@kernel.org> wrote:

> On Fri 04-05-18 09:53:11, Jonathan Cameron wrote:
> > The case of a new numa node got missed in avoiding using
> > the node info from page_struct during hotplug.  In this
> > path we have a call to register_mem_sect_under_node (which allows
> > us to specify it is hotplug so don't change the node),
> > via link_mem_sections which unfortunately does not.  
> 
> I have hard time to parse the problem description. Could you be more
> specific and describe the user visible effect along with steps to
> trigger the issue?

Hi Michal,

Sure, the result is that (with a new memory only node) we never
successfully call register_mem_sect_under_node so don't get the
memory associated with the node in sysfs and meminfo for the
node doesn't report it.

It came up whilst testing some arm64 hotplug patches, but appears
to be universal.  Whilst I'm triggering it by removing then reinserting
memory to a node with no other elements (thus making the node disappear
then appear again), it appears it would happen on hotplugging memory
where there was none before and it doesn't seem to be related the
arm64 patches.  These patches call __add_pages (where most of the issue was
fixed by Pavel's patch). If there is a node at the time of the __add_pages
call then all is well as it calls register_mem_sect_under_node from
there with check_nid set to false.  Without a node that function returns
having not done the sysfs related stuff as there is no node to use.
This is expected but it is the resulting path that fails...

Exact path to the problem is as follows:

mm/memory_hotplug.c : add_memory_resource
The node is not online so we enter the
if (new_node) twice, on the second such block there is a call to
link_mem_sections which calls into
drivers/node.c: link_mem_sections which calls
drivers/node.c: register_mem_sect_under_node which calls
get_nid_for_pfn and keeps trying until the output of that matches
the expected node (passed all the way down from add_memory_resource)

It is effectively the same fix as the one referred to in the fixes
tag just in the code path for a new node where the comments point
out we have to rerun the link creation because it will have failed
in register_new_memory (as there was no node at the time).
(actually that comment is wrong now as we don't have register_new_memory
any more it got renamed to hotplug_memory_register in Pavel's patch).

Jonathan
Michal Hocko May 10, 2018, 12:02 p.m. UTC | #4
On Fri 04-05-18 17:50:51, Jonathan Cameron wrote:
[...]
> Exact path to the problem is as follows:
> 
> mm/memory_hotplug.c : add_memory_resource
> The node is not online so we enter the
> if (new_node) twice, on the second such block there is a call to
> link_mem_sections which calls into
> drivers/node.c: link_mem_sections which calls
> drivers/node.c: register_mem_sect_under_node which calls
> get_nid_for_pfn and keeps trying until the output of that matches
> the expected node (passed all the way down from add_memory_resource)

I am sorry but I am still confused. Why don't we create sysfs files from
__add_pages
  __add_section
    hotplug_memory_register
      register_mem_sect_under_node

The whole sysfs mess just deserves to die and be reworked completely.
Creating different pieces here and there is just a recipe for bugs
and unreviewable code </rant>
Michal Hocko May 22, 2018, 12:56 p.m. UTC | #5
On Thu 10-05-18 14:02:00, Michal Hocko wrote:
> On Fri 04-05-18 17:50:51, Jonathan Cameron wrote:
> [...]
> > Exact path to the problem is as follows:
> > 
> > mm/memory_hotplug.c : add_memory_resource
> > The node is not online so we enter the
> > if (new_node) twice, on the second such block there is a call to
> > link_mem_sections which calls into
> > drivers/node.c: link_mem_sections which calls
> > drivers/node.c: register_mem_sect_under_node which calls
> > get_nid_for_pfn and keeps trying until the output of that matches
> > the expected node (passed all the way down from add_memory_resource)
> 
> I am sorry but I am still confused. Why don't we create sysfs files from
> __add_pages
>   __add_section
>     hotplug_memory_register
>       register_mem_sect_under_node
> 

ping?
Oscar Salvador May 23, 2018, 1:54 p.m. UTC | #6
On Thu, May 10, 2018 at 02:02:00PM +0200, Michal Hocko wrote:
> On Fri 04-05-18 17:50:51, Jonathan Cameron wrote:
> [...]
> > Exact path to the problem is as follows:
> > 
> > mm/memory_hotplug.c : add_memory_resource
> > The node is not online so we enter the
> > if (new_node) twice, on the second such block there is a call to
> > link_mem_sections which calls into
> > drivers/node.c: link_mem_sections which calls
> > drivers/node.c: register_mem_sect_under_node which calls
> > get_nid_for_pfn and keeps trying until the output of that matches
> > the expected node (passed all the way down from add_memory_resource)
> 
> I am sorry but I am still confused. Why don't we create sysfs files from
> __add_pages
>   __add_section
>     hotplug_memory_register
>       register_mem_sect_under_node

IIUC the problem is that at the point we are calling register_mem_sect_under_node(),
pages are not initialized yet.

While walking the pfns in register_mem_sect_under_node(),
we might check for the node-id of the pfn if check_nid is true.

if (check_nid) {
	page_nid = get_nid_for_pfn(pfn);
	if (page_nid < 0)
		continue;
	if (page_nid != nid)
		continue;
}

I think the problem is in:

get_nid_for_pfn()->pfn_to_nid()->page_to_nid()

static inline int page_to_nid(const struct page *page)
{
	struct page *p = (struct page *)page;

	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
}

We access a field of the page, but these are not initialiazed, so it can
contain anything.
Because of that we can just get a wrong id, making the loop to not pass the
below check.

if (check_nid) {
        page_nid = get_nid_for_pfn(pfn);
        if (page_nid < 0)
                continue;
        if (page_nid != nid)
                continue;
}

create_sys_fs ...

and we do not carry on creating the sysfs.


Oscar Salvador
Michal Hocko May 23, 2018, 2:16 p.m. UTC | #7
On Wed 23-05-18 15:54:03, Oscar Salvador wrote:
> On Thu, May 10, 2018 at 02:02:00PM +0200, Michal Hocko wrote:
> > On Fri 04-05-18 17:50:51, Jonathan Cameron wrote:
> > [...]
> > > Exact path to the problem is as follows:
> > > 
> > > mm/memory_hotplug.c : add_memory_resource
> > > The node is not online so we enter the
> > > if (new_node) twice, on the second such block there is a call to
> > > link_mem_sections which calls into
> > > drivers/node.c: link_mem_sections which calls
> > > drivers/node.c: register_mem_sect_under_node which calls
> > > get_nid_for_pfn and keeps trying until the output of that matches
> > > the expected node (passed all the way down from add_memory_resource)
> > 
> > I am sorry but I am still confused. Why don't we create sysfs files from
> > __add_pages
> >   __add_section
> >     hotplug_memory_register
> >       register_mem_sect_under_node
> 
> IIUC the problem is that at the point we are calling register_mem_sect_under_node(),
> pages are not initialized yet.

Ahh, of course. I keep forgetting the latest hotplug optimizations that
we do not initialize even nid for struct pages. Which is the whole point
of this patch... Sigh.

I think the whole sysfs initialization needs to be refactored to be more
sane. The way how we depend on things silently is just not maintainable.

Thanks!
Michal Hocko May 23, 2018, 2:17 p.m. UTC | #8
On Fri 04-05-18 09:53:11, Jonathan Cameron wrote:
> The case of a new numa node got missed in avoiding using
> the node info from page_struct during hotplug.  In this
> path we have a call to register_mem_sect_under_node (which allows
> us to specify it is hotplug so don't change the node),
> via link_mem_sections which unfortunately does not.
> 
> Fix is to pass check_nid through link_mem_sections as well and
> disable it in the new numa node path.
> 
> Note the bug only 'sometimes' manifests depending on what happens to
> be in the struct page structures - there are lots of them and it only
> needs to match one of them.
> 
> Fixes: fc44f7f9231a ("mm/memory_hotplug: don't read nid from struct page during hotplug")
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

I believe the proper fix is to refactor the sysfs code because it is
just ridiculous. But this should work good enough as the stop gap

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  drivers/base/node.c  | 5 +++--
>  include/linux/node.h | 8 +++++---
>  mm/memory_hotplug.c  | 2 +-
>  3 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 7a3a580821e0..a5e821d09656 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -490,7 +490,8 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
>  	return 0;
>  }
>  
> -int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
> +int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages,
> +		      bool check_nid)
>  {
>  	unsigned long end_pfn = start_pfn + nr_pages;
>  	unsigned long pfn;
> @@ -514,7 +515,7 @@ int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
>  
>  		mem_blk = find_memory_block_hinted(mem_sect, mem_blk);
>  
> -		ret = register_mem_sect_under_node(mem_blk, nid, true);
> +		ret = register_mem_sect_under_node(mem_blk, nid, check_nid);
>  		if (!err)
>  			err = ret;
>  
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 41f171861dcc..6d336e38d155 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -32,9 +32,11 @@ extern struct node *node_devices[];
>  typedef  void (*node_registration_func_t)(struct node *);
>  
>  #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_NUMA)
> -extern int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages);
> +extern int link_mem_sections(int nid, unsigned long start_pfn,
> +			     unsigned long nr_pages, bool check_nid);
>  #else
> -static inline int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
> +static inline int link_mem_sections(int nid, unsigned long start_pfn,
> +				    unsigned long nr_pages, bool check_nid)
>  {
>  	return 0;
>  }
> @@ -57,7 +59,7 @@ static inline int register_one_node(int nid)
>  		if (error)
>  			return error;
>  		/* link memory sections under this node */
> -		error = link_mem_sections(nid, pgdat->node_start_pfn, pgdat->node_spanned_pages);
> +		error = link_mem_sections(nid, pgdat->node_start_pfn, pgdat->node_spanned_pages, true);
>  	}
>  
>  	return error;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f74826cdceea..25982467800b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1158,7 +1158,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
>  		 * nodes have to go through register_node.
>  		 * TODO clean up this mess.
>  		 */
> -		ret = link_mem_sections(nid, start_pfn, nr_pages);
> +		ret = link_mem_sections(nid, start_pfn, nr_pages, false);
>  register_fail:
>  		/*
>  		 * If sysfs file of new node can't create, cpu on the node
> -- 
> 2.16.2
>
Oscar Salvador May 23, 2018, 2:31 p.m. UTC | #9
On Wed, May 23, 2018 at 04:16:08PM +0200, Michal Hocko wrote:
> On Wed 23-05-18 15:54:03, Oscar Salvador wrote:
> > On Thu, May 10, 2018 at 02:02:00PM +0200, Michal Hocko wrote:
> > > On Fri 04-05-18 17:50:51, Jonathan Cameron wrote:
> > > [...]
> > > > Exact path to the problem is as follows:
> > > > 
> > > > mm/memory_hotplug.c : add_memory_resource
> > > > The node is not online so we enter the
> > > > if (new_node) twice, on the second such block there is a call to
> > > > link_mem_sections which calls into
> > > > drivers/node.c: link_mem_sections which calls
> > > > drivers/node.c: register_mem_sect_under_node which calls
> > > > get_nid_for_pfn and keeps trying until the output of that matches
> > > > the expected node (passed all the way down from add_memory_resource)
> > > 
> > > I am sorry but I am still confused. Why don't we create sysfs files from
> > > __add_pages
> > >   __add_section
> > >     hotplug_memory_register
> > >       register_mem_sect_under_node
> > 
> > IIUC the problem is that at the point we are calling register_mem_sect_under_node(),
> > pages are not initialized yet.
> 
> Ahh, of course. I keep forgetting the latest hotplug optimizations that
> we do not initialize even nid for struct pages. Which is the whole point
> of this patch... Sigh.
> 
> I think the whole sysfs initialization needs to be refactored to be more
> sane. The way how we depend on things silently is just not maintainable.

I will try to work that out.
I also want to see if we can get rid of link_mem_sections() since it shares
almost all the code with walk_memory_range().
Maybe we can pass register_mem_sect_under_node() as a callback of walk_memory_range().

Oscar Salvador
diff mbox

Patch

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 7a3a580821e0..a5e821d09656 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -490,7 +490,8 @@  int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 	return 0;
 }
 
-int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
+int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages,
+		      bool check_nid)
 {
 	unsigned long end_pfn = start_pfn + nr_pages;
 	unsigned long pfn;
@@ -514,7 +515,7 @@  int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
 
 		mem_blk = find_memory_block_hinted(mem_sect, mem_blk);
 
-		ret = register_mem_sect_under_node(mem_blk, nid, true);
+		ret = register_mem_sect_under_node(mem_blk, nid, check_nid);
 		if (!err)
 			err = ret;
 
diff --git a/include/linux/node.h b/include/linux/node.h
index 41f171861dcc..6d336e38d155 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -32,9 +32,11 @@  extern struct node *node_devices[];
 typedef  void (*node_registration_func_t)(struct node *);
 
 #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_NUMA)
-extern int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages);
+extern int link_mem_sections(int nid, unsigned long start_pfn,
+			     unsigned long nr_pages, bool check_nid);
 #else
-static inline int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages)
+static inline int link_mem_sections(int nid, unsigned long start_pfn,
+				    unsigned long nr_pages, bool check_nid)
 {
 	return 0;
 }
@@ -57,7 +59,7 @@  static inline int register_one_node(int nid)
 		if (error)
 			return error;
 		/* link memory sections under this node */
-		error = link_mem_sections(nid, pgdat->node_start_pfn, pgdat->node_spanned_pages);
+		error = link_mem_sections(nid, pgdat->node_start_pfn, pgdat->node_spanned_pages, true);
 	}
 
 	return error;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f74826cdceea..25982467800b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1158,7 +1158,7 @@  int __ref add_memory_resource(int nid, struct resource *res, bool online)
 		 * nodes have to go through register_node.
 		 * TODO clean up this mess.
 		 */
-		ret = link_mem_sections(nid, start_pfn, nr_pages);
+		ret = link_mem_sections(nid, start_pfn, nr_pages, false);
 register_fail:
 		/*
 		 * If sysfs file of new node can't create, cpu on the node