diff mbox series

[v3,3/4] mm/memory_hotplug: Refactor unregister_mem_sect_under_nodes

Message ID 20180815144219.6014-4-osalvador@techadventures.net (mailing list archive)
State New, archived
Headers show
Series Refactoring for remove_memory_section/unregister_mem_sect_under_nodes | expand

Commit Message

Oscar Salvador Aug. 15, 2018, 2:42 p.m. UTC
From: Oscar Salvador <osalvador@suse.de>

unregister_mem_sect_under_nodes() tries to allocate a nodemask_t
in order to check whithin the loop which nodes have already been unlinked,
so we do not repeat the operation on them.

NODEMASK_ALLOC calls kmalloc() if NODES_SHIFT > 8, otherwise
it just declares a nodemask_t variable whithin the stack.

Since kamlloc() can fail, we actually check whether NODEMASK_ALLOC failed
or not, and we return -ENOMEM accordingly.
remove_memory_section() does not check for the return value though.

The problem with this is that if we return -ENOMEM, it means that
unregister_mem_sect_under_nodes will not be able to remove the symlinks,
but since we do not check the return value, we go ahead and we call
unregister_memory(), which will remove all the mem_blks directories.

This will leave us with dangled symlinks.

The easiest way to overcome this is to fallback by calling
sysfs_remove_link() unconditionally in case NODEMASK_ALLOC failed.
This means that we will call sysfs_remove_link on nodes that have been
already unlinked, but nothing wrong happens as sysfs_remove_link()
backs off somewhere down the chain in case the link has already been
removed.

I think that this is better than

a) dangled symlinks
b) having to recovery from such error in remove_memory_section

Since from now on we will not need to care about return values, we can make
the function void.

As we have a safe fallback, one thing that could also be done is to add
__GFP_NORETRY in the flags when calling NODEMASK_ALLOC, so we do not retry.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 drivers/base/node.c  | 23 +++++++++++++++--------
 include/linux/node.h |  5 ++---
 2 files changed, 17 insertions(+), 11 deletions(-)

Comments

Andrew Morton Aug. 15, 2018, 10:01 p.m. UTC | #1
On Wed, 15 Aug 2018 16:42:18 +0200 Oscar Salvador <osalvador@techadventures.net> wrote:

> From: Oscar Salvador <osalvador@suse.de>
> 
> unregister_mem_sect_under_nodes() tries to allocate a nodemask_t
> in order to check whithin the loop which nodes have already been unlinked,
> so we do not repeat the operation on them.
> 
> NODEMASK_ALLOC calls kmalloc() if NODES_SHIFT > 8, otherwise
> it just declares a nodemask_t variable whithin the stack.
> 
> Since kamlloc() can fail, we actually check whether NODEMASK_ALLOC failed
> or not, and we return -ENOMEM accordingly.
> remove_memory_section() does not check for the return value though.
> 
> The problem with this is that if we return -ENOMEM, it means that
> unregister_mem_sect_under_nodes will not be able to remove the symlinks,
> but since we do not check the return value, we go ahead and we call
> unregister_memory(), which will remove all the mem_blks directories.
> 
> This will leave us with dangled symlinks.
> 
> The easiest way to overcome this is to fallback by calling
> sysfs_remove_link() unconditionally in case NODEMASK_ALLOC failed.
> This means that we will call sysfs_remove_link on nodes that have been
> already unlinked, but nothing wrong happens as sysfs_remove_link()
> backs off somewhere down the chain in case the link has already been
> removed.
> 
> I think that this is better than
> 
> a) dangled symlinks
> b) having to recovery from such error in remove_memory_section
> 
> Since from now on we will not need to care about return values, we can make
> the function void.
> 
> As we have a safe fallback, one thing that could also be done is to add
> __GFP_NORETRY in the flags when calling NODEMASK_ALLOC, so we do not retry.
> 

Oh boy, lots of things.

That small GFP_KERNEL allocation will basically never fail.  In the
exceedingly-rare-basically-never-happens case, simply bailing out of
unregister_mem_sect_under_nodes() seems acceptable.  But I guess that
addressing it is a reasonable thing to do, if it can be done sanely.

But given that your new unregister_mem_sect_under_nodes() can proceed
happily even if the allocation failed, what's the point in allocating
the nodemask at all?  Why not just go ahead and run sysfs_remove_link()
against possibly-absent sysfs objects every time?  That produces
simpler code and avoids having this basically-never-executed-or-tested
code path in there.

Incidentally, do we have locking in place to prevent
unregister_mem_sect_under_nodes() from accidentally removing sysfs
nodes which were added 2 nanoseconds ago by a concurrent thread?

Also, this stuff in nodemask.h:

: /*
:  * For nodemask scrach area.
:  * NODEMASK_ALLOC(type, name) allocates an object with a specified type and
:  * name.
:  */
: #if NODES_SHIFT > 8 /* nodemask_t > 256 bytes */
: #define NODEMASK_ALLOC(type, name, gfp_flags)	\
:			type *name = kmalloc(sizeof(*name), gfp_flags)
: #define NODEMASK_FREE(m)			kfree(m)
: #else
: #define NODEMASK_ALLOC(type, name, gfp_flags)	type _##name, *name = &_##name
: #define NODEMASK_FREE(m)			do {} while (0)
: #endif

a) s/scrach/scratch/

b) The comment is wrong, isn't it?  "NODES_SHIFT > 8" means
   "nodemask_t > 32 bytes"?

c) If "yes" then we can surely bump that up a bit - "NODES_SHIFT >
   11", say.

d) What's the maximum number of nodes, ever?  Perhaps we can always
   fit a nodemask_t onto the stack, dunno.
Oscar Salvador Aug. 16, 2018, 7:48 a.m. UTC | #2
On Wed, Aug 15, 2018 at 03:01:21PM -0700, Andrew Morton wrote:
> Oh boy, lots of things.
> 
> That small GFP_KERNEL allocation will basically never fail.  In the
> exceedingly-rare-basically-never-happens case, simply bailing out of
> unregister_mem_sect_under_nodes() seems acceptable.  But I guess that
> addressing it is a reasonable thing to do, if it can be done sanely.

Yes, I think this can be fixed as the patch showed.
Currently, if we bail out, we will have dangled symlinks, but we do not
really need to bail out, as we can proceed anyway.

> But given that your new unregister_mem_sect_under_nodes() can proceed
> happily even if the allocation failed, what's the point in allocating
> the nodemask at all?  Why not just go ahead and run sysfs_remove_link()
> against possibly-absent sysfs objects every time?  That produces
> simpler code and avoids having this basically-never-executed-or-tested
> code path in there.

Unless I am mistaken, the whole point to allocate a nodemask_t there is to
try to perform as less operations as possible.
If we already unlinked a link, let us not call syfs_remove_link again,
although doing it more than once on the same node is not harmful.
I will have a small impact in performance though, as we will repeat
operations.

Of course we can get rid of the nodemask_t and just call syfs_remove_link,
but I wonder if that is a bit suboptimal.

> Incidentally, do we have locking in place to prevent
> unregister_mem_sect_under_nodes() from accidentally removing sysfs
> nodes which were added 2 nanoseconds ago by a concurrent thread?

Well, remove_memory() and  add_memory_resource() is being serialized with
mem_hotplug_begin()/mem_hotplug_done().

Since registering node's on mem_blk's is done in add_memory_resource(),
and unregistering them it is done in remove_memory() patch, I think they
cannot step on each other's feet.

Although, I saw that remove_memory_section() takes mem_sysfs_mutex.
I wonder if we should do the same in link_mem_sections(), just to be on the
safe side.

> Also, this stuff in nodemask.h:
> 
> : /*
> :  * For nodemask scrach area.
> :  * NODEMASK_ALLOC(type, name) allocates an object with a specified type and
> :  * name.
> :  */
> : #if NODES_SHIFT > 8 /* nodemask_t > 256 bytes */
> : #define NODEMASK_ALLOC(type, name, gfp_flags)	\
> :			type *name = kmalloc(sizeof(*name), gfp_flags)
> : #define NODEMASK_FREE(m)			kfree(m)
> : #else
> : #define NODEMASK_ALLOC(type, name, gfp_flags)	type _##name, *name = &_##name
> : #define NODEMASK_FREE(m)			do {} while (0)
> : #endif
> 
> a) s/scrach/scratch/
> 
> b) The comment is wrong, isn't it?  "NODES_SHIFT > 8" means
>    "nodemask_t > 32 bytes"?

It is wrong yes.
For example, if NODES_SHIFT = 9, that makes 64 bytes.

> c) If "yes" then we can surely bump that up a bit - "NODES_SHIFT >
>    11", say.

I checked all architectures that define NODES_SHIFT in Kconfig.
The maximum we can get is NODES_SHIFT = 10, and this makes 128 bytes.

> d) What's the maximum number of nodes, ever?  Perhaps we can always
>    fit a nodemask_t onto the stack, dunno.

Right now, we define the maximum as NODES_SHIFT = 10, so:

1 << 10 = 1024 Maximum nodes.

Since this makes only 128 bytes, I wonder if we can just go ahead and define a nodemask_t
whithin the stack.
128 bytes is not that much, is it?

Thanks
Pasha Tatashin Aug. 16, 2018, 6:22 p.m. UTC | #3
> > d) What's the maximum number of nodes, ever?  Perhaps we can always
> >    fit a nodemask_t onto the stack, dunno.
> 
> Right now, we define the maximum as NODES_SHIFT = 10, so:
> 
> 1 << 10 = 1024 Maximum nodes.
> 
> Since this makes only 128 bytes, I wonder if we can just go ahead and define a nodemask_t
> whithin the stack.
> 128 bytes is not that much, is it?

Yeah, sue stack here, 128b is tiny. This also will solve Andrew's point of having an untested path when alloc fails, and simplify the patch overall.

Thank you,
Pavel
diff mbox series

Patch

diff --git a/drivers/base/node.c b/drivers/base/node.c
index dd3bdab230b2..81b27b5b1f15 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -449,35 +449,42 @@  int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg)
 }
 
 /* unregister memory section under all nodes that it spans */
-int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+void unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 				    unsigned long phys_index)
 {
 	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
 	unsigned long pfn, sect_start_pfn, sect_end_pfn;
 
-	if (!unlinked_nodes)
-		return -ENOMEM;
-	nodes_clear(*unlinked_nodes);
+	if (unlinked_nodes)
+		nodes_clear(*unlinked_nodes);
 
 	sect_start_pfn = section_nr_to_pfn(phys_index);
 	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
-		int nid;
+		int nid = get_nid_for_pfn(pfn);
 
-		nid = get_nid_for_pfn(pfn);
 		if (nid < 0)
 			continue;
 		if (!node_online(nid))
 			continue;
-		if (node_test_and_set(nid, *unlinked_nodes))
+		/*
+		 * It is possible that NODEMASK_ALLOC fails due to memory
+		 * pressure.
+		 * If that happens, we fallback to call sysfs_remove_link
+		 * unconditionally.
+		 * Nothing wrong will happen as sysfs_remove_link will back off
+		 * somewhere down the chain in case the link has already been
+		 * removed.
+		 */
+		if (unlinked_nodes && node_test_and_set(nid, *unlinked_nodes))
 			continue;
+
 		sysfs_remove_link(&node_devices[nid]->dev.kobj,
 			 kobject_name(&mem_blk->dev.kobj));
 		sysfs_remove_link(&mem_blk->dev.kobj,
 			 kobject_name(&node_devices[nid]->dev.kobj));
 	}
 	NODEMASK_FREE(unlinked_nodes);
-	return 0;
 }
 
 int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn)
diff --git a/include/linux/node.h b/include/linux/node.h
index 257bb3d6d014..1203378e596a 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -72,7 +72,7 @@  extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 						void *arg);
-extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+extern void unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 					   unsigned long phys_index);
 
 #ifdef CONFIG_HUGETLBFS
@@ -105,10 +105,9 @@  static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
 {
 	return 0;
 }
-static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+static inline void unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 						  unsigned long phys_index)
 {
-	return 0;
 }
 
 static inline void register_hugetlbfs_with_node(node_registration_func_t reg,