diff mbox series

[1/1] arch/x86/mm/numa: Do not initialize nodes twice

Message ID 20220218224302.5282-2-osalvador@suse.de (mailing list archive)
State New
Headers show
Series Fix allocating nodes twice on x86 | expand

Commit Message

Oscar Salvador Feb. 18, 2022, 10:43 p.m. UTC
On x86, prior to ("mm: handle uninitialized numa nodes gracecully"),
NUMA nodes could be allocated at three different places.

- numa_register_memblks
- init_cpu_to_node
- init_gi_nodes

All these calls happen at setup_arch, and have the following order:

setup_arch
  ...
  x86_numa_init
   numa_init
    numa_register_memblks
  ...
  init_cpu_to_node
   init_memory_less_node
    alloc_node_data
    free_area_init_memoryless_node
  init_gi_nodes
   init_memory_less_node
    alloc_node_data
    free_area_init_memoryless_node

numa_register_memblks() is only interested in those nodes which have memory,
so it skips over any memoryless node it founds.
Later on, when we have read ACPI's SRAT table, we call init_cpu_to_node()
and init_gi_nodes(), which initialize any memoryless node we might have
that have either CPU or Initiator affinity, meaning we allocate pg_data_t
struct for them and we mark them as ONLINE.

So far so good, but the thing is that after ("mm: handle uninitialized numa
nodes gracefully"), we allocate all possible NUMA nodes in free_area_init(),
meaning we have a picture like the following:

setup_arch
  x86_numa_init
   numa_init
    numa_register_memblks  <-- allocate non-memoryless node
  x86_init.paging.pagetable_init
   ...
    free_area_init
     free_area_init_memoryless <-- allocate memoryless node
  init_cpu_to_node
   alloc_node_data             <-- allocate memoryless node with CPU
   free_area_init_memoryless_node
  init_gi_nodes
   alloc_node_data             <-- allocate memoryless node with Initiator
   free_area_init_memoryless_node

free_area_init() already allocates all possible NUMA nodes, but
init_cpu_to_node() and init_gi_nodes() are clueless about that,
so they go ahead and allocate a new pg_data_t struct without
checking anything, meaning we end up allocating twice.

It should be mad clear that this only happens in the case where
memoryless NUMA node happens to have a CPU/Initiator affinity.

So get rid of init_memory_less_node() and just set the node online.

Note that setting the node online is needed, otherwise we choke
down the chain when bringup_nonboot_cpus() ends up calling
__try_online_node()->register_one_node()->... and we blow up in
bus_add_device(). Like can be seen here:

=========
[    0.585060] BUG: kernel NULL pointer dereference, address: 0000000000000060
[    0.586091] #PF: supervisor read access in kernel mode
[    0.586831] #PF: error_code(0x0000) - not-present page
[    0.586930] PGD 0 P4D 0
[    0.586930] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[    0.586930] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
[    0.586930] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
[    0.586930] RIP: 0010:bus_add_device+0x5a/0x140
[    0.586930] Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
[    0.586930] RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
[    0.586930] RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
[    0.586930] RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
[    0.586930] RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
[    0.586930] R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
[    0.586930] R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
[    0.586930] FS:  0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
[    0.586930] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.586930] CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
[    0.586930] Call Trace:
[    0.586930]  <TASK>
[    0.586930]  device_add+0x4c0/0x910
[    0.586930]  __register_one_node+0x97/0x2d0
[    0.586930]  __try_online_node+0x85/0xc0
[    0.586930]  try_online_node+0x25/0x40
[    0.586930]  cpu_up+0x4f/0x100
[    0.586930]  bringup_nonboot_cpus+0x4f/0x60
[    0.586930]  smp_init+0x26/0x79
[    0.586930]  kernel_init_freeable+0x130/0x2f1
[    0.586930]  ? rest_init+0x100/0x100
[    0.586930]  kernel_init+0x17/0x150
[    0.586930]  ? rest_init+0x100/0x100
[    0.586930]  ret_from_fork+0x22/0x30
[    0.586930]  </TASK>
[    0.586930] Modules linked in:
[    0.586930] CR2: 0000000000000060
[    0.586930] ---[ end trace 0000000000000000 ]---
=========

The reason is simple, by the time bringup_nonboot_cpus() gets called,
we did not register the node_subsys bus yet, so we crash when bus_add_device()
tries to dereference bus()->p.

The following shows the order of the calls:

kernel_init_freeable
 smp_init
  bringup_nonboot_cpus
   ...
     bus_add_device()      <- we did not register node_subsys yet
 do_basic_setup
  do_initcalls
   postcore_initcall(register_node_type);
    register_node_type
     subsys_system_register
      subsys_register
       bus_register         <- register node_subsys bus

Why setting the node online saves us then? Well, simply because
__try_online_node() backs off when the node is online, meaning
we do not end up calling register_one_node() in the first place.

This is subtle, broken and deserves a deep analysis and thought
about how to put this into shape, but for now let us have this
easy fix for the leaking memory issue.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")
---
 arch/x86/mm/numa.c | 15 ++-------------
 include/linux/mm.h |  1 -
 mm/page_alloc.c    |  2 +-
 3 files changed, 3 insertions(+), 15 deletions(-)

Comments

Michal Hocko Feb. 21, 2022, 9:20 a.m. UTC | #1
On Fri 18-02-22 23:43:02, Oscar Salvador wrote:
> On x86, prior to ("mm: handle uninitialized numa nodes gracecully"),
> NUMA nodes could be allocated at three different places.
> 
> - numa_register_memblks
> - init_cpu_to_node
> - init_gi_nodes
> 
> All these calls happen at setup_arch, and have the following order:
> 
> setup_arch
>   ...
>   x86_numa_init
>    numa_init
>     numa_register_memblks
>   ...
>   init_cpu_to_node
>    init_memory_less_node
>     alloc_node_data
>     free_area_init_memoryless_node
>   init_gi_nodes
>    init_memory_less_node
>     alloc_node_data
>     free_area_init_memoryless_node
> 
> numa_register_memblks() is only interested in those nodes which have memory,
> so it skips over any memoryless node it founds.
> Later on, when we have read ACPI's SRAT table, we call init_cpu_to_node()
> and init_gi_nodes(), which initialize any memoryless node we might have
> that have either CPU or Initiator affinity, meaning we allocate pg_data_t
> struct for them and we mark them as ONLINE.
> 
> So far so good, but the thing is that after ("mm: handle uninitialized numa
> nodes gracefully"), we allocate all possible NUMA nodes in free_area_init(),
> meaning we have a picture like the following:
> 
> setup_arch
>   x86_numa_init
>    numa_init
>     numa_register_memblks  <-- allocate non-memoryless node
>   x86_init.paging.pagetable_init
>    ...
>     free_area_init
>      free_area_init_memoryless <-- allocate memoryless node
>   init_cpu_to_node
>    alloc_node_data             <-- allocate memoryless node with CPU
>    free_area_init_memoryless_node
>   init_gi_nodes
>    alloc_node_data             <-- allocate memoryless node with Initiator
>    free_area_init_memoryless_node

Thanks for diving into this and double checking after me. I misread the
ordering and thought free_area_init is the last one to be called.

> free_area_init() already allocates all possible NUMA nodes, but
> init_cpu_to_node() and init_gi_nodes() are clueless about that,
> so they go ahead and allocate a new pg_data_t struct without
> checking anything, meaning we end up allocating twice.
> 
> It should be mad clear that this only happens in the case where
> memoryless NUMA node happens to have a CPU/Initiator affinity.
> 
> So get rid of init_memory_less_node() and just set the node online.
> 
> Note that setting the node online is needed, otherwise we choke
> down the chain when bringup_nonboot_cpus() ends up calling
> __try_online_node()->register_one_node()->... and we blow up in
> bus_add_device(). Like can be seen here:
> 
> =========
> [    0.585060] BUG: kernel NULL pointer dereference, address: 0000000000000060
> [    0.586091] #PF: supervisor read access in kernel mode
> [    0.586831] #PF: error_code(0x0000) - not-present page
> [    0.586930] PGD 0 P4D 0
> [    0.586930] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
> [    0.586930] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
> [    0.586930] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
> [    0.586930] RIP: 0010:bus_add_device+0x5a/0x140
> [    0.586930] Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
> [    0.586930] RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
> [    0.586930] RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
> [    0.586930] RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
> [    0.586930] RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
> [    0.586930] R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
> [    0.586930] R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
> [    0.586930] FS:  0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
> [    0.586930] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.586930] CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
> [    0.586930] Call Trace:
> [    0.586930]  <TASK>
> [    0.586930]  device_add+0x4c0/0x910
> [    0.586930]  __register_one_node+0x97/0x2d0
> [    0.586930]  __try_online_node+0x85/0xc0
> [    0.586930]  try_online_node+0x25/0x40
> [    0.586930]  cpu_up+0x4f/0x100
> [    0.586930]  bringup_nonboot_cpus+0x4f/0x60
> [    0.586930]  smp_init+0x26/0x79
> [    0.586930]  kernel_init_freeable+0x130/0x2f1
> [    0.586930]  ? rest_init+0x100/0x100
> [    0.586930]  kernel_init+0x17/0x150
> [    0.586930]  ? rest_init+0x100/0x100
> [    0.586930]  ret_from_fork+0x22/0x30
> [    0.586930]  </TASK>
> [    0.586930] Modules linked in:
> [    0.586930] CR2: 0000000000000060
> [    0.586930] ---[ end trace 0000000000000000 ]---
> =========
> 
> The reason is simple, by the time bringup_nonboot_cpus() gets called,
> we did not register the node_subsys bus yet, so we crash when bus_add_device()
> tries to dereference bus()->p.
> 
> The following shows the order of the calls:
> 
> kernel_init_freeable
>  smp_init
>   bringup_nonboot_cpus
>    ...
>      bus_add_device()      <- we did not register node_subsys yet
>  do_basic_setup
>   do_initcalls
>    postcore_initcall(register_node_type);
>     register_node_type
>      subsys_system_register
>       subsys_register
>        bus_register         <- register node_subsys bus
> 
> Why setting the node online saves us then? Well, simply because
> __try_online_node() backs off when the node is online, meaning
> we do not end up calling register_one_node() in the first place.

This is really a mess and a house built on sand. Thanks for looking into
it and hopefully this can get cleaned up to a saner state.

> This is subtle, broken and deserves a deep analysis and thought
> about how to put this into shape, but for now let us have this
> easy fix for the leaking memory issue.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")

This sha1 is from linux-next very likely so it won't be persistent.
Please drop it.

Other than that
Acked-by: Michal Hocko <mhocko@suse.com>

One nit below

Thanks!

> ---
>  arch/x86/mm/numa.c | 15 ++-------------
>  include/linux/mm.h |  1 -
>  mm/page_alloc.c    |  2 +-
>  3 files changed, 3 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index c6b1213086d6..37039a6af8ae 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -738,17 +738,6 @@ void __init x86_numa_init(void)
>  	numa_init(dummy_numa_init);
>  }
>  
> -static void __init init_memory_less_node(int nid)
> -{
> -	/* Allocate and initialize node data. Memory-less node is now online.*/
> -	alloc_node_data(nid);
> -	free_area_init_memoryless_node(nid);
> -
> -	/*
> -	 * All zonelists will be built later in start_kernel() after per cpu
> -	 * areas are initialized.
> -	 */
> -}
>  
>  /*
>   * A node may exist which has one or more Generic Initiators but no CPUs and no
> @@ -768,7 +757,7 @@ void __init init_gi_nodes(void)
>  
>  	for_each_node_state(nid, N_GENERIC_INITIATOR)
>  		if (!node_online(nid))
> -			init_memory_less_node(nid);
> +			node_set_online(nid);

I would stick a TODO here.
			/*
			 * Exclude this node from 
			 * bringup_nonboot_cpus
			 *  cpu_up
			 *    __try_online_node
			 *      register_one_node
			 * because node_subsys is not initialized yet
			 * TODO remove dependency on node_online()
			 */
>  }
>  
>  /*
> @@ -799,7 +788,7 @@ void __init init_cpu_to_node(void)
>  			continue;
>  
>  		if (!node_online(node))
> -			init_memory_less_node(node);
> +			node_set_online(node);

and here as well.
Oscar Salvador Feb. 21, 2022, 9:47 a.m. UTC | #2
On Mon, Feb 21, 2022 at 10:20:02AM +0100, Michal Hocko wrote:
> On Fri 18-02-22 23:43:02, Oscar Salvador wrote:
> > Why setting the node online saves us then? Well, simply because
> > __try_online_node() backs off when the node is online, meaning
> > we do not end up calling register_one_node() in the first place.
> 
> This is really a mess and a house built on sand. Thanks for looking into
> it and hopefully this can get cleaned up to a saner state.

Yes, I am willing to have a deep look into that and see how we can
improve the situation.

> This sha1 is from linux-next very likely so it won't be persistent.
> Please drop it.

Yes, it is. I guess it is fine to not have a "Fixes" tag here, so I will
remove it then.

> I would stick a TODO here.
> 			/*
> 			 * Exclude this node from 
> 			 * bringup_nonboot_cpus
> 			 *  cpu_up
> 			 *    __try_online_node
> 			 *      register_one_node
> 			 * because node_subsys is not initialized yet
> 			 * TODO remove dependency on node_online()
> 			 */

Sure, will do.

Thanks!
Michal Hocko Feb. 21, 2022, 1:32 p.m. UTC | #3
On Mon 21-02-22 10:47:44, Oscar Salvador wrote:
> On Mon, Feb 21, 2022 at 10:20:02AM +0100, Michal Hocko wrote:
> > On Fri 18-02-22 23:43:02, Oscar Salvador wrote:
> > > Why setting the node online saves us then? Well, simply because
> > > __try_online_node() backs off when the node is online, meaning
> > > we do not end up calling register_one_node() in the first place.
> > 
> > This is really a mess and a house built on sand. Thanks for looking into
> > it and hopefully this can get cleaned up to a saner state.
> 
> Yes, I am willing to have a deep look into that and see how we can
> improve the situation.
> 
> > This sha1 is from linux-next very likely so it won't be persistent.
> > Please drop it.
> 
> Yes, it is. I guess it is fine to not have a "Fixes" tag here, so I will
> remove it then.

Normally we use sha in Fixes tag and I am not sure how many scripts we
would confuse if there was no but I guess it is good enough to mention
the patch name in the description. Theoretically we could have folded it
to my patch but I think it would be better to have it separate because
a) it gives a nice overview of the mess we should be dealing with and b)
the original patch would likely be more convoluted than necessary.
diff mbox series

Patch

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index c6b1213086d6..37039a6af8ae 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -738,17 +738,6 @@  void __init x86_numa_init(void)
 	numa_init(dummy_numa_init);
 }
 
-static void __init init_memory_less_node(int nid)
-{
-	/* Allocate and initialize node data. Memory-less node is now online.*/
-	alloc_node_data(nid);
-	free_area_init_memoryless_node(nid);
-
-	/*
-	 * All zonelists will be built later in start_kernel() after per cpu
-	 * areas are initialized.
-	 */
-}
 
 /*
  * A node may exist which has one or more Generic Initiators but no CPUs and no
@@ -768,7 +757,7 @@  void __init init_gi_nodes(void)
 
 	for_each_node_state(nid, N_GENERIC_INITIATOR)
 		if (!node_online(nid))
-			init_memory_less_node(nid);
+			node_set_online(nid);
 }
 
 /*
@@ -799,7 +788,7 @@  void __init init_cpu_to_node(void)
 			continue;
 
 		if (!node_online(node))
-			init_memory_less_node(node);
+			node_set_online(node);
 
 		numa_set_node(cpu, node);
 	}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b192..9ff1c4c8449e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2453,7 +2453,6 @@  static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
 }
 
 extern void __init pagecache_init(void);
-extern void __init free_area_init_memoryless_node(int nid);
 extern void free_initmem(void);
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 83da2279be72..967085c1c78a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7698,7 +7698,7 @@  static void __init free_area_init_node(int nid)
 	free_area_init_core(pgdat);
 }
 
-void __init free_area_init_memoryless_node(int nid)
+static void __init free_area_init_memoryless_node(int nid)
 {
 	free_area_init_node(nid);
 }