diff mbox series

powerpc/numa: Handle partially initialized numa nodes

Message ID 20220330135123.1868197-1-srikar@linux.vnet.ibm.com (mailing list archive)
State New
Headers show
Series powerpc/numa: Handle partially initialized numa nodes | expand

Commit Message

Srikar Dronamraju March 30, 2022, 1:51 p.m. UTC
With commit 09f49dca570a ("mm: handle uninitialized numa nodes
gracefully") NODE_DATA for even a memoryless/cpuless node is partially
initialized at boot time.

Before onlining the node, current Powerpc code checks for NODE_DATA to
be NULL. However since NODE_DATA is partially initialized, this check
will end up always being false.

This causes hotplugging a CPU to a memoryless/cpuless node to fail.

Before adding CPUs
$ numactl -H
available: 1 nodes (4)
node 4 cpus: 0 1 2 3 4 5 6 7
node 4 size: 97372 MB
node 4 free: 95545 MB
node distances:
node   4
4:  10

$ lparstat
System Configuration
type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00

%user  %sys %wait    %idle    physc %entc lbusy   app  vcsw phint
----- ----- -----    -----    ----- ----- ----- ----- ----- -----
2.66  2.67  0.16    94.51     0.00  0.00  5.33  0.00 67749     0

After hotplugging 32 cores
$ numactl -H
node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
167 168 169 170 171 172 173 174 175
node 4 size: 97372 MB
node 4 free: 93636 MB
node distances:
node   4
4:  10

$ lparstat
System Configuration
type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00

%user  %sys %wait    %idle    physc %entc lbusy   app  vcsw phint
----- ----- -----    -----    ----- ----- ----- ----- ----- -----
0.04  0.02  0.00    99.94     0.00  0.00  0.06  0.00 1128751     3

As we can see numactl is listing only 8 cores while lparstat is showing
33 cores.

Also dmesg is showing messages like:
[ 2261.318350 ] BUG: arch topology borken
[ 2261.318357 ]      the DIE domain not a subset of the NODE domain

Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully")
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Michal Hocko March 30, 2022, 1:59 p.m. UTC | #1
On Wed 30-03-22 19:21:23, Srikar Dronamraju wrote:
> With commit 09f49dca570a ("mm: handle uninitialized numa nodes
> gracefully") NODE_DATA for even a memoryless/cpuless node is partially
> initialized at boot time.
> 
> Before onlining the node, current Powerpc code checks for NODE_DATA to
> be NULL. However since NODE_DATA is partially initialized, this check
> will end up always being false.
> 
> This causes hotplugging a CPU to a memoryless/cpuless node to fail.
> 
> Before adding CPUs
> $ numactl -H
> available: 1 nodes (4)
> node 4 cpus: 0 1 2 3 4 5 6 7
> node 4 size: 97372 MB
> node 4 free: 95545 MB
> node distances:
> node   4
> 4:  10
> 
> $ lparstat
> System Configuration
> type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00
> 
> %user  %sys %wait    %idle    physc %entc lbusy   app  vcsw phint
> ----- ----- -----    -----    ----- ----- ----- ----- ----- -----
> 2.66  2.67  0.16    94.51     0.00  0.00  5.33  0.00 67749     0
> 
> After hotplugging 32 cores
> $ numactl -H
> node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130
> 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
> 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
> 167 168 169 170 171 172 173 174 175
> node 4 size: 97372 MB
> node 4 free: 93636 MB
> node distances:
> node   4
> 4:  10
> 
> $ lparstat
> System Configuration
> type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00
> 
> %user  %sys %wait    %idle    physc %entc lbusy   app  vcsw phint
> ----- ----- -----    -----    ----- ----- ----- ----- ----- -----
> 0.04  0.02  0.00    99.94     0.00  0.00  0.06  0.00 1128751     3
> 
> As we can see numactl is listing only 8 cores while lparstat is showing
> 33 cores.
> 
> Also dmesg is showing messages like:
> [ 2261.318350 ] BUG: arch topology borken
> [ 2261.318357 ]      the DIE domain not a subset of the NODE domain
> 
> Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully")
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!

> ---
>  arch/powerpc/mm/numa.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b9b7fefbb64b..13022d734951 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
>  	if (new_nid < 0 || !node_possible(new_nid))
>  		new_nid = first_online_node;
>  
> -	if (NODE_DATA(new_nid) == NULL) {
> +	if (!node_online(new_nid)) {
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  		/*
>  		 * Need to ensure that NODE_DATA is initialized for a node from
> -- 
> 2.27.0
Oscar Salvador March 30, 2022, 2:04 p.m. UTC | #2
On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote:
> With commit 09f49dca570a ("mm: handle uninitialized numa nodes
> gracefully") NODE_DATA for even a memoryless/cpuless node is partially
> initialized at boot time.
> 
> Before onlining the node, current Powerpc code checks for NODE_DATA to
> be NULL. However since NODE_DATA is partially initialized, this check
> will end up always being false.
> 
> This causes hotplugging a CPU to a memoryless/cpuless node to fail.
> 
> Before adding CPUs
> $ numactl -H
> available: 1 nodes (4)
> node 4 cpus: 0 1 2 3 4 5 6 7
> node 4 size: 97372 MB
> node 4 free: 95545 MB
> node distances:
> node   4
> 4:  10
> 
> $ lparstat
> System Configuration
> type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00
> 
> %user  %sys %wait    %idle    physc %entc lbusy   app  vcsw phint
> ----- ----- -----    -----    ----- ----- ----- ----- ----- -----
> 2.66  2.67  0.16    94.51     0.00  0.00  5.33  0.00 67749     0
> 
> After hotplugging 32 cores
> $ numactl -H
> node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130
> 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
> 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
> 167 168 169 170 171 172 173 174 175
> node 4 size: 97372 MB
> node 4 free: 93636 MB
> node distances:
> node   4
> 4:  10
> 
> $ lparstat
> System Configuration
> type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00
> 
> %user  %sys %wait    %idle    physc %entc lbusy   app  vcsw phint
> ----- ----- -----    -----    ----- ----- ----- ----- ----- -----
> 0.04  0.02  0.00    99.94     0.00  0.00  0.06  0.00 1128751     3
> 
> As we can see numactl is listing only 8 cores while lparstat is showing
> 33 cores.
> 
> Also dmesg is showing messages like:
> [ 2261.318350 ] BUG: arch topology borken
> [ 2261.318357 ]      the DIE domain not a subset of the NODE domain
> 
> Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully")
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Oscar Salvador <osalvador@suse.de>

Thanks Srikar!

> ---
>  arch/powerpc/mm/numa.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b9b7fefbb64b..13022d734951 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
>  	if (new_nid < 0 || !node_possible(new_nid))
>  		new_nid = first_online_node;
>  
> -	if (NODE_DATA(new_nid) == NULL) {
> +	if (!node_online(new_nid)) {
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  		/*
>  		 * Need to ensure that NODE_DATA is initialized for a node from
> -- 
> 2.27.0
> 
>
Mike Rapoport April 5, 2022, 4:41 a.m. UTC | #3
On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote:
> With commit 09f49dca570a ("mm: handle uninitialized numa nodes
> gracefully") NODE_DATA for even a memoryless/cpuless node is partially
> initialized at boot time.
> 
> Before onlining the node, current Powerpc code checks for NODE_DATA to
> be NULL. However since NODE_DATA is partially initialized, this check
> will end up always being false.
> 
> This causes hotplugging a CPU to a memoryless/cpuless node to fail.
> 
> Before adding CPUs
> $ numactl -H
> available: 1 nodes (4)
> node 4 cpus: 0 1 2 3 4 5 6 7
> node 4 size: 97372 MB
> node 4 free: 95545 MB
> node distances:
> node   4
> 4:  10
> 
> $ lparstat
> System Configuration
> type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00
> 
> %user  %sys %wait    %idle    physc %entc lbusy   app  vcsw phint
> ----- ----- -----    -----    ----- ----- ----- ----- ----- -----
> 2.66  2.67  0.16    94.51     0.00  0.00  5.33  0.00 67749     0
> 
> After hotplugging 32 cores
> $ numactl -H
> node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130
> 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
> 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
> 167 168 169 170 171 172 173 174 175
> node 4 size: 97372 MB
> node 4 free: 93636 MB
> node distances:
> node   4
> 4:  10
> 
> $ lparstat
> System Configuration
> type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00
> 
> %user  %sys %wait    %idle    physc %entc lbusy   app  vcsw phint
> ----- ----- -----    -----    ----- ----- ----- ----- ----- -----
> 0.04  0.02  0.00    99.94     0.00  0.00  0.06  0.00 1128751     3
> 
> As we can see numactl is listing only 8 cores while lparstat is showing
> 33 cores.
> 
> Also dmesg is showing messages like:
> [ 2261.318350 ] BUG: arch topology borken
> [ 2261.318357 ]      the DIE domain not a subset of the NODE domain
> 
> Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully")
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  arch/powerpc/mm/numa.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b9b7fefbb64b..13022d734951 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
>  	if (new_nid < 0 || !node_possible(new_nid))
>  		new_nid = first_online_node;
>  
> -	if (NODE_DATA(new_nid) == NULL) {
> +	if (!node_online(new_nid)) {
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  		/*
>  		 * Need to ensure that NODE_DATA is initialized for a node from
> -- 
> 2.27.0
> 
>
Oscar Salvador April 6, 2022, 4:19 p.m. UTC | #4
On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote:
>  arch/powerpc/mm/numa.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b9b7fefbb64b..13022d734951 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
>  	if (new_nid < 0 || !node_possible(new_nid))
>  		new_nid = first_online_node;
>  
> -	if (NODE_DATA(new_nid) == NULL) {
> +	if (!node_online(new_nid)) {
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  		/*
>  		 * Need to ensure that NODE_DATA is initialized for a node from

Because of this fix, I wanted to check whether we might have any more fallouts due
to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer
as to why powerpc is the only platform that special cases try_online_node(),
while all others rely on cpu_up()->try_online_node() to do the right thing.

So, I had a look.
Let us rewind a bit:

The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was
e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes").

In there, it says that we have the following picture:

partition_sched_domains
 arch_update_cpu_topology
  numa_update_cpu_topology
   find_and_online_cpu_nid

and by the time find_and_online_cpu_nid() gets called to online the node, it might be
too late as we might have referenced some NODE_DATA() fields.
Note that this happens at a much later stage in cpuhp.

Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(),
which should allocate-and-online the node and prevent accessing garbage.
But the problem is that, on powerpc, all possible cpus have the same node set at boot stage
(see arch/powerpc/mm/numa.c:mem_topology_setup),
so cpu_to_node() returns the same thing until it the mapping gets update (which happens in
start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not
apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu.

(e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node()
 tries to online node0, or whatever old mapping numa<->cpu is there).

So, we have something like:

dlpar_online_cpu
 device_online
  dev->bus->online
   cpu_subsys_online
    cpu_device_up
     cpu_up
      try_online_node (old mapping nid <-> cpu )
      ...
      ...
      cphp_callbacks
       sched_cpu_activate
        cpuset_update_active_cpus
         schedule_work(&cpuset_hotplug_work)
          cpuset_hotplug_work
           partition_sched_domains
            arch_update_cpu_topology
             numa_update_cpu_topology
              find_and_online_cpu_nid (online new_nid)


So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid()
happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu()
fixed the issue, but we should not need this special casing at all.

We do already know the numa<->cpu associativity in
dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to
update the numa<->cpu mapping, and let the try_online_node() do the right thing.

With this in mind, I came up with the following patch, where I carried out a battery
of tests for CPU hotplug stuff and it worked as expected.
But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so
I would like to hear from more familiar people.

The patch is:

From: Oscar Salvador <osalvador@suse.de>
Date: Wed, 6 Apr 2022 14:39:15 +0200
Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier

powerpc is the only platform that do not rely on
cpu_up()->try_online_node() to bring up a numa node,
and special cases it, instead, deep in its own machinery:

dlpar_online_cpu
 find_and_online_cpu_nid
  try_online_node

This should not be needed, but the thing is that the try_online_node()
from cpu_up() will not apply on the right node, because cpu_to_node()
will return the old mapping numa<->cpu that gets set on boot stage
for all possible cpus.

That can be seen easily if we try to print out the numa node passed
to try_online_node() in cpu_up().

The thing is that the numa<->cpu mapping does not get updated till a much
later stage in start_secondary:

start_secondary:
 set_numa_node(numa_cpu_lookup_table[cpu])

But we do not really care, as we already now the
CPU <-> NUMA associativity back in find_and_online_cpu_nid(),
so let us make use of that and set the proper numa<->cpu mapping,
so cpu_to_node() in cpu_up() returns the right node and
try_online_node() can do its work.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/powerpc/include/asm/topology.h          |  8 ++-----
 arch/powerpc/mm/numa.c                       | 31 +++++++---------------------
 arch/powerpc/platforms/pseries/hotplug-cpu.c |  2 +-
 3 files changed, 11 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 36fcafb1fd6d..6ae1b2dce83e 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -111,14 +111,10 @@ static inline void unmap_cpu_from_node(unsigned long cpu) {}
 #endif /* CONFIG_NUMA */

 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
-extern int find_and_online_cpu_nid(int cpu);
+extern void find_and_update_cpu_nid(int cpu);
 extern int cpu_to_coregroup_id(int cpu);
 #else
-static inline int find_and_online_cpu_nid(int cpu)
-{
-	return 0;
-}
-
+static inline void find_and_update_cpu_nid(int cpu) {}
 static inline int cpu_to_coregroup_id(int cpu)
 {
 #ifdef CONFIG_SMP
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b9b7fefbb64b..b5bc8b1a833d 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1423,43 +1423,28 @@ static long vphn_get_associativity(unsigned long cpu,
 	return rc;
 }

-int find_and_online_cpu_nid(int cpu)
+void find_and_update_cpu_nid(int cpu)
 {
 	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
 	int new_nid;

 	/* Use associativity from first thread for all siblings */
 	if (vphn_get_associativity(cpu, associativity))
-		return cpu_to_node(cpu);
+		return;

+	/* Do not have previous associativity, so find it now. */
 	new_nid = associativity_to_nid(associativity);
+
 	if (new_nid < 0 || !node_possible(new_nid))
 		new_nid = first_online_node;
-
-	if (NODE_DATA(new_nid) == NULL) {
-#ifdef CONFIG_MEMORY_HOTPLUG
-		/*
-		 * Need to ensure that NODE_DATA is initialized for a node from
-		 * available memory (see memblock_alloc_try_nid). If unable to
-		 * init the node, then default to nearest node that has memory
-		 * installed. Skip onlining a node if the subsystems are not
-		 * yet initialized.
-		 */
-		if (!topology_inited || try_online_node(new_nid))
-			new_nid = first_online_node;
-#else
-		/*
-		 * Default to using the nearest node that has memory installed.
-		 * Otherwise, it would be necessary to patch the kernel MM code
-		 * to deal with more memoryless-node error conditions.
+	else
+		/* Associate node <-> cpu, so cpu_up() calls
+		 * try_online_node() on the right node.
 		 */
-		new_nid = first_online_node;
-#endif
-	}
+		set_cpu_numa_node(cpu, new_nid);

 	pr_debug("%s:%d cpu %d nid %d\n", __FUNCTION__, __LINE__,
 		cpu, new_nid);
-	return new_nid;
 }

 int cpu_to_coregroup_id(int cpu)
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index b81fc846d99c..0f8cd8b06432 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -398,7 +398,7 @@ static int dlpar_online_cpu(struct device_node *dn)
 			if (get_hard_smp_processor_id(cpu) != thread)
 				continue;
 			cpu_maps_update_done();
-			find_and_online_cpu_nid(cpu);
+			find_and_update_cpu_nid(cpu);
 			rc = device_online(get_cpu_device(cpu));
 			if (rc) {
 				dlpar_offline_cpu(dn);
--
2.16.4
Srikar Dronamraju April 8, 2022, 12:25 p.m. UTC | #5
* Oscar Salvador <osalvador@suse.de> [2022-04-06 18:19:00]:

> On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote:
> >  arch/powerpc/mm/numa.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index b9b7fefbb64b..13022d734951 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
> >  	if (new_nid < 0 || !node_possible(new_nid))
> >  		new_nid = first_online_node;
> >  
> > -	if (NODE_DATA(new_nid) == NULL) {
> > +	if (!node_online(new_nid)) {
> >  #ifdef CONFIG_MEMORY_HOTPLUG
> >  		/*
> >  		 * Need to ensure that NODE_DATA is initialized for a node from
> 
> Because of this fix, I wanted to check whether we might have any more fallouts due
> to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer
> as to why powerpc is the only platform that special cases try_online_node(),
> while all others rely on cpu_up()->try_online_node() to do the right thing.
> 
> So, I had a look.
> Let us rewind a bit:
> 
> The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was
> e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes").
> 
> In there, it says that we have the following picture:
> 
> partition_sched_domains
>  arch_update_cpu_topology
>   numa_update_cpu_topology
>    find_and_online_cpu_nid
> 
> and by the time find_and_online_cpu_nid() gets called to online the node, it might be
> too late as we might have referenced some NODE_DATA() fields.
> Note that this happens at a much later stage in cpuhp.
> 
> Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(),
> which should allocate-and-online the node and prevent accessing garbage.
> But the problem is that, on powerpc, all possible cpus have the same node set at boot stage
> (see arch/powerpc/mm/numa.c:mem_topology_setup),
> so cpu_to_node() returns the same thing until it the mapping gets update (which happens in
> start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not
> apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu.
> 
> (e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node()
>  tries to online node0, or whatever old mapping numa<->cpu is there).
> 
> So, we have something like:
> 
> dlpar_online_cpu
>  device_online
>   dev->bus->online
>    cpu_subsys_online
>     cpu_device_up
>      cpu_up
>       try_online_node (old mapping nid <-> cpu )
>       ...
>       ...
>       cphp_callbacks
>        sched_cpu_activate
>         cpuset_update_active_cpus
>          schedule_work(&cpuset_hotplug_work)
>           cpuset_hotplug_work
>            partition_sched_domains
>             arch_update_cpu_topology
>              numa_update_cpu_topology
>               find_and_online_cpu_nid (online new_nid)
> 
> 
> So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid()
> happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu()
> fixed the issue, but we should not need this special casing at all.
> 
> We do already know the numa<->cpu associativity in
> dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to
> update the numa<->cpu mapping, and let the try_online_node() do the right thing.
> 
> With this in mind, I came up with the following patch, where I carried out a battery
> of tests for CPU hotplug stuff and it worked as expected.
> But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so
> I would like to hear from more familiar people.
> 
> The patch is:
> 
> From: Oscar Salvador <osalvador@suse.de>
> Date: Wed, 6 Apr 2022 14:39:15 +0200
> Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier
> 
> powerpc is the only platform that do not rely on
> cpu_up()->try_online_node() to bring up a numa node,
> and special cases it, instead, deep in its own machinery:
> 
> dlpar_online_cpu
>  find_and_online_cpu_nid
>   try_online_node
> 
> This should not be needed, but the thing is that the try_online_node()
> from cpu_up() will not apply on the right node, because cpu_to_node()
> will return the old mapping numa<->cpu that gets set on boot stage
> for all possible cpus.
> 
> That can be seen easily if we try to print out the numa node passed
> to try_online_node() in cpu_up().
> 
> The thing is that the numa<->cpu mapping does not get updated till a much
> later stage in start_secondary:
> 
> start_secondary:
>  set_numa_node(numa_cpu_lookup_table[cpu])
> 
> But we do not really care, as we already now the
> CPU <-> NUMA associativity back in find_and_online_cpu_nid(),
> so let us make use of that and set the proper numa<->cpu mapping,
> so cpu_to_node() in cpu_up() returns the right node and
> try_online_node() can do its work.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Looks good to me.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>


> ---
>  arch/powerpc/include/asm/topology.h          |  8 ++-----
>  arch/powerpc/mm/numa.c                       | 31 +++++++---------------------
>  arch/powerpc/platforms/pseries/hotplug-cpu.c |  2 +-
>  3 files changed, 11 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index 36fcafb1fd6d..6ae1b2dce83e 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -111,14 +111,10 @@ static inline void unmap_cpu_from_node(unsigned long cpu) {}
>  #endif /* CONFIG_NUMA */
> 
>  #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
> -extern int find_and_online_cpu_nid(int cpu);
> +extern void find_and_update_cpu_nid(int cpu);
>  extern int cpu_to_coregroup_id(int cpu);
>  #else
> -static inline int find_and_online_cpu_nid(int cpu)
> -{
> -	return 0;
> -}
> -
> +static inline void find_and_update_cpu_nid(int cpu) {}
>  static inline int cpu_to_coregroup_id(int cpu)
>  {
>  #ifdef CONFIG_SMP
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b9b7fefbb64b..b5bc8b1a833d 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1423,43 +1423,28 @@ static long vphn_get_associativity(unsigned long cpu,
>  	return rc;
>  }
> 
> -int find_and_online_cpu_nid(int cpu)
> +void find_and_update_cpu_nid(int cpu)
>  {
>  	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
>  	int new_nid;
> 
>  	/* Use associativity from first thread for all siblings */
>  	if (vphn_get_associativity(cpu, associativity))
> -		return cpu_to_node(cpu);
> +		return;
> 
> +	/* Do not have previous associativity, so find it now. */
>  	new_nid = associativity_to_nid(associativity);
> +
>  	if (new_nid < 0 || !node_possible(new_nid))
>  		new_nid = first_online_node;
> -
> -	if (NODE_DATA(new_nid) == NULL) {
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -		/*
> -		 * Need to ensure that NODE_DATA is initialized for a node from
> -		 * available memory (see memblock_alloc_try_nid). If unable to
> -		 * init the node, then default to nearest node that has memory
> -		 * installed. Skip onlining a node if the subsystems are not
> -		 * yet initialized.
> -		 */
> -		if (!topology_inited || try_online_node(new_nid))
> -			new_nid = first_online_node;
> -#else
> -		/*
> -		 * Default to using the nearest node that has memory installed.
> -		 * Otherwise, it would be necessary to patch the kernel MM code
> -		 * to deal with more memoryless-node error conditions.
> +	else
> +		/* Associate node <-> cpu, so cpu_up() calls
> +		 * try_online_node() on the right node.
>  		 */
> -		new_nid = first_online_node;
> -#endif
> -	}
> +		set_cpu_numa_node(cpu, new_nid);
> 
>  	pr_debug("%s:%d cpu %d nid %d\n", __FUNCTION__, __LINE__,
>  		cpu, new_nid);
> -	return new_nid;
>  }
> 
>  int cpu_to_coregroup_id(int cpu)
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index b81fc846d99c..0f8cd8b06432 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -398,7 +398,7 @@ static int dlpar_online_cpu(struct device_node *dn)
>  			if (get_hard_smp_processor_id(cpu) != thread)
>  				continue;
>  			cpu_maps_update_done();
> -			find_and_online_cpu_nid(cpu);
> +			find_and_update_cpu_nid(cpu);
>  			rc = device_online(get_cpu_device(cpu));
>  			if (rc) {
>  				dlpar_offline_cpu(dn);
> --
> 2.16.4
> 
> -- 
> Oscar Salvador
> SUSE Labs
Michael Ellerman April 10, 2022, 11:28 a.m. UTC | #6
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Oscar Salvador <osalvador@suse.de> [2022-04-06 18:19:00]:
>
>> On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote:
>> >  arch/powerpc/mm/numa.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > 
>> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> > index b9b7fefbb64b..13022d734951 100644
>> > --- a/arch/powerpc/mm/numa.c
>> > +++ b/arch/powerpc/mm/numa.c
>> > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu)
>> >  	if (new_nid < 0 || !node_possible(new_nid))
>> >  		new_nid = first_online_node;
>> >  
>> > -	if (NODE_DATA(new_nid) == NULL) {
>> > +	if (!node_online(new_nid)) {
>> >  #ifdef CONFIG_MEMORY_HOTPLUG
>> >  		/*
>> >  		 * Need to ensure that NODE_DATA is initialized for a node from
>> 
>> Because of this fix, I wanted to check whether we might have any more fallouts due
>> to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer
>> as to why powerpc is the only platform that special cases try_online_node(),
>> while all others rely on cpu_up()->try_online_node() to do the right thing.
>> 
>> So, I had a look.
>> Let us rewind a bit:
>> 
>> The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was
>> e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes").
>> 
>> In there, it says that we have the following picture:
>> 
>> partition_sched_domains
>>  arch_update_cpu_topology
>>   numa_update_cpu_topology
>>    find_and_online_cpu_nid
>> 
>> and by the time find_and_online_cpu_nid() gets called to online the node, it might be
>> too late as we might have referenced some NODE_DATA() fields.
>> Note that this happens at a much later stage in cpuhp.
>> 
>> Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(),
>> which should allocate-and-online the node and prevent accessing garbage.
>> But the problem is that, on powerpc, all possible cpus have the same node set at boot stage
>> (see arch/powerpc/mm/numa.c:mem_topology_setup),
>> so cpu_to_node() returns the same thing until it the mapping gets update (which happens in
>> start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not
>> apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu.
>> 
>> (e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node()
>>  tries to online node0, or whatever old mapping numa<->cpu is there).
>> 
>> So, we have something like:
>> 
>> dlpar_online_cpu
>>  device_online
>>   dev->bus->online
>>    cpu_subsys_online
>>     cpu_device_up
>>      cpu_up
>>       try_online_node (old mapping nid <-> cpu )
>>       ...
>>       ...
>>       cphp_callbacks
>>        sched_cpu_activate
>>         cpuset_update_active_cpus
>>          schedule_work(&cpuset_hotplug_work)
>>           cpuset_hotplug_work
>>            partition_sched_domains
>>             arch_update_cpu_topology
>>              numa_update_cpu_topology
>>               find_and_online_cpu_nid (online new_nid)
>> 
>> 
>> So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid()
>> happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu()
>> fixed the issue, but we should not need this special casing at all.
>> 
>> We do already know the numa<->cpu associativity in
>> dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to
>> update the numa<->cpu mapping, and let the try_online_node() do the right thing.
>> 
>> With this in mind, I came up with the following patch, where I carried out a battery
>> of tests for CPU hotplug stuff and it worked as expected.
>> But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so
>> I would like to hear from more familiar people.
>> 
>> The patch is:
>> 
>> From: Oscar Salvador <osalvador@suse.de>
>> Date: Wed, 6 Apr 2022 14:39:15 +0200
>> Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier
>> 
>> powerpc is the only platform that do not rely on
>> cpu_up()->try_online_node() to bring up a numa node,
>> and special cases it, instead, deep in its own machinery:
>> 
>> dlpar_online_cpu
>>  find_and_online_cpu_nid
>>   try_online_node
>> 
>> This should not be needed, but the thing is that the try_online_node()
>> from cpu_up() will not apply on the right node, because cpu_to_node()
>> will return the old mapping numa<->cpu that gets set on boot stage
>> for all possible cpus.
>> 
>> That can be seen easily if we try to print out the numa node passed
>> to try_online_node() in cpu_up().
>> 
>> The thing is that the numa<->cpu mapping does not get updated till a much
>> later stage in start_secondary:
>> 
>> start_secondary:
>>  set_numa_node(numa_cpu_lookup_table[cpu])
>> 
>> But we do not really care, as we already now the
>> CPU <-> NUMA associativity back in find_and_online_cpu_nid(),
>> so let us make use of that and set the proper numa<->cpu mapping,
>> so cpu_to_node() in cpu_up() returns the right node and
>> try_online_node() can do its work.
>> 
>> Signed-off-by: Oscar Salvador <osalvador@suse.de>
>
> Looks good to me.
>
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Yeah agreed, thanks for getting to the root of the problem.

Can you resend as a standalone patch. Because you sent it as a reply it
won't be recognised by patchwork[1] which means it risks getting lost.

cheers

1: http://patchwork.ozlabs.org/project/linuxppc-dev/list/
Michael Ellerman April 10, 2022, 12:27 p.m. UTC | #7
On Wed, 30 Mar 2022 19:21:23 +0530, Srikar Dronamraju wrote:
> With commit 09f49dca570a ("mm: handle uninitialized numa nodes
> gracefully") NODE_DATA for even a memoryless/cpuless node is partially
> initialized at boot time.
> 
> Before onlining the node, current Powerpc code checks for NODE_DATA to
> be NULL. However since NODE_DATA is partially initialized, this check
> will end up always being false.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/numa: Handle partially initialized numa nodes
      https://git.kernel.org/powerpc/c/e4ff77598a109bd36789ad5e80aba66fc53d0ffb

cheers
Oscar Salvador April 11, 2022, 8 a.m. UTC | #8
On Sun, Apr 10, 2022 at 09:28:38PM +1000, Michael Ellerman wrote:
> Yeah agreed, thanks for getting to the root of the problem.
> 
> Can you resend as a standalone patch. Because you sent it as a reply it
> won't be recognised by patchwork[1] which means it risks getting lost.

Hi Michael,

It's done [1].

thanks!

[1] http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220411074934.4632-1-osalvador@suse.de/
diff mbox series

Patch

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b9b7fefbb64b..13022d734951 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1436,7 +1436,7 @@  int find_and_online_cpu_nid(int cpu)
 	if (new_nid < 0 || !node_possible(new_nid))
 		new_nid = first_online_node;
 
-	if (NODE_DATA(new_nid) == NULL) {
+	if (!node_online(new_nid)) {
 #ifdef CONFIG_MEMORY_HOTPLUG
 		/*
 		 * Need to ensure that NODE_DATA is initialized for a node from