Message ID | 20220330135123.1868197-1-srikar@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | powerpc/numa: Handle partially initialized numa nodes | expand |
On Wed 30-03-22 19:21:23, Srikar Dronamraju wrote: > With commit 09f49dca570a ("mm: handle uninitialized numa nodes > gracefully") NODE_DATA for even a memoryless/cpuless node is partially > initialized at boot time. > > Before onlining the node, current Powerpc code checks for NODE_DATA to > be NULL. However since NODE_DATA is partially initialized, this check > will end up always being false. > > This causes hotplugging a CPU to a memoryless/cpuless node to fail. > > Before adding CPUs > $ numactl -H > available: 1 nodes (4) > node 4 cpus: 0 1 2 3 4 5 6 7 > node 4 size: 97372 MB > node 4 free: 95545 MB > node distances: > node 4 > 4: 10 > > $ lparstat > System Configuration > type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00 > > %user %sys %wait %idle physc %entc lbusy app vcsw phint > ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- > 2.66 2.67 0.16 94.51 0.00 0.00 5.33 0.00 67749 0 > > After hotplugging 32 cores > $ numactl -H > node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130 > 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 > 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 > 167 168 169 170 171 172 173 174 175 > node 4 size: 97372 MB > node 4 free: 93636 MB > node distances: > node 4 > 4: 10 > > $ lparstat > System Configuration > type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00 > > %user %sys %wait %idle physc %entc lbusy app vcsw phint > ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- > 0.04 0.02 0.00 99.94 0.00 0.00 0.06 0.00 1128751 3 > > As we can see numactl is listing only 8 cores while lparstat is showing > 33 cores. > > Also dmesg is showing messages like: > [ 2261.318350 ] BUG: arch topology borken > [ 2261.318357 ] the DIE domain not a subset of the NODE domain > > Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully") > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-mm@kvack.org > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > --- > arch/powerpc/mm/numa.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index b9b7fefbb64b..13022d734951 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu) > if (new_nid < 0 || !node_possible(new_nid)) > new_nid = first_online_node; > > - if (NODE_DATA(new_nid) == NULL) { > + if (!node_online(new_nid)) { > #ifdef CONFIG_MEMORY_HOTPLUG > /* > * Need to ensure that NODE_DATA is initialized for a node from > -- > 2.27.0
On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote: > With commit 09f49dca570a ("mm: handle uninitialized numa nodes > gracefully") NODE_DATA for even a memoryless/cpuless node is partially > initialized at boot time. > > Before onlining the node, current Powerpc code checks for NODE_DATA to > be NULL. However since NODE_DATA is partially initialized, this check > will end up always being false. > > This causes hotplugging a CPU to a memoryless/cpuless node to fail. > > Before adding CPUs > $ numactl -H > available: 1 nodes (4) > node 4 cpus: 0 1 2 3 4 5 6 7 > node 4 size: 97372 MB > node 4 free: 95545 MB > node distances: > node 4 > 4: 10 > > $ lparstat > System Configuration > type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00 > > %user %sys %wait %idle physc %entc lbusy app vcsw phint > ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- > 2.66 2.67 0.16 94.51 0.00 0.00 5.33 0.00 67749 0 > > After hotplugging 32 cores > $ numactl -H > node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130 > 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 > 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 > 167 168 169 170 171 172 173 174 175 > node 4 size: 97372 MB > node 4 free: 93636 MB > node distances: > node 4 > 4: 10 > > $ lparstat > System Configuration > type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00 > > %user %sys %wait %idle physc %entc lbusy app vcsw phint > ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- > 0.04 0.02 0.00 99.94 0.00 0.00 0.06 0.00 1128751 3 > > As we can see numactl is listing only 8 cores while lparstat is showing > 33 cores. > > Also dmesg is showing messages like: > [ 2261.318350 ] BUG: arch topology borken > [ 2261.318357 ] the DIE domain not a subset of the NODE domain > > Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully") > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-mm@kvack.org > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Oscar Salvador <osalvador@suse.de> Thanks Srikar! > --- > arch/powerpc/mm/numa.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index b9b7fefbb64b..13022d734951 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu) > if (new_nid < 0 || !node_possible(new_nid)) > new_nid = first_online_node; > > - if (NODE_DATA(new_nid) == NULL) { > + if (!node_online(new_nid)) { > #ifdef CONFIG_MEMORY_HOTPLUG > /* > * Need to ensure that NODE_DATA is initialized for a node from > -- > 2.27.0 > >
On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote: > With commit 09f49dca570a ("mm: handle uninitialized numa nodes > gracefully") NODE_DATA for even a memoryless/cpuless node is partially > initialized at boot time. > > Before onlining the node, current Powerpc code checks for NODE_DATA to > be NULL. However since NODE_DATA is partially initialized, this check > will end up always being false. > > This causes hotplugging a CPU to a memoryless/cpuless node to fail. > > Before adding CPUs > $ numactl -H > available: 1 nodes (4) > node 4 cpus: 0 1 2 3 4 5 6 7 > node 4 size: 97372 MB > node 4 free: 95545 MB > node distances: > node 4 > 4: 10 > > $ lparstat > System Configuration > type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00 > > %user %sys %wait %idle physc %entc lbusy app vcsw phint > ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- > 2.66 2.67 0.16 94.51 0.00 0.00 5.33 0.00 67749 0 > > After hotplugging 32 cores > $ numactl -H > node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130 > 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 > 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 > 167 168 169 170 171 172 173 174 175 > node 4 size: 97372 MB > node 4 free: 93636 MB > node distances: > node 4 > 4: 10 > > $ lparstat > System Configuration > type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00 > > %user %sys %wait %idle physc %entc lbusy app vcsw phint > ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- > 0.04 0.02 0.00 99.94 0.00 0.00 0.06 0.00 1128751 3 > > As we can see numactl is listing only 8 cores while lparstat is showing > 33 cores. > > Also dmesg is showing messages like: > [ 2261.318350 ] BUG: arch topology borken > [ 2261.318357 ] the DIE domain not a subset of the NODE domain > > Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully") > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-mm@kvack.org > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> > --- > arch/powerpc/mm/numa.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index b9b7fefbb64b..13022d734951 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu) > if (new_nid < 0 || !node_possible(new_nid)) > new_nid = first_online_node; > > - if (NODE_DATA(new_nid) == NULL) { > + if (!node_online(new_nid)) { > #ifdef CONFIG_MEMORY_HOTPLUG > /* > * Need to ensure that NODE_DATA is initialized for a node from > -- > 2.27.0 > >
On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote: > arch/powerpc/mm/numa.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index b9b7fefbb64b..13022d734951 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu) > if (new_nid < 0 || !node_possible(new_nid)) > new_nid = first_online_node; > > - if (NODE_DATA(new_nid) == NULL) { > + if (!node_online(new_nid)) { > #ifdef CONFIG_MEMORY_HOTPLUG > /* > * Need to ensure that NODE_DATA is initialized for a node from Because of this fix, I wanted to check whether we might have any more fallouts due to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer as to why powerpc is the only platform that special cases try_online_node(), while all others rely on cpu_up()->try_online_node() to do the right thing. So, I had a look. Let us rewind a bit: The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes"). In there, it says that we have the following picture: partition_sched_domains arch_update_cpu_topology numa_update_cpu_topology find_and_online_cpu_nid and by the time find_and_online_cpu_nid() gets called to online the node, it might be too late as we might have referenced some NODE_DATA() fields. Note that this happens at a much later stage in cpuhp. Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(), which should allocate-and-online the node and prevent accessing garbage. But the problem is that, on powerpc, all possible cpus have the same node set at boot stage (see arch/powerpc/mm/numa.c:mem_topology_setup), so cpu_to_node() returns the same thing until it the mapping gets update (which happens in start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu. (e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node() tries to online node0, or whatever old mapping numa<->cpu is there). So, we have something like: dlpar_online_cpu device_online dev->bus->online cpu_subsys_online cpu_device_up cpu_up try_online_node (old mapping nid <-> cpu ) ... ... cphp_callbacks sched_cpu_activate cpuset_update_active_cpus schedule_work(&cpuset_hotplug_work) cpuset_hotplug_work partition_sched_domains arch_update_cpu_topology numa_update_cpu_topology find_and_online_cpu_nid (online new_nid) So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid() happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu() fixed the issue, but we should not need this special casing at all. We do already know the numa<->cpu associativity in dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to update the numa<->cpu mapping, and let the try_online_node() do the right thing. With this in mind, I came up with the following patch, where I carried out a battery of tests for CPU hotplug stuff and it worked as expected. But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so I would like to hear from more familiar people. The patch is: From: Oscar Salvador <osalvador@suse.de> Date: Wed, 6 Apr 2022 14:39:15 +0200 Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier powerpc is the only platform that do not rely on cpu_up()->try_online_node() to bring up a numa node, and special cases it, instead, deep in its own machinery: dlpar_online_cpu find_and_online_cpu_nid try_online_node This should not be needed, but the thing is that the try_online_node() from cpu_up() will not apply on the right node, because cpu_to_node() will return the old mapping numa<->cpu that gets set on boot stage for all possible cpus. That can be seen easily if we try to print out the numa node passed to try_online_node() in cpu_up(). The thing is that the numa<->cpu mapping does not get updated till a much later stage in start_secondary: start_secondary: set_numa_node(numa_cpu_lookup_table[cpu]) But we do not really care, as we already now the CPU <-> NUMA associativity back in find_and_online_cpu_nid(), so let us make use of that and set the proper numa<->cpu mapping, so cpu_to_node() in cpu_up() returns the right node and try_online_node() can do its work. Signed-off-by: Oscar Salvador <osalvador@suse.de> --- arch/powerpc/include/asm/topology.h | 8 ++----- arch/powerpc/mm/numa.c | 31 +++++++--------------------- arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +- 3 files changed, 11 insertions(+), 30 deletions(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 36fcafb1fd6d..6ae1b2dce83e 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -111,14 +111,10 @@ static inline void unmap_cpu_from_node(unsigned long cpu) {} #endif /* CONFIG_NUMA */ #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) -extern int find_and_online_cpu_nid(int cpu); +extern void find_and_update_cpu_nid(int cpu); extern int cpu_to_coregroup_id(int cpu); #else -static inline int find_and_online_cpu_nid(int cpu) -{ - return 0; -} - +static inline void find_and_update_cpu_nid(int cpu) {} static inline int cpu_to_coregroup_id(int cpu) { #ifdef CONFIG_SMP diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index b9b7fefbb64b..b5bc8b1a833d 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1423,43 +1423,28 @@ static long vphn_get_associativity(unsigned long cpu, return rc; } -int find_and_online_cpu_nid(int cpu) +void find_and_update_cpu_nid(int cpu) { __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0}; int new_nid; /* Use associativity from first thread for all siblings */ if (vphn_get_associativity(cpu, associativity)) - return cpu_to_node(cpu); + return; + /* Do not have previous associativity, so find it now. */ new_nid = associativity_to_nid(associativity); + if (new_nid < 0 || !node_possible(new_nid)) new_nid = first_online_node; - - if (NODE_DATA(new_nid) == NULL) { -#ifdef CONFIG_MEMORY_HOTPLUG - /* - * Need to ensure that NODE_DATA is initialized for a node from - * available memory (see memblock_alloc_try_nid). If unable to - * init the node, then default to nearest node that has memory - * installed. Skip onlining a node if the subsystems are not - * yet initialized. - */ - if (!topology_inited || try_online_node(new_nid)) - new_nid = first_online_node; -#else - /* - * Default to using the nearest node that has memory installed. - * Otherwise, it would be necessary to patch the kernel MM code - * to deal with more memoryless-node error conditions. + else + /* Associate node <-> cpu, so cpu_up() calls + * try_online_node() on the right node. */ - new_nid = first_online_node; -#endif - } + set_cpu_numa_node(cpu, new_nid); pr_debug("%s:%d cpu %d nid %d\n", __FUNCTION__, __LINE__, cpu, new_nid); - return new_nid; } int cpu_to_coregroup_id(int cpu) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index b81fc846d99c..0f8cd8b06432 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -398,7 +398,7 @@ static int dlpar_online_cpu(struct device_node *dn) if (get_hard_smp_processor_id(cpu) != thread) continue; cpu_maps_update_done(); - find_and_online_cpu_nid(cpu); + find_and_update_cpu_nid(cpu); rc = device_online(get_cpu_device(cpu)); if (rc) { dlpar_offline_cpu(dn); -- 2.16.4
* Oscar Salvador <osalvador@suse.de> [2022-04-06 18:19:00]: > On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote: > > arch/powerpc/mm/numa.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > > index b9b7fefbb64b..13022d734951 100644 > > --- a/arch/powerpc/mm/numa.c > > +++ b/arch/powerpc/mm/numa.c > > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu) > > if (new_nid < 0 || !node_possible(new_nid)) > > new_nid = first_online_node; > > > > - if (NODE_DATA(new_nid) == NULL) { > > + if (!node_online(new_nid)) { > > #ifdef CONFIG_MEMORY_HOTPLUG > > /* > > * Need to ensure that NODE_DATA is initialized for a node from > > Because of this fix, I wanted to check whether we might have any more fallouts due > to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer > as to why powerpc is the only platform that special cases try_online_node(), > while all others rely on cpu_up()->try_online_node() to do the right thing. > > So, I had a look. > Let us rewind a bit: > > The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was > e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes"). > > In there, it says that we have the following picture: > > partition_sched_domains > arch_update_cpu_topology > numa_update_cpu_topology > find_and_online_cpu_nid > > and by the time find_and_online_cpu_nid() gets called to online the node, it might be > too late as we might have referenced some NODE_DATA() fields. > Note that this happens at a much later stage in cpuhp. > > Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(), > which should allocate-and-online the node and prevent accessing garbage. > But the problem is that, on powerpc, all possible cpus have the same node set at boot stage > (see arch/powerpc/mm/numa.c:mem_topology_setup), > so cpu_to_node() returns the same thing until it the mapping gets update (which happens in > start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not > apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu. > > (e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node() > tries to online node0, or whatever old mapping numa<->cpu is there). > > So, we have something like: > > dlpar_online_cpu > device_online > dev->bus->online > cpu_subsys_online > cpu_device_up > cpu_up > try_online_node (old mapping nid <-> cpu ) > ... > ... > cphp_callbacks > sched_cpu_activate > cpuset_update_active_cpus > schedule_work(&cpuset_hotplug_work) > cpuset_hotplug_work > partition_sched_domains > arch_update_cpu_topology > numa_update_cpu_topology > find_and_online_cpu_nid (online new_nid) > > > So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid() > happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu() > fixed the issue, but we should not need this special casing at all. > > We do already know the numa<->cpu associativity in > dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to > update the numa<->cpu mapping, and let the try_online_node() do the right thing. > > With this in mind, I came up with the following patch, where I carried out a battery > of tests for CPU hotplug stuff and it worked as expected. > But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so > I would like to hear from more familiar people. > > The patch is: > > From: Oscar Salvador <osalvador@suse.de> > Date: Wed, 6 Apr 2022 14:39:15 +0200 > Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier > > powerpc is the only platform that do not rely on > cpu_up()->try_online_node() to bring up a numa node, > and special cases it, instead, deep in its own machinery: > > dlpar_online_cpu > find_and_online_cpu_nid > try_online_node > > This should not be needed, but the thing is that the try_online_node() > from cpu_up() will not apply on the right node, because cpu_to_node() > will return the old mapping numa<->cpu that gets set on boot stage > for all possible cpus. > > That can be seen easily if we try to print out the numa node passed > to try_online_node() in cpu_up(). > > The thing is that the numa<->cpu mapping does not get updated till a much > later stage in start_secondary: > > start_secondary: > set_numa_node(numa_cpu_lookup_table[cpu]) > > But we do not really care, as we already now the > CPU <-> NUMA associativity back in find_and_online_cpu_nid(), > so let us make use of that and set the proper numa<->cpu mapping, > so cpu_to_node() in cpu_up() returns the right node and > try_online_node() can do its work. > > Signed-off-by: Oscar Salvador <osalvador@suse.de> Looks good to me. Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > --- > arch/powerpc/include/asm/topology.h | 8 ++----- > arch/powerpc/mm/numa.c | 31 +++++++--------------------- > arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +- > 3 files changed, 11 insertions(+), 30 deletions(-) > > diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h > index 36fcafb1fd6d..6ae1b2dce83e 100644 > --- a/arch/powerpc/include/asm/topology.h > +++ b/arch/powerpc/include/asm/topology.h > @@ -111,14 +111,10 @@ static inline void unmap_cpu_from_node(unsigned long cpu) {} > #endif /* CONFIG_NUMA */ > > #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) > -extern int find_and_online_cpu_nid(int cpu); > +extern void find_and_update_cpu_nid(int cpu); > extern int cpu_to_coregroup_id(int cpu); > #else > -static inline int find_and_online_cpu_nid(int cpu) > -{ > - return 0; > -} > - > +static inline void find_and_update_cpu_nid(int cpu) {} > static inline int cpu_to_coregroup_id(int cpu) > { > #ifdef CONFIG_SMP > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index b9b7fefbb64b..b5bc8b1a833d 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -1423,43 +1423,28 @@ static long vphn_get_associativity(unsigned long cpu, > return rc; > } > > -int find_and_online_cpu_nid(int cpu) > +void find_and_update_cpu_nid(int cpu) > { > __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0}; > int new_nid; > > /* Use associativity from first thread for all siblings */ > if (vphn_get_associativity(cpu, associativity)) > - return cpu_to_node(cpu); > + return; > > + /* Do not have previous associativity, so find it now. */ > new_nid = associativity_to_nid(associativity); > + > if (new_nid < 0 || !node_possible(new_nid)) > new_nid = first_online_node; > - > - if (NODE_DATA(new_nid) == NULL) { > -#ifdef CONFIG_MEMORY_HOTPLUG > - /* > - * Need to ensure that NODE_DATA is initialized for a node from > - * available memory (see memblock_alloc_try_nid). If unable to > - * init the node, then default to nearest node that has memory > - * installed. Skip onlining a node if the subsystems are not > - * yet initialized. > - */ > - if (!topology_inited || try_online_node(new_nid)) > - new_nid = first_online_node; > -#else > - /* > - * Default to using the nearest node that has memory installed. > - * Otherwise, it would be necessary to patch the kernel MM code > - * to deal with more memoryless-node error conditions. > + else > + /* Associate node <-> cpu, so cpu_up() calls > + * try_online_node() on the right node. > */ > - new_nid = first_online_node; > -#endif > - } > + set_cpu_numa_node(cpu, new_nid); > > pr_debug("%s:%d cpu %d nid %d\n", __FUNCTION__, __LINE__, > cpu, new_nid); > - return new_nid; > } > > int cpu_to_coregroup_id(int cpu) > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c > index b81fc846d99c..0f8cd8b06432 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c > @@ -398,7 +398,7 @@ static int dlpar_online_cpu(struct device_node *dn) > if (get_hard_smp_processor_id(cpu) != thread) > continue; > cpu_maps_update_done(); > - find_and_online_cpu_nid(cpu); > + find_and_update_cpu_nid(cpu); > rc = device_online(get_cpu_device(cpu)); > if (rc) { > dlpar_offline_cpu(dn); > -- > 2.16.4 > > -- > Oscar Salvador > SUSE Labs
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes: > * Oscar Salvador <osalvador@suse.de> [2022-04-06 18:19:00]: > >> On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote: >> > arch/powerpc/mm/numa.c | 2 +- >> > 1 file changed, 1 insertion(+), 1 deletion(-) >> > >> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >> > index b9b7fefbb64b..13022d734951 100644 >> > --- a/arch/powerpc/mm/numa.c >> > +++ b/arch/powerpc/mm/numa.c >> > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu) >> > if (new_nid < 0 || !node_possible(new_nid)) >> > new_nid = first_online_node; >> > >> > - if (NODE_DATA(new_nid) == NULL) { >> > + if (!node_online(new_nid)) { >> > #ifdef CONFIG_MEMORY_HOTPLUG >> > /* >> > * Need to ensure that NODE_DATA is initialized for a node from >> >> Because of this fix, I wanted to check whether we might have any more fallouts due >> to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer >> as to why powerpc is the only platform that special cases try_online_node(), >> while all others rely on cpu_up()->try_online_node() to do the right thing. >> >> So, I had a look. >> Let us rewind a bit: >> >> The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was >> e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes"). >> >> In there, it says that we have the following picture: >> >> partition_sched_domains >> arch_update_cpu_topology >> numa_update_cpu_topology >> find_and_online_cpu_nid >> >> and by the time find_and_online_cpu_nid() gets called to online the node, it might be >> too late as we might have referenced some NODE_DATA() fields. >> Note that this happens at a much later stage in cpuhp. >> >> Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(), >> which should allocate-and-online the node and prevent accessing garbage. >> But the problem is that, on powerpc, all possible cpus have the same node set at boot stage >> (see arch/powerpc/mm/numa.c:mem_topology_setup), >> so cpu_to_node() returns the same thing until it the mapping gets update (which happens in >> start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not >> apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu. >> >> (e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node() >> tries to online node0, or whatever old mapping numa<->cpu is there). >> >> So, we have something like: >> >> dlpar_online_cpu >> device_online >> dev->bus->online >> cpu_subsys_online >> cpu_device_up >> cpu_up >> try_online_node (old mapping nid <-> cpu ) >> ... >> ... >> cphp_callbacks >> sched_cpu_activate >> cpuset_update_active_cpus >> schedule_work(&cpuset_hotplug_work) >> cpuset_hotplug_work >> partition_sched_domains >> arch_update_cpu_topology >> numa_update_cpu_topology >> find_and_online_cpu_nid (online new_nid) >> >> >> So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid() >> happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu() >> fixed the issue, but we should not need this special casing at all. >> >> We do already know the numa<->cpu associativity in >> dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to >> update the numa<->cpu mapping, and let the try_online_node() do the right thing. >> >> With this in mind, I came up with the following patch, where I carried out a battery >> of tests for CPU hotplug stuff and it worked as expected. >> But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so >> I would like to hear from more familiar people. >> >> The patch is: >> >> From: Oscar Salvador <osalvador@suse.de> >> Date: Wed, 6 Apr 2022 14:39:15 +0200 >> Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier >> >> powerpc is the only platform that do not rely on >> cpu_up()->try_online_node() to bring up a numa node, >> and special cases it, instead, deep in its own machinery: >> >> dlpar_online_cpu >> find_and_online_cpu_nid >> try_online_node >> >> This should not be needed, but the thing is that the try_online_node() >> from cpu_up() will not apply on the right node, because cpu_to_node() >> will return the old mapping numa<->cpu that gets set on boot stage >> for all possible cpus. >> >> That can be seen easily if we try to print out the numa node passed >> to try_online_node() in cpu_up(). >> >> The thing is that the numa<->cpu mapping does not get updated till a much >> later stage in start_secondary: >> >> start_secondary: >> set_numa_node(numa_cpu_lookup_table[cpu]) >> >> But we do not really care, as we already now the >> CPU <-> NUMA associativity back in find_and_online_cpu_nid(), >> so let us make use of that and set the proper numa<->cpu mapping, >> so cpu_to_node() in cpu_up() returns the right node and >> try_online_node() can do its work. >> >> Signed-off-by: Oscar Salvador <osalvador@suse.de> > > Looks good to me. > > Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Yeah agreed, thanks for getting to the root of the problem. Can you resend as a standalone patch. Because you sent it as a reply it won't be recognised by patchwork[1] which means it risks getting lost. cheers 1: http://patchwork.ozlabs.org/project/linuxppc-dev/list/
On Wed, 30 Mar 2022 19:21:23 +0530, Srikar Dronamraju wrote: > With commit 09f49dca570a ("mm: handle uninitialized numa nodes > gracefully") NODE_DATA for even a memoryless/cpuless node is partially > initialized at boot time. > > Before onlining the node, current Powerpc code checks for NODE_DATA to > be NULL. However since NODE_DATA is partially initialized, this check > will end up always being false. > > [...] Applied to powerpc/fixes. [1/1] powerpc/numa: Handle partially initialized numa nodes https://git.kernel.org/powerpc/c/e4ff77598a109bd36789ad5e80aba66fc53d0ffb cheers
On Sun, Apr 10, 2022 at 09:28:38PM +1000, Michael Ellerman wrote: > Yeah agreed, thanks for getting to the root of the problem. > > Can you resend as a standalone patch. Because you sent it as a reply it > won't be recognised by patchwork[1] which means it risks getting lost. Hi Michael, It's done [1]. thanks! [1] http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220411074934.4632-1-osalvador@suse.de/
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index b9b7fefbb64b..13022d734951 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu) if (new_nid < 0 || !node_possible(new_nid)) new_nid = first_online_node; - if (NODE_DATA(new_nid) == NULL) { + if (!node_online(new_nid)) { #ifdef CONFIG_MEMORY_HOTPLUG /* * Need to ensure that NODE_DATA is initialized for a node from
With commit 09f49dca570a ("mm: handle uninitialized numa nodes gracefully") NODE_DATA for even a memoryless/cpuless node is partially initialized at boot time. Before onlining the node, current Powerpc code checks for NODE_DATA to be NULL. However since NODE_DATA is partially initialized, this check will end up always being false. This causes hotplugging a CPU to a memoryless/cpuless node to fail. Before adding CPUs $ numactl -H available: 1 nodes (4) node 4 cpus: 0 1 2 3 4 5 6 7 node 4 size: 97372 MB node 4 free: 95545 MB node distances: node 4 4: 10 $ lparstat System Configuration type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00 %user %sys %wait %idle physc %entc lbusy app vcsw phint ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 2.66 2.67 0.16 94.51 0.00 0.00 5.33 0.00 67749 0 After hotplugging 32 cores $ numactl -H node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 node 4 size: 97372 MB node 4 free: 93636 MB node distances: node 4 4: 10 $ lparstat System Configuration type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00 %user %sys %wait %idle physc %entc lbusy app vcsw phint ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0.04 0.02 0.00 99.94 0.00 0.00 0.06 0.00 1128751 3 As we can see numactl is listing only 8 cores while lparstat is showing 33 cores. Also dmesg is showing messages like: [ 2261.318350 ] BUG: arch topology borken [ 2261.318357 ] the DIE domain not a subset of the NODE domain Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully") Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-mm@kvack.org Cc: Michal Hocko <mhocko@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)