diff mbox series

[v4,3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

Message ID 20200512132937.19295-4-srikar@linux.vnet.ibm.com (mailing list archive)
State New, archived
Headers show
Series Offline memoryless cpuless node 0 | expand

Commit Message

Srikar Dronamraju May 12, 2020, 1:29 p.m. UTC
Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot.  However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
always online.

v5.7-rc3
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
------------------
 /sys/devices/system/node/online:            0,2
 /proc/sys/kernel/numa_balancing:            1
 /sys/devices/system/node/has_cpu:           2
 /sys/devices/system/node/has_memory:        2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:          0-31

v5.7-rc3 + patch
------------------
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
------------------
/sys/devices/system/node/online:            2
/proc/sys/kernel/numa_balancing:            0
/sys/devices/system/node/has_cpu:           2
/sys/devices/system/node/has_memory:        2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:          0-31

Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1:->v2:
- Rebased to v5.7-rc3
Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u

 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Christoph Lameter (Ampere) May 12, 2020, 4:31 p.m. UTC | #1
On Tue, 12 May 2020, Srikar Dronamraju wrote:

> +#ifdef CONFIG_NUMA
> +	[N_ONLINE] = NODE_MASK_NONE,

Again. Same issue as before. If you do this then you do a global change
for all architectures. You need to put something in the early boot
sequence (in a non architecture specific way) that sets the first node
online by default.

You have fixed the issue in your earlier patches for the powerpc
archicture. What about the other architectures?

Or did I miss something?
Srikar Dronamraju May 13, 2020, 7:46 a.m. UTC | #2
* Christopher Lameter <cl@linux.com> [2020-05-12 16:31:26]:

> On Tue, 12 May 2020, Srikar Dronamraju wrote:
> 
> > +#ifdef CONFIG_NUMA
> > +	[N_ONLINE] = NODE_MASK_NONE,
> 
> Again. Same issue as before. If you do this then you do a global change
> for all architectures. You need to put something in the early boot
> sequence (in a non architecture specific way) that sets the first node
> online by default.
> 

I did respond to that earlier.

> You have fixed the issue in your earlier patches for the powerpc
> archicture. What about the other architectures?
> 
> Or did I miss something?
> 

Here are my assumptions, please do correct me if any of them are wrong.
1. My other patches for Powerpc, don't change when the nodes are being
onlined. They only change how the cpu_to_node numbering of the offline cpus.
In this respect Powerpc due to its PAPR compliance may be slightly unique
from other archs where the cpu binding of the node is not known till CPUs
are onlined.

2. Currently the nodes are onlined (in all arch specific code) as soon as
they are detected. This is unconditional onlining as in there are no checks
to see the node number is 0. i.e I don't see any special checks that
restrict or allow node 0 from being onlined / offlined. Its considered no
special than any other online node.

3. If we were to expect node 0 to be always online, then why do we have
first_online_node. We could always hard code it to 0.

4. I tried enabling CONFIG_MEMORYLESS_NODE on x86, but that's seems to be
not possible. And it looks to me that something like that is only possible
on powerpc and IA64.

5. Without my patch on a regular numa system, node 0 would be onlined by
default during structure initialization. When the nodes get detected, node 0
and other nodes would again be onlined. The only drawback being if node 0
wasn't suppose to be online, it will still end up being marked online.
With the proposed patch, when the nodes get detected, any nodes detected
would be onlined.

I think the node onlining is already pretty early in boot. I don't know of
any other mechanism to move the onlining further up and in a non
architecture specific way. However if you have ideas, please do let me know.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69827d4..03b8959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -116,8 +116,10 @@  struct pcpu_drain {
  */
 nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
 	[N_POSSIBLE] = NODE_MASK_ALL,
+#ifdef CONFIG_NUMA
+	[N_ONLINE] = NODE_MASK_NONE,
+#else
 	[N_ONLINE] = { { [0] = 1UL } },
-#ifndef CONFIG_NUMA
 	[N_NORMAL_MEMORY] = { { [0] = 1UL } },
 #ifdef CONFIG_HIGHMEM
 	[N_HIGH_MEMORY] = { { [0] = 1UL } },