mbox series

[v6,0/3] Offline memoryless cpuless node 0

Message ID 20200818081104.57888-1-srikar@linux.vnet.ibm.com (mailing list archive)
Headers show
Series Offline memoryless cpuless node 0 | expand

Message

Srikar Dronamraju Aug. 18, 2020, 8:11 a.m. UTC
Changelog v5:->v6:
- Now the fix is Powerpc specific.
	(David Hildenbrand, Michal Hocko, Christopher Lamater)
- rebased to v5.8
link v5: https://lore.kernel.org/linuxppc-dev/20200624092846.9194-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v4:->v5:
- rebased to v5.8
link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v3:->v4:
- Resolved comments from Christopher.
Link v3: http://lore.kernel.org/lkml/20200501031128.19584-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v2:->v3:
- Resolved comments from Gautham.
Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3
- Updated the changelog.
Link v1: https://lore.kernel.org/linuxppc-dev/20200311110237.5731-1-srikar@linux.vnet.ibm.com/t/#u

Linux kernel configured with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause
1. numa_balancing to be enabled on systems with only one online node.
2. Existence of dummy (cpuless and memoryless) node which can confuse
users/scripts looking at output of lscpu / numactl.

This patchset wants to correct this anomaly.

This should only affect systems that have CONFIG_MEMORYLESS_NODES.
Currently there are only 2 architectures ia64 and powerpc that have this
config.

Note: Patch 3 in this patch series depends on patches 1 and 2.
Without patches 1 and 2, patch 3 might crash powerpc.

v5.8
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
------------------
 /sys/devices/system/node/online:            0,2
 /proc/sys/kernel/numa_balancing:            1
 /sys/devices/system/node/has_cpu:           2
 /sys/devices/system/node/has_memory:        2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:          0-31

v5.8 + patches
------------------
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
------------------
/sys/devices/system/node/online:            2
/proc/sys/kernel/numa_balancing:            0
/sys/devices/system/node/has_cpu:           2
/sys/devices/system/node/has_memory:        2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:          0-31

1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.

2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump

3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.

On a machine with just one node with node number not being 0,
the current setup will end up showing 2 online nodes. And when there are
more than one online nodes, numa_balancing gets enabled.

Without patch
$ grep numa /proc/vmstat
numa_hit 3864714
numa_miss 0
numa_foreign 0
numa_interleave 2872
numa_local 3864714
numa_other 0
numa_pte_updates 13739278           <----------
numa_huge_pte_updates 0               <----------
numa_hint_faults 13717222         <----------
numa_hint_faults_local 13717222         <----------
numa_pages_migrated 0

With patch
$ grep numa /proc/vmstat
numa_hit 6633324
numa_miss 0
numa_foreign 0
numa_interleave 2864
numa_local 6633324
numa_other 0
numa_pte_updates 0                 <----------
numa_huge_pte_updates 0                 <----------
numa_hint_faults 0                 <----------
numa_hint_faults_local 0                 <----------
numa_pages_migrated 0

Here are 2 sample numa programs.

numa01.sh is a set of 2 process each running threads as many as number of
cpus;
each thread doing 50 loops on 3GB process shared memory operations.

numa02.sh is a single process with threads as many as number of cpus;
each thread doing 800 loops on 32MB thread local memory operations.

Without patch
-------------
Testcase         Time:  Min      Max      Avg      StdDev
./numa01.sh      Real:  164.67   164.89   164.76   0.07
./numa01.sh      Sys:   2.88     3.38     3.05     0.17
./numa01.sh      User:  1297.85  1301.82  1300.86  1.51
./numa02.sh      Real:  27.44    27.46    27.45    0.01
./numa02.sh      Sys:   0.15     0.25     0.21     0.03
./numa02.sh      User:  216.65   216.93   216.80   0.09

With patch
-----------
Testcase         Time:  Min      Max      Avg      StdDev  %Change
./numa01.sh      Real:  164.20   164.38   164.28   0.08    0.292184%
./numa01.sh      Sys:   0.72     0.90     0.82     0.06    271.951%
./numa01.sh      User:  1300.39  1301.97  1300.94  0.56    -0.0061494%
./numa02.sh      Real:  27.41    27.51    27.45    0.03    0%
./numa02.sh      Sys:   0.09     0.16     0.13     0.03    61.5385%
./numa02.sh      User:  216.38   216.91   216.64   0.21    0.0738552%

numa01.sh
param                   no_patch    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        2946055     0           -100%
numa_hint_faults_local  2946055     0           -100%
numa_hit                700617      681234      -2.76656%
numa_local              700617      681234      -2.76656%
numa_pte_updates        2947175     0           -100%
pgfault                 4125926     1120053     -72.8533%
pgmajfault              269         181         -32.7138%

numa02.sh
param                   no_patch    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        137623      0           -100%
numa_hint_faults_local  137623      0           -100%
numa_hit                51332       54645       6.45406%
numa_local              51332       54645       6.45406%
numa_pte_updates        138903      0           -100%
pgfault                 247058      116743      -52.7467%
pgmajfault              154         157         1.94805%

Observations:
The real time and user time actually doesn't change much. However the system
time changes to some extent. The reason being the number of numa hinting
faults. With the patch we are not seeing the numa hinting faults.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>

Srikar Dronamraju (3):
  powerpc/numa: Set numa_node for all possible cpus
  powerpc/numa: Prefer node id queried from vphn
  powerpc/numa: Offline memoryless cpuless node 0

 arch/powerpc/mm/numa.c | 45 ++++++++++++++++++++++++++++++++----------
 1 file changed, 35 insertions(+), 10 deletions(-)

Comments

Michael Ellerman Sept. 17, 2020, 11:27 a.m. UTC | #1
On Tue, 18 Aug 2020 13:41:01 +0530, Srikar Dronamraju wrote:
> Changelog v5:->v6:
> - Now the fix is Powerpc specific.
> 	(David Hildenbrand, Michal Hocko, Christopher Lamater)
> - rebased to v5.8
> link v5: https://lore.kernel.org/linuxppc-dev/20200624092846.9194-1-srikar@linux.vnet.ibm.com/t/#u
> 
> Changelog v4:->v5:
> - rebased to v5.8
> link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u
> 
> [...]

Applied to powerpc/next.

[1/3] powerpc/numa: Set numa_node for all possible cpus
      https://git.kernel.org/powerpc/c/a874f1005ef5dfe53dfd8cda59a6600e89986ecd
[2/3] powerpc/numa: Prefer node id queried from vphn
      https://git.kernel.org/powerpc/c/6398eaa268168b528dd1d3d0e70e61e9c13bea23
[3/3] powerpc/numa: Offline memoryless cpuless node 0
      https://git.kernel.org/powerpc/c/e75130f20b1f48e04ccc806aea01f0a361f9cb6b

cheers