mbox series

[-V10,RESEND,0/6] NUMA balancing: optimize memory placement for memory tiering system

Message ID 20211207022757.2523359-1-ying.huang@intel.com (mailing list archive)
Headers show
Series NUMA balancing: optimize memory placement for memory tiering system | expand

Message

Huang, Ying Dec. 7, 2021, 2:27 a.m. UTC
The changes since the last post are as follows,

- Rebased on v5.16-rc1

- Revise error processing for [1/6] (promotion counter) per Yang's comments

- Add sysctl document for [2/6] (optimize page placement)

- Reset threshold adjustment state when disable/enable tiering mode

- Reset threshold when workload transition is detected.

--

With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory).  The
memory subsystem of these machines can be called memory tiering
system, because the performance of the different types of memory are
different.

After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
for use like normal RAM"), the PMEM could be used as the
cost-effective volatile memory in separate NUMA nodes.  In a typical
memory tiering system, there are CPUs, DRAM and PMEM in each physical
NUMA node.  The CPUs and the DRAM will be put in one logical node,
while the PMEM will be put in another (faked) logical node.

To optimize the system overall performance, the hot pages should be
placed in DRAM node.  To do that, we need to identify the hot pages in
the PMEM node and migrate them to DRAM node via NUMA migration.

In the original NUMA balancing, there are already a set of existing
mechanisms to identify the pages recently accessed by the CPUs in a
node and migrate the pages to the node.  So we can reuse these
mechanisms to build the mechanisms to optimize the page placement in
the memory tiering system.  This is implemented in this patchset.

At the other hand, the cold pages should be placed in PMEM node.  So,
we also need to identify the cold pages in the DRAM node and migrate
them to PMEM node.

In commit 26aa2d199d6f ("mm/migrate: demote pages during reclaim"), a
mechanism to demote the cold DRAM pages to PMEM node under memory
pressure is implemented.  Based on that, the cold DRAM pages can be
demoted to PMEM node proactively to free some memory space on DRAM
node to accommodate the promoted hot PMEM pages.  This is implemented
in this patchset too.

We have tested the solution with the pmbench memory accessing
benchmark with the 80:20 read/write ratio and the normal access
address distribution on a 2 socket Intel server with Optane DC
Persistent Memory Model.  The test results of the base kernel and step
by step optimizations are as follows,

                Throughput	Promotion      DRAM bandwidth
		  access/s           MB/s                MB/s
               -----------     ----------      --------------
Base		69263986.8			       1830.2
Patch 2	       135691921.4	    385.6	      11315.9
Patch 3	       133239016.8	    384.7	      11065.2
Patch 4	       151310868.9          197.6	      11397.0
Patch 5	       142311252.8           99.3	       9580.8
Patch 6	       149044263.9	     65.5	       9922.8

The whole patchset improves the benchmark score up to 115.2%.  The
basic NUMA balancing based optimization solution (patch 2), the hot
page selection algorithm (patch 4), and the threshold automatic
adjustment algorithms (patch 6) improves the performance or reduce the
overhead (promotion MB/s) greatly.

Changelog:

v10:

- Rebased on v5.16-rc1

- Revise error processing for [1/6] (promotion counter) per Yang's comments

- Add sysctl document for [2/6] (optimize page placement)

- Reset threshold adjustment state when disable/enable tiering mode

- Reset threshold when workload transition is detected.

v9:

- Rebased on v5.15-rc4

- Make "add promotion counter" the first patch per Yang's comments

v8:

- Rebased on v5.15-rc1

- Make user-specified threshold take effect sooner

v7:

- Rebased on the mmots tree of 2021-07-15.

- Some minor fixes.

v6:

- Rebased on the latest page demotion patchset. (which bases on v5.11)

v5:

- Rebased on the latest page demotion patchset. (which bases on v5.10)

v4:

- Rebased on the latest page demotion patchset. (which bases on v5.9-rc6)

- Add page promotion counter.

v3:

- Move the rate limit control as late as possible per Mel Gorman's
  comments.

- Revise the hot page selection implementation to store page scan time
  in struct page.

- Code cleanup.

- Rebased on the latest page demotion patchset.

v2:

- Addressed comments for V1.

- Rebased on v5.5.

Best Regards,
Huang, Ying

Comments

Peter Zijlstra Jan. 12, 2022, 4:10 p.m. UTC | #1
On Tue, Dec 07, 2021 at 10:27:51AM +0800, Huang Ying wrote:
> After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
> for use like normal RAM"), the PMEM could be used as the
> cost-effective volatile memory in separate NUMA nodes.  In a typical
> memory tiering system, there are CPUs, DRAM and PMEM in each physical
> NUMA node.  The CPUs and the DRAM will be put in one logical node,
> while the PMEM will be put in another (faked) logical node.

So what does a system like that actually look like, SLIT table wise, and
how does that affect init_numa_topology_type() ?
Huang, Ying Jan. 13, 2022, 7:19 a.m. UTC | #2
Hi, Peter,

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Dec 07, 2021 at 10:27:51AM +0800, Huang Ying wrote:
>> After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
>> for use like normal RAM"), the PMEM could be used as the
>> cost-effective volatile memory in separate NUMA nodes.  In a typical
>> memory tiering system, there are CPUs, DRAM and PMEM in each physical
>> NUMA node.  The CPUs and the DRAM will be put in one logical node,
>> while the PMEM will be put in another (faked) logical node.
>
> So what does a system like that actually look like, SLIT table wise, and
> how does that affect init_numa_topology_type() ?

The SLIT table is as follows,

[000h 0000   4]                    Signature : "SLIT"    [System Locality Information Table]
[004h 0004   4]                 Table Length : 0000042C
[008h 0008   1]                     Revision : 01
[009h 0009   1]                     Checksum : 59
[00Ah 0010   6]                       Oem ID : "INTEL "
[010h 0016   8]                 Oem Table ID : "S2600WF "
[018h 0024   4]                 Oem Revision : 00000001
[01Ch 0028   4]              Asl Compiler ID : "INTL"
[020h 0032   4]        Asl Compiler Revision : 20091013

[024h 0036   8]                   Localities : 0000000000000004
[02Ch 0044   4]                 Locality   0 : 0A 15 11 1C
[030h 0048   4]                 Locality   1 : 15 0A 1C 11
[034h 0052   4]                 Locality   2 : 11 1C 0A 1C
[038h 0056   4]                 Locality   3 : 1C 11 1C 0A

The `numactl -H` output is as follows,

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 64136 MB
node 0 free: 5981 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 64466 MB
node 1 free: 10415 MB
node 2 cpus:
node 2 size: 253952 MB
node 2 free: 253920 MB
node 3 cpus:
node 3 size: 253952 MB
node 3 free: 253951 MB
node distances:
node   0   1   2   3 
  0:  10  21  17  28 
  1:  21  10  28  17 
  2:  17  28  10  28 
  3:  28  17  28  10 

init_numa_topology_type() set sched_numa_topology_type to NUMA_DIRECT.

The node 0 and node 1 are onlined during boot.  While the PMEM node,
that is, node 2 and node 3 are onlined later.  As in the following dmesg
snippet.

[    2.252573][    T0] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[    2.259224][    T0] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x107fffffff]
[    2.266139][    T0] ACPI: SRAT: Node 2 PXM 2 [mem 0x1080000000-0x4f7fffffff] non-volatile
[    2.274267][    T0] ACPI: SRAT: Node 1 PXM 1 [mem 0x4f80000000-0x5f7fffffff]
[    2.281271][    T0] ACPI: SRAT: Node 3 PXM 3 [mem 0x5f80000000-0x9e7fffffff] non-volatile
[    2.289403][    T0] NUMA: Initialized distance table, cnt=4
[    2.294934][    T0] NUMA: Node 0 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x107fffffff] -> [mem 0x00000000-0x107fffffff]
[    2.306266][    T0] NODE_DATA(0) allocated [mem 0x107ffd5000-0x107fffffff]
[    2.313115][    T0] NODE_DATA(1) allocated [mem 0x5f7ffd0000-0x5f7fffafff]

[    5.391151][    T1] smp: Brought up 2 nodes, 96 CPUs

Full dmesg is attached.

Best Regards,
Huang, Ying
Peter Zijlstra Jan. 13, 2022, 9:49 a.m. UTC | #3
On Thu, Jan 13, 2022 at 03:19:06PM +0800, Huang, Ying wrote:
> Hi, Peter,
> 
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Tue, Dec 07, 2021 at 10:27:51AM +0800, Huang Ying wrote:
> >> After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
> >> for use like normal RAM"), the PMEM could be used as the
> >> cost-effective volatile memory in separate NUMA nodes.  In a typical
> >> memory tiering system, there are CPUs, DRAM and PMEM in each physical
> >> NUMA node.  The CPUs and the DRAM will be put in one logical node,
> >> while the PMEM will be put in another (faked) logical node.
> >
> > So what does a system like that actually look like, SLIT table wise, and
> > how does that affect init_numa_topology_type() ?
> 
> The SLIT table is as follows,
> 
> [000h 0000   4]                    Signature : "SLIT"    [System Locality Information Table]
> [004h 0004   4]                 Table Length : 0000042C
> [008h 0008   1]                     Revision : 01
> [009h 0009   1]                     Checksum : 59
> [00Ah 0010   6]                       Oem ID : "INTEL "
> [010h 0016   8]                 Oem Table ID : "S2600WF "
> [018h 0024   4]                 Oem Revision : 00000001
> [01Ch 0028   4]              Asl Compiler ID : "INTL"
> [020h 0032   4]        Asl Compiler Revision : 20091013
> 
> [024h 0036   8]                   Localities : 0000000000000004
> [02Ch 0044   4]                 Locality   0 : 0A 15 11 1C
> [030h 0048   4]                 Locality   1 : 15 0A 1C 11
> [034h 0052   4]                 Locality   2 : 11 1C 0A 1C
> [038h 0056   4]                 Locality   3 : 1C 11 1C 0A
> 
> The `numactl -H` output is as follows,
> 
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> node 0 size: 64136 MB
> node 0 free: 5981 MB
> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> node 1 size: 64466 MB
> node 1 free: 10415 MB
> node 2 cpus:
> node 2 size: 253952 MB
> node 2 free: 253920 MB
> node 3 cpus:
> node 3 size: 253952 MB
> node 3 free: 253951 MB
> node distances:
> node   0   1   2   3 
>   0:  10  21  17  28 
>   1:  21  10  28  17 
>   2:  17  28  10  28 
>   3:  28  17  28  10 
> 
> init_numa_topology_type() set sched_numa_topology_type to NUMA_DIRECT.
> 
> The node 0 and node 1 are onlined during boot.  While the PMEM node,
> that is, node 2 and node 3 are onlined later.  As in the following dmesg
> snippet.

But how? sched_init_numa() scans the *whole* SLIT table to determine
nr_levels / sched_domains_numa_levels, even offline nodes. Therefore it
should find 4 distinct distance values and end up not selecting
NUMA_DIRECT.

Similarly for the other types it uses for_each_online_node(), which
would include the pmem nodes once they've been onlined, but I'm thinking
we explicitly want to skip CPU-less nodes in that iteration.
Peter Zijlstra Jan. 13, 2022, 1 p.m. UTC | #4
On Thu, Jan 13, 2022 at 08:06:40PM +0800, Huang, Ying wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> > On Thu, Jan 13, 2022 at 03:19:06PM +0800, Huang, Ying wrote:
> >> Peter Zijlstra <peterz@infradead.org> writes:
> >> > On Tue, Dec 07, 2021 at 10:27:51AM +0800, Huang Ying wrote:

> >> >> After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
> >> >> for use like normal RAM"), the PMEM could be used as the
> >> >> cost-effective volatile memory in separate NUMA nodes.  In a typical
> >> >> memory tiering system, there are CPUs, DRAM and PMEM in each physical
> >> >> NUMA node.  The CPUs and the DRAM will be put in one logical node,
> >> >> while the PMEM will be put in another (faked) logical node.
> >> >
> >> > So what does a system like that actually look like, SLIT table wise, and
> >> > how does that affect init_numa_topology_type() ?
> >> 
> >> The SLIT table is as follows,

<snip>

> >> node distances:
> >> node   0   1   2   3 
> >>   0:  10  21  17  28 
> >>   1:  21  10  28  17 
> >>   2:  17  28  10  28 
> >>   3:  28  17  28  10 
> >> 
> >> init_numa_topology_type() set sched_numa_topology_type to NUMA_DIRECT.
> >> 
> >> The node 0 and node 1 are onlined during boot.  While the PMEM node,
> >> that is, node 2 and node 3 are onlined later.  As in the following dmesg
> >> snippet.
> >
> > But how? sched_init_numa() scans the *whole* SLIT table to determine
> > nr_levels / sched_domains_numa_levels, even offline nodes. Therefore it
> > should find 4 distinct distance values and end up not selecting
> > NUMA_DIRECT.
> >
> > Similarly for the other types it uses for_each_online_node(), which
> > would include the pmem nodes once they've been onlined, but I'm thinking
> > we explicitly want to skip CPU-less nodes in that iteration.
> 
> I used the debug patch as below, and get the log in dmesg as follows,
> 
> [    5.394577][    T1] sched_numa_topology_type: 0, levels: 4, max_distance: 28
> 
> I found that I forget another caller of init_numa_topology_type() run
> during hotplug.  I will add another printk() to show it.  Sorry about
> that.

Can you try with this on?

I'm suspecting there's a problem with init_numa_topology_type(); it will
never find the max distance due to the _online_ clause in the iteration,
since you said the pmem nodes are not online yet.

---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..53ab9c63c185 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1756,6 +1756,8 @@ static void init_numa_topology_type(void)
 			return;
 		}
 	}
+
+	WARN(1, "no NUMA type determined");
 }
Huang, Ying Jan. 13, 2022, 1:13 p.m. UTC | #5
Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Jan 13, 2022 at 08:06:40PM +0800, Huang, Ying wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> > On Thu, Jan 13, 2022 at 03:19:06PM +0800, Huang, Ying wrote:
>> >> Peter Zijlstra <peterz@infradead.org> writes:
>> >> > On Tue, Dec 07, 2021 at 10:27:51AM +0800, Huang Ying wrote:
>
>> >> >> After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
>> >> >> for use like normal RAM"), the PMEM could be used as the
>> >> >> cost-effective volatile memory in separate NUMA nodes.  In a typical
>> >> >> memory tiering system, there are CPUs, DRAM and PMEM in each physical
>> >> >> NUMA node.  The CPUs and the DRAM will be put in one logical node,
>> >> >> while the PMEM will be put in another (faked) logical node.
>> >> >
>> >> > So what does a system like that actually look like, SLIT table wise, and
>> >> > how does that affect init_numa_topology_type() ?
>> >> 
>> >> The SLIT table is as follows,
>
> <snip>
>
>> >> node distances:
>> >> node   0   1   2   3 
>> >>   0:  10  21  17  28 
>> >>   1:  21  10  28  17 
>> >>   2:  17  28  10  28 
>> >>   3:  28  17  28  10 
>> >> 
>> >> init_numa_topology_type() set sched_numa_topology_type to NUMA_DIRECT.
>> >> 
>> >> The node 0 and node 1 are onlined during boot.  While the PMEM node,
>> >> that is, node 2 and node 3 are onlined later.  As in the following dmesg
>> >> snippet.
>> >
>> > But how? sched_init_numa() scans the *whole* SLIT table to determine
>> > nr_levels / sched_domains_numa_levels, even offline nodes. Therefore it
>> > should find 4 distinct distance values and end up not selecting
>> > NUMA_DIRECT.
>> >
>> > Similarly for the other types it uses for_each_online_node(), which
>> > would include the pmem nodes once they've been onlined, but I'm thinking
>> > we explicitly want to skip CPU-less nodes in that iteration.
>> 
>> I used the debug patch as below, and get the log in dmesg as follows,
>> 
>> [    5.394577][    T1] sched_numa_topology_type: 0, levels: 4, max_distance: 28
>> 
>> I found that I forget another caller of init_numa_topology_type() run
>> during hotplug.  I will add another printk() to show it.  Sorry about
>> that.
>
> Can you try with this on?
>
> I'm suspecting there's a problem with init_numa_topology_type(); it will
> never find the max distance due to the _online_ clause in the iteration,
> since you said the pmem nodes are not online yet.
>
> ---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..53ab9c63c185 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1756,6 +1756,8 @@ static void init_numa_topology_type(void)
>  			return;
>  		}
>  	}
> +
> +	WARN(1, "no NUMA type determined");
>  }

Sure.  Will do this.

Best Regards,
Huang, Ying
Huang, Ying Jan. 13, 2022, 2:24 p.m. UTC | #6
Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Jan 13, 2022 at 08:06:40PM +0800, Huang, Ying wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> > On Thu, Jan 13, 2022 at 03:19:06PM +0800, Huang, Ying wrote:
>> >> Peter Zijlstra <peterz@infradead.org> writes:
>> >> > On Tue, Dec 07, 2021 at 10:27:51AM +0800, Huang Ying wrote:
>
>> >> >> After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
>> >> >> for use like normal RAM"), the PMEM could be used as the
>> >> >> cost-effective volatile memory in separate NUMA nodes.  In a typical
>> >> >> memory tiering system, there are CPUs, DRAM and PMEM in each physical
>> >> >> NUMA node.  The CPUs and the DRAM will be put in one logical node,
>> >> >> while the PMEM will be put in another (faked) logical node.
>> >> >
>> >> > So what does a system like that actually look like, SLIT table wise, and
>> >> > how does that affect init_numa_topology_type() ?
>> >> 
>> >> The SLIT table is as follows,
>
> <snip>
>
>> >> node distances:
>> >> node   0   1   2   3 
>> >>   0:  10  21  17  28 
>> >>   1:  21  10  28  17 
>> >>   2:  17  28  10  28 
>> >>   3:  28  17  28  10 
>> >> 
>> >> init_numa_topology_type() set sched_numa_topology_type to NUMA_DIRECT.
>> >> 
>> >> The node 0 and node 1 are onlined during boot.  While the PMEM node,
>> >> that is, node 2 and node 3 are onlined later.  As in the following dmesg
>> >> snippet.
>> >
>> > But how? sched_init_numa() scans the *whole* SLIT table to determine
>> > nr_levels / sched_domains_numa_levels, even offline nodes. Therefore it
>> > should find 4 distinct distance values and end up not selecting
>> > NUMA_DIRECT.
>> >
>> > Similarly for the other types it uses for_each_online_node(), which
>> > would include the pmem nodes once they've been onlined, but I'm thinking
>> > we explicitly want to skip CPU-less nodes in that iteration.
>> 
>> I used the debug patch as below, and get the log in dmesg as follows,
>> 
>> [    5.394577][    T1] sched_numa_topology_type: 0, levels: 4, max_distance: 28
>> 
>> I found that I forget another caller of init_numa_topology_type() run
>> during hotplug.  I will add another printk() to show it.  Sorry about
>> that.
>
> Can you try with this on?
>
> I'm suspecting there's a problem with init_numa_topology_type(); it will
> never find the max distance due to the _online_ clause in the iteration,
> since you said the pmem nodes are not online yet.
>
> ---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..53ab9c63c185 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1756,6 +1756,8 @@ static void init_numa_topology_type(void)
>  			return;
>  		}
>  	}
> +
> +	WARN(1, "no NUMA type determined");
>  }

Hi, Peter,

I have run the test, the warning is triggered in the dmesg as follows.
I will continue to debug hotplug tomorrow.

[    5.400923][    T1] ------------[ cut here ]------------
[    5.401917][    T1] no NUMA type determined
[    5.401921][    T1] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:1760 init_numa_topology_type+0x199/0x1c0
[    5.403918][    T1] Modules linked in:
[    5.404917][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.16.0-rc8-00053-gbe30433a13c0 #1
[    5.405917][    T1] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.0D.01.0286.011120190816 01/11/2019
[    5.406917][    T1] RIP: 0010:init_numa_topology_type+0x199/0x1c0
[    5.407917][    T1] Code: de 82 41 89 dc e8 07 4f 4e 00 3d 00 04 00 00 44 0f 4e e0 3d ff 03 00 00 0f 8e ca fe ff ff 48 c7 c7 a7 88 55 82 e8 0c e5 b3 00 <0f> 0b e9 74 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f
[    5.408917][    T1] RSP: 0000:ffffc900000b7e00 EFLAGS: 00010286
[    5.409917][    T1] RAX: 0000000000000000 RBX: 0000000000000400 RCX: c0000000ffff7fff
[    5.410917][    T1] RDX: ffffc900000b7c28 RSI: 00000000ffff7fff RDI: 0000000000000000
[    5.411917][    T1] RBP: 000000000000001c R08: 0000000000000000 R09: ffffc900000b7c20
[    5.412917][    T1] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000400
[    5.413917][    T1] R13: 0000000000000400 R14: 0000000000000400 R15: 000000000000000c
[    5.414917][    T1] FS:  0000000000000000(0000) GS:ffff88903f600000(0000) knlGS:0000000000000000
[    5.415917][    T1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.416917][    T1] CR2: ffff88df7fc01000 CR3: 0000005f7ec0a001 CR4: 00000000007706f0
[    5.417917][    T1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    5.418917][    T1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    5.419917][    T1] PKRU: 55555554
[    5.420917][    T1] Call Trace:
[    5.421919][    T1]  <TASK>
[    5.422919][    T1]  sched_init_numa+0x4a7/0x5c0
[    5.423918][    T1]  sched_init_smp+0x18/0x79
[    5.424918][    T1]  kernel_init_freeable+0x136/0x276
[    5.425918][    T1]  ? rest_init+0x100/0x100
[    5.426917][    T1]  kernel_init+0x16/0x140
[    5.427917][    T1]  ret_from_fork+0x1f/0x30
[    5.428918][    T1]  </TASK>
[    5.429919][    T1] ---[ end trace aa5563c4363f1ba3 ]---
[    5.430917][    T1] sched_numa_topology_type: 0, levels: 4, max_distance: 28

Best Regards,
Huang, Ying
Huang, Ying Jan. 14, 2022, 5:24 a.m. UTC | #7
"Huang, Ying" <ying.huang@intel.com> writes:

> Peter Zijlstra <peterz@infradead.org> writes:
>
>> On Thu, Jan 13, 2022 at 08:06:40PM +0800, Huang, Ying wrote:
>>> Peter Zijlstra <peterz@infradead.org> writes:
>>> > On Thu, Jan 13, 2022 at 03:19:06PM +0800, Huang, Ying wrote:
>>> >> Peter Zijlstra <peterz@infradead.org> writes:
>>> >> > On Tue, Dec 07, 2021 at 10:27:51AM +0800, Huang Ying wrote:
>>
>>> >> >> After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
>>> >> >> for use like normal RAM"), the PMEM could be used as the
>>> >> >> cost-effective volatile memory in separate NUMA nodes.  In a typical
>>> >> >> memory tiering system, there are CPUs, DRAM and PMEM in each physical
>>> >> >> NUMA node.  The CPUs and the DRAM will be put in one logical node,
>>> >> >> while the PMEM will be put in another (faked) logical node.
>>> >> >
>>> >> > So what does a system like that actually look like, SLIT table wise, and
>>> >> > how does that affect init_numa_topology_type() ?
>>> >> 
>>> >> The SLIT table is as follows,
>>
>> <snip>
>>
>>> >> node distances:
>>> >> node   0   1   2   3 
>>> >>   0:  10  21  17  28 
>>> >>   1:  21  10  28  17 
>>> >>   2:  17  28  10  28 
>>> >>   3:  28  17  28  10 
>>> >> 
>>> >> init_numa_topology_type() set sched_numa_topology_type to NUMA_DIRECT.
>>> >> 
>>> >> The node 0 and node 1 are onlined during boot.  While the PMEM node,
>>> >> that is, node 2 and node 3 are onlined later.  As in the following dmesg
>>> >> snippet.
>>> >
>>> > But how? sched_init_numa() scans the *whole* SLIT table to determine
>>> > nr_levels / sched_domains_numa_levels, even offline nodes. Therefore it
>>> > should find 4 distinct distance values and end up not selecting
>>> > NUMA_DIRECT.
>>> >
>>> > Similarly for the other types it uses for_each_online_node(), which
>>> > would include the pmem nodes once they've been onlined, but I'm thinking
>>> > we explicitly want to skip CPU-less nodes in that iteration.
>>> 
>>> I used the debug patch as below, and get the log in dmesg as follows,
>>> 
>>> [    5.394577][    T1] sched_numa_topology_type: 0, levels: 4, max_distance: 28
>>> 
>>> I found that I forget another caller of init_numa_topology_type() run
>>> during hotplug.  I will add another printk() to show it.  Sorry about
>>> that.
>>
>> Can you try with this on?
>>
>> I'm suspecting there's a problem with init_numa_topology_type(); it will
>> never find the max distance due to the _online_ clause in the iteration,
>> since you said the pmem nodes are not online yet.
>>
>> ---
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index d201a7052a29..53ab9c63c185 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1756,6 +1756,8 @@ static void init_numa_topology_type(void)
>>  			return;
>>  		}
>>  	}
>> +
>> +	WARN(1, "no NUMA type determined");
>>  }
>
> Hi, Peter,
>
> I have run the test, the warning is triggered in the dmesg as follows.
> I will continue to debug hotplug tomorrow.

I did more experiments and found that init_numa_topology_type() will not
be called during PMEM nodes plugging in.  Because it will only be called
when a CPU of a never-onlined-before node is onlined.  There's no CPU on
the PMEM nodes (2/3).  So, when the PMEM node is onlined,
init_numa_topology_type() will not be called. And
sched_numa_topology_type will not be changed.

Best Regards,
Huang, Ying