mbox series

[RFC,0/5] support NUMA emulation for arm64

Message ID 20231012024842.99703-1-rongwei.wang@linux.alibaba.com (mailing list archive)
Headers show
Series support NUMA emulation for arm64 | expand

Message

Rongwei Wang Oct. 12, 2023, 2:48 a.m. UTC
A brief introduction
====================

The NUMA emulation can fake more node base on a single
node system, e.g.

one node system:

[root@localhost ~]# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 31788 MB
node 0 free: 31446 MB
node distances:
node   0
  0:  10

add numa=fake=2 (fake 2 node on each origin node):

[root@localhost ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 15806 MB
node 0 free: 15451 MB
node 1 cpus: 0 1 2 3 4 5 6 7
node 1 size: 16029 MB
node 1 free: 15989 MB
node distances:
node   0   1
  0:  10  10
  1:  10  10

As above shown, a new node has been faked. As cpus, the realization
of x86 NUMA emulation is kept. Maybe each node should has 4 cores is
better (not sure, next to do if so).

Why do this
===========

It seems has following reasons:
  (1) In x86 host, apply NUMA emulation can fake more nodes environment
      to test or verify some performance stuff, but arm64 only has
      one method that modify ACPI table to do this. It's troublesome
      more or less.
  (2) Reduce competition for some locks. Here an example we found:
      will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious
      hotspot on lruvec->lock when test in single environment. What's
      more, The performance improved greatly if test in two more nodes
      system. The data shows below (more is better):

      ---------------------------------------------------------------------
      threads/process |   1     |     12   |     24   |   48     |   96
      ---------------------------------------------------------------------
      one node        | 14 1122 | 110 5372 | 111 2615 | 79 7084  | 72 4516
      ---------------------------------------------------------------------
      numa=fake=2     | 14 1168 | 144 4848 | 215 9070 | 157 0412 | 142 3968
      ---------------------------------------------------------------------
                      | For concurrency 12, no lruvec->lock hotspot. For 24,
      hotspot         | one node has 24% hotspot on lruvec->lock, but
                      | two nodes env hasn't.
      ---------------------------------------------------------------------

As for risks (e.g. numa balance...), they need to be discussed here.

Lastly, this just is a draft, I can improve next if it's acceptable.

Thanks!

Rongwei Wang (5):
  mm/numa: move numa emulation APIs into generic files
  mm: percpu: fix variable type of cpu
  arch_numa: remove __init in early_cpu_to_node()
  mm/numa: support CONFIG_NUMA_EMU for arm64
  mm/numa: migrate leftover numa emulation into mm/numa.c

 arch/x86/Kconfig                          |   8 -
 arch/x86/include/asm/numa.h               |   3 -
 arch/x86/mm/Makefile                      |   1 -
 arch/x86/mm/numa.c                        | 216 +-------------
 arch/x86/mm/numa_internal.h               |  14 +-
 drivers/base/arch_numa.c                  |   7 +-
 include/asm-generic/numa.h                |  33 +++
 include/linux/percpu.h                    |   2 +-
 mm/Kconfig                                |   8 +
 mm/Makefile                               |   1 +
 arch/x86/mm/numa_emulation.c => mm/numa.c | 333 +++++++++++++++++++++-
 11 files changed, 373 insertions(+), 253 deletions(-)
 rename arch/x86/mm/numa_emulation.c => mm/numa.c (63%)

Comments

Pierre Gondois Oct. 12, 2023, 12:37 p.m. UTC | #1
Hello Rongwei,

On 10/12/23 04:48, Rongwei Wang wrote:
> A brief introduction
> ====================
> 
> The NUMA emulation can fake more node base on a single
> node system, e.g.
> 
> one node system:
> 
> [root@localhost ~]# numactl -H
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 31788 MB
> node 0 free: 31446 MB
> node distances:
> node   0
>    0:  10
> 
> add numa=fake=2 (fake 2 node on each origin node):
> 
> [root@localhost ~]# numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 15806 MB
> node 0 free: 15451 MB
> node 1 cpus: 0 1 2 3 4 5 6 7
> node 1 size: 16029 MB
> node 1 free: 15989 MB
> node distances:
> node   0   1
>    0:  10  10
>    1:  10  10
> 
> As above shown, a new node has been faked. As cpus, the realization
> of x86 NUMA emulation is kept. Maybe each node should has 4 cores is
> better (not sure, next to do if so).
> 
> Why do this
> ===========
> 
> It seems has following reasons:
>    (1) In x86 host, apply NUMA emulation can fake more nodes environment
>        to test or verify some performance stuff, but arm64 only has
>        one method that modify ACPI table to do this. It's troublesome
>        more or less.
>    (2) Reduce competition for some locks. Here an example we found:
>        will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious
>        hotspot on lruvec->lock when test in single environment. What's
>        more, The performance improved greatly if test in two more nodes
>        system. The data shows below (more is better):
> 
>        ---------------------------------------------------------------------
>        threads/process |   1     |     12   |     24   |   48     |   96
>        ---------------------------------------------------------------------
>        one node        | 14 1122 | 110 5372 | 111 2615 | 79 7084  | 72 4516
>        ---------------------------------------------------------------------
>        numa=fake=2     | 14 1168 | 144 4848 | 215 9070 | 157 0412 | 142 3968
>        ---------------------------------------------------------------------
>                        | For concurrency 12, no lruvec->lock hotspot. For 24,
>        hotspot         | one node has 24% hotspot on lruvec->lock, but
>                        | two nodes env hasn't.
>        ---------------------------------------------------------------------
> 
> As for risks (e.g. numa balance...), they need to be discussed here.
> 
> Lastly, this just is a draft, I can improve next if it's acceptable.

I'm not engaging on the utility/relevance of the patch-set, but I tried
them on an arm64 system with the 'numa=fake=2' parameter and could not
see 2 nodes being created under:
   /sys/devices/system/node/
Indeed it seems that even though numa_emulation() is moved to a generic
mm/numa.c file, the function is only called from:
   arch/x86/mm/numa.c:numa_init()
(or maybe I'm misinterpreting the intent of the patches).

Also I had the following errors when building (still for arm64):
mm/numa.c:862:8: error: implicit declaration of function 'early_cpu_to_node' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
         nid = early_cpu_to_node(cpu);
               ^
mm/numa.c:862:8: note: did you mean 'early_map_cpu_to_node'?
./include/asm-generic/numa.h:37:13: note: 'early_map_cpu_to_node' declared here
void __init early_map_cpu_to_node(unsigned int cpu, int nid);
             ^
mm/numa.c:874:3: error: implicit declaration of function 'debug_cpumask_set_cpu' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
                 debug_cpumask_set_cpu(cpu, nid, enable);
                 ^
mm/numa.c:874:3: note: did you mean '__cpumask_set_cpu'?
./include/linux/cpumask.h:474:29: note: '__cpumask_set_cpu' declared here
static __always_inline void __cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
                             ^
2 errors generated.

Regards,
Pierre

> 
> Thanks!
> 
> Rongwei Wang (5):
>    mm/numa: move numa emulation APIs into generic files
>    mm: percpu: fix variable type of cpu
>    arch_numa: remove __init in early_cpu_to_node()
>    mm/numa: support CONFIG_NUMA_EMU for arm64
>    mm/numa: migrate leftover numa emulation into mm/numa.c
> 
>   arch/x86/Kconfig                          |   8 -
>   arch/x86/include/asm/numa.h               |   3 -
>   arch/x86/mm/Makefile                      |   1 -
>   arch/x86/mm/numa.c                        | 216 +-------------
>   arch/x86/mm/numa_internal.h               |  14 +-
>   drivers/base/arch_numa.c                  |   7 +-
>   include/asm-generic/numa.h                |  33 +++
>   include/linux/percpu.h                    |   2 +-
>   mm/Kconfig                                |   8 +
>   mm/Makefile                               |   1 +
>   arch/x86/mm/numa_emulation.c => mm/numa.c | 333 +++++++++++++++++++++-
>   11 files changed, 373 insertions(+), 253 deletions(-)
>   rename arch/x86/mm/numa_emulation.c => mm/numa.c (63%)
>
Rongwei Wang Oct. 12, 2023, 1:30 p.m. UTC | #2
On 2023/10/12 20:37, Pierre Gondois wrote:
> Hello Rongwei,
>
> On 10/12/23 04:48, Rongwei Wang wrote:
>> A brief introduction
>> ====================
>>
>> The NUMA emulation can fake more node base on a single
>> node system, e.g.
>>
>> one node system:
>>
>> [root@localhost ~]# numactl -H
>> available: 1 nodes (0)
>> node 0 cpus: 0 1 2 3 4 5 6 7
>> node 0 size: 31788 MB
>> node 0 free: 31446 MB
>> node distances:
>> node   0
>>    0:  10
>>
>> add numa=fake=2 (fake 2 node on each origin node):
>>
>> [root@localhost ~]# numactl -H
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 1 2 3 4 5 6 7
>> node 0 size: 15806 MB
>> node 0 free: 15451 MB
>> node 1 cpus: 0 1 2 3 4 5 6 7
>> node 1 size: 16029 MB
>> node 1 free: 15989 MB
>> node distances:
>> node   0   1
>>    0:  10  10
>>    1:  10  10
>>
>> As above shown, a new node has been faked. As cpus, the realization
>> of x86 NUMA emulation is kept. Maybe each node should has 4 cores is
>> better (not sure, next to do if so).
>>
>> Why do this
>> ===========
>>
>> It seems has following reasons:
>>    (1) In x86 host, apply NUMA emulation can fake more nodes environment
>>        to test or verify some performance stuff, but arm64 only has
>>        one method that modify ACPI table to do this. It's troublesome
>>        more or less.
>>    (2) Reduce competition for some locks. Here an example we found:
>>        will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious
>>        hotspot on lruvec->lock when test in single environment. What's
>>        more, The performance improved greatly if test in two more nodes
>>        system. The data shows below (more is better):
>>
>> ---------------------------------------------------------------------
>>        threads/process |   1     |     12   |     24   | 48     |   96
>> ---------------------------------------------------------------------
>>        one node        | 14 1122 | 110 5372 | 111 2615 | 79 7084  | 
>> 72 4516
>> ---------------------------------------------------------------------
>>        numa=fake=2     | 14 1168 | 144 4848 | 215 9070 | 157 0412 | 
>> 142 3968
>> ---------------------------------------------------------------------
>>                        | For concurrency 12, no lruvec->lock hotspot. 
>> For 24,
>>        hotspot         | one node has 24% hotspot on lruvec->lock, but
>>                        | two nodes env hasn't.
>> ---------------------------------------------------------------------
>>
>> As for risks (e.g. numa balance...), they need to be discussed here.
>>
>> Lastly, this just is a draft, I can improve next if it's acceptable.
>
> I'm not engaging on the utility/relevance of the patch-set, but I tried
> them on an arm64 system with the 'numa=fake=2' parameter and could not

Sorry, my fault.

I should mention this in previous brief introduction: acpi=on numa=fake=2.

The default patch of arm64 numa initialize is numa_init() -> 
dummy_numa_init() if turn off acpi (this path has not been taken into 
account yet in this patch, next will to do).

What's more, if you test these patchset in qemu-kvm, you should add 
below parameters in the script.

object memory-backend-ram,id=mem0,size=32G \
numa node,memdev=mem0,cpus=0-7,nodeid=0 \

(Above parameters just make sure SRAT table has NUMA configure, avoiding 
path of numa_init() -> dummy_numa_init())

> see 2 nodes being created under:
>   /sys/devices/system/node/
> Indeed it seems that even though numa_emulation() is moved to a generic
> mm/numa.c file, the function is only called from:
>   arch/x86/mm/numa.c:numa_init()
> (or maybe I'm misinterpreting the intent of the patches).

Here drivers/base/arch_numa.c:numa_init() has called numa_emulation() (I 
guess it works if you add acpi=on :-)).


>
> Also I had the following errors when building (still for arm64):
> mm/numa.c:862:8: error: implicit declaration of function 
> 'early_cpu_to_node' is invalid in C99 
> [-Werror,-Wimplicit-function-declaration]
>         nid = early_cpu_to_node(cpu);

It seems CONFIG_DEBUG_PER_CPU_MAPS enabled in your environment? You can 
disable CONFIG_DEBUG_PER_CPU_MAPS and test it again.

I have not test it with CONFIG_DEBUG_PER_CPU_MAPS enabled. It's very 
helpful, I will fix it next time.

If you have any questions, please let me know.

Regards,

-wrw

> ^
> mm/numa.c:862:8: note: did you mean 'early_map_cpu_to_node'?
> ./include/asm-generic/numa.h:37:13: note: 'early_map_cpu_to_node' 
> declared here
> void __init early_map_cpu_to_node(unsigned int cpu, int nid);
>             ^
> mm/numa.c:874:3: error: implicit declaration of function 
> 'debug_cpumask_set_cpu' is invalid in C99 
> [-Werror,-Wimplicit-function-declaration]
>                 debug_cpumask_set_cpu(cpu, nid, enable);
>                 ^
> mm/numa.c:874:3: note: did you mean '__cpumask_set_cpu'?
> ./include/linux/cpumask.h:474:29: note: '__cpumask_set_cpu' declared here
> static __always_inline void __cpumask_set_cpu(unsigned int cpu, struct 
> cpumask *dstp)
>                             ^
> 2 errors generated.
>
> Regards,
> Pierre
>
>>
>> Thanks!
>>
>> Rongwei Wang (5):
>>    mm/numa: move numa emulation APIs into generic files
>>    mm: percpu: fix variable type of cpu
>>    arch_numa: remove __init in early_cpu_to_node()
>>    mm/numa: support CONFIG_NUMA_EMU for arm64
>>    mm/numa: migrate leftover numa emulation into mm/numa.c
>>
>>   arch/x86/Kconfig                          |   8 -
>>   arch/x86/include/asm/numa.h               |   3 -
>>   arch/x86/mm/Makefile                      |   1 -
>>   arch/x86/mm/numa.c                        | 216 +-------------
>>   arch/x86/mm/numa_internal.h               |  14 +-
>>   drivers/base/arch_numa.c                  |   7 +-
>>   include/asm-generic/numa.h                |  33 +++
>>   include/linux/percpu.h                    |   2 +-
>>   mm/Kconfig                                |   8 +
>>   mm/Makefile                               |   1 +
>>   arch/x86/mm/numa_emulation.c => mm/numa.c | 333 +++++++++++++++++++++-
>>   11 files changed, 373 insertions(+), 253 deletions(-)
>>   rename arch/x86/mm/numa_emulation.c => mm/numa.c (63%)
>>
Pierre Gondois Oct. 23, 2023, 1:03 p.m. UTC | #3
Hello Rongwei,

On 10/12/23 15:30, Rongwei Wang wrote:
> 
> On 2023/10/12 20:37, Pierre Gondois wrote:
>> Hello Rongwei,
>>
>> On 10/12/23 04:48, Rongwei Wang wrote:
>>> A brief introduction
>>> ====================
>>>
>>> The NUMA emulation can fake more node base on a single
>>> node system, e.g.
>>>
>>> one node system:
>>>
>>> [root@localhost ~]# numactl -H
>>> available: 1 nodes (0)
>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>> node 0 size: 31788 MB
>>> node 0 free: 31446 MB
>>> node distances:
>>> node   0
>>>     0:  10
>>>
>>> add numa=fake=2 (fake 2 node on each origin node):
>>>
>>> [root@localhost ~]# numactl -H
>>> available: 2 nodes (0-1)
>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>> node 0 size: 15806 MB
>>> node 0 free: 15451 MB
>>> node 1 cpus: 0 1 2 3 4 5 6 7
>>> node 1 size: 16029 MB
>>> node 1 free: 15989 MB
>>> node distances:
>>> node   0   1
>>>     0:  10  10
>>>     1:  10  10
>>>
>>> As above shown, a new node has been faked. As cpus, the realization
>>> of x86 NUMA emulation is kept. Maybe each node should has 4 cores is
>>> better (not sure, next to do if so).
>>>
>>> Why do this
>>> ===========
>>>
>>> It seems has following reasons:
>>>     (1) In x86 host, apply NUMA emulation can fake more nodes environment
>>>         to test or verify some performance stuff, but arm64 only has
>>>         one method that modify ACPI table to do this. It's troublesome
>>>         more or less.
>>>     (2) Reduce competition for some locks. Here an example we found:
>>>         will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious
>>>         hotspot on lruvec->lock when test in single environment. What's
>>>         more, The performance improved greatly if test in two more nodes
>>>         system. The data shows below (more is better):
>>>
>>> ---------------------------------------------------------------------
>>>         threads/process |   1     |     12   |     24   | 48     |   96
>>> ---------------------------------------------------------------------
>>>         one node        | 14 1122 | 110 5372 | 111 2615 | 79 7084  |
>>> 72 4516
>>> ---------------------------------------------------------------------
>>>         numa=fake=2     | 14 1168 | 144 4848 | 215 9070 | 157 0412 |
>>> 142 3968
>>> ---------------------------------------------------------------------
>>>                         | For concurrency 12, no lruvec->lock hotspot.
>>> For 24,
>>>         hotspot         | one node has 24% hotspot on lruvec->lock, but
>>>                         | two nodes env hasn't.
>>> ---------------------------------------------------------------------
>>>
>>> As for risks (e.g. numa balance...), they need to be discussed here.
>>>
>>> Lastly, this just is a draft, I can improve next if it's acceptable.
>>
>> I'm not engaging on the utility/relevance of the patch-set, but I tried
>> them on an arm64 system with the 'numa=fake=2' parameter and could not
> 
> Sorry, my fault.
> 
> I should mention this in previous brief introduction: acpi=on numa=fake=2.
> 
> The default patch of arm64 numa initialize is numa_init() ->
> dummy_numa_init() if turn off acpi (this path has not been taken into
> account yet in this patch, next will to do).
> 
> What's more, if you test these patchset in qemu-kvm, you should add
> below parameters in the script.
> 
> object memory-backend-ram,id=mem0,size=32G \
> numa node,memdev=mem0,cpus=0-7,nodeid=0 \
> 
> (Above parameters just make sure SRAT table has NUMA configure, avoiding
> path of numa_init() -> dummy_numa_init())
> 
>> see 2 nodes being created under:
>>    /sys/devices/system/node/
>> Indeed it seems that even though numa_emulation() is moved to a generic
>> mm/numa.c file, the function is only called from:
>>    arch/x86/mm/numa.c:numa_init()
>> (or maybe I'm misinterpreting the intent of the patches).
> 
> Here drivers/base/arch_numa.c:numa_init() has called numa_emulation() (I
> guess it works if you add acpi=on :-)).

I don't see numa_emulation() being called from drivers/base/arch_numa.c:numa_init()

I have:
   $ git grep numa_emulation
   arch/x86/mm/numa.c:     numa_emulation(&numa_meminfo, numa_distance_cnt);
   arch/x86/mm/numa_internal.h:extern void __init numa_emulation(struct numa_meminfo *numa_meminfo,
   include/asm-generic/numa.h:void __init numa_emulation(struct numa_meminfo *numa_meminfo,
   mm/numa.c:/* Most of this file comes from x86/numa_emulation.c */
   mm/numa.c: * numa_emulation - Emulate NUMA nodes
   mm/numa.c:void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
so from this, an arm64-based platform should not be able to call numa_emulation().

Is it possible to add a call to  dump_stack() in numa_emulation() to see the call stack ?

The branch I'm using is based on  v6.6-rc5 and has the 5 patches applied:
2af398a87cc7 mm/numa: migrate leftover numa emulation into mm/numa.c
c8e314fb23be mm/numa: support CONFIG_NUMA_EMU for arm64
335b7219d40e arch_numa: remove __init in early_cpu_to_node()
d9358adf1cdc mm: percpu: fix variable type of cpu
1ffbe40a00f5 mm/numa: move numa emulation APIs into generic files
94f6f0550c62 (tag: v6.6-rc5) Linux 6.6-rc5

Regards,
Pierre

> 
> 
>>
>> Also I had the following errors when building (still for arm64):
>> mm/numa.c:862:8: error: implicit declaration of function
>> 'early_cpu_to_node' is invalid in C99
>> [-Werror,-Wimplicit-function-declaration]
>>          nid = early_cpu_to_node(cpu);
> 
> It seems CONFIG_DEBUG_PER_CPU_MAPS enabled in your environment? You can
> disable CONFIG_DEBUG_PER_CPU_MAPS and test it again.
> 
> I have not test it with CONFIG_DEBUG_PER_CPU_MAPS enabled. It's very
> helpful, I will fix it next time.
> 
> If you have any questions, please let me know.
> 
> Regards,
> 
> -wrw
> 
>> ^
>> mm/numa.c:862:8: note: did you mean 'early_map_cpu_to_node'?
>> ./include/asm-generic/numa.h:37:13: note: 'early_map_cpu_to_node'
>> declared here
>> void __init early_map_cpu_to_node(unsigned int cpu, int nid);
>>              ^
>> mm/numa.c:874:3: error: implicit declaration of function
>> 'debug_cpumask_set_cpu' is invalid in C99
>> [-Werror,-Wimplicit-function-declaration]
>>                  debug_cpumask_set_cpu(cpu, nid, enable);
>>                  ^
>> mm/numa.c:874:3: note: did you mean '__cpumask_set_cpu'?
>> ./include/linux/cpumask.h:474:29: note: '__cpumask_set_cpu' declared here
>> static __always_inline void __cpumask_set_cpu(unsigned int cpu, struct
>> cpumask *dstp)
>>                              ^
>> 2 errors generated.
>>
>> Regards,
>> Pierre
>>
>>>
>>> Thanks!
>>>
>>> Rongwei Wang (5):
>>>     mm/numa: move numa emulation APIs into generic files
>>>     mm: percpu: fix variable type of cpu
>>>     arch_numa: remove __init in early_cpu_to_node()
>>>     mm/numa: support CONFIG_NUMA_EMU for arm64
>>>     mm/numa: migrate leftover numa emulation into mm/numa.c
>>>
>>>    arch/x86/Kconfig                          |   8 -
>>>    arch/x86/include/asm/numa.h               |   3 -
>>>    arch/x86/mm/Makefile                      |   1 -
>>>    arch/x86/mm/numa.c                        | 216 +-------------
>>>    arch/x86/mm/numa_internal.h               |  14 +-
>>>    drivers/base/arch_numa.c                  |   7 +-
>>>    include/asm-generic/numa.h                |  33 +++
>>>    include/linux/percpu.h                    |   2 +-
>>>    mm/Kconfig                                |   8 +
>>>    mm/Makefile                               |   1 +
>>>    arch/x86/mm/numa_emulation.c => mm/numa.c | 333 +++++++++++++++++++++-
>>>    11 files changed, 373 insertions(+), 253 deletions(-)
>>>    rename arch/x86/mm/numa_emulation.c => mm/numa.c (63%)
>>>
Rongwei Wang Feb. 20, 2024, 11:36 a.m. UTC | #4
A brief introduction
====================

The NUMA emulation can fake more node base on a single
node system, e.g.

one node system:

[root@localhost ~]# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 31788 MB
node 0 free: 31446 MB
node distances:
node   0
  0:  10

add numa=fake=2 (fake 2 node on each origin node):

[root@localhost ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 15806 MB
node 0 free: 15451 MB
node 1 cpus: 0 1 2 3 4 5 6 7
node 1 size: 16029 MB
node 1 free: 15989 MB
node distances:
node   0   1
  0:  10  10
  1:  10  10

As above shown, a new node has been faked. As cpus, the realization
of x86 NUMA emulation is kept. Maybe each node should has 4 cores is
better (not sure, next to do if so).

Why do this
===========

It seems has following reasons:
  (1) In x86 host, apply NUMA emulation can fake more nodes environment
      to test or verify some performance stuff, but arm64 only has
      one method that modify ACPI table to do this. It's troublesome
      more or less.
  (2) Reduce competition for some locks. Here an example we found:
      will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious
      hotspot on lruvec->lock when test in single environment. What's
      more, The performance improved greatly if test in two more nodes
      system. The data shows below (more is better):

      ---------------------------------------------------------------------
      threads/process |   1     |     12   |     24   |   48     |   96
      ---------------------------------------------------------------------
      one node        | 14 1122 | 110 5372 | 111 2615 | 79 7084  | 72 4516
      ---------------------------------------------------------------------
      numa=fake=2     | 14 1168 | 144 4848 | 215 9070 | 157 0412 | 142 3968
      ---------------------------------------------------------------------
                      | For concurrency 12, no lruvec->lock hotspot. For 24,
      hotspot         | one node has 24% hotspot on lruvec->lock, but
                      | two nodes env hasn't.
      ---------------------------------------------------------------------

As for risks (e.g. numa balance...), they need to be discussed here.

Lastly, it seems not a good choice to realize x86 and other genertic
archs separately. But it can indeed avoid some architecture related
APIs adjustments and alleviate future maintenance. The previous RFC
link see [1].

Any advice are welcome, Thanks!

Change log
==========

RFC v1 -> v1
* add new CONFIG_NUMA_FAKE for genertic archs.
* keep x86 implementation, realize numa emulation in driver/base/ for
  genertic arch, e.g, arm64.

[1] RFC v1: https://patchwork.kernel.org/project/linux-arm-kernel/cover/20231012024842.99703-1-rongwei.wang@linux.alibaba.com/

Rongwei Wang (2):
  arch_numa: remove __init for early_cpu_to_node
  numa: introduce numa emulation for genertic arch

 drivers/base/Kconfig          |   9 +
 drivers/base/Makefile         |   1 +
 drivers/base/arch_numa.c      |  32 +-
 drivers/base/numa_emulation.c | 909 ++++++++++++++++++++++++++++++++++
 drivers/base/numa_emulation.h |  41 ++
 include/asm-generic/numa.h    |   2 +-
 6 files changed, 992 insertions(+), 2 deletions(-)
 create mode 100644 drivers/base/numa_emulation.c
 create mode 100644 drivers/base/numa_emulation.h
Mike Rapoport Feb. 21, 2024, 6:12 a.m. UTC | #5
On Tue, Feb 20, 2024 at 07:36:00PM +0800, Rongwei Wang wrote:
> A brief introduction
> ====================
> 
> The NUMA emulation can fake more node base on a single
> node system, e.g.

... 
 
> Lastly, it seems not a good choice to realize x86 and other genertic
> archs separately. But it can indeed avoid some architecture related
> APIs adjustments and alleviate future maintenance.

Why is it a good choice? Copying 1k lines from x86 to a new place and
having to maintain two copies does not sound like a good choice to me.

> The previous RFC link see [1].
> 
> Any advice are welcome, Thanks!
> 
> Change log
> ==========
> 
> RFC v1 -> v1
> * add new CONFIG_NUMA_FAKE for genertic archs.
> * keep x86 implementation, realize numa emulation in driver/base/ for
>   genertic arch, e.g, arm64.
> 
> [1] RFC v1: https://patchwork.kernel.org/project/linux-arm-kernel/cover/20231012024842.99703-1-rongwei.wang@linux.alibaba.com/
> 
> Rongwei Wang (2):
>   arch_numa: remove __init for early_cpu_to_node
>   numa: introduce numa emulation for genertic arch
> 
>  drivers/base/Kconfig          |   9 +
>  drivers/base/Makefile         |   1 +
>  drivers/base/arch_numa.c      |  32 +-
>  drivers/base/numa_emulation.c | 909 ++++++++++++++++++++++++++++++++++
>  drivers/base/numa_emulation.h |  41 ++
>  include/asm-generic/numa.h    |   2 +-
>  6 files changed, 992 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/base/numa_emulation.c
>  create mode 100644 drivers/base/numa_emulation.h
> 
> -- 
> 2.32.0.3.gf3a3e56d6
> 
>
Pierre Gondois Feb. 21, 2024, 3:51 p.m. UTC | #6
On 2/21/24 07:12, Mike Rapoport wrote:
> On Tue, Feb 20, 2024 at 07:36:00PM +0800, Rongwei Wang wrote:
>> A brief introduction
>> ====================
>>
>> The NUMA emulation can fake more node base on a single
>> node system, e.g.
> 
> ...
>   
>> Lastly, it seems not a good choice to realize x86 and other genertic
>> archs separately. But it can indeed avoid some architecture related
>> APIs adjustments and alleviate future maintenance.
> 
> Why is it a good choice? Copying 1k lines from x86 to a new place and
> having to maintain two copies does not sound like a good choice to me.

I agree it would be better to avoid duplication and extract the common
code from the original x86 implementation. The RFC seemed to go more
in this direction.
Also NITs:
- genertic -> generic
- there is a 'ifdef CONFIG_X86' in drivers/base/numa_emulation.c,
   but the file should not be used by x86 as the arch doesn't set
   CONFIG_GENERIC_ARCH_NUMA

Regards,
Pierre

> 
>> The previous RFC link see [1].
>>
>> Any advice are welcome, Thanks!
>>
>> Change log
>> ==========
>>
>> RFC v1 -> v1
>> * add new CONFIG_NUMA_FAKE for genertic archs.
>> * keep x86 implementation, realize numa emulation in driver/base/ for
>>    genertic arch, e.g, arm64.
>>
>> [1] RFC v1: https://patchwork.kernel.org/project/linux-arm-kernel/cover/20231012024842.99703-1-rongwei.wang@linux.alibaba.com/
>>
>> Rongwei Wang (2):
>>    arch_numa: remove __init for early_cpu_to_node
>>    numa: introduce numa emulation for genertic arch
>>
>>   drivers/base/Kconfig          |   9 +
>>   drivers/base/Makefile         |   1 +
>>   drivers/base/arch_numa.c      |  32 +-
>>   drivers/base/numa_emulation.c | 909 ++++++++++++++++++++++++++++++++++
>>   drivers/base/numa_emulation.h |  41 ++
>>   include/asm-generic/numa.h    |   2 +-
>>   6 files changed, 992 insertions(+), 2 deletions(-)
>>   create mode 100644 drivers/base/numa_emulation.c
>>   create mode 100644 drivers/base/numa_emulation.h
>>
>> -- 
>> 2.32.0.3.gf3a3e56d6
>>
>>
>
Rongwei Wang Feb. 29, 2024, 3:26 a.m. UTC | #7
On 2/21/24 11:51 PM, Pierre Gondois wrote:
>
>
> On 2/21/24 07:12, Mike Rapoport wrote:
>> On Tue, Feb 20, 2024 at 07:36:00PM +0800, Rongwei Wang wrote:
>>> A brief introduction
>>> ====================
>>>
>>> The NUMA emulation can fake more node base on a single
>>> node system, e.g.
>>
>> ...
>>> Lastly, it seems not a good choice to realize x86 and other genertic
>>> archs separately. But it can indeed avoid some architecture related
>>> APIs adjustments and alleviate future maintenance.
>>
>> Why is it a good choice? Copying 1k lines from x86 to a new place and
>> having to maintain two copies does not sound like a good choice to me.
Hi Pierre
> I agree it would be better to avoid duplication and extract the common
> code from the original x86 implementation. The RFC seemed to go more
> in this direction.
> Also NITs:
> - genertic -> generic
Thanks, my fault, zhaoyu also found this (thanks).
> - there is a 'ifdef CONFIG_X86' in drivers/base/numa_emulation.c,
>   but the file should not be used by x86 as the arch doesn't set
>   CONFIG_GENERIC_ARCH_NUMA
>
Actually, I have not think about how to ask the question. I'm also try 
to original direction like RFC version, but found much APIs need to be 
updated, and there are many APIs are similar but a little difference. 
That seems much modification needed in more than one arch if go in 
original direction.

But if all think original method is right, I will continue it in RFC 
version.

Thanks for your time to review.
> Regards,
> Pierre
>
>>
>>> The previous RFC link see [1].
>>>
>>> Any advice are welcome, Thanks!
>>>
>>> Change log
>>> ==========
>>>
>>> RFC v1 -> v1
>>> * add new CONFIG_NUMA_FAKE for genertic archs.
>>> * keep x86 implementation, realize numa emulation in driver/base/ for
>>>    genertic arch, e.g, arm64.
>>>
>>> [1] RFC v1: 
>>> https://patchwork.kernel.org/project/linux-arm-kernel/cover/20231012024842.99703-1-rongwei.wang@linux.alibaba.com/
>>>
>>> Rongwei Wang (2):
>>>    arch_numa: remove __init for early_cpu_to_node
>>>    numa: introduce numa emulation for genertic arch
>>>
>>>   drivers/base/Kconfig          |   9 +
>>>   drivers/base/Makefile         |   1 +
>>>   drivers/base/arch_numa.c      |  32 +-
>>>   drivers/base/numa_emulation.c | 909 
>>> ++++++++++++++++++++++++++++++++++
>>>   drivers/base/numa_emulation.h |  41 ++
>>>   include/asm-generic/numa.h    |   2 +-
>>>   6 files changed, 992 insertions(+), 2 deletions(-)
>>>   create mode 100644 drivers/base/numa_emulation.c
>>>   create mode 100644 drivers/base/numa_emulation.h
>>>
>>> -- 
>>> 2.32.0.3.gf3a3e56d6
>>>
>>>
>>