diff mbox series

[v3,10/11] mm: vmalloc: Set nr_nodes based on CPUs in a system

Message ID 20240102184633.748113-11-urezki@gmail.com (mailing list archive)
State New
Headers show
Series Mitigate a vmap lock contention v3 | expand

Commit Message

Uladzislau Rezki Jan. 2, 2024, 6:46 p.m. UTC
A number of nodes which are used in the alloc/free paths is
set based on num_possible_cpus() in a system. Please note a
high limit threshold though is fixed and corresponds to 128
nodes.

For 32-bit or single core systems an access to a global vmap
heap is not balanced. Such small systems do not suffer from
lock contentions due to low number of CPUs. In such case the
nr_nodes is equal to 1.

Test on AMD Ryzen Threadripper 3970X 32-Core Processor:
sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64

<default perf>
 94.41%     0.89%  [kernel]        [k] _raw_spin_lock
 93.35%    93.07%  [kernel]        [k] native_queued_spin_lock_slowpath
 76.13%     0.28%  [kernel]        [k] __vmalloc_node_range
 72.96%     0.81%  [kernel]        [k] alloc_vmap_area
 56.94%     0.00%  [kernel]        [k] __get_vm_area_node
 41.95%     0.00%  [kernel]        [k] vmalloc
 37.15%     0.01%  [test_vmalloc]  [k] full_fit_alloc_test
 35.17%     0.00%  [kernel]        [k] ret_from_fork_asm
 35.17%     0.00%  [kernel]        [k] ret_from_fork
 35.17%     0.00%  [kernel]        [k] kthread
 35.08%     0.00%  [test_vmalloc]  [k] test_func
 34.45%     0.00%  [test_vmalloc]  [k] fix_size_alloc_test
 28.09%     0.01%  [test_vmalloc]  [k] long_busy_list_alloc_test
 23.53%     0.25%  [kernel]        [k] vfree.part.0
 21.72%     0.00%  [kernel]        [k] remove_vm_area
 20.08%     0.21%  [kernel]        [k] find_unlink_vmap_area
  2.34%     0.61%  [kernel]        [k] free_vmap_area_noflush
<default perf>
   vs
<patch-series perf>
 82.32%     0.22%  [test_vmalloc]  [k] long_busy_list_alloc_test
 63.36%     0.02%  [kernel]        [k] vmalloc
 63.34%     2.64%  [kernel]        [k] __vmalloc_node_range
 30.42%     4.46%  [kernel]        [k] vfree.part.0
 28.98%     2.51%  [kernel]        [k] __alloc_pages_bulk
 27.28%     0.19%  [kernel]        [k] __get_vm_area_node
 26.13%     1.50%  [kernel]        [k] alloc_vmap_area
 21.72%    21.67%  [kernel]        [k] clear_page_rep
 19.51%     2.43%  [kernel]        [k] _raw_spin_lock
 16.61%    16.51%  [kernel]        [k] native_queued_spin_lock_slowpath
 13.40%     2.07%  [kernel]        [k] free_unref_page
 10.62%     0.01%  [kernel]        [k] remove_vm_area
  9.02%     8.73%  [kernel]        [k] insert_vmap_area
  8.94%     0.00%  [kernel]        [k] ret_from_fork_asm
  8.94%     0.00%  [kernel]        [k] ret_from_fork
  8.94%     0.00%  [kernel]        [k] kthread
  8.29%     0.00%  [test_vmalloc]  [k] test_func
  7.81%     0.05%  [test_vmalloc]  [k] full_fit_alloc_test
  5.30%     4.73%  [kernel]        [k] purge_vmap_node
  4.47%     2.65%  [kernel]        [k] free_vmap_area_noflush
<patch-series perf>

confirms that a native_queued_spin_lock_slowpath goes down to
16.51% percent from 93.07%.

The throughput is ~12x higher:

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real    10m51.271s
user    0m0.013s
sys     0m0.187s
urezki@pc638:~$

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
Run the test with following parameters: run_test_mask=7 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real    0m51.301s
user    0m0.015s
sys     0m0.040s
urezki@pc638:~$

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

Comments

Dave Chinner Jan. 11, 2024, 9:25 a.m. UTC | #1
On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote:
> A number of nodes which are used in the alloc/free paths is
> set based on num_possible_cpus() in a system. Please note a
> high limit threshold though is fixed and corresponds to 128
> nodes.

Large CPU count machines are NUMA machines. ALl of the allocation
and reclaim is NUMA node based i.e. a pgdat per NUMA node.

Shrinkers are also able to be run in a NUMA aware mode so that
per-node structures can be reclaimed similar to how per-node LRU
lists are scanned for reclaim.

Hence I'm left to wonder if it would be better to have a vmalloc
area per pgdat (or sub-node cluster) rather than just base the
number on CPU count and then have an arbitrary maximum number when
we get to 128 CPU cores. We can have 128 CPU cores in a
single socket these days, so not being able to scale the vmalloc
areas beyond a single socket seems like a bit of a limitation.

Scaling out the vmalloc areas in a NUMA aware fashion allows the
shrinker to be run in numa aware mode, which gets rid of the need
for the global shrinker to loop over every single vmap area in every
shrinker invocation. Only the vm areas on the node that has a memory
shortage need to be scanned and reclaimed, it doesn't need reclaim
everything globally when a single node runs out of memory.

Yes, this may not give quite as good microbenchmark scalability
results, but being able to locate each vm area in node local memory
and have operation on them largely isolated to node-local tasks and
vmalloc area reclaim will work much better on large multi-socket
NUMA machines.

Cheers,

Dave.
Uladzislau Rezki Jan. 15, 2024, 7:09 p.m. UTC | #2
> On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote:
> > A number of nodes which are used in the alloc/free paths is
> > set based on num_possible_cpus() in a system. Please note a
> > high limit threshold though is fixed and corresponds to 128
> > nodes.
> 
> Large CPU count machines are NUMA machines. ALl of the allocation
> and reclaim is NUMA node based i.e. a pgdat per NUMA node.
> 
> Shrinkers are also able to be run in a NUMA aware mode so that
> per-node structures can be reclaimed similar to how per-node LRU
> lists are scanned for reclaim.
> 
> Hence I'm left to wonder if it would be better to have a vmalloc
> area per pgdat (or sub-node cluster) rather than just base the
> number on CPU count and then have an arbitrary maximum number when
> we get to 128 CPU cores. We can have 128 CPU cores in a
> single socket these days, so not being able to scale the vmalloc
> areas beyond a single socket seems like a bit of a limitation.
> 
>
> Hence I'm left to wonder if it would be better to have a vmalloc
> area per pgdat (or sub-node cluster) rather than just base the
>
> Scaling out the vmalloc areas in a NUMA aware fashion allows the
> shrinker to be run in numa aware mode, which gets rid of the need
> for the global shrinker to loop over every single vmap area in every
> shrinker invocation. Only the vm areas on the node that has a memory
> shortage need to be scanned and reclaimed, it doesn't need reclaim
> everything globally when a single node runs out of memory.
> 
> Yes, this may not give quite as good microbenchmark scalability
> results, but being able to locate each vm area in node local memory
> and have operation on them largely isolated to node-local tasks and
> vmalloc area reclaim will work much better on large multi-socket
> NUMA machines.
> 
Currently i fix the max nodes number to 128. This is because i do not
have an access to such big NUMA systems whereas i do have an access to
around ~128 ones. That is why i have decided to stop on that number as
of now.

We can easily set nr_nodes to num_possible_cpus() and let it scale for
anyone. But before doing this, i would like to give it a try as a first
step because i have not tested it well on really big NUMA systems.

Thanks for you NUMA-aware input.

--
Uladzislau Rezki
Dave Chinner Jan. 16, 2024, 10:06 p.m. UTC | #3
On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote:
> > On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote:
> > > A number of nodes which are used in the alloc/free paths is
> > > set based on num_possible_cpus() in a system. Please note a
> > > high limit threshold though is fixed and corresponds to 128
> > > nodes.
> > 
> > Large CPU count machines are NUMA machines. ALl of the allocation
> > and reclaim is NUMA node based i.e. a pgdat per NUMA node.
> > 
> > Shrinkers are also able to be run in a NUMA aware mode so that
> > per-node structures can be reclaimed similar to how per-node LRU
> > lists are scanned for reclaim.
> > 
> > Hence I'm left to wonder if it would be better to have a vmalloc
> > area per pgdat (or sub-node cluster) rather than just base the
> > number on CPU count and then have an arbitrary maximum number when
> > we get to 128 CPU cores. We can have 128 CPU cores in a
> > single socket these days, so not being able to scale the vmalloc
> > areas beyond a single socket seems like a bit of a limitation.
> > 
> >
> > Hence I'm left to wonder if it would be better to have a vmalloc
> > area per pgdat (or sub-node cluster) rather than just base the
> >
> > Scaling out the vmalloc areas in a NUMA aware fashion allows the
> > shrinker to be run in numa aware mode, which gets rid of the need
> > for the global shrinker to loop over every single vmap area in every
> > shrinker invocation. Only the vm areas on the node that has a memory
> > shortage need to be scanned and reclaimed, it doesn't need reclaim
> > everything globally when a single node runs out of memory.
> > 
> > Yes, this may not give quite as good microbenchmark scalability
> > results, but being able to locate each vm area in node local memory
> > and have operation on them largely isolated to node-local tasks and
> > vmalloc area reclaim will work much better on large multi-socket
> > NUMA machines.
> > 
> Currently i fix the max nodes number to 128. This is because i do not
> have an access to such big NUMA systems whereas i do have an access to
> around ~128 ones. That is why i have decided to stop on that number as
> of now.

I suspect you are confusing number of CPUs with number of NUMA nodes.

A NUMA system with 128 nodes is a large NUMA system that will have
thousands of CPU cores, whilst above you talk about basing the
count on CPU cores and that a single socket can have 128 cores?

> We can easily set nr_nodes to num_possible_cpus() and let it scale for
> anyone. But before doing this, i would like to give it a try as a first
> step because i have not tested it well on really big NUMA systems.

I don't think you need to have large NUMA systems to test it. We
have the "fakenuma" feature for a reason.  Essentially, once you
have enough CPU cores that catastrophic lock contention can be
generated in a fast path (can take as few as 4-5 CPU cores), then
you can effectively test NUMA scalability with fakenuma by creating
nodes with >=8 CPUs each.

This is how I've done testing of numa aware algorithms (like
shrinkers!) for the past decade - I haven't had direct access to a
big NUMA machine since 2008, yet it's relatively trivial to test
NUMA based scalability algorithms without them these days.

-Dave.
Uladzislau Rezki Jan. 18, 2024, 6:23 p.m. UTC | #4
On Wed, Jan 17, 2024 at 09:06:02AM +1100, Dave Chinner wrote:
> On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote:
> > > On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > A number of nodes which are used in the alloc/free paths is
> > > > set based on num_possible_cpus() in a system. Please note a
> > > > high limit threshold though is fixed and corresponds to 128
> > > > nodes.
> > > 
> > > Large CPU count machines are NUMA machines. ALl of the allocation
> > > and reclaim is NUMA node based i.e. a pgdat per NUMA node.
> > > 
> > > Shrinkers are also able to be run in a NUMA aware mode so that
> > > per-node structures can be reclaimed similar to how per-node LRU
> > > lists are scanned for reclaim.
> > > 
> > > Hence I'm left to wonder if it would be better to have a vmalloc
> > > area per pgdat (or sub-node cluster) rather than just base the
> > > number on CPU count and then have an arbitrary maximum number when
> > > we get to 128 CPU cores. We can have 128 CPU cores in a
> > > single socket these days, so not being able to scale the vmalloc
> > > areas beyond a single socket seems like a bit of a limitation.
> > > 
> > >
> > > Hence I'm left to wonder if it would be better to have a vmalloc
> > > area per pgdat (or sub-node cluster) rather than just base the
> > >
> > > Scaling out the vmalloc areas in a NUMA aware fashion allows the
> > > shrinker to be run in numa aware mode, which gets rid of the need
> > > for the global shrinker to loop over every single vmap area in every
> > > shrinker invocation. Only the vm areas on the node that has a memory
> > > shortage need to be scanned and reclaimed, it doesn't need reclaim
> > > everything globally when a single node runs out of memory.
> > > 
> > > Yes, this may not give quite as good microbenchmark scalability
> > > results, but being able to locate each vm area in node local memory
> > > and have operation on them largely isolated to node-local tasks and
> > > vmalloc area reclaim will work much better on large multi-socket
> > > NUMA machines.
> > > 
> > Currently i fix the max nodes number to 128. This is because i do not
> > have an access to such big NUMA systems whereas i do have an access to
> > around ~128 ones. That is why i have decided to stop on that number as
> > of now.
> 
> I suspect you are confusing number of CPUs with number of NUMA nodes.
> 
I do not think so :)

>
> A NUMA system with 128 nodes is a large NUMA system that will have
> thousands of CPU cores, whilst above you talk about basing the
> count on CPU cores and that a single socket can have 128 cores?
> 
> > We can easily set nr_nodes to num_possible_cpus() and let it scale for
> > anyone. But before doing this, i would like to give it a try as a first
> > step because i have not tested it well on really big NUMA systems.
> 
> I don't think you need to have large NUMA systems to test it. We
> have the "fakenuma" feature for a reason.  Essentially, once you
> have enough CPU cores that catastrophic lock contention can be
> generated in a fast path (can take as few as 4-5 CPU cores), then
> you can effectively test NUMA scalability with fakenuma by creating
> nodes with >=8 CPUs each.
> 
> This is how I've done testing of numa aware algorithms (like
> shrinkers!) for the past decade - I haven't had direct access to a
> big NUMA machine since 2008, yet it's relatively trivial to test
> NUMA based scalability algorithms without them these days.
> 
I see your point. NUMA-aware scalability require reworking adding extra
layer that allows such scaling.

If the socket has 256 CPUs, how do scale VAs inside that node among
those CPUs?

--
Uladzislau Rezki
Dave Chinner Jan. 18, 2024, 9:28 p.m. UTC | #5
On Thu, Jan 18, 2024 at 07:23:47PM +0100, Uladzislau Rezki wrote:
> On Wed, Jan 17, 2024 at 09:06:02AM +1100, Dave Chinner wrote:
> > On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote:
> > > We can easily set nr_nodes to num_possible_cpus() and let it scale for
> > > anyone. But before doing this, i would like to give it a try as a first
> > > step because i have not tested it well on really big NUMA systems.
> > 
> > I don't think you need to have large NUMA systems to test it. We
> > have the "fakenuma" feature for a reason.  Essentially, once you
> > have enough CPU cores that catastrophic lock contention can be
> > generated in a fast path (can take as few as 4-5 CPU cores), then
> > you can effectively test NUMA scalability with fakenuma by creating
> > nodes with >=8 CPUs each.
> > 
> > This is how I've done testing of numa aware algorithms (like
> > shrinkers!) for the past decade - I haven't had direct access to a
> > big NUMA machine since 2008, yet it's relatively trivial to test
> > NUMA based scalability algorithms without them these days.
> > 
> I see your point. NUMA-aware scalability require reworking adding extra
> layer that allows such scaling.
> 
> If the socket has 256 CPUs, how do scale VAs inside that node among
> those CPUs?

It's called "sub-numa clustering" and is a bios option that presents
large core count CPU packages as multiple NUMA nodes. See:

https://www.intel.com/content/www/us/en/developer/articles/technical/fourth-generation-xeon-scalable-family-overview.html

Essentially, large core count CPUs are a cluster of smaller core
groups with their own resources and memory controllers. This is how
they are laid out either on a single die (intel) or as a collection
of smaller dies (AMD compute complexes) that are tied together by
the interconnect between the LLCs and memory controllers. They only
appear as a "unified" CPU because they are configured that way by
the bios, but can also be configured to actually expose their inner
non-uniform memory access topology for operating systems and
application stacks that are NUMA aware (like Linux).

This means a "256 core" CPU would probably present as 16 smaller 16
core CPUs each with their own L1/2/3 caches and memory controllers.
IOWs, a single socket appears to the kernel as a 16 node NUMA system
with 16 cores per node. Most NUMA aware scalability algorithms will
work just fine with this sort setup - it's just another set of
numbers in the NUMA distance table...

Cheers,

Dave.
Uladzislau Rezki Jan. 19, 2024, 10:32 a.m. UTC | #6
On Fri, Jan 19, 2024 at 08:28:05AM +1100, Dave Chinner wrote:
> On Thu, Jan 18, 2024 at 07:23:47PM +0100, Uladzislau Rezki wrote:
> > On Wed, Jan 17, 2024 at 09:06:02AM +1100, Dave Chinner wrote:
> > > On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote:
> > > > We can easily set nr_nodes to num_possible_cpus() and let it scale for
> > > > anyone. But before doing this, i would like to give it a try as a first
> > > > step because i have not tested it well on really big NUMA systems.
> > > 
> > > I don't think you need to have large NUMA systems to test it. We
> > > have the "fakenuma" feature for a reason.  Essentially, once you
> > > have enough CPU cores that catastrophic lock contention can be
> > > generated in a fast path (can take as few as 4-5 CPU cores), then
> > > you can effectively test NUMA scalability with fakenuma by creating
> > > nodes with >=8 CPUs each.
> > > 
> > > This is how I've done testing of numa aware algorithms (like
> > > shrinkers!) for the past decade - I haven't had direct access to a
> > > big NUMA machine since 2008, yet it's relatively trivial to test
> > > NUMA based scalability algorithms without them these days.
> > > 
> > I see your point. NUMA-aware scalability require reworking adding extra
> > layer that allows such scaling.
> > 
> > If the socket has 256 CPUs, how do scale VAs inside that node among
> > those CPUs?
> 
> It's called "sub-numa clustering" and is a bios option that presents
> large core count CPU packages as multiple NUMA nodes. See:
> 
> https://www.intel.com/content/www/us/en/developer/articles/technical/fourth-generation-xeon-scalable-family-overview.html
> 
> Essentially, large core count CPUs are a cluster of smaller core
> groups with their own resources and memory controllers. This is how
> they are laid out either on a single die (intel) or as a collection
> of smaller dies (AMD compute complexes) that are tied together by
> the interconnect between the LLCs and memory controllers. They only
> appear as a "unified" CPU because they are configured that way by
> the bios, but can also be configured to actually expose their inner
> non-uniform memory access topology for operating systems and
> application stacks that are NUMA aware (like Linux).
> 
> This means a "256 core" CPU would probably present as 16 smaller 16
> core CPUs each with their own L1/2/3 caches and memory controllers.
> IOWs, a single socket appears to the kernel as a 16 node NUMA system
> with 16 cores per node. Most NUMA aware scalability algorithms will
> work just fine with this sort setup - it's just another set of
> numbers in the NUMA distance table...
> 
Thank you for your input. I will go through it to see what we can
do in terms of NUMA-aware with thousands of CPUs in total.

Thanks!

--
Uladzislau Rezki
diff mbox series

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0c671cb96151..ef534c76daef 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4879,10 +4879,27 @@  static void vmap_init_free_space(void)
 static void vmap_init_nodes(void)
 {
 	struct vmap_node *vn;
-	int i, j;
+	int i, n;
+
+#if BITS_PER_LONG == 64
+	/* A high threshold of max nodes is fixed and bound to 128. */
+	n = clamp_t(unsigned int, num_possible_cpus(), 1, 128);
+
+	if (n > 1) {
+		vn = kmalloc_array(n, sizeof(*vn), GFP_NOWAIT | __GFP_NOWARN);
+		if (vn) {
+			/* Node partition is 16 pages. */
+			vmap_zone_size = (1 << 4) * PAGE_SIZE;
+			nr_vmap_nodes = n;
+			vmap_nodes = vn;
+		} else {
+			pr_err("Failed to allocate an array. Disable a node layer\n");
+		}
+	}
+#endif
 
-	for (i = 0; i < nr_vmap_nodes; i++) {
-		vn = &vmap_nodes[i];
+	for (n = 0; n < nr_vmap_nodes; n++) {
+		vn = &vmap_nodes[n];
 		vn->busy.root = RB_ROOT;
 		INIT_LIST_HEAD(&vn->busy.head);
 		spin_lock_init(&vn->busy.lock);
@@ -4891,9 +4908,9 @@  static void vmap_init_nodes(void)
 		INIT_LIST_HEAD(&vn->lazy.head);
 		spin_lock_init(&vn->lazy.lock);
 
-		for (j = 0; j < MAX_VA_SIZE_PAGES; j++) {
-			INIT_LIST_HEAD(&vn->pool[j].head);
-			WRITE_ONCE(vn->pool[j].len, 0);
+		for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
+			INIT_LIST_HEAD(&vn->pool[i].head);
+			WRITE_ONCE(vn->pool[i].len, 0);
 		}
 
 		spin_lock_init(&vn->pool_lock);