diff mbox series

[RFC,1/3] mm/slub: increase the maximum slab order to 4 for big systems

Message ID 20230905141348.32946-2-feng.tang@intel.com (mailing list archive)
State New
Headers show
Series mm/slub: reduce contention for per-node list_lock for large systems | expand

Commit Message

Feng Tang Sept. 5, 2023, 2:13 p.m. UTC
There are reports about severe lock contention for slub's per-node
'list_lock' in 'hackbench' test, [1][2], on server systems. And
similar contention is also seen when running 'mmap1' case of
will-it-scale on big systems. As the trend is one processor (socket)
will have more and more CPUs (100+, 200+), the contention could be
much more severe and becomes a scalability issue.

One way to help reducing the contention is to increase the maximum
slab order from 3 to 4, for big systems.

Unconditionally increasing the order could  bring trouble to client
devices with very limited size of memory, which may care more about
memory footprint, also allocating order 4 page could be harder under
memory pressure. So the increase will only be done for big systems
like servers, which usually are equipped with plenty of memory and
easier to hit lock contention issues.

Following is some performance data:

will-it-scale/mmap1
-------------------
Run will-it-scale benchmark's 'mmap1' test case on a 2 socket Sapphire
Rapids server (112 cores / 224 threads) with 256 GB DRAM, run 3
configurations with parallel test threads of 25%, 50% and 100% of
number of CPUs, and the data is (base is vanilla v6.5 kernel):

		     base                      base+patch
wis-mmap1-25%	    223670           +33.3%     298205        per_process_ops
wis-mmap1-50%	    186020           +51.8%     282383        per_process_ops
wis-mmap1-100%       89200           +65.0%     147139        per_process_ops

Take the perf-profile comparasion of 50% test case, the lock contention
is greatly reduced:

      43.80           -30.8       13.04       pp.self.native_queued_spin_lock_slowpath
      0.85            -0.2        0.65        pp.self.___slab_alloc
      0.41            -0.1        0.27        pp.self.__unfreeze_partials
      0.20 ±  2%      -0.1        0.12 ±  4%  pp.self.get_any_partial

hackbench
---------

Run same hackbench testcase  mentioned in [1], use same HW/SW as will-it-scale:

		     base                      base+patch
hackbench	    759951           +10.5%     839601        hackbench.throughput

perf-profile diff:
     22.20 ±  3%     -15.2        7.05        pp.self.native_queued_spin_lock_slowpath
      0.82            -0.2        0.59        pp.self.___slab_alloc
      0.33            -0.2        0.13        pp.self.__unfreeze_partials

[1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@intel.com/
[2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/slub.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 13 deletions(-)

Comments

Hyeonggon Yoo Sept. 12, 2023, 4:52 a.m. UTC | #1
On Tue, Sep 5, 2023 at 11:07 PM Feng Tang <feng.tang@intel.com> wrote:
>
> There are reports about severe lock contention for slub's per-node
> 'list_lock' in 'hackbench' test, [1][2], on server systems. And
> similar contention is also seen when running 'mmap1' case of
> will-it-scale on big systems. As the trend is one processor (socket)
> will have more and more CPUs (100+, 200+), the contention could be
> much more severe and becomes a scalability issue.
>
> One way to help reducing the contention is to increase the maximum
> slab order from 3 to 4, for big systems.

Hello Feng,

Increasing order with a higher number of CPUs (and so with more
memory) makes sense to me.
IIUC the contention here becomes worse when the number of slabs
increases, so it makes sense to
decrease the number of slabs by increasing order.

By the way, my silly question here is:
In the first place, is it worth taking 1/2 of s->cpu_partial_slabs in
the slowpath when slab is frequently used?
wouldn't the cpu partial slab list be re-filled again by free if free
operations are frequently performed?

> Unconditionally increasing the order could  bring trouble to client
> devices with very limited size of memory, which may care more about
> memory footprint, also allocating order 4 page could be harder under
> memory pressure. So the increase will only be done for big systems
> like servers, which usually are equipped with plenty of memory and
> easier to hit lock contention issues.

Also, does it make sense not to increase the order when PAGE_SIZE > 4096?

> Following is some performance data:
>
> will-it-scale/mmap1
> -------------------
> Run will-it-scale benchmark's 'mmap1' test case on a 2 socket Sapphire
> Rapids server (112 cores / 224 threads) with 256 GB DRAM, run 3
> configurations with parallel test threads of 25%, 50% and 100% of
> number of CPUs, and the data is (base is vanilla v6.5 kernel):
>
>                      base                      base+patch
> wis-mmap1-25%       223670           +33.3%     298205        per_process_ops
> wis-mmap1-50%       186020           +51.8%     282383        per_process_ops
> wis-mmap1-100%       89200           +65.0%     147139        per_process_ops
>
> Take the perf-profile comparasion of 50% test case, the lock contention
> is greatly reduced:
>
>       43.80           -30.8       13.04       pp.self.native_queued_spin_lock_slowpath
>       0.85            -0.2        0.65        pp.self.___slab_alloc
>       0.41            -0.1        0.27        pp.self.__unfreeze_partials
>       0.20 ±  2%      -0.1        0.12 ±  4%  pp.self.get_any_partial
>
> hackbench
> ---------
>
> Run same hackbench testcase  mentioned in [1], use same HW/SW as will-it-scale:
>
>                      base                      base+patch
> hackbench           759951           +10.5%     839601        hackbench.throughput
>
> perf-profile diff:
>      22.20 ±  3%     -15.2        7.05        pp.self.native_queued_spin_lock_slowpath
>       0.82            -0.2        0.59        pp.self.___slab_alloc
>       0.33            -0.2        0.13        pp.self.__unfreeze_partials
>
> [1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@intel.com/
> [2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
> Signed-off-by: Feng Tang <feng.tang@intel.com>

> ---
>  mm/slub.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 38 insertions(+), 13 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index f7940048138c..09ae1ed642b7 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4081,7 +4081,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_bulk);
>   */
>  static unsigned int slub_min_order;
>  static unsigned int slub_max_order =
> -       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : PAGE_ALLOC_COSTLY_ORDER;
> +       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 4;
>  static unsigned int slub_min_objects;
>
>  /*
> @@ -4134,6 +4134,26 @@ static inline unsigned int calc_slab_order(unsigned int size,
>         return order;
>  }
Feng Tang Sept. 12, 2023, 3:52 p.m. UTC | #2
Hi Hyeonggon,

Many thanks for the review!

On Tue, Sep 12, 2023 at 01:52:19PM +0900, Hyeonggon Yoo wrote:
> On Tue, Sep 5, 2023 at 11:07 PM Feng Tang <feng.tang@intel.com> wrote:
> >
> > There are reports about severe lock contention for slub's per-node
> > 'list_lock' in 'hackbench' test, [1][2], on server systems. And
> > similar contention is also seen when running 'mmap1' case of
> > will-it-scale on big systems. As the trend is one processor (socket)
> > will have more and more CPUs (100+, 200+), the contention could be
> > much more severe and becomes a scalability issue.
> >
> > One way to help reducing the contention is to increase the maximum
> > slab order from 3 to 4, for big systems.
> 
> Hello Feng,
> 
> Increasing order with a higher number of CPUs (and so with more
> memory) makes sense to me.
> IIUC the contention here becomes worse when the number of slabs
> increases, so it makes sense to
> decrease the number of slabs by increasing order.
> 
> By the way, my silly question here is:
> In the first place, is it worth taking 1/2 of s->cpu_partial_slabs in
> the slowpath when slab is frequently used?
> wouldn't the cpu partial slab list be re-filled again by free if free
> operations are frequently performed?

My understanding is the contention is related to the number of
objects for each cpu (the current slab and on the per-cpu partial
list), if it's easier to be used up, then the per-node lock will be
contended.

This patch increase the order (I should have also considered the
CPU number), while keeping the per-cpu partial numbers unchanged,
as it doubles the 'nr_objects' in set_cpu_partial().

But the 2/3 patch only increases the per-cpu partial number, and
keeps the order unchanged. From the performance data in cover
letter, 1/3 and 2/3 can individually reduce the contention for
will-it-scale/mmap1, as they both increase the available per-cpu
object numbers. 

> 
> > Unconditionally increasing the order could  bring trouble to client
> > devices with very limited size of memory, which may care more about
> > memory footprint, also allocating order 4 page could be harder under
> > memory pressure. So the increase will only be done for big systems
> > like servers, which usually are equipped with plenty of memory and
> > easier to hit lock contention issues.
> 
> Also, does it make sense not to increase the order when PAGE_SIZE > 4096?

Good point! Some other discussion  on mm list earlier this week
also reminded me that there are ARCHs supporting bigger pages like
64KB, and these patches needs to consider more about it.

> > Following is some performance data:
> >
> > will-it-scale/mmap1
> > -------------------
> > Run will-it-scale benchmark's 'mmap1' test case on a 2 socket Sapphire
> > Rapids server (112 cores / 224 threads) with 256 GB DRAM, run 3
> > configurations with parallel test threads of 25%, 50% and 100% of
> > number of CPUs, and the data is (base is vanilla v6.5 kernel):
> >
> >                      base                      base+patch
> > wis-mmap1-25%       223670           +33.3%     298205        per_process_ops
> > wis-mmap1-50%       186020           +51.8%     282383        per_process_ops
> > wis-mmap1-100%       89200           +65.0%     147139        per_process_ops
> >
> > Take the perf-profile comparasion of 50% test case, the lock contention
> > is greatly reduced:
> >
> >       43.80           -30.8       13.04       pp.self.native_queued_spin_lock_slowpath
> >       0.85            -0.2        0.65        pp.self.___slab_alloc
> >       0.41            -0.1        0.27        pp.self.__unfreeze_partials
> >       0.20 ±  2%      -0.1        0.12 ±  4%  pp.self.get_any_partial
> >
> > hackbench
> > ---------
> >
> > Run same hackbench testcase  mentioned in [1], use same HW/SW as will-it-scale:
> >
> >                      base                      base+patch
> > hackbench           759951           +10.5%     839601        hackbench.throughput
> >
> > perf-profile diff:
> >      22.20 ±  3%     -15.2        7.05        pp.self.native_queued_spin_lock_slowpath
> >       0.82            -0.2        0.59        pp.self.___slab_alloc
> >       0.33            -0.2        0.13        pp.self.__unfreeze_partials
> >
> > [1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@intel.com/
> > [2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> 
> > ---
> >  mm/slub.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 38 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index f7940048138c..09ae1ed642b7 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4081,7 +4081,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_bulk);
> >   */
> >  static unsigned int slub_min_order;
> >  static unsigned int slub_max_order =
> > -       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : PAGE_ALLOC_COSTLY_ORDER;
> > +       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 4;
> >  static unsigned int slub_min_objects;
> >
> >  /*
> > @@ -4134,6 +4134,26 @@ static inline unsigned int calc_slab_order(unsigned int size,
> >         return order;
> >  }
>
diff mbox series

Patch

diff --git a/mm/slub.c b/mm/slub.c
index f7940048138c..09ae1ed642b7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4081,7 +4081,7 @@  EXPORT_SYMBOL(kmem_cache_alloc_bulk);
  */
 static unsigned int slub_min_order;
 static unsigned int slub_max_order =
-	IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : PAGE_ALLOC_COSTLY_ORDER;
+	IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 4;
 static unsigned int slub_min_objects;
 
 /*
@@ -4134,6 +4134,26 @@  static inline unsigned int calc_slab_order(unsigned int size,
 	return order;
 }
 
+static inline int num_cpus(void)
+{
+	int nr_cpus;
+
+	/*
+	 * Some architectures will only update present cpus when
+	 * onlining them, so don't trust the number if it's just 1. But
+	 * we also don't want to use nr_cpu_ids always, as on some other
+	 * architectures, there can be many possible cpus, but never
+	 * onlined. Here we compromise between trying to avoid too high
+	 * order on systems that appear larger than they are, and too
+	 * low order on systems that appear smaller than they are.
+	 */
+	nr_cpus = num_present_cpus();
+	if (nr_cpus <= 1)
+		nr_cpus = nr_cpu_ids;
+
+	return nr_cpus;
+}
+
 static inline int calculate_order(unsigned int size)
 {
 	unsigned int order;
@@ -4151,19 +4171,17 @@  static inline int calculate_order(unsigned int size)
 	 */
 	min_objects = slub_min_objects;
 	if (!min_objects) {
-		/*
-		 * Some architectures will only update present cpus when
-		 * onlining them, so don't trust the number if it's just 1. But
-		 * we also don't want to use nr_cpu_ids always, as on some other
-		 * architectures, there can be many possible cpus, but never
-		 * onlined. Here we compromise between trying to avoid too high
-		 * order on systems that appear larger than they are, and too
-		 * low order on systems that appear smaller than they are.
-		 */
-		nr_cpus = num_present_cpus();
-		if (nr_cpus <= 1)
-			nr_cpus = nr_cpu_ids;
+		nr_cpus = num_cpus();
 		min_objects = 4 * (fls(nr_cpus) + 1);
+
+		/*
+		 * If nr_cpus >= 32, the platform is likely to be a server
+		 * which usually has much more memory, and is easier to be
+		 * hurt by scalability issue, so enlarge it to reduce the
+		 * possible contention of the per-node 'list_lock'.
+		 */
+		if (nr_cpus >= 32)
+			min_objects *= 2;
 	}
 	max_objects = order_objects(slub_max_order, size);
 	min_objects = min(min_objects, max_objects);
@@ -4361,6 +4379,13 @@  static void set_cpu_partial(struct kmem_cache *s)
 	else
 		nr_objects = 120;
 
+	/*
+	 * Give larger system more buffer to reduce scalability issue, like
+	 * the handling in calculate_order().
+	 */
+	if (num_cpus() >= 32)
+		nr_objects *= 2;
+
 	slub_set_cpu_partial(s, nr_objects);
 #endif
 }