diff mbox series

[RFC,3/3] mm/slub: setup maxim per-node partial according to cpu numbers

Message ID 20230905141348.32946-4-feng.tang@intel.com (mailing list archive)
State New
Headers show
Series mm/slub: reduce contention for per-node list_lock for large systems | expand

Commit Message

Feng Tang Sept. 5, 2023, 2:13 p.m. UTC
Currently most of the slab's min_partial is set to 5 (as MIN_PARTIAL
is 5). This is fine for older or small systesms, and could be too
small for a large system with hundreds of CPUs, when per-node
'list_lock' is contended for allocating from and freeing to per-node
partial list.

So enlarge it based on the CPU numbers per node.

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/linux/nodemask.h | 1 +
 mm/slub.c                | 9 +++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

Comments

Hyeonggon Yoo Sept. 12, 2023, 4:48 a.m. UTC | #1
On Tue, Sep 5, 2023 at 11:07 PM Feng Tang <feng.tang@intel.com> wrote:
>
> Currently most of the slab's min_partial is set to 5 (as MIN_PARTIAL
> is 5). This is fine for older or small systesms, and could be too
> small for a large system with hundreds of CPUs, when per-node
> 'list_lock' is contended for allocating from and freeing to per-node
> partial list.
>
> So enlarge it based on the CPU numbers per node.
>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  include/linux/nodemask.h | 1 +
>  mm/slub.c                | 9 +++++++--
>  2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index 8d07116caaf1..6e22caab186d 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -530,6 +530,7 @@ static inline int node_random(const nodemask_t *maskp)
>
>  #define num_online_nodes()     num_node_state(N_ONLINE)
>  #define num_possible_nodes()   num_node_state(N_POSSIBLE)
> +#define num_cpu_nodes()                num_node_state(N_CPU)
>  #define node_online(node)      node_state((node), N_ONLINE)
>  #define node_possible(node)    node_state((node), N_POSSIBLE)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 09ae1ed642b7..984e012d7bbc 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4533,6 +4533,7 @@ static int calculate_sizes(struct kmem_cache *s)
>
>  static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
>  {
> +       unsigned long min_partial;
>         s->flags = kmem_cache_flags(s->size, flags, s->name);
>  #ifdef CONFIG_SLAB_FREELIST_HARDENED
>         s->random = get_random_long();
> @@ -4564,8 +4565,12 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
>          * The larger the object size is, the more slabs we want on the partial
>          * list to avoid pounding the page allocator excessively.
>          */
> -       s->min_partial = min_t(unsigned long, MAX_PARTIAL, ilog2(s->size) / 2);
> -       s->min_partial = max_t(unsigned long, MIN_PARTIAL, s->min_partial);
> +
> +       min_partial = rounddown_pow_of_two(num_cpus() / num_cpu_nodes());
> +       min_partial = max_t(unsigned long, MIN_PARTIAL, min_partial);
> +
> +       s->min_partial = min_t(unsigned long, min_partial * 2, ilog2(s->size) / 2);
> +       s->min_partial = max_t(unsigned long, min_partial, s->min_partial);

Hello Feng,

How much memory is consumed by this change on your machine?

I won't argue that it would be huge for large machines but it
increases the minimum value for every
cache (even for those that are not contended) and there is no way to
reclaim this.

Maybe a way to reclaim a full slab on memory pressure (on buddy side)
wouldn't hurt?

>         set_cpu_partial(s);
>
> --
> 2.27.0
>
Feng Tang Sept. 14, 2023, 7:05 a.m. UTC | #2
Hi Hyeonggon,

On Tue, Sep 12, 2023 at 01:48:23PM +0900, Hyeonggon Yoo wrote:
> On Tue, Sep 5, 2023 at 11:07 PM Feng Tang <feng.tang@intel.com> wrote:
> >
> > Currently most of the slab's min_partial is set to 5 (as MIN_PARTIAL
> > is 5). This is fine for older or small systesms, and could be too
> > small for a large system with hundreds of CPUs, when per-node
> > 'list_lock' is contended for allocating from and freeing to per-node
> > partial list.
> >
> > So enlarge it based on the CPU numbers per node.
> >
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> >  include/linux/nodemask.h | 1 +
> >  mm/slub.c                | 9 +++++++--
> >  2 files changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> > index 8d07116caaf1..6e22caab186d 100644
> > --- a/include/linux/nodemask.h
> > +++ b/include/linux/nodemask.h
> > @@ -530,6 +530,7 @@ static inline int node_random(const nodemask_t *maskp)
> >
> >  #define num_online_nodes()     num_node_state(N_ONLINE)
> >  #define num_possible_nodes()   num_node_state(N_POSSIBLE)
> > +#define num_cpu_nodes()                num_node_state(N_CPU)
> >  #define node_online(node)      node_state((node), N_ONLINE)
> >  #define node_possible(node)    node_state((node), N_POSSIBLE)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 09ae1ed642b7..984e012d7bbc 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4533,6 +4533,7 @@ static int calculate_sizes(struct kmem_cache *s)
> >
> >  static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
> >  {
> > +       unsigned long min_partial;
> >         s->flags = kmem_cache_flags(s->size, flags, s->name);
> >  #ifdef CONFIG_SLAB_FREELIST_HARDENED
> >         s->random = get_random_long();
> > @@ -4564,8 +4565,12 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
> >          * The larger the object size is, the more slabs we want on the partial
> >          * list to avoid pounding the page allocator excessively.
> >          */
> > -       s->min_partial = min_t(unsigned long, MAX_PARTIAL, ilog2(s->size) / 2);
> > -       s->min_partial = max_t(unsigned long, MIN_PARTIAL, s->min_partial);
> > +
> > +       min_partial = rounddown_pow_of_two(num_cpus() / num_cpu_nodes());
> > +       min_partial = max_t(unsigned long, MIN_PARTIAL, min_partial);
> > +
> > +       s->min_partial = min_t(unsigned long, min_partial * 2, ilog2(s->size) / 2);
> > +       s->min_partial = max_t(unsigned long, min_partial, s->min_partial);
> 
> Hello Feng,
> 
> How much memory is consumed by this change on your machine?

As the code touches mostly the per-node partial, I did some profiling
by checking the 'partial' of each slab in /sys/kernel/slab/, both
after boot and after running will-it-scale/mmap1 case with all cpu. 

The HW is a 2S 48C/96T platform, with CentOS 9. The kernel is
6.6-rc1 with and without this patch (effectively the MIN_PARTIL
increasing to 32).

There are 246 slabs in total for the system, and after boot, there
are 27 slabs show difference:

	6.6-rc1                         |    6.6-rc1 + node_paritl patch 
-----------------------------------------------------------------------------

anon_vma_chain/partial:8 N0=5 N1=3      | anon_vma_chain/partial:29 N0=22 N1=7
anon_vma/partial:1 N0=1                 | anon_vma/partial:22 N0=22
bio-184/partial:0                       | bio-184/partial:6 N0=6
buffer_head/partial:0                   | buffer_head/partial:29 N1=29
dentry/partial:2 N0=2                   | dentry/partial:3 N1=3
filp/partial:5 N0=5                     | filp/partial:44 N0=28 N1=16
ioat/partial:10 N0=5 N1=5               | ioat/partial:62 N0=31 N1=31
kmalloc-128/partial:0                   | kmalloc-128/partial:1 N0=1
kmalloc-16/partial:1 N1=1               | kmalloc-16/partial:0
kmalloc-1k/partial:5 N0=5               | kmalloc-1k/partial:12 N0=12
kmalloc-32/partial:2 N0=1 N1=1          | kmalloc-32/partial:0
kmalloc-512/partial:4 N0=4              | kmalloc-512/partial:5 N0=4 N1=1
kmalloc-64/partial:1 N0=1               | kmalloc-64/partial:0
kmalloc-8k/partial:6 N0=6               | kmalloc-8k/partial:28 N0=28
kmalloc-96/partial:24 N0=23 N1=1        | kmalloc-96/partial:44 N0=41 N1=3
kmalloc-cg-32/partial:1 N0=1            | kmalloc-cg-32/partial:0
maple_node/partial:10 N0=6 N1=4         | maple_node/partial:55 N0=27 N1=28
pool_workqueue/partial:1 N0=1           | pool_workqueue/partial:0
radix_tree_node/partial:0               | radix_tree_node/partial:2 N0=1 N1=1
sighand_cache/partial:4 N0=4            | sighand_cache/partial:0
signal_cache/partial:0                  | signal_cache/partial:2 N0=2
skbuff_head_cache/partial:4 N0=2 N1=2   | skbuff_head_cache/partial:27 N0=27
skbuff_small_head/partial:5 N0=5        | skbuff_small_head/partial:32 N0=32
task_struct/partial:1 N0=1              | task_struct/partial:17 N0=17
vma_lock/partial:6 N0=4 N1=2            | vma_lock/partial:32 N0=25 N1=7
vmap_area/partial:1 N0=1                | vmap_area/partial:53 N0=32 N1=21
vm_area_struct/partial:14 N0=8 N1=6     | vm_area_struct/partial:38 N0=15 N1=23


After running will-it-scale/mmap1 case with 96 proceses, 30 slab has diffs:

	6.6-rc1                         |    6.6-rc1 + node_paritl patch 
-----------------------------------------------------------------------------

anon_vma_chain/partial:8 N0=5 N1=3      | anon_vma_chain/partial:29 N0=22 N1=7
anon_vma/partial:1 N0=1                 | anon_vma/partial:22 N0=22
bio-184/partial:0                       | bio-184/partial:6 N0=6
buffer_head/partial:0                   | buffer_head/partial:29 N1=29
cred_jar/partial:0                      | cred_jar/partial:6 N1=6
dentry/partial:8 N0=3 N1=5              | dentry/partial:22 N0=6 N1=16
filp/partial:6 N0=1 N1=5                | filp/partial:48 N0=28 N1=20
ioat/partial:10 N0=5 N1=5               | ioat/partial:62 N0=31 N1=31
kmalloc-128/partial:0                   | kmalloc-128/partial:1 N0=1
kmalloc-16/partial:2 N0=1 N1=1          | kmalloc-16/partial:3 N0=3
kmalloc-1k/partial:94 N0=49 N1=45       | kmalloc-1k/partial:100 N0=58 N1=42
kmalloc-32/partial:2 N0=1 N1=1          | kmalloc-32/partial:0
kmalloc-512/partial:209 N0=120 N1=89    | kmalloc-512/partial:205 N0=156 N1=49
kmalloc-64/partial:1 N0=1               | kmalloc-64/partial:0
kmalloc-8k/partial:6 N0=6               | kmalloc-8k/partial:28 N0=28
kmalloc-8/partial:0                     | kmalloc-8/partial:1 N0=1
kmalloc-96/partial:25 N0=23 N1=2        | kmalloc-96/partial:36 N0=33 N1=3
kmalloc-cg-32/partial:1 N0=1            | kmalloc-cg-32/partial:0
lsm_inode_cache/partial:0               | lsm_inode_cache/partial:8 N0=8
maple_node/partial:89 N0=46 N1=43       | maple_node/partial:116 N0=63 N1=53
pool_workqueue/partial:1 N0=1           | pool_workqueue/partial:0
radix_tree_node/partial:0               | radix_tree_node/partial:2 N0=1 N1=1
sighand_cache/partial:4 N0=4            | sighand_cache/partial:0
signal_cache/partial:0                  | signal_cache/partial:2 N0=2
skbuff_head_cache/partial:4 N0=2 N1=2   | skbuff_head_cache/partial:27 N0=27
skbuff_small_head/partial:5 N0=5        | skbuff_small_head/partial:32 N0=32
task_struct/partial:1 N0=1              | task_struct/partial:41 N0=32 N1=9
vma_lock/partial:71 N0=40 N1=31         | vma_lock/partial:110 N0=65 N1=45
vmap_area/partial:1 N0=1                | vmap_area/partial:59 N0=38 N1=21
vm_area_struct/partial:106 N0=58 N1=48  | vm_area_struct/partial:151 N0=88 N1=63

There is meansurable increase for some slabs, but not that much.

> I won't argue that it would be huge for large machines but it
> increases the minimum value for every
> cache (even for those that are not contended) and there is no way to
> reclaim this.

For slabs with less contension, the per-node partial list may also
be less likely to grow? From above data, about 10% slabs get affect
by the change. Maybe we can also limit the change to large systems?

One reason I wanted to revisit the MIN_PARTIAL is, it was changed from
2 to 5 in 2007 by Christoph, in commit 76be895001f2 ("SLUB: Improve
hackbench speed"), the system has been much huger since then. 
Currently while a per-cpu partial can already have 5 or more slabs, 
the limit for a node with possible 100+ CPU could be reconsidered.  

> Maybe a way to reclaim a full slab on memory pressure (on buddy side)
> wouldn't hurt?


Sorry, I don't follow. Do you mean to reclaim a slab with 0 'inuse'
objects, like the work done in __kmem_cache_do_shrink()?

Thanks,
Feng

> 
> >         set_cpu_partial(s);
> >
> > --
> > 2.27.0
> >
Lameter, Christopher Sept. 15, 2023, 2:40 a.m. UTC | #3
On Thu, 14 Sep 2023, Feng Tang wrote:

> One reason I wanted to revisit the MIN_PARTIAL is, it was changed from
> 2 to 5 in 2007 by Christoph, in commit 76be895001f2 ("SLUB: Improve
> hackbench speed"), the system has been much huger since then.
> Currently while a per-cpu partial can already have 5 or more slabs,
> the limit for a node with possible 100+ CPU could be reconsidered.

Well the trick that I keep using in large systems with lots of memory is 
to use huge page sized page allocation. The applications on those 
already are using the same page size. Doing so usually removes a lot of 
overhead and speeds up things significantly.

Try booting with "slab_min_order=9"
Feng Tang Sept. 15, 2023, 5:05 a.m. UTC | #4
On Thu, Sep 14, 2023 at 07:40:22PM -0700, Lameter, Christopher wrote:
> On Thu, 14 Sep 2023, Feng Tang wrote:
> 
> > One reason I wanted to revisit the MIN_PARTIAL is, it was changed from
> > 2 to 5 in 2007 by Christoph, in commit 76be895001f2 ("SLUB: Improve
> > hackbench speed"), the system has been much huger since then.
> > Currently while a per-cpu partial can already have 5 or more slabs,
> > the limit for a node with possible 100+ CPU could be reconsidered.
> 
> Well the trick that I keep using in large systems with lots of memory is to
> use huge page sized page allocation. The applications on those already are
> using the same page size. Doing so usually removes a lot of overhead and
> speeds up things significantly.
> 
> Try booting with "slab_min_order=9"

Thanks for sharing the trick! I tried and it works here. But this is
kind of extreme and fit for some special use case, and these patches
try to be useful for generic usage.

Thanks,
Feng
Lameter, Christopher Sept. 15, 2023, 4:13 p.m. UTC | #5
On Fri, 15 Sep 2023, Feng Tang wrote:

> Thanks for sharing the trick! I tried and it works here. But this is
> kind of extreme and fit for some special use case, and these patches
> try to be useful for generic usage.

Having a couple of TB main storage becomes more and more customary for 
servers.
diff mbox series

Patch

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 8d07116caaf1..6e22caab186d 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -530,6 +530,7 @@  static inline int node_random(const nodemask_t *maskp)
 
 #define num_online_nodes()	num_node_state(N_ONLINE)
 #define num_possible_nodes()	num_node_state(N_POSSIBLE)
+#define num_cpu_nodes()		num_node_state(N_CPU)
 #define node_online(node)	node_state((node), N_ONLINE)
 #define node_possible(node)	node_state((node), N_POSSIBLE)
 
diff --git a/mm/slub.c b/mm/slub.c
index 09ae1ed642b7..984e012d7bbc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4533,6 +4533,7 @@  static int calculate_sizes(struct kmem_cache *s)
 
 static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
 {
+	unsigned long min_partial;
 	s->flags = kmem_cache_flags(s->size, flags, s->name);
 #ifdef CONFIG_SLAB_FREELIST_HARDENED
 	s->random = get_random_long();
@@ -4564,8 +4565,12 @@  static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
 	 * The larger the object size is, the more slabs we want on the partial
 	 * list to avoid pounding the page allocator excessively.
 	 */
-	s->min_partial = min_t(unsigned long, MAX_PARTIAL, ilog2(s->size) / 2);
-	s->min_partial = max_t(unsigned long, MIN_PARTIAL, s->min_partial);
+
+	min_partial = rounddown_pow_of_two(num_cpus() / num_cpu_nodes());
+	min_partial = max_t(unsigned long, MIN_PARTIAL, min_partial);
+
+	s->min_partial = min_t(unsigned long, min_partial * 2, ilog2(s->size) / 2);
+	s->min_partial = max_t(unsigned long, min_partial, s->min_partial);
 
 	set_cpu_partial(s);