diff mbox series

[RFC,v0] mm/slub: Let number of online CPUs determine the slub page order

Message ID 20201118082759.1413056-1-bharata@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series [RFC,v0] mm/slub: Let number of online CPUs determine the slub page order | expand

Commit Message

Bharata B Rao Nov. 18, 2020, 8:27 a.m. UTC
The page order of the slab that gets chosen for a given slab
cache depends on the number of objects that can be fit in the
slab while meeting other requirements. We start with a value
of minimum objects based on nr_cpu_ids that is driven by
possible number of CPUs and hence could be higher than the
actual number of CPUs present in the system. This leads to
calculate_order() chosing a page order that is on the higher
side leading to increased slab memory consumption on systems
that have bigger page sizes.

Hence rely on the number of online CPUs when determining the
mininum objects, thereby increasing the chances of chosing
a lower conservative page order for the slab.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
---
This is a generic change and I am unsure how it would affect
other archs, but as a start, here are some numbers from
PowerPC pseries KVM guest with and without this patch:

This table shows how this change has affected some of the slab
caches.
===================================================================
		Current				Patched
Cache	<objperslab> <pagesperslab>	<objperslab> <pagesperslab>
===================================================================
TCPv6		53    2			26    1
net_namespace	53    4			26    2
dtl		32    2			16    1
names_cache	32    2			16    1
task_struct	53    8			13    2
thread_stack	32    8			8     2
pgtable-2^11	16    8			8     4
pgtable-2^8	32    2			16    1
kmalloc-32k	16    8			8     4
kmalloc-16k	32    8			8     2
kmalloc-8k	32    4			8     1
kmalloc-4k	32    2			16    1
===================================================================

Slab memory (kB) consumption comparision
==================================================================
			Current		Patched
==================================================================
After-boot		205760		156096
During-hackbench	629145		506752 (Avg of 5 runs)
After-hackbench		474176		331840 (after drop_caches)
==================================================================

Hackbench Time (Avg of 5 runs)
(hackbench -s 1024 -l 200 -g 200 -f 25 -P)
==========================================
Current		Patched
==========================================
10.990		11.010
==========================================

Measuring the effect due to CPU hotplug
----------------------------------------
Since the patch doesn't consider all the possible CPUs for page
order calcluation, let's see how affects the case when CPUs are
hotplugged. Here I compare a system that is booted with 64CPUs
with a system that is booted with 16CPUs but hotplugged with
48CPUs after boot. These numbers are with the patch applied.

Slab memory (kB) consumption comparision
===================================================================
			64bootCPUs	16bootCPUs+48HotPluggedCPUs
===================================================================
After-boot		390272		159744
After-hotplug		-		251328
During-hackbench	1001267		941926 (Avg of 5 runs)
After-hackbench		913600		827200 (after drop_caches)
===================================================================

Hackbench Time (Avg of 5 runs)
(hackbench -s 1024 -l 200 -g 200 -f 25 -P)
===========================================
64bootCPUs	16bootCPUs+48HotPluggedCPUs
===========================================
12.554		12.589
===========================================
 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Vlastimil Babka Nov. 18, 2020, 11:25 a.m. UTC | #1
On 11/18/20 9:27 AM, Bharata B Rao wrote:
> The page order of the slab that gets chosen for a given slab
> cache depends on the number of objects that can be fit in the
> slab while meeting other requirements. We start with a value
> of minimum objects based on nr_cpu_ids that is driven by
> possible number of CPUs and hence could be higher than the
> actual number of CPUs present in the system. This leads to
> calculate_order() chosing a page order that is on the higher
> side leading to increased slab memory consumption on systems
> that have bigger page sizes.
> 
> Hence rely on the number of online CPUs when determining the
> mininum objects, thereby increasing the chances of chosing
> a lower conservative page order for the slab.
> 
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Ideally, we would react to hotplug events and update existing caches 
accordingly. But for that, recalculation of order for existing caches 
would have to be made safe, while not affecting hot paths. We have 
removed the sysfs interface with 32a6f409b693 ("mm, slub: remove runtime 
allocation order changes") as it didn't seem easy and worth the trouble.

In case somebody wants to start with a large order right from the boot 
because they know they will hotplug lots of cpus later, they can use 
slub_min_objects= boot param to override this heuristic. So in case this 
change regresses somebody's performance, there's a way around it and 
thus the risk is low IMHO.

> ---
> This is a generic change and I am unsure how it would affect
> other archs, but as a start, here are some numbers from
> PowerPC pseries KVM guest with and without this patch:
> 
> This table shows how this change has affected some of the slab
> caches.
> ===================================================================
> 		Current				Patched
> Cache	<objperslab> <pagesperslab>	<objperslab> <pagesperslab>
> ===================================================================
> TCPv6		53    2			26    1
> net_namespace	53    4			26    2
> dtl		32    2			16    1
> names_cache	32    2			16    1
> task_struct	53    8			13    2
> thread_stack	32    8			8     2
> pgtable-2^11	16    8			8     4
> pgtable-2^8	32    2			16    1
> kmalloc-32k	16    8			8     4
> kmalloc-16k	32    8			8     2
> kmalloc-8k	32    4			8     1
> kmalloc-4k	32    2			16    1
> ===================================================================
> 
> Slab memory (kB) consumption comparision
> ==================================================================
> 			Current		Patched
> ==================================================================
> After-boot		205760		156096
> During-hackbench	629145		506752 (Avg of 5 runs)
> After-hackbench		474176		331840 (after drop_caches)
> ==================================================================
> 
> Hackbench Time (Avg of 5 runs)
> (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> ==========================================
> Current		Patched
> ==========================================
> 10.990		11.010
> ==========================================
> 
> Measuring the effect due to CPU hotplug
> ----------------------------------------
> Since the patch doesn't consider all the possible CPUs for page
> order calcluation, let's see how affects the case when CPUs are
> hotplugged. Here I compare a system that is booted with 64CPUs
> with a system that is booted with 16CPUs but hotplugged with
> 48CPUs after boot. These numbers are with the patch applied.
> 
> Slab memory (kB) consumption comparision
> ===================================================================
> 			64bootCPUs	16bootCPUs+48HotPluggedCPUs
> ===================================================================
> After-boot		390272		159744
> After-hotplug		-		251328
> During-hackbench	1001267		941926 (Avg of 5 runs)
> After-hackbench		913600		827200 (after drop_caches)
> ===================================================================
> 
> Hackbench Time (Avg of 5 runs)
> (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> ===========================================
> 64bootCPUs	16bootCPUs+48HotPluggedCPUs
> ===========================================
> 12.554		12.589
> ===========================================
>   mm/slub.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 34dcc09e2ec9..8342c0a167b2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
>   	 */
>   	min_objects = slub_min_objects;
>   	if (!min_objects)
> -		min_objects = 4 * (fls(nr_cpu_ids) + 1);
> +		min_objects = 4 * (fls(num_online_cpus()) + 1);
>   	max_objects = order_objects(slub_max_order, size);
>   	min_objects = min(min_objects, max_objects);
>   
>
Roman Gushchin Nov. 18, 2020, 7:34 p.m. UTC | #2
On Wed, Nov 18, 2020 at 12:25:38PM +0100, Vlastimil Babka wrote:
> On 11/18/20 9:27 AM, Bharata B Rao wrote:
> > The page order of the slab that gets chosen for a given slab
> > cache depends on the number of objects that can be fit in the
> > slab while meeting other requirements. We start with a value
> > of minimum objects based on nr_cpu_ids that is driven by
> > possible number of CPUs and hence could be higher than the
> > actual number of CPUs present in the system. This leads to
> > calculate_order() chosing a page order that is on the higher
> > side leading to increased slab memory consumption on systems
> > that have bigger page sizes.
> > 
> > Hence rely on the number of online CPUs when determining the
> > mininum objects, thereby increasing the chances of chosing
> > a lower conservative page order for the slab.
> > 
> > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Ideally, we would react to hotplug events and update existing caches
> accordingly. But for that, recalculation of order for existing caches would
> have to be made safe, while not affecting hot paths. We have removed the
> sysfs interface with 32a6f409b693 ("mm, slub: remove runtime allocation
> order changes") as it didn't seem easy and worth the trouble.
> 
> In case somebody wants to start with a large order right from the boot
> because they know they will hotplug lots of cpus later, they can use
> slub_min_objects= boot param to override this heuristic. So in case this
> change regresses somebody's performance, there's a way around it and thus
> the risk is low IMHO.

I agree. For the absolute majority of users there will be no difference.
And there is a good workaround for the rest.

Acked-by: Roman Gushchin <guro@fb.com>
David Rientjes Nov. 18, 2020, 7:53 p.m. UTC | #3
On Wed, 18 Nov 2020, Roman Gushchin wrote:

> On Wed, Nov 18, 2020 at 12:25:38PM +0100, Vlastimil Babka wrote:
> > On 11/18/20 9:27 AM, Bharata B Rao wrote:
> > > The page order of the slab that gets chosen for a given slab
> > > cache depends on the number of objects that can be fit in the
> > > slab while meeting other requirements. We start with a value
> > > of minimum objects based on nr_cpu_ids that is driven by
> > > possible number of CPUs and hence could be higher than the
> > > actual number of CPUs present in the system. This leads to
> > > calculate_order() chosing a page order that is on the higher
> > > side leading to increased slab memory consumption on systems
> > > that have bigger page sizes.
> > > 
> > > Hence rely on the number of online CPUs when determining the
> > > mininum objects, thereby increasing the chances of chosing
> > > a lower conservative page order for the slab.
> > > 
> > > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> > 
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > 
> > Ideally, we would react to hotplug events and update existing caches
> > accordingly. But for that, recalculation of order for existing caches would
> > have to be made safe, while not affecting hot paths. We have removed the
> > sysfs interface with 32a6f409b693 ("mm, slub: remove runtime allocation
> > order changes") as it didn't seem easy and worth the trouble.
> > 
> > In case somebody wants to start with a large order right from the boot
> > because they know they will hotplug lots of cpus later, they can use
> > slub_min_objects= boot param to override this heuristic. So in case this
> > change regresses somebody's performance, there's a way around it and thus
> > the risk is low IMHO.
> 
> I agree. For the absolute majority of users there will be no difference.
> And there is a good workaround for the rest.
> 
> Acked-by: Roman Gushchin <guro@fb.com>
> 

Acked-by: David Rientjes <rientjes@google.com>
Vincent Guittot Jan. 20, 2021, 5:36 p.m. UTC | #4
Hi,

On Wed, 18 Nov 2020 at 09:28, Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> The page order of the slab that gets chosen for a given slab
> cache depends on the number of objects that can be fit in the
> slab while meeting other requirements. We start with a value
> of minimum objects based on nr_cpu_ids that is driven by
> possible number of CPUs and hence could be higher than the
> actual number of CPUs present in the system. This leads to
> calculate_order() chosing a page order that is on the higher
> side leading to increased slab memory consumption on systems
> that have bigger page sizes.
>
> Hence rely on the number of online CPUs when determining the
> mininum objects, thereby increasing the chances of chosing
> a lower conservative page order for the slab.
>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> ---
> This is a generic change and I am unsure how it would affect
> other archs, but as a start, here are some numbers from
> PowerPC pseries KVM guest with and without this patch:
>
> This table shows how this change has affected some of the slab
> caches.
> ===================================================================
>                 Current                         Patched
> Cache   <objperslab> <pagesperslab>     <objperslab> <pagesperslab>
> ===================================================================
> TCPv6           53    2                 26    1
> net_namespace   53    4                 26    2
> dtl             32    2                 16    1
> names_cache     32    2                 16    1
> task_struct     53    8                 13    2
> thread_stack    32    8                 8     2
> pgtable-2^11    16    8                 8     4
> pgtable-2^8     32    2                 16    1
> kmalloc-32k     16    8                 8     4
> kmalloc-16k     32    8                 8     2
> kmalloc-8k      32    4                 8     1
> kmalloc-4k      32    2                 16    1
> ===================================================================
>
> Slab memory (kB) consumption comparision
> ==================================================================
>                         Current         Patched
> ==================================================================
> After-boot              205760          156096
> During-hackbench        629145          506752 (Avg of 5 runs)
> After-hackbench         474176          331840 (after drop_caches)
> ==================================================================
>
> Hackbench Time (Avg of 5 runs)
> (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> ==========================================
> Current         Patched
> ==========================================
> 10.990          11.010
> ==========================================
>
> Measuring the effect due to CPU hotplug
> ----------------------------------------
> Since the patch doesn't consider all the possible CPUs for page
> order calcluation, let's see how affects the case when CPUs are
> hotplugged. Here I compare a system that is booted with 64CPUs
> with a system that is booted with 16CPUs but hotplugged with
> 48CPUs after boot. These numbers are with the patch applied.
>
> Slab memory (kB) consumption comparision
> ===================================================================
>                         64bootCPUs      16bootCPUs+48HotPluggedCPUs
> ===================================================================
> After-boot              390272          159744
> After-hotplug           -               251328
> During-hackbench        1001267         941926 (Avg of 5 runs)
> After-hackbench         913600          827200 (after drop_caches)
> ===================================================================
>
> Hackbench Time (Avg of 5 runs)
> (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> ===========================================
> 64bootCPUs      16bootCPUs+48HotPluggedCPUs
> ===========================================
> 12.554          12.589
> ===========================================
>  mm/slub.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>

I'm facing significant performances regression on a large arm64 server
system (224 CPUs). Regressions is also present on small arm64 system
(8 CPUs) but in a far smaller order of magnitude

On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
v5.11-rc4 : 9.135sec (+/- 0.45%)
v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
v5.10: 3.136sec (+/- 0.40%)

This is a 191% regression compared to v5.10.

The problem is that calculate_order() is called a number of times
before secondaries CPUs are booted and it returns 1 instead of 224.
This makes the use of num_online_cpus() irrelevant for those cases

After adding in my command line "slub_min_objects=36" which equals to
4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
, the regression diseapears:

9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)



> diff --git a/mm/slub.c b/mm/slub.c
> index 34dcc09e2ec9..8342c0a167b2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
>          */
>         min_objects = slub_min_objects;
>         if (!min_objects)
> -               min_objects = 4 * (fls(nr_cpu_ids) + 1);
> +               min_objects = 4 * (fls(num_online_cpus()) + 1);
>         max_objects = order_objects(slub_max_order, size);
>         min_objects = min(min_objects, max_objects);
>
> --
> 2.26.2
>
Bharata B Rao Jan. 21, 2021, 5:30 a.m. UTC | #5
On Wed, Jan 20, 2021 at 06:36:31PM +0100, Vincent Guittot wrote:
> Hi,
> 
> On Wed, 18 Nov 2020 at 09:28, Bharata B Rao <bharata@linux.ibm.com> wrote:
> >
> > The page order of the slab that gets chosen for a given slab
> > cache depends on the number of objects that can be fit in the
> > slab while meeting other requirements. We start with a value
> > of minimum objects based on nr_cpu_ids that is driven by
> > possible number of CPUs and hence could be higher than the
> > actual number of CPUs present in the system. This leads to
> > calculate_order() chosing a page order that is on the higher
> > side leading to increased slab memory consumption on systems
> > that have bigger page sizes.
> >
> > Hence rely on the number of online CPUs when determining the
> > mininum objects, thereby increasing the chances of chosing
> > a lower conservative page order for the slab.
> >
> > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> > ---
> > This is a generic change and I am unsure how it would affect
> > other archs, but as a start, here are some numbers from
> > PowerPC pseries KVM guest with and without this patch:
> >
> > This table shows how this change has affected some of the slab
> > caches.
> > ===================================================================
> >                 Current                         Patched
> > Cache   <objperslab> <pagesperslab>     <objperslab> <pagesperslab>
> > ===================================================================
> > TCPv6           53    2                 26    1
> > net_namespace   53    4                 26    2
> > dtl             32    2                 16    1
> > names_cache     32    2                 16    1
> > task_struct     53    8                 13    2
> > thread_stack    32    8                 8     2
> > pgtable-2^11    16    8                 8     4
> > pgtable-2^8     32    2                 16    1
> > kmalloc-32k     16    8                 8     4
> > kmalloc-16k     32    8                 8     2
> > kmalloc-8k      32    4                 8     1
> > kmalloc-4k      32    2                 16    1
> > ===================================================================
> >
> > Slab memory (kB) consumption comparision
> > ==================================================================
> >                         Current         Patched
> > ==================================================================
> > After-boot              205760          156096
> > During-hackbench        629145          506752 (Avg of 5 runs)
> > After-hackbench         474176          331840 (after drop_caches)
> > ==================================================================
> >
> > Hackbench Time (Avg of 5 runs)
> > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > ==========================================
> > Current         Patched
> > ==========================================
> > 10.990          11.010
> > ==========================================
> >
> > Measuring the effect due to CPU hotplug
> > ----------------------------------------
> > Since the patch doesn't consider all the possible CPUs for page
> > order calcluation, let's see how affects the case when CPUs are
> > hotplugged. Here I compare a system that is booted with 64CPUs
> > with a system that is booted with 16CPUs but hotplugged with
> > 48CPUs after boot. These numbers are with the patch applied.
> >
> > Slab memory (kB) consumption comparision
> > ===================================================================
> >                         64bootCPUs      16bootCPUs+48HotPluggedCPUs
> > ===================================================================
> > After-boot              390272          159744
> > After-hotplug           -               251328
> > During-hackbench        1001267         941926 (Avg of 5 runs)
> > After-hackbench         913600          827200 (after drop_caches)
> > ===================================================================
> >
> > Hackbench Time (Avg of 5 runs)
> > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > ===========================================
> > 64bootCPUs      16bootCPUs+48HotPluggedCPUs
> > ===========================================
> > 12.554          12.589
> > ===========================================
> >  mm/slub.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> 
> I'm facing significant performances regression on a large arm64 server
> system (224 CPUs). Regressions is also present on small arm64 system
> (8 CPUs) but in a far smaller order of magnitude
> 
> On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
> v5.11-rc4 : 9.135sec (+/- 0.45%)
> v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
> v5.10: 3.136sec (+/- 0.40%)
> 
> This is a 191% regression compared to v5.10.
> 
> The problem is that calculate_order() is called a number of times
> before secondaries CPUs are booted and it returns 1 instead of 224.
> This makes the use of num_online_cpus() irrelevant for those cases
> 
> After adding in my command line "slub_min_objects=36" which equals to
> 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> , the regression diseapears:
> 
> 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)

Should we have switched to num_present_cpus() rather than
num_online_cpus()? If so, the below patch should address the
above problem.

From 252b332ccbee7152da1e18f1fff5b83f8e01b8df Mon Sep 17 00:00:00 2001
From: Bharata B Rao <bharata@linux.ibm.com>
Date: Thu, 21 Jan 2021 10:35:08 +0530
Subject: [PATCH] mm/slub: let number of present CPUs determine the slub
 page order

Commit 045ab8c9487b ("mm/slub: let number of online CPUs determine
the slub page order") changed the slub page order to depend on
num_online_cpus() from nr_cpu_ids. However we find that certain
caches (kmalloc) are initialized even before the secondary CPUs
are onlined resulting in lower slub page order and subsequent
regression.

Switch to num_present_cpus() instead.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Fixes: 045ab8c9487b ("mm/slub: let number of online CPUs determine the slub page order")
---
 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index d9e4e10683cc..2f3e412c849d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
 	 */
 	min_objects = slub_min_objects;
 	if (!min_objects)
-		min_objects = 4 * (fls(num_online_cpus()) + 1);
+		min_objects = 4 * (fls(num_present_cpus()) + 1);
 	max_objects = order_objects(slub_max_order, size);
 	min_objects = min(min_objects, max_objects);
Vincent Guittot Jan. 21, 2021, 9:09 a.m. UTC | #6
On Thu, 21 Jan 2021 at 06:31, Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> On Wed, Jan 20, 2021 at 06:36:31PM +0100, Vincent Guittot wrote:
> > Hi,
> >
> > On Wed, 18 Nov 2020 at 09:28, Bharata B Rao <bharata@linux.ibm.com> wrote:
> > >
> > > The page order of the slab that gets chosen for a given slab
> > > cache depends on the number of objects that can be fit in the
> > > slab while meeting other requirements. We start with a value
> > > of minimum objects based on nr_cpu_ids that is driven by
> > > possible number of CPUs and hence could be higher than the
> > > actual number of CPUs present in the system. This leads to
> > > calculate_order() chosing a page order that is on the higher
> > > side leading to increased slab memory consumption on systems
> > > that have bigger page sizes.
> > >
> > > Hence rely on the number of online CPUs when determining the
> > > mininum objects, thereby increasing the chances of chosing
> > > a lower conservative page order for the slab.
> > >
> > > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> > > ---
> > > This is a generic change and I am unsure how it would affect
> > > other archs, but as a start, here are some numbers from
> > > PowerPC pseries KVM guest with and without this patch:
> > >
> > > This table shows how this change has affected some of the slab
> > > caches.
> > > ===================================================================
> > >                 Current                         Patched
> > > Cache   <objperslab> <pagesperslab>     <objperslab> <pagesperslab>
> > > ===================================================================
> > > TCPv6           53    2                 26    1
> > > net_namespace   53    4                 26    2
> > > dtl             32    2                 16    1
> > > names_cache     32    2                 16    1
> > > task_struct     53    8                 13    2
> > > thread_stack    32    8                 8     2
> > > pgtable-2^11    16    8                 8     4
> > > pgtable-2^8     32    2                 16    1
> > > kmalloc-32k     16    8                 8     4
> > > kmalloc-16k     32    8                 8     2
> > > kmalloc-8k      32    4                 8     1
> > > kmalloc-4k      32    2                 16    1
> > > ===================================================================
> > >
> > > Slab memory (kB) consumption comparision
> > > ==================================================================
> > >                         Current         Patched
> > > ==================================================================
> > > After-boot              205760          156096
> > > During-hackbench        629145          506752 (Avg of 5 runs)
> > > After-hackbench         474176          331840 (after drop_caches)
> > > ==================================================================
> > >
> > > Hackbench Time (Avg of 5 runs)
> > > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > > ==========================================
> > > Current         Patched
> > > ==========================================
> > > 10.990          11.010
> > > ==========================================
> > >
> > > Measuring the effect due to CPU hotplug
> > > ----------------------------------------
> > > Since the patch doesn't consider all the possible CPUs for page
> > > order calcluation, let's see how affects the case when CPUs are
> > > hotplugged. Here I compare a system that is booted with 64CPUs
> > > with a system that is booted with 16CPUs but hotplugged with
> > > 48CPUs after boot. These numbers are with the patch applied.
> > >
> > > Slab memory (kB) consumption comparision
> > > ===================================================================
> > >                         64bootCPUs      16bootCPUs+48HotPluggedCPUs
> > > ===================================================================
> > > After-boot              390272          159744
> > > After-hotplug           -               251328
> > > During-hackbench        1001267         941926 (Avg of 5 runs)
> > > After-hackbench         913600          827200 (after drop_caches)
> > > ===================================================================
> > >
> > > Hackbench Time (Avg of 5 runs)
> > > (hackbench -s 1024 -l 200 -g 200 -f 25 -P)
> > > ===========================================
> > > 64bootCPUs      16bootCPUs+48HotPluggedCPUs
> > > ===========================================
> > > 12.554          12.589
> > > ===========================================
> > >  mm/slub.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> >
> > I'm facing significant performances regression on a large arm64 server
> > system (224 CPUs). Regressions is also present on small arm64 system
> > (8 CPUs) but in a far smaller order of magnitude
> >
> > On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
> > v5.11-rc4 : 9.135sec (+/- 0.45%)
> > v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
> > v5.10: 3.136sec (+/- 0.40%)
> >
> > This is a 191% regression compared to v5.10.
> >
> > The problem is that calculate_order() is called a number of times
> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > This makes the use of num_online_cpus() irrelevant for those cases
> >
> > After adding in my command line "slub_min_objects=36" which equals to
> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > , the regression diseapears:
> >
> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> Should we have switched to num_present_cpus() rather than
> num_online_cpus()? If so, the below patch should address the
> above problem.

The problem is the same with num_present_cpus() which is initialized
at the same time as num_online_cpus.
Only num_possible_cpus() returns a correct value just like nr_cpu_ids.

Both  num_possible_cpus and nr_cpu_ids return the number of CPUs of
the platforms and not the NR_CPUS
num_possible_cpus() = nr_cpu_ids = 224 from the beginning whereas
NR_CPUS=256 on my large system
num_possible_cpus() = nr_cpu_ids = 8 from the beginning whereas
NR_CPUS=256 on my small system


>
> From 252b332ccbee7152da1e18f1fff5b83f8e01b8df Mon Sep 17 00:00:00 2001
> From: Bharata B Rao <bharata@linux.ibm.com>
> Date: Thu, 21 Jan 2021 10:35:08 +0530
> Subject: [PATCH] mm/slub: let number of present CPUs determine the slub
>  page order
>
> Commit 045ab8c9487b ("mm/slub: let number of online CPUs determine
> the slub page order") changed the slub page order to depend on
> num_online_cpus() from nr_cpu_ids. However we find that certain
> caches (kmalloc) are initialized even before the secondary CPUs
> are onlined resulting in lower slub page order and subsequent
> regression.
>
> Switch to num_present_cpus() instead.
>
> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> Fixes: 045ab8c9487b ("mm/slub: let number of online CPUs determine the slub page order")
> ---
>  mm/slub.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index d9e4e10683cc..2f3e412c849d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3433,7 +3433,7 @@ static inline int calculate_order(unsigned int size)
>          */
>         min_objects = slub_min_objects;
>         if (!min_objects)
> -               min_objects = 4 * (fls(num_online_cpus()) + 1);
> +               min_objects = 4 * (fls(num_present_cpus()) + 1);
>         max_objects = order_objects(slub_max_order, size);
>         min_objects = min(min_objects, max_objects);
>
> --
> 2.26.2
>
>
>
Christoph Lameter Jan. 21, 2021, 10:01 a.m. UTC | #7
On Thu, 21 Jan 2021, Bharata B Rao wrote:

> > The problem is that calculate_order() is called a number of times
> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > This makes the use of num_online_cpus() irrelevant for those cases
> >
> > After adding in my command line "slub_min_objects=36" which equals to
> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > , the regression diseapears:
> >
> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> Should we have switched to num_present_cpus() rather than
> num_online_cpus()? If so, the below patch should address the
> above problem.

There is certainly an initcall after secondaries are booted where we could
redo the calculate_order?

Or the num_online_cpus needs to be up to date earlier. Why does this issue
not occur on x86? Does x86 have an up to date num_online_cpus earlier?
Vincent Guittot Jan. 21, 2021, 10:48 a.m. UTC | #8
On Thu, 21 Jan 2021 at 11:01, Christoph Lameter <cl@linux.com> wrote:
>
> On Thu, 21 Jan 2021, Bharata B Rao wrote:
>
> > > The problem is that calculate_order() is called a number of times
> > > before secondaries CPUs are booted and it returns 1 instead of 224.
> > > This makes the use of num_online_cpus() irrelevant for those cases
> > >
> > > After adding in my command line "slub_min_objects=36" which equals to
> > > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > > , the regression diseapears:
> > >
> > > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >
> > Should we have switched to num_present_cpus() rather than
> > num_online_cpus()? If so, the below patch should address the
> > above problem.
>
> There is certainly an initcall after secondaries are booted where we could
> redo the calculate_order?
>
> Or the num_online_cpus needs to be up to date earlier. Why does this issue
> not occur on x86? Does x86 have an up to date num_online_cpus earlier?

I have added a printk in calculate_order() :
        pr_info(" SLUB calculate_order cmd  %d min %d online %d
present %d possible %d cpus %d", slub_min_objects, min_objects,
num_online_cpus(), num_present_cpus(), num_possible_cpus(),
nr_cpu_ids);

And checked with
qemu-system-x86_64 -kernel bzImage -nographic -smp 4 -append "console=ttyS0"

[    0.064927]  SLUB calculate_order cmd  0 min 8 online 1 present 4
possible 4 cpus 4
[    0.064970] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1

The num_online_cpus has the same behavior as on my platform and
num_online_cpus =  1 when kmem_cache_init() is called

Only the num_present_cpus = 4 from the beginning but that's probably
just because it runs in a VM

Also it's interesting to notice that num_possible_cpus and nr_cpu_ids
are set to the correct value

>
>
Vlastimil Babka Jan. 21, 2021, 6:19 p.m. UTC | #9
On 1/21/21 11:01 AM, Christoph Lameter wrote:
> On Thu, 21 Jan 2021, Bharata B Rao wrote:
> 
>> > The problem is that calculate_order() is called a number of times
>> > before secondaries CPUs are booted and it returns 1 instead of 224.
>> > This makes the use of num_online_cpus() irrelevant for those cases
>> >
>> > After adding in my command line "slub_min_objects=36" which equals to
>> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
>> > , the regression diseapears:
>> >
>> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)

I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
supposed to be a scheduler benchmark? What exactly is going on?

>> Should we have switched to num_present_cpus() rather than
>> num_online_cpus()? If so, the below patch should address the
>> above problem.
> 
> There is certainly an initcall after secondaries are booted where we could
> redo the calculate_order?

We could do it even in hotplug handler. But in practice that means making sure
it's safe, i.e. all users of oo_order/oo_objects must handle the value changing.

Consider e.g. init_cache_random_seq() which uses oo_objects(s->oo) to allocate
s->random_seq when cache s is created. Then shuffle_freelist() will use the
current value of oo_objects(s->oo) to index s->random_seq, for a newly allocated
slab - what if the page order has increased meanwhile due to secondary booting
or hotplug? Array overflow. That's why I just made the former sysfs handler for
changing order read-only.

Things would be easier if we could trust *on all arches* either

- num_present_cpus() to count what the hardware really physically has during
boot, even if not yet onlined, at the time we init slab. This would still not
handle later hotplug (probably mostly in a VM scenario, not that somebody would
bring bunch of actual new cpu boards to a running bare metal system?).

- num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
where it's not really possible to plug more CPU's. In a VM scenario we could
still have an opposite problem, where theoretically "anything is possible" but
the virtual cpus are never added later.

We could also start questioning the very assumption that number of cpus should
affect slab page size in the first place. Should it? After all, each CPU will
have one or more slab pages privately cached, as we discuss in the other
thread... So why make the slab pages also larger?

> Or the num_online_cpus needs to be up to date earlier. Why does this issue
> not occur on x86? Does x86 have an up to date num_online_cpus earlier?
> 
>
Vincent Guittot Jan. 22, 2021, 8:03 a.m. UTC | #10
On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >
> >> > The problem is that calculate_order() is called a number of times
> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >
> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> > , the regression diseapears:
> >> >
> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> supposed to be a scheduler benchmark? What exactly is going on?
>

From hackbench description:
Hackbench is both a benchmark and a stress test for the Linux kernel
scheduler. It's  main
       job  is  to  create a specified number of pairs of schedulable
entities (either threads or
       traditional processes) which communicate via either sockets or
pipes and time how long  it
       takes for each pair to send data back and forth.

> >> Should we have switched to num_present_cpus() rather than
> >> num_online_cpus()? If so, the below patch should address the
> >> above problem.
> >
> > There is certainly an initcall after secondaries are booted where we could
> > redo the calculate_order?
>
> We could do it even in hotplug handler. But in practice that means making sure
> it's safe, i.e. all users of oo_order/oo_objects must handle the value changing.
>
> Consider e.g. init_cache_random_seq() which uses oo_objects(s->oo) to allocate
> s->random_seq when cache s is created. Then shuffle_freelist() will use the
> current value of oo_objects(s->oo) to index s->random_seq, for a newly allocated
> slab - what if the page order has increased meanwhile due to secondary booting
> or hotplug? Array overflow. That's why I just made the former sysfs handler for
> changing order read-only.
>
> Things would be easier if we could trust *on all arches* either
>
> - num_present_cpus() to count what the hardware really physically has during
> boot, even if not yet onlined, at the time we init slab. This would still not
> handle later hotplug (probably mostly in a VM scenario, not that somebody would
> bring bunch of actual new cpu boards to a running bare metal system?).
>
> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
> where it's not really possible to plug more CPU's. In a VM scenario we could
> still have an opposite problem, where theoretically "anything is possible" but
> the virtual cpus are never added later.

On all the system that I have tested num_possible_cpus()/nr_cpu_ids
were correctly initialized

large arm64 acpi system
small arm64 DT based system
VM on x86 system

>
> We could also start questioning the very assumption that number of cpus should
> affect slab page size in the first place. Should it? After all, each CPU will
> have one or more slab pages privately cached, as we discuss in the other
> thread... So why make the slab pages also larger?
>
> > Or the num_online_cpus needs to be up to date earlier. Why does this issue
> > not occur on x86? Does x86 have an up to date num_online_cpus earlier?
> >
> >
>
Vlastimil Babka Jan. 22, 2021, 12:03 p.m. UTC | #11
On 1/22/21 9:03 AM, Vincent Guittot wrote:
> On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 1/21/21 11:01 AM, Christoph Lameter wrote:
>> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
>> >
>> >> > The problem is that calculate_order() is called a number of times
>> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
>> >> > This makes the use of num_online_cpus() irrelevant for those cases
>> >> >
>> >> > After adding in my command line "slub_min_objects=36" which equals to
>> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
>> >> > , the regression diseapears:
>> >> >
>> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>>
>> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
>> supposed to be a scheduler benchmark? What exactly is going on?
>>
> 
> From hackbench description:
> Hackbench is both a benchmark and a stress test for the Linux kernel
> scheduler. It's  main
>        job  is  to  create a specified number of pairs of schedulable
> entities (either threads or
>        traditional processes) which communicate via either sockets or
> pipes and time how long  it
>        takes for each pair to send data back and forth.

Yep, so I wonder which slab entities this is stressing that much.

>> Things would be easier if we could trust *on all arches* either
>>
>> - num_present_cpus() to count what the hardware really physically has during
>> boot, even if not yet onlined, at the time we init slab. This would still not
>> handle later hotplug (probably mostly in a VM scenario, not that somebody would
>> bring bunch of actual new cpu boards to a running bare metal system?).
>>
>> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
>> where it's not really possible to plug more CPU's. In a VM scenario we could
>> still have an opposite problem, where theoretically "anything is possible" but
>> the virtual cpus are never added later.
> 
> On all the system that I have tested num_possible_cpus()/nr_cpu_ids
> were correctly initialized
> 
> large arm64 acpi system
> small arm64 DT based system
> VM on x86 system

So it's just powerpc that has this issue with too large nr_cpu_ids? Is it caused
by bios or the hypervisor? How does num_present_cpus() look there?

What about heuristic:
- num_online_cpus() > 1 - we trust that and use it
- otherwise nr_cpu_ids
Would that work? Too arbitrary?


>> We could also start questioning the very assumption that number of cpus should
>> affect slab page size in the first place. Should it? After all, each CPU will
>> have one or more slab pages privately cached, as we discuss in the other
>> thread... So why make the slab pages also larger?
>>
>> > Or the num_online_cpus needs to be up to date earlier. Why does this issue
>> > not occur on x86? Does x86 have an up to date num_online_cpus earlier?
>> >
>> >
>>
>
Jann Horn Jan. 22, 2021, 1:05 p.m. UTC | #12
On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >
> >> > The problem is that calculate_order() is called a number of times
> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >
> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> > , the regression diseapears:
> >> >
> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> supposed to be a scheduler benchmark? What exactly is going on?

Uuuh, I think powerpc doesn't have cmpxchg_double?

"vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
<https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
says under "POWERPC": "no DW LL/SC"

So powerpc is probably hitting the page-bitlock-based implementation
all the time for stuff like __slub_free()? Do you have detailed
profiling results from "perf top" or something like that?

(I actually have some WIP patches and a design document for getting
rid of cmpxchg_double in struct page that I hacked together in the
last couple days; I'm currently in the process of sending them over to
some other folks in the company who hopefully have cycles to
review/polish/benchmark them so that they can be upstreamed, assuming
that those folks think they're important enough. I don't have the
cycles for it...)
Jann Horn Jan. 22, 2021, 1:09 p.m. UTC | #13
On Fri, Jan 22, 2021 at 2:05 PM Jann Horn <jannh@google.com> wrote:
> On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> > >
> > >> > The problem is that calculate_order() is called a number of times
> > >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > >> > This makes the use of num_online_cpus() irrelevant for those cases
> > >> >
> > >> > After adding in my command line "slub_min_objects=36" which equals to
> > >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > >> > , the regression diseapears:
> > >> >
> > >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >
> > I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> > supposed to be a scheduler benchmark? What exactly is going on?
>
> Uuuh, I think powerpc doesn't have cmpxchg_double?
>
> "vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
> <https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
> says under "POWERPC": "no DW LL/SC"
>
> So powerpc is probably hitting the page-bitlock-based implementation
> all the time for stuff like __slub_free()? Do you have detailed
> profiling results from "perf top" or something like that?
>
> (I actually have some WIP patches and a design document for getting
> rid of cmpxchg_double in struct page that I hacked together in the
> last couple days; I'm currently in the process of sending them over to
> some other folks in the company who hopefully have cycles to
> review/polish/benchmark them so that they can be upstreamed, assuming
> that those folks think they're important enough. I don't have the
> cycles for it...)

(The stuff I have in mind will only work on 64-bit though. We are
talking about PPC64 here, right?)
Vincent Guittot Jan. 22, 2021, 1:16 p.m. UTC | #14
On Fri, 22 Jan 2021 at 13:03, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/22/21 9:03 AM, Vincent Guittot wrote:
> > On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> >> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >> >
> >> >> > The problem is that calculate_order() is called a number of times
> >> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >> >
> >> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> >> > , the regression diseapears:
> >> >> >
> >> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >>
> >> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> >> supposed to be a scheduler benchmark? What exactly is going on?
> >>
> >
> > From hackbench description:
> > Hackbench is both a benchmark and a stress test for the Linux kernel
> > scheduler. It's  main
> >        job  is  to  create a specified number of pairs of schedulable
> > entities (either threads or
> >        traditional processes) which communicate via either sockets or
> > pipes and time how long  it
> >        takes for each pair to send data back and forth.
>
> Yep, so I wonder which slab entities this is stressing that much.
>
> >> Things would be easier if we could trust *on all arches* either
> >>
> >> - num_present_cpus() to count what the hardware really physically has during
> >> boot, even if not yet onlined, at the time we init slab. This would still not
> >> handle later hotplug (probably mostly in a VM scenario, not that somebody would
> >> bring bunch of actual new cpu boards to a running bare metal system?).
> >>
> >> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
> >> where it's not really possible to plug more CPU's. In a VM scenario we could
> >> still have an opposite problem, where theoretically "anything is possible" but
> >> the virtual cpus are never added later.
> >
> > On all the system that I have tested num_possible_cpus()/nr_cpu_ids
> > were correctly initialized
> >
> > large arm64 acpi system
> > small arm64 DT based system
> > VM on x86 system
>
> So it's just powerpc that has this issue with too large nr_cpu_ids? Is it caused
> by bios or the hypervisor? How does num_present_cpus() look there?

num_present_cpus() starts to 1 until secondary cpus boot in the arm64 case

>
> What about heuristic:
> - num_online_cpus() > 1 - we trust that and use it
> - otherwise nr_cpu_ids
> Would that work? Too arbitrary?
>
>
> >> We could also start questioning the very assumption that number of cpus should
> >> affect slab page size in the first place. Should it? After all, each CPU will
> >> have one or more slab pages privately cached, as we discuss in the other
> >> thread... So why make the slab pages also larger?
> >>
> >> > Or the num_online_cpus needs to be up to date earlier. Why does this issue
> >> > not occur on x86? Does x86 have an up to date num_online_cpus earlier?
> >> >
> >> >
> >>
> >
>
Vlastimil Babka Jan. 22, 2021, 3:27 p.m. UTC | #15
On 1/22/21 2:05 PM, Jann Horn wrote:
> On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 1/21/21 11:01 AM, Christoph Lameter wrote:
>> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
>> >
>> >> > The problem is that calculate_order() is called a number of times
>> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
>> >> > This makes the use of num_online_cpus() irrelevant for those cases
>> >> >
>> >> > After adding in my command line "slub_min_objects=36" which equals to
>> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
>> >> > , the regression diseapears:
>> >> >
>> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>>
>> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
>> supposed to be a scheduler benchmark? What exactly is going on?
> 
> Uuuh, I think powerpc doesn't have cmpxchg_double?

The benchmark was done by Vincent on arm64, AFAICS. PowerPC (ppc64) was what
Bharata had used to demonstrate the order calculation change in his patch.

There seems to be some implementation dependency on CONFIG_ARM64_LSE_ATOMICS but
AFAICS that doesn't determine if cmpxchg_double is provided.

> "vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
> <https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
> says under "POWERPC": "no DW LL/SC"

Interesting find in any case.

> So powerpc is probably hitting the page-bitlock-based implementation
> all the time for stuff like __slub_free()? Do you have detailed
> profiling results from "perf top" or something like that?
> 
> (I actually have some WIP patches and a design document for getting
> rid of cmpxchg_double in struct page that I hacked together in the
> last couple days; I'm currently in the process of sending them over to
> some other folks in the company who hopefully have cycles to
> review/polish/benchmark them so that they can be upstreamed, assuming
> that those folks think they're important enough. I don't have the
> cycles for it...)

I'm curious, so I hope this works out :)
Bharata B Rao Jan. 23, 2021, 5:16 a.m. UTC | #16
On Fri, Jan 22, 2021 at 01:03:57PM +0100, Vlastimil Babka wrote:
> On 1/22/21 9:03 AM, Vincent Guittot wrote:
> > On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> >> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> >> >
> >> >> > The problem is that calculate_order() is called a number of times
> >> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> >> >> > This makes the use of num_online_cpus() irrelevant for those cases
> >> >> >
> >> >> > After adding in my command line "slub_min_objects=36" which equals to
> >> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> >> >> > , the regression diseapears:
> >> >> >
> >> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >>
> >> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> >> supposed to be a scheduler benchmark? What exactly is going on?
> >>
> > 
> > From hackbench description:
> > Hackbench is both a benchmark and a stress test for the Linux kernel
> > scheduler. It's  main
> >        job  is  to  create a specified number of pairs of schedulable
> > entities (either threads or
> >        traditional processes) which communicate via either sockets or
> > pipes and time how long  it
> >        takes for each pair to send data back and forth.
> 
> Yep, so I wonder which slab entities this is stressing that much.
> 
> >> Things would be easier if we could trust *on all arches* either
> >>
> >> - num_present_cpus() to count what the hardware really physically has during
> >> boot, even if not yet onlined, at the time we init slab. This would still not
> >> handle later hotplug (probably mostly in a VM scenario, not that somebody would
> >> bring bunch of actual new cpu boards to a running bare metal system?).
> >>
> >> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
> >> where it's not really possible to plug more CPU's. In a VM scenario we could
> >> still have an opposite problem, where theoretically "anything is possible" but
> >> the virtual cpus are never added later.
> > 
> > On all the system that I have tested num_possible_cpus()/nr_cpu_ids
> > were correctly initialized
> > 
> > large arm64 acpi system
> > small arm64 DT based system
> > VM on x86 system
> 
> So it's just powerpc that has this issue with too large nr_cpu_ids? Is it caused
> by bios or the hypervisor? How does num_present_cpus() look there?

PowerPC PowerNV Host: (160 cpus)
num_online_cpus 1 num_present_cpus 160 num_possible_cpus 160 nr_cpu_ids 160 

PowerPC pseries KVM guest: (-smp 16,maxcpus=160)
num_online_cpus 1 num_present_cpus 16 num_possible_cpus 160 nr_cpu_ids 160 

That's what I see on powerpc, hence I thought num_present_cpus() could
be the correct one to use in slub page order calculation.

> 
> What about heuristic:
> - num_online_cpus() > 1 - we trust that and use it
> - otherwise nr_cpu_ids
> Would that work? Too arbitrary?

Looking at the following snippet from include/linux/cpumask.h, it
appears that num_present_cpus() should be reasonable compromise
between online and possible/nr_cpus_ids to use here.

/*
 * The following particular system cpumasks and operations manage
 * possible, present, active and online cpus.
 *
 *     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
 *     cpu_present_mask - has bit 'cpu' set iff cpu is populated
 *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
 *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
 *
 *  If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
 *
 *  The cpu_possible_mask is fixed at boot time, as the set of CPU id's
 *  that it is possible might ever be plugged in at anytime during the
 *  life of that system boot.  The cpu_present_mask is dynamic(*),
 *  representing which CPUs are currently plugged in.  And
 *  cpu_online_mask is the dynamic subset of cpu_present_mask,
 *  indicating those CPUs available for scheduling.
 *
 *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
 *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
 *  ACPI reports present at boot.
 *
 *  If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
 *  depending on what ACPI reports as currently plugged in, otherwise
 *  cpu_present_mask is just a copy of cpu_possible_mask.
 *
 *  (*) Well, cpu_present_mask is dynamic in the hotplug case.  If not
 *      hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
 */

So for host systems, present is (usually) equal to possible and for
guest systems present should indicate the CPUs found to be present
at boottime. The intention of my original patch was to use this
metric in slub page order calculation rather than nr_cpus_ids
or num_cpus_possible() which could be high on guest systems that
typically support CPU hotplug.

Regards,
Bharata.
Vincent Guittot Jan. 23, 2021, 12:32 p.m. UTC | #17
+Adding arch arm64 Maintainers

On Sat, 23 Jan 2021 at 06:16, Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> On Fri, Jan 22, 2021 at 01:03:57PM +0100, Vlastimil Babka wrote:
> > On 1/22/21 9:03 AM, Vincent Guittot wrote:
> > > On Thu, 21 Jan 2021 at 19:19, Vlastimil Babka <vbabka@suse.cz> wrote:
> > >>
> > >> On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > >> > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> > >> >
> > >> >> > The problem is that calculate_order() is called a number of times
> > >> >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > >> >> > This makes the use of num_online_cpus() irrelevant for those cases
> > >> >> >
> > >> >> > After adding in my command line "slub_min_objects=36" which equals to
> > >> >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > >> >> > , the regression diseapears:
> > >> >> >
> > >> >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> > >>
> > >> I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> > >> supposed to be a scheduler benchmark? What exactly is going on?
> > >>
> > >
> > > From hackbench description:
> > > Hackbench is both a benchmark and a stress test for the Linux kernel
> > > scheduler. It's  main
> > >        job  is  to  create a specified number of pairs of schedulable
> > > entities (either threads or
> > >        traditional processes) which communicate via either sockets or
> > > pipes and time how long  it
> > >        takes for each pair to send data back and forth.
> >
> > Yep, so I wonder which slab entities this is stressing that much.
> >
> > >> Things would be easier if we could trust *on all arches* either
> > >>
> > >> - num_present_cpus() to count what the hardware really physically has during
> > >> boot, even if not yet onlined, at the time we init slab. This would still not
> > >> handle later hotplug (probably mostly in a VM scenario, not that somebody would
> > >> bring bunch of actual new cpu boards to a running bare metal system?).
> > >>
> > >> - num_possible_cpus()/nr_cpu_ids not to be excessive (broken BIOS?) on systems
> > >> where it's not really possible to plug more CPU's. In a VM scenario we could
> > >> still have an opposite problem, where theoretically "anything is possible" but
> > >> the virtual cpus are never added later.
> > >
> > > On all the system that I have tested num_possible_cpus()/nr_cpu_ids
> > > were correctly initialized
> > >
> > > large arm64 acpi system
> > > small arm64 DT based system
> > > VM on x86 system
> >
> > So it's just powerpc that has this issue with too large nr_cpu_ids? Is it caused
> > by bios or the hypervisor? How does num_present_cpus() look there?
>
> PowerPC PowerNV Host: (160 cpus)
> num_online_cpus 1 num_present_cpus 160 num_possible_cpus 160 nr_cpu_ids 160
>
> PowerPC pseries KVM guest: (-smp 16,maxcpus=160)
> num_online_cpus 1 num_present_cpus 16 num_possible_cpus 160 nr_cpu_ids 160
>
> That's what I see on powerpc, hence I thought num_present_cpus() could
> be the correct one to use in slub page order calculation.

num_present_cpus() is set to 1 on arm64 until secondaries cpus boot

arm64 224cpus acpi host:
num_online_cpus 1 num_present_cpus 1 num_possible_cpus 224 nr_cpu_ids 224
arm64 8cpus DT host:
num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8
arm64 8cpus qemu-system-aarch64 (-smp 8,maxcpus=256)
num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8

Then present and online increase to num_possible_cpus once all cpus are booted

>
> >
> > What about heuristic:
> > - num_online_cpus() > 1 - we trust that and use it
> > - otherwise nr_cpu_ids
> > Would that work? Too arbitrary?
>
> Looking at the following snippet from include/linux/cpumask.h, it
> appears that num_present_cpus() should be reasonable compromise
> between online and possible/nr_cpus_ids to use here.
>
> /*
>  * The following particular system cpumasks and operations manage
>  * possible, present, active and online cpus.
>  *
>  *     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
>  *     cpu_present_mask - has bit 'cpu' set iff cpu is populated
>  *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
>  *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
>  *
>  *  If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
>  *
>  *  The cpu_possible_mask is fixed at boot time, as the set of CPU id's
>  *  that it is possible might ever be plugged in at anytime during the
>  *  life of that system boot.  The cpu_present_mask is dynamic(*),
>  *  representing which CPUs are currently plugged in.  And
>  *  cpu_online_mask is the dynamic subset of cpu_present_mask,
>  *  indicating those CPUs available for scheduling.
>  *
>  *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
>  *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
>  *  ACPI reports present at boot.
>  *
>  *  If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
>  *  depending on what ACPI reports as currently plugged in, otherwise
>  *  cpu_present_mask is just a copy of cpu_possible_mask.
>  *
>  *  (*) Well, cpu_present_mask is dynamic in the hotplug case.  If not
>  *      hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
>  */
>
> So for host systems, present is (usually) equal to possible and for

But "cpu_present_mask varies dynamically,  depending on what ACPI
reports as currently plugged in"

So it should varies when secondaries cpus are booted

> guest systems present should indicate the CPUs found to be present
> at boottime. The intention of my original patch was to use this
> metric in slub page order calculation rather than nr_cpus_ids
> or num_cpus_possible() which could be high on guest systems that
> typically support CPU hotplug.
>
> Regards,
> Bharata.
Bharata B Rao Jan. 25, 2021, 4:28 a.m. UTC | #18
On Fri, Jan 22, 2021 at 02:05:47PM +0100, Jann Horn wrote:
> On Thu, Jan 21, 2021 at 7:19 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > On 1/21/21 11:01 AM, Christoph Lameter wrote:
> > > On Thu, 21 Jan 2021, Bharata B Rao wrote:
> > >
> > >> > The problem is that calculate_order() is called a number of times
> > >> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > >> > This makes the use of num_online_cpus() irrelevant for those cases
> > >> >
> > >> > After adding in my command line "slub_min_objects=36" which equals to
> > >> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > >> > , the regression diseapears:
> > >> >
> > >> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
> >
> > I'm surprised that hackbench is that sensitive to slab performance, anyway. It's
> > supposed to be a scheduler benchmark? What exactly is going on?
> 
> Uuuh, I think powerpc doesn't have cmpxchg_double?
> 
> "vgrep cmpxchg_double arch/" just spits out arm64, s390 and x86? And
> <https://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family>
> says under "POWERPC": "no DW LL/SC"
> 
> So powerpc is probably hitting the page-bitlock-based implementation
> all the time for stuff like __slub_free()? Do you have detailed
> profiling results from "perf top" or something like that?

I can check that, but the current patch was aimed at reducing
the page order of the slub caches so that they don't end up
consuming more memory on 64K systems.

> 
> (I actually have some WIP patches and a design document for getting
> rid of cmpxchg_double in struct page that I hacked together in the
> last couple days; I'm currently in the process of sending them over to
> some other folks in the company who hopefully have cycles to
> review/polish/benchmark them so that they can be upstreamed, assuming
> that those folks think they're important enough. I don't have the
> cycles for it...)

Sounds interesting, will keep a watch to see its effect on powerpc.

Regards,
Bharata.
Vlastimil Babka Jan. 25, 2021, 11:20 a.m. UTC | #19
On 1/23/21 1:32 PM, Vincent Guittot wrote:
>> PowerPC PowerNV Host: (160 cpus)
>> num_online_cpus 1 num_present_cpus 160 num_possible_cpus 160 nr_cpu_ids 160
>>
>> PowerPC pseries KVM guest: (-smp 16,maxcpus=160)
>> num_online_cpus 1 num_present_cpus 16 num_possible_cpus 160 nr_cpu_ids 160
>>
>> That's what I see on powerpc, hence I thought num_present_cpus() could
>> be the correct one to use in slub page order calculation.
> 
> num_present_cpus() is set to 1 on arm64 until secondaries cpus boot
> 
> arm64 224cpus acpi host:
> num_online_cpus 1 num_present_cpus 1 num_possible_cpus 224 nr_cpu_ids 224
> arm64 8cpus DT host:
> num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8
> arm64 8cpus qemu-system-aarch64 (-smp 8,maxcpus=256)
> num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8

I would have expected num_present_cpus to be 224, 8, 8, respectively.

> Then present and online increase to num_possible_cpus once all cpus are booted
> 
>>
>> >
>> > What about heuristic:
>> > - num_online_cpus() > 1 - we trust that and use it
>> > - otherwise nr_cpu_ids
>> > Would that work? Too arbitrary?
>>
>> Looking at the following snippet from include/linux/cpumask.h, it
>> appears that num_present_cpus() should be reasonable compromise
>> between online and possible/nr_cpus_ids to use here.
>>
>> /*
>>  * The following particular system cpumasks and operations manage
>>  * possible, present, active and online cpus.
>>  *
>>  *     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
>>  *     cpu_present_mask - has bit 'cpu' set iff cpu is populated
>>  *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
>>  *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
>>  *
>>  *  If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
>>  *
>>  *  The cpu_possible_mask is fixed at boot time, as the set of CPU id's
>>  *  that it is possible might ever be plugged in at anytime during the
>>  *  life of that system boot.  The cpu_present_mask is dynamic(*),
>>  *  representing which CPUs are currently plugged in.  And
>>  *  cpu_online_mask is the dynamic subset of cpu_present_mask,
>>  *  indicating those CPUs available for scheduling.
>>  *
>>  *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
>>  *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
>>  *  ACPI reports present at boot.
>>  *
>>  *  If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
>>  *  depending on what ACPI reports as currently plugged in, otherwise
>>  *  cpu_present_mask is just a copy of cpu_possible_mask.
>>  *
>>  *  (*) Well, cpu_present_mask is dynamic in the hotplug case.  If not
>>  *      hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
>>  */
>>
>> So for host systems, present is (usually) equal to possible and for
> 
> But "cpu_present_mask varies dynamically,  depending on what ACPI
> reports as currently plugged in"
> 
> So it should varies when secondaries cpus are booted

Hm, but booting the secondaries is just a software (kernel) action? They are
already physically there, so it seems to me as if the cpu_present_mask is not
populated correctly on arm64, and it's just a mirror of cpu_online_mask?
Michal Hocko Jan. 26, 2021, 8:52 a.m. UTC | #20
On Thu 21-01-21 19:19:21, Vlastimil Babka wrote:
[...]
> We could also start questioning the very assumption that number of cpus should
> affect slab page size in the first place. Should it? After all, each CPU will
> have one or more slab pages privately cached, as we discuss in the other
> thread... So why make the slab pages also larger?

I do agree. What is the acutal justification for this scaling?
        /*
         * Attempt to find best configuration for a slab. This
         * works by first attempting to generate a layout with
         * the best configuration and backing off gradually.
         *
         * First we increase the acceptable waste in a slab. Then
         * we reduce the minimum objects required in a slab.
         */

doesn't speak about CPUs.  9b2cd506e5f2 ("slub: Calculate min_objects
based on number of processors.") does talk about hackbench "This has
been shown to address the performance issues in hackbench on 16p etc."
but it doesn't give any more details to tell actually _why_ that works.

This thread shows that this is still somehow related to performance but
the real reason is not clear. I believe we should be focusing on the
actual reasons for the performance impact than playing with some fancy
math and tuning for a benchmark on a particular machine which doesn't
work for others due to subtle initialization timing issues.

Fundamentally why should higher number of CPUs imply the size of slab in
the first place?
Vincent Guittot Jan. 26, 2021, 1:38 p.m. UTC | #21
On Tue, 26 Jan 2021 at 09:52, Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 21-01-21 19:19:21, Vlastimil Babka wrote:
> [...]
> > We could also start questioning the very assumption that number of cpus should
> > affect slab page size in the first place. Should it? After all, each CPU will
> > have one or more slab pages privately cached, as we discuss in the other
> > thread... So why make the slab pages also larger?
>
> I do agree. What is the acutal justification for this scaling?
>         /*
>          * Attempt to find best configuration for a slab. This
>          * works by first attempting to generate a layout with
>          * the best configuration and backing off gradually.
>          *
>          * First we increase the acceptable waste in a slab. Then
>          * we reduce the minimum objects required in a slab.
>          */
>
> doesn't speak about CPUs.  9b2cd506e5f2 ("slub: Calculate min_objects
> based on number of processors.") does talk about hackbench "This has
> been shown to address the performance issues in hackbench on 16p etc."
> but it doesn't give any more details to tell actually _why_ that works.
>
> This thread shows that this is still somehow related to performance but
> the real reason is not clear. I believe we should be focusing on the
> actual reasons for the performance impact than playing with some fancy
> math and tuning for a benchmark on a particular machine which doesn't
> work for others due to subtle initialization timing issues.
>
> Fundamentally why should higher number of CPUs imply the size of slab in
> the first place?

A 1st answer is that the activity and the number of threads involved
scales with the number of CPUs. Regarding the hackbench benchmark as
an example, the number of group/threads raise to a higher level on the
server than on the small system which doesn't seem unreasonable.

On 8 CPUs, I run hackbench with up to 16 groups which means 16*40
threads. But I raise up to 256 groups, which means 256*40 threads, on
the 224 CPUs system. In fact, hackbench -g 1 (with 1 group) doesn't
regress on the 224 CPUs  system.  The next test with 4 groups starts
to regress by -7%. But the next one: hackbench -g 16 regresses by 187%
(duration is almost 3 times longer). It seems reasonable to assume
that the number of running threads and resources scale with the number
of CPUs because we want to run more stuff.


> --
> Michal Hocko
> SUSE Labs
Michal Hocko Jan. 26, 2021, 1:59 p.m. UTC | #22
On Tue 26-01-21 14:38:14, Vincent Guittot wrote:
> On Tue, 26 Jan 2021 at 09:52, Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 21-01-21 19:19:21, Vlastimil Babka wrote:
> > [...]
> > > We could also start questioning the very assumption that number of cpus should
> > > affect slab page size in the first place. Should it? After all, each CPU will
> > > have one or more slab pages privately cached, as we discuss in the other
> > > thread... So why make the slab pages also larger?
> >
> > I do agree. What is the acutal justification for this scaling?
> >         /*
> >          * Attempt to find best configuration for a slab. This
> >          * works by first attempting to generate a layout with
> >          * the best configuration and backing off gradually.
> >          *
> >          * First we increase the acceptable waste in a slab. Then
> >          * we reduce the minimum objects required in a slab.
> >          */
> >
> > doesn't speak about CPUs.  9b2cd506e5f2 ("slub: Calculate min_objects
> > based on number of processors.") does talk about hackbench "This has
> > been shown to address the performance issues in hackbench on 16p etc."
> > but it doesn't give any more details to tell actually _why_ that works.
> >
> > This thread shows that this is still somehow related to performance but
> > the real reason is not clear. I believe we should be focusing on the
> > actual reasons for the performance impact than playing with some fancy
> > math and tuning for a benchmark on a particular machine which doesn't
> > work for others due to subtle initialization timing issues.
> >
> > Fundamentally why should higher number of CPUs imply the size of slab in
> > the first place?
> 
> A 1st answer is that the activity and the number of threads involved
> scales with the number of CPUs. Regarding the hackbench benchmark as
> an example, the number of group/threads raise to a higher level on the
> server than on the small system which doesn't seem unreasonable.
> 
> On 8 CPUs, I run hackbench with up to 16 groups which means 16*40
> threads. But I raise up to 256 groups, which means 256*40 threads, on
> the 224 CPUs system. In fact, hackbench -g 1 (with 1 group) doesn't
> regress on the 224 CPUs  system.  The next test with 4 groups starts
> to regress by -7%. But the next one: hackbench -g 16 regresses by 187%
> (duration is almost 3 times longer). It seems reasonable to assume
> that the number of running threads and resources scale with the number
> of CPUs because we want to run more stuff.

OK, I do understand that more jobs scale with the number of CPUs but I
would also expect that higher order pages are generally more expensive
to get so this is not really a clear cut especially under some more
demand on the memory where allocations are smooth. So the question
really is whether this is not just optimizing for artificial conditions.
Will Deacon Jan. 26, 2021, 11:03 p.m. UTC | #23
On Mon, Jan 25, 2021 at 12:20:14PM +0100, Vlastimil Babka wrote:
> On 1/23/21 1:32 PM, Vincent Guittot wrote:
> >> PowerPC PowerNV Host: (160 cpus)
> >> num_online_cpus 1 num_present_cpus 160 num_possible_cpus 160 nr_cpu_ids 160
> >>
> >> PowerPC pseries KVM guest: (-smp 16,maxcpus=160)
> >> num_online_cpus 1 num_present_cpus 16 num_possible_cpus 160 nr_cpu_ids 160
> >>
> >> That's what I see on powerpc, hence I thought num_present_cpus() could
> >> be the correct one to use in slub page order calculation.
> > 
> > num_present_cpus() is set to 1 on arm64 until secondaries cpus boot
> > 
> > arm64 224cpus acpi host:
> > num_online_cpus 1 num_present_cpus 1 num_possible_cpus 224 nr_cpu_ids 224
> > arm64 8cpus DT host:
> > num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8
> > arm64 8cpus qemu-system-aarch64 (-smp 8,maxcpus=256)
> > num_online_cpus 1 num_present_cpus 1 num_possible_cpus 8 nr_cpu_ids 8
> 
> I would have expected num_present_cpus to be 224, 8, 8, respectively.
> 
> > Then present and online increase to num_possible_cpus once all cpus are booted
> > 
> >>
> >> >
> >> > What about heuristic:
> >> > - num_online_cpus() > 1 - we trust that and use it
> >> > - otherwise nr_cpu_ids
> >> > Would that work? Too arbitrary?
> >>
> >> Looking at the following snippet from include/linux/cpumask.h, it
> >> appears that num_present_cpus() should be reasonable compromise
> >> between online and possible/nr_cpus_ids to use here.
> >>
> >> /*
> >>  * The following particular system cpumasks and operations manage
> >>  * possible, present, active and online cpus.
> >>  *
> >>  *     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
> >>  *     cpu_present_mask - has bit 'cpu' set iff cpu is populated
> >>  *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
> >>  *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
> >>  *
> >>  *  If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
> >>  *
> >>  *  The cpu_possible_mask is fixed at boot time, as the set of CPU id's
> >>  *  that it is possible might ever be plugged in at anytime during the
> >>  *  life of that system boot.  The cpu_present_mask is dynamic(*),
> >>  *  representing which CPUs are currently plugged in.  And
> >>  *  cpu_online_mask is the dynamic subset of cpu_present_mask,
> >>  *  indicating those CPUs available for scheduling.
> >>  *
> >>  *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
> >>  *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
> >>  *  ACPI reports present at boot.
> >>  *
> >>  *  If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
> >>  *  depending on what ACPI reports as currently plugged in, otherwise
> >>  *  cpu_present_mask is just a copy of cpu_possible_mask.
> >>  *
> >>  *  (*) Well, cpu_present_mask is dynamic in the hotplug case.  If not
> >>  *      hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
> >>  */
> >>
> >> So for host systems, present is (usually) equal to possible and for
> > 
> > But "cpu_present_mask varies dynamically,  depending on what ACPI
> > reports as currently plugged in"
> > 
> > So it should varies when secondaries cpus are booted
> 
> Hm, but booting the secondaries is just a software (kernel) action? They are
> already physically there, so it seems to me as if the cpu_present_mask is not
> populated correctly on arm64, and it's just a mirror of cpu_online_mask?

I think the present_mask retains CPUs if they are hotplugged off, whereas
the online mask does not. We can't really do any better on arm64, as there's
no way of telling that a CPU is present until we've seen it.

Will
Christoph Lameter Jan. 27, 2021, 9:10 a.m. UTC | #24
On Tue, 26 Jan 2021, Will Deacon wrote:

> > Hm, but booting the secondaries is just a software (kernel) action? They are
> > already physically there, so it seems to me as if the cpu_present_mask is not
> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
>
> I think the present_mask retains CPUs if they are hotplugged off, whereas
> the online mask does not. We can't really do any better on arm64, as there's
> no way of telling that a CPU is present until we've seen it.

The order of each page in a kmem cache --and therefore also the number
of objects in a slab page-- can be different because that information is
stored in the page struct.

Therefore it is possible to retune the order while the cache is in operaton.

This means you can run an initcall after all cpus have been brought up to
set the order and number of objects in a slab page differently.

The older slab pages will continue to exist with the old orders until they
are freed.
Vlastimil Babka Jan. 27, 2021, 11:04 a.m. UTC | #25
On 1/27/21 10:10 AM, Christoph Lameter wrote:
> On Tue, 26 Jan 2021, Will Deacon wrote:
> 
>> > Hm, but booting the secondaries is just a software (kernel) action? They are
>> > already physically there, so it seems to me as if the cpu_present_mask is not
>> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
>>
>> I think the present_mask retains CPUs if they are hotplugged off, whereas
>> the online mask does not. We can't really do any better on arm64, as there's
>> no way of telling that a CPU is present until we've seen it.
> 
> The order of each page in a kmem cache --and therefore also the number
> of objects in a slab page-- can be different because that information is
> stored in the page struct.
> 
> Therefore it is possible to retune the order while the cache is in operaton.

Yes, but it's tricky to do the retuning safely, e.g. if freelist randomization
is enabled, see [1].

But as a quick fix for the regression, the heuristic idea could work reasonably
on all architectures?
- if num_present_cpus() is > 1, trust that it doesn't have the issue such as
arm64, and use it
- otherwise use nr_cpu_ids

Long-term we can attempt to do the retuning safe, or decide that number of cpus
shouldn't determine the order...

[1] https://lore.kernel.org/linux-mm/d7fb9425-9a62-c7b8-604d-5828d7e6b1da@suse.cz/

> This means you can run an initcall after all cpus have been brought up to
> set the order and number of objects in a slab page differently.
> 
> The older slab pages will continue to exist with the old orders until they
> are freed.
>
Mel Gorman Jan. 28, 2021, 1:45 p.m. UTC | #26
On Tue, Jan 26, 2021 at 02:59:18PM +0100, Michal Hocko wrote:
> > > This thread shows that this is still somehow related to performance but
> > > the real reason is not clear. I believe we should be focusing on the
> > > actual reasons for the performance impact than playing with some fancy
> > > math and tuning for a benchmark on a particular machine which doesn't
> > > work for others due to subtle initialization timing issues.
> > >
> > > Fundamentally why should higher number of CPUs imply the size of slab in
> > > the first place?
> > 
> > A 1st answer is that the activity and the number of threads involved
> > scales with the number of CPUs. Regarding the hackbench benchmark as
> > an example, the number of group/threads raise to a higher level on the
> > server than on the small system which doesn't seem unreasonable.
> > 
> > On 8 CPUs, I run hackbench with up to 16 groups which means 16*40
> > threads. But I raise up to 256 groups, which means 256*40 threads, on
> > the 224 CPUs system. In fact, hackbench -g 1 (with 1 group) doesn't
> > regress on the 224 CPUs  system.  The next test with 4 groups starts
> > to regress by -7%. But the next one: hackbench -g 16 regresses by 187%
> > (duration is almost 3 times longer). It seems reasonable to assume
> > that the number of running threads and resources scale with the number
> > of CPUs because we want to run more stuff.
> 
> OK, I do understand that more jobs scale with the number of CPUs but I
> would also expect that higher order pages are generally more expensive
> to get so this is not really a clear cut especially under some more
> demand on the memory where allocations are smooth. So the question
> really is whether this is not just optimizing for artificial conditions.

The flip side is that smaller orders increase zone lock contention and
contention can csale with the number of CPUs so it's partially related.
hackbench-sockets is an extreme case (pipetest is not affected) but it's
the messenger in this case.

On a x86-64 2-socket 40 core (80 threads) machine then comparing a revert
of the patch with vanilla 5.11-rc5 is

hackbench-process-sockets
                            5.11-rc5               5.11-rc5
                     revert-lockstat       vanilla-lockstat
Amean     1        1.1560 (   0.00%)      1.0633 *   8.02%*
Amean     4        2.0797 (   0.00%)      2.5470 * -22.47%*
Amean     7        3.2693 (   0.00%)      4.3433 * -32.85%*
Amean     12       5.2043 (   0.00%)      6.5600 * -26.05%*
Amean     21      10.5817 (   0.00%)     11.3320 *  -7.09%*
Amean     30      13.3923 (   0.00%)     15.5817 * -16.35%*
Amean     48      20.3893 (   0.00%)     23.6733 * -16.11%*
Amean     79      31.4210 (   0.00%)     38.2787 * -21.83%*
Amean     110     43.6177 (   0.00%)     53.8847 * -23.54%*
Amean     141     56.3840 (   0.00%)     68.4257 * -21.36%*
Amean     172     70.0577 (   0.00%)     85.0077 * -21.34%*
Amean     203     81.9717 (   0.00%)    100.7137 * -22.86%*
Amean     234     95.1900 (   0.00%)    116.0280 * -21.89%*
Amean     265    108.9097 (   0.00%)    130.4307 * -19.76%*
Amean     296    119.7470 (   0.00%)    142.3637 * -18.89%*

i.e. the patch incurs a 7% to 32% performance penalty. This bisected
cleanly yesterday when I was looking for the regression and then found
the thread.

Numerous caches change size. For example, kmalloc-512 goes from order-0
(vanilla) to order-2 with the revert.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

VANILLA
                             &zone->lock:       1202731        1203433           0.07         120.55     1555485.48           1.29        8920825       12537091           0.06          84.10     9855085.12           0.79
                             -----------
                             &zone->lock          61903          [<00000000b47dc96a>] free_one_page+0x3f/0x530
                             &zone->lock           7655          [<00000000099f6e05>] get_page_from_freelist+0x475/0x1370
                             &zone->lock          36529          [<0000000075b9b918>] free_pcppages_bulk+0x1ac/0x7d0
                             &zone->lock        1097346          [<00000000b8e4950a>] get_page_from_freelist+0xaf0/0x1370
                             -----------
                             &zone->lock          44716          [<00000000099f6e05>] get_page_from_freelist+0x475/0x1370
                             &zone->lock          69813          [<0000000075b9b918>] free_pcppages_bulk+0x1ac/0x7d0
                             &zone->lock          31596          [<00000000b47dc96a>] free_one_page+0x3f/0x530
                             &zone->lock        1057308          [<00000000b8e4950a>] get_page_from_freelist+0xaf0/0x1370

REVERT
                             &zone->lock:        735827         739037           0.06          66.12      699661.56           0.95        4095299        7757942           0.05          54.35     5670083.68           0.73
                             -----------
                             &zone->lock         101927          [<00000000a60d5f86>] free_one_page+0x3f/0x530
                             &zone->lock         626426          [<00000000122cecf3>] get_page_from_freelist+0xaf0/0x1370
                             &zone->lock           9207          [<0000000068b9c9a1>] free_pcppages_bulk+0x1ac/0x7d0
                             &zone->lock           1477          [<00000000f856e720>] get_page_from_freelist+0x475/0x1370
                             -----------
                             &zone->lock           6249          [<00000000f856e720>] get_page_from_freelist+0x475/0x1370
                             &zone->lock          92224          [<00000000a60d5f86>] free_one_page+0x3f/0x530
                             &zone->lock          19690          [<0000000068b9c9a1>] free_pcppages_bulk+0x1ac/0x7d0
                             &zone->lock         620874          [<00000000122cecf3>] get_page_from_freelist+0xaf0/0x1370

Each individual wait time is small but the maximum waittime-max is roughly
double (120us vanilla vs 66us reverting the patch). Total wait time is
roughly doubled also due to the patch. Acquisitions are almost doubled.

So mostly this is down to the number of times SLUB calls into the page
allocator which only caches order-0 pages on a per-cpu basis. I do have
a prototype for a high-order per-cpu allocator but it is very rough --
high watermarks stop making sense, code is rough, memory needed for the
pcpu structures quadruples etc.
Michal Hocko Jan. 28, 2021, 1:57 p.m. UTC | #27
On Thu 28-01-21 13:45:12, Mel Gorman wrote:
[...]
> So mostly this is down to the number of times SLUB calls into the page
> allocator which only caches order-0 pages on a per-cpu basis. I do have
> a prototype for a high-order per-cpu allocator but it is very rough --
> high watermarks stop making sense, code is rough, memory needed for the
> pcpu structures quadruples etc.

Thanks this is really useful. But it really begs a question whether this
is a general case or more an exception. And as such maybe we want to
define high throughput caches which would gain a higher order pages to
keep pace with allocation and reduce the churn or deploy some other
techniques to reduce the direct page allocator involvement.
Mel Gorman Jan. 28, 2021, 2:42 p.m. UTC | #28
On Thu, Jan 28, 2021 at 02:57:10PM +0100, Michal Hocko wrote:
> On Thu 28-01-21 13:45:12, Mel Gorman wrote:
> [...]
> > So mostly this is down to the number of times SLUB calls into the page
> > allocator which only caches order-0 pages on a per-cpu basis. I do have
> > a prototype for a high-order per-cpu allocator but it is very rough --
> > high watermarks stop making sense, code is rough, memory needed for the
> > pcpu structures quadruples etc.
> 
> Thanks this is really useful. But it really begs a question whether this
> is a general case or more an exception. And as such maybe we want to
> define high throughput caches which would gain a higher order pages to
> keep pace with allocation and reduce the churn or deploy some other
> techniques to reduce the direct page allocator involvement.

I don't think we want to define "high throughput caches" because it'll
be workload dependant and a game of whack-a-mole. If the "high throughput
cache" is a kmalloc cache for some set of workloads and one of the inode
caches or dcaches for another one, there will be no setting that is
universally good.
Bharata B Rao Feb. 3, 2021, 11:10 a.m. UTC | #29
On Wed, Jan 27, 2021 at 12:04:01PM +0100, Vlastimil Babka wrote:
> On 1/27/21 10:10 AM, Christoph Lameter wrote:
> > On Tue, 26 Jan 2021, Will Deacon wrote:
> > 
> >> > Hm, but booting the secondaries is just a software (kernel) action? They are
> >> > already physically there, so it seems to me as if the cpu_present_mask is not
> >> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
> >>
> >> I think the present_mask retains CPUs if they are hotplugged off, whereas
> >> the online mask does not. We can't really do any better on arm64, as there's
> >> no way of telling that a CPU is present until we've seen it.
> > 
> > The order of each page in a kmem cache --and therefore also the number
> > of objects in a slab page-- can be different because that information is
> > stored in the page struct.
> > 
> > Therefore it is possible to retune the order while the cache is in operaton.
> 
> Yes, but it's tricky to do the retuning safely, e.g. if freelist randomization
> is enabled, see [1].
> 
> But as a quick fix for the regression, the heuristic idea could work reasonably
> on all architectures?
> - if num_present_cpus() is > 1, trust that it doesn't have the issue such as
> arm64, and use it
> - otherwise use nr_cpu_ids
> 
> Long-term we can attempt to do the retuning safe, or decide that number of cpus
> shouldn't determine the order...
> 
> [1] https://lore.kernel.org/linux-mm/d7fb9425-9a62-c7b8-604d-5828d7e6b1da@suse.cz/

So what is preferrable here now? Above or other quick fix or reverting
the original commit?

Regards,
Bharata.
Vincent Guittot Feb. 4, 2021, 7:32 a.m. UTC | #30
On Wed, 3 Feb 2021 at 12:10, Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> On Wed, Jan 27, 2021 at 12:04:01PM +0100, Vlastimil Babka wrote:
> > On 1/27/21 10:10 AM, Christoph Lameter wrote:
> > > On Tue, 26 Jan 2021, Will Deacon wrote:
> > >
> > >> > Hm, but booting the secondaries is just a software (kernel) action? They are
> > >> > already physically there, so it seems to me as if the cpu_present_mask is not
> > >> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
> > >>
> > >> I think the present_mask retains CPUs if they are hotplugged off, whereas
> > >> the online mask does not. We can't really do any better on arm64, as there's
> > >> no way of telling that a CPU is present until we've seen it.
> > >
> > > The order of each page in a kmem cache --and therefore also the number
> > > of objects in a slab page-- can be different because that information is
> > > stored in the page struct.
> > >
> > > Therefore it is possible to retune the order while the cache is in operaton.
> >
> > Yes, but it's tricky to do the retuning safely, e.g. if freelist randomization
> > is enabled, see [1].
> >
> > But as a quick fix for the regression, the heuristic idea could work reasonably
> > on all architectures?
> > - if num_present_cpus() is > 1, trust that it doesn't have the issue such as
> > arm64, and use it
> > - otherwise use nr_cpu_ids
> >
> > Long-term we can attempt to do the retuning safe, or decide that number of cpus
> > shouldn't determine the order...
> >
> > [1] https://lore.kernel.org/linux-mm/d7fb9425-9a62-c7b8-604d-5828d7e6b1da@suse.cz/
>
> So what is preferrable here now? Above or other quick fix or reverting
> the original commit?

I'm fine with whatever the solution as long as we can use keep using
nr_cpu_ids when other values like num_present_cpus, don't reflect
correctly the system

Regards,
Vincent

>
> Regards,
> Bharata.
Christoph Lameter Feb. 4, 2021, 9:07 a.m. UTC | #31
On Thu, 4 Feb 2021, Vincent Guittot wrote:

> > So what is preferrable here now? Above or other quick fix or reverting
> > the original commit?
>
> I'm fine with whatever the solution as long as we can use keep using
> nr_cpu_ids when other values like num_present_cpus, don't reflect
> correctly the system

AFAICT they are correctly reflecting the current state of the system.

The problem here is the bringup of the system and the tuning therefor.

One additional thing that may help: The slab caches can work in a degraded
mode where no fastpath allocations can occur. That mode is used primarily
for debugging but maybe that mode can also help during bootstrap to avoid
having to deal with the per cpu data and so on.

In degraded mode SLUB will take a lock for each operation on an object.

In this mode the following is true

  kmem_cache_cpu->page == NULL
  kmem_cache_cpu->freelist == NULL

   kmem_cache_debug(s) == true

So if you define a new debug mode and include it in SLAB_DEBUG_FLAGS then
you can force SLUB to fallback to operations where a lock is taken and
where slab allocation can be stopped. This may be ok for bring up.

The debug flags are also tied to some wizardry that can patch the code at
runtime to optimize for debubgging or fast operations. You would tie into
that one as well. Start in debug mode by default and switch to fast
operations after all processors are up.
Vlastimil Babka Feb. 4, 2021, 9:33 a.m. UTC | #32
On 2/3/21 12:10 PM, Bharata B Rao wrote:
> On Wed, Jan 27, 2021 at 12:04:01PM +0100, Vlastimil Babka wrote:
>> Yes, but it's tricky to do the retuning safely, e.g. if freelist randomization
>> is enabled, see [1].
>> 
>> But as a quick fix for the regression, the heuristic idea could work reasonably
>> on all architectures?
>> - if num_present_cpus() is > 1, trust that it doesn't have the issue such as
>> arm64, and use it
>> - otherwise use nr_cpu_ids
>> 
>> Long-term we can attempt to do the retuning safe, or decide that number of cpus
>> shouldn't determine the order...
>> 
>> [1] https://lore.kernel.org/linux-mm/d7fb9425-9a62-c7b8-604d-5828d7e6b1da@suse.cz/
> 
> So what is preferrable here now? Above or other quick fix or reverting
> the original commit?

I would try the above first. In case it doesn't work, revert. As the immediate
fix for the regression, that people can safely backport.
Anything more complex will take more time and would be more risky to backport.

> Regards,
> Bharata.
>
diff mbox series

Patch

diff --git a/mm/slub.c b/mm/slub.c
index 34dcc09e2ec9..8342c0a167b2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3433,7 +3433,7 @@  static inline int calculate_order(unsigned int size)
 	 */
 	min_objects = slub_min_objects;
 	if (!min_objects)
-		min_objects = 4 * (fls(nr_cpu_ids) + 1);
+		min_objects = 4 * (fls(num_online_cpus()) + 1);
 	max_objects = order_objects(slub_max_order, size);
 	min_objects = min(min_objects, max_objects);