diff mbox series

mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup

Message ID 1720169301-21002-1-git-send-email-ssengar@linux.microsoft.com (mailing list archive)
State New
Headers show
Series mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup | expand

Commit Message

Saurabh Singh Sengar July 5, 2024, 8:48 a.m. UTC
refresh_zone_stat_thresholds function has two loops which is expensive for
higher number of CPUs and NUMA nodes.

Below is the rough estimation of total iterations done by these loops
based on number of NUMA and CPUs.

Total number of iterations: nCPU * 2 * Numa * mCPU
Where:
 nCPU = total number of CPUs
 Numa = total number of NUMA nodes
 mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)

For the system under test with 16 NUMA nodes and 1024 CPUs, this
results in a substantial increase in the number of loop iterations
during boot-up when NUMA is enabled:

No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
takes around 224 ms total for all the CPUs in the system under test.
16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
takes around 4.5 seconds total for all the CPUs in the system under test.

Calling this for each CPU is expensive when there are large number
of CPUs along with multiple NUMAs. Fix this by deferring
refresh_zone_stat_thresholds to be called later at once when all the
secondary CPUs are up. Also, register the DYN hooks to keep the
existing hotplug functionality intact.

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
---
 mm/vmstat.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Comments

Andrew Morton July 5, 2024, 8:59 p.m. UTC | #1
On Fri,  5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@linux.microsoft.com> wrote:

> refresh_zone_stat_thresholds function has two loops which is expensive for
> higher number of CPUs and NUMA nodes.
> 
> Below is the rough estimation of total iterations done by these loops
> based on number of NUMA and CPUs.
> 
> Total number of iterations: nCPU * 2 * Numa * mCPU
> Where:
>  nCPU = total number of CPUs
>  Numa = total number of NUMA nodes
>  mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
> 
> For the system under test with 16 NUMA nodes and 1024 CPUs, this
> results in a substantial increase in the number of loop iterations
> during boot-up when NUMA is enabled:
> 
> No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> takes around 224 ms total for all the CPUs in the system under test.
> 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> takes around 4.5 seconds total for all the CPUs in the system under test.

Did you measure the overall before-and-after times?  IOW, how much of
that 4.5s do we reclaim?

> Calling this for each CPU is expensive when there are large number
> of CPUs along with multiple NUMAs. Fix this by deferring
> refresh_zone_stat_thresholds to be called later at once when all the
> secondary CPUs are up. Also, register the DYN hooks to keep the
> existing hotplug functionality intact.
> 

Seems risky - we'll now have online CPUs which have unintialized data,
yes?  What assurance do we have that this data won't be accessed?

Another approach might be to make the code a bit smarter - instead of
calculating thresholds for the whole world, we make incremental changes
to the existing thresholds on behalf of the new resource which just
became available?
Saurabh Singh Sengar July 9, 2024, 4:57 a.m. UTC | #2
On Fri, Jul 05, 2024 at 01:59:11PM -0700, Andrew Morton wrote:
> On Fri,  5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@linux.microsoft.com> wrote:
> 
> > refresh_zone_stat_thresholds function has two loops which is expensive for
> > higher number of CPUs and NUMA nodes.
> > 
> > Below is the rough estimation of total iterations done by these loops
> > based on number of NUMA and CPUs.
> > 
> > Total number of iterations: nCPU * 2 * Numa * mCPU
> > Where:
> >  nCPU = total number of CPUs
> >  Numa = total number of NUMA nodes
> >  mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
> > 
> > For the system under test with 16 NUMA nodes and 1024 CPUs, this
> > results in a substantial increase in the number of loop iterations
> > during boot-up when NUMA is enabled:
> > 
> > No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> > takes around 224 ms total for all the CPUs in the system under test.
> > 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> > takes around 4.5 seconds total for all the CPUs in the system under test.
> 
> Did you measure the overall before-and-after times?  IOW, how much of
> that 4.5s do we reclaim?

This entire gain is accounted in over all boot processi time. Most of the Linux
kernel boot process is sequential and doesn't take advantage of SMP.

> 
> > Calling this for each CPU is expensive when there are large number
> > of CPUs along with multiple NUMAs. Fix this by deferring
> > refresh_zone_stat_thresholds to be called later at once when all the
> > secondary CPUs are up. Also, register the DYN hooks to keep the
> > existing hotplug functionality intact.
> > 
> 
> Seems risky - we'll now have online CPUs which have unintialized data,
> yes?  What assurance do we have that this data won't be accessed?

I understand that this data is only accessed by userspace tools, and they can
only access it post late_initcall. Please let me know if there are any other
cases, I will look to address them.

> 
> Another approach might be to make the code a bit smarter - instead of
> calculating thresholds for the whole world, we make incremental changes
> to the existing thresholds on behalf of the new resource which just
> became available?

I agree, and I have spent good amount of time undertanding the calculation,
but couldn't find any obvious way to code everything it does in incremental way.

I would be happy to assist if you have any suggestions how to do this.

- Saurabh
Andrew Morton July 9, 2024, 9:45 p.m. UTC | #3
On Mon, 8 Jul 2024 21:57:50 -0700 Saurabh Singh Sengar <ssengar@linux.microsoft.com> wrote:

> this data is only accessed by userspace tools, and they can
> only access it post late_initcall

OK, thanks.  We're at -rc7 now, so I'll queue this for the
post-6.11-rc1 flood.  That way it should get plenty of testing.
Andrew Morton Aug. 9, 2024, 5:20 a.m. UTC | #4
On Fri,  5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@linux.microsoft.com> wrote:

> refresh_zone_stat_thresholds function has two loops which is expensive for
> higher number of CPUs and NUMA nodes.
> 
> Below is the rough estimation of total iterations done by these loops
> based on number of NUMA and CPUs.
> 
> Total number of iterations: nCPU * 2 * Numa * mCPU
> Where:
>  nCPU = total number of CPUs
>  Numa = total number of NUMA nodes
>  mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
> 
> For the system under test with 16 NUMA nodes and 1024 CPUs, this
> results in a substantial increase in the number of loop iterations
> during boot-up when NUMA is enabled:
> 
> No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> takes around 224 ms total for all the CPUs in the system under test.
> 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> takes around 4.5 seconds total for all the CPUs in the system under test.
> 
> Calling this for each CPU is expensive when there are large number
> of CPUs along with multiple NUMAs. Fix this by deferring
> refresh_zone_stat_thresholds to be called later at once when all the
> secondary CPUs are up. Also, register the DYN hooks to keep the
> existing hotplug functionality intact.
>
> ...
>
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -31,6 +31,7 @@
>  
>  #include "internal.h"
>  
> +static int vmstat_late_init_done;
>  #ifdef CONFIG_NUMA
>  int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>  
> @@ -2107,7 +2108,8 @@ static void __init init_cpu_node_state(void)
>  
>  static int vmstat_cpu_online(unsigned int cpu)
>  {
> -	refresh_zone_stat_thresholds();
> +	if (vmstat_late_init_done)
> +		refresh_zone_stat_thresholds();
>  
>  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
>  		node_set_state(cpu_to_node(cpu), N_CPU);
> @@ -2139,6 +2141,14 @@ static int vmstat_cpu_dead(unsigned int cpu)
>  	return 0;
>  }
>  
> +static int __init vmstat_late_init(void)
> +{
> +	refresh_zone_stat_thresholds();
> +	vmstat_late_init_done = 1;
> +
> +	return 0;
> +}
> +late_initcall(vmstat_late_init);

OK, so what's happening here.  Once all CPUs are online and running
around doing heaven knows what, one of the CPUs sets up everyone's
thresholds.  So for a period, all the other CPUs are running with
inappropriate threshold values.

So what are all the other CPUs doing at this point in time, and why is
it safe to leave their thresholds in an inappropriate state while they
are doing it?
Saurabh Singh Sengar Aug. 9, 2024, 5:49 a.m. UTC | #5
On Thu, Aug 08, 2024 at 10:20:06PM -0700, Andrew Morton wrote:
> On Fri,  5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@linux.microsoft.com> wrote:
> 
> > refresh_zone_stat_thresholds function has two loops which is expensive for
> > higher number of CPUs and NUMA nodes.
> > 
> > Below is the rough estimation of total iterations done by these loops
> > based on number of NUMA and CPUs.
> > 
> > Total number of iterations: nCPU * 2 * Numa * mCPU
> > Where:
> >  nCPU = total number of CPUs
> >  Numa = total number of NUMA nodes
> >  mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
> > 
> > For the system under test with 16 NUMA nodes and 1024 CPUs, this
> > results in a substantial increase in the number of loop iterations
> > during boot-up when NUMA is enabled:
> > 
> > No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> > takes around 224 ms total for all the CPUs in the system under test.
> > 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> > takes around 4.5 seconds total for all the CPUs in the system under test.
> > 
> > Calling this for each CPU is expensive when there are large number
> > of CPUs along with multiple NUMAs. Fix this by deferring
> > refresh_zone_stat_thresholds to be called later at once when all the
> > secondary CPUs are up. Also, register the DYN hooks to keep the
> > existing hotplug functionality intact.
> >
> > ...
> >
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -31,6 +31,7 @@
> >  
> >  #include "internal.h"
> >  
> > +static int vmstat_late_init_done;
> >  #ifdef CONFIG_NUMA
> >  int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
> >  
> > @@ -2107,7 +2108,8 @@ static void __init init_cpu_node_state(void)
> >  
> >  static int vmstat_cpu_online(unsigned int cpu)
> >  {
> > -	refresh_zone_stat_thresholds();
> > +	if (vmstat_late_init_done)
> > +		refresh_zone_stat_thresholds();
> >  
> >  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
> >  		node_set_state(cpu_to_node(cpu), N_CPU);
> > @@ -2139,6 +2141,14 @@ static int vmstat_cpu_dead(unsigned int cpu)
> >  	return 0;
> >  }
> >  
> > +static int __init vmstat_late_init(void)
> > +{
> > +	refresh_zone_stat_thresholds();
> > +	vmstat_late_init_done = 1;
> > +
> > +	return 0;
> > +}
> > +late_initcall(vmstat_late_init);
> 
> OK, so what's happening here.  Once all CPUs are online and running
> around doing heaven knows what, one of the CPUs sets up everyone's
> thresholds.  So for a period, all the other CPUs are running with
> inappropriate threshold values.
> 
> So what are all the other CPUs doing at this point in time, and why is
> it safe to leave their thresholds in an inappropriate state while they
> are doing it?

From what I undersatnd these threshold values are primarily used by
userspace tools, and this data will be useful post late_initcall only.

If there’s a more effective approach to handle this, please let me know,
and I can investigate further.

- Saurabh
Andrew Morton Aug. 10, 2024, 7:04 a.m. UTC | #6
On Mon, 8 Jul 2024 21:57:50 -0700 Saurabh Singh Sengar <ssengar@linux.microsoft.com> wrote:

> > > No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> > > takes around 224 ms total for all the CPUs in the system under test.
> > > 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> > > takes around 4.5 seconds total for all the CPUs in the system under test.
> > 
> > Did you measure the overall before-and-after times?  IOW, how much of
> > that 4.5s do we reclaim?
> 
> This entire gain is accounted in over all boot processi time. Most of the Linux
> kernel boot process is sequential and doesn't take advantage of SMP.

Again, if you were able to measure 4.5s without the patch then you are
able to measure how long this delay is with the patch.  Please share
that number.
Saurabh Singh Sengar Aug. 12, 2024, 4:37 a.m. UTC | #7
On Sat, Aug 10, 2024 at 12:04:04AM -0700, Andrew Morton wrote:
> On Mon, 8 Jul 2024 21:57:50 -0700 Saurabh Singh Sengar <ssengar@linux.microsoft.com> wrote:
> 
> > > > No NUMA = 1024*2*1*512  =   1,048,576 : Here refresh_zone_stat_thresholds
> > > > takes around 224 ms total for all the CPUs in the system under test.
> > > > 16 NUMA = 1024*2*16*512 =  16,777,216 : Here refresh_zone_stat_thresholds
> > > > takes around 4.5 seconds total for all the CPUs in the system under test.
> > > 
> > > Did you measure the overall before-and-after times?  IOW, how much of
> > > that 4.5s do we reclaim?
> > 
> > This entire gain is accounted in over all boot processi time. Most of the Linux
> > kernel boot process is sequential and doesn't take advantage of SMP.
> 
> Again, if you were able to measure 4.5s without the patch then you are
> able to measure how long this delay is with the patch.  Please share
> that number.

If I understand your question correctly, you're asking about the total time taken by
refresh_zone_stat_threshold for all its executions before and after the fix.

Without this patch, refresh_zone_stat_threshold was being called 1024 times.
After applying the patch, it is called only once, which is same as the last
iteration of the earlier 1024 calls. Further testing with this patch, I observed
a 4.5-second improvement in the overall boot timing due to this fix, which is
same as the total time taken by refresh_zone_stat_thresholds without thie patch,
leading me to reasonably conclude that refresh_zone_stat_threshold now takes a
negligible amount of time (likely just a few milliseconds). If you would like
precise timing details on single refresh_zone_stat_threshold execution, please
let me know, and I can conduct the tests again and provide the results in few days.

- Saurabh
Andrew Morton Aug. 13, 2024, 11:37 p.m. UTC | #8
On Sun, 11 Aug 2024 21:37:54 -0700 Saurabh Singh Sengar <ssengar@linux.microsoft.com> wrote:

> Without this patch, refresh_zone_stat_threshold was being called 1024 times.
> After applying the patch, it is called only once, which is same as the last
> iteration of the earlier 1024 calls. Further testing with this patch, I observed
> a 4.5-second improvement in the overall boot timing due to this fix, which is
> same as the total time taken by refresh_zone_stat_thresholds without thie patch,
> leading me to reasonably conclude that refresh_zone_stat_threshold now takes a
> negligible amount of time (likely just a few milliseconds).

Thanks, nice.  I pasted this into the changelog.
Christoph Lameter (Ampere) Aug. 23, 2024, 3:32 p.m. UTC | #9
On Mon, 8 Jul 2024, Saurabh Singh Sengar wrote:

> > > Calling this for each CPU is expensive when there are large number
> > > of CPUs along with multiple NUMAs. Fix this by deferring
> > > refresh_zone_stat_thresholds to be called later at once when all the
> > > secondary CPUs are up. Also, register the DYN hooks to keep the
> > > existing hotplug functionality intact.
> > >
> >
> > Seems risky - we'll now have online CPUs which have unintialized data,
> > yes?  What assurance do we have that this data won't be accessed?
>
> I understand that this data is only accessed by userspace tools, and they can
> only access it post late_initcall. Please let me know if there are any other
> cases, I will look to address them.

stat_threshold is used in all statistics functions that modify VM
counters. It is core to the functioning of the VM statistics.

However, if the threshold is zero (not initialized) then the VM counter
handling will simply be less effective because it will not do the per cpu
counter diffs anymore. This may increase contention and eat up the benefit
you are getting from deferring the calculation of the threshholds.

What may be more promising is to make it possible to calculate the
threshholds per cpu instead of recalculating the thresholds for every zone
again and again.
Saurabh Singh Sengar Aug. 28, 2024, 5:37 a.m. UTC | #10
On Fri, Aug 23, 2024 at 08:32:43AM -0700, Christoph Lameter (Ampere) wrote:
> On Mon, 8 Jul 2024, Saurabh Singh Sengar wrote:
> 
> > > > Calling this for each CPU is expensive when there are large number
> > > > of CPUs along with multiple NUMAs. Fix this by deferring
> > > > refresh_zone_stat_thresholds to be called later at once when all the
> > > > secondary CPUs are up. Also, register the DYN hooks to keep the
> > > > existing hotplug functionality intact.
> > > >
> > >
> > > Seems risky - we'll now have online CPUs which have unintialized data,
> > > yes?  What assurance do we have that this data won't be accessed?
> >
> > I understand that this data is only accessed by userspace tools, and they can
> > only access it post late_initcall. Please let me know if there are any other
> > cases, I will look to address them.
> 
> stat_threshold is used in all statistics functions that modify VM
> counters. It is core to the functioning of the VM statistics.
> 
> However, if the threshold is zero (not initialized) then the VM counter
> handling will simply be less effective because it will not do the per cpu
> counter diffs anymore. This may increase contention and eat up the benefit
> you are getting from deferring the calculation of the threshholds.

Christoph,

Thank you for your review. I would like to gain a better understanding of how
to measure contention. Could you please inform me if there is any recommended
method for doing so?

In my testing, this patch has resulted in a significant improvement in boot time.

> 
> What may be more promising is to make it possible to calculate the
> threshholds per cpu instead of recalculating the thresholds for every zone
> again and again.

I am happy to explore alternatives, can you please share more details around
this approach. Are you referring to avoiding the repetition of the calculation
below?

mem = zone_managed_pages(zone) >> (27 - PAGE_SHIFT);

- Saurabh
Christoph Lameter (Ampere) Aug. 28, 2024, 3:43 p.m. UTC | #11
On Tue, 27 Aug 2024, Saurabh Singh Sengar wrote:

> Thank you for your review. I would like to gain a better understanding of how
> to measure contention. Could you please inform me if there is any recommended
> method for doing so?
>
> In my testing, this patch has resulted in a significant improvement in boot time.

Good.

There was the question by Andrew as to what happens if the
threshold is not initialized. I have no specific benchmarks to measure the
effect.

> > What may be more promising is to make it possible to calculate the
> > threshholds per cpu instead of recalculating the thresholds for every zone
> > again and again.
>
> I am happy to explore alternatives, can you please share more details around
> this approach. Are you referring to avoiding the repetition of the calculation
> below?
>
> mem = zone_managed_pages(zone) >> (27 - PAGE_SHIFT);

The thresholds vary per the zone setup in a NUMA node. So one approach
would be to calculate  these values for each new NODE once and only
update them when memory is brought online or offlines.

Then the parameters for each per cpu pcp could be set from the per NODE /
zone information when a cpu is brought up. The overhead would be minimal
and there would not be a need for the loops.

My company has similar amounts of cpus and it would be great to have a
clean solution here.
diff mbox series

Patch

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6e3347789eb2..5c0df62238d9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -31,6 +31,7 @@ 
 
 #include "internal.h"
 
+static int vmstat_late_init_done;
 #ifdef CONFIG_NUMA
 int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
 
@@ -2107,7 +2108,8 @@  static void __init init_cpu_node_state(void)
 
 static int vmstat_cpu_online(unsigned int cpu)
 {
-	refresh_zone_stat_thresholds();
+	if (vmstat_late_init_done)
+		refresh_zone_stat_thresholds();
 
 	if (!node_state(cpu_to_node(cpu), N_CPU)) {
 		node_set_state(cpu_to_node(cpu), N_CPU);
@@ -2139,6 +2141,14 @@  static int vmstat_cpu_dead(unsigned int cpu)
 	return 0;
 }
 
+static int __init vmstat_late_init(void)
+{
+	refresh_zone_stat_thresholds();
+	vmstat_late_init_done = 1;
+
+	return 0;
+}
+late_initcall(vmstat_late_init);
 #endif
 
 struct workqueue_struct *mm_percpu_wq;