diff mbox series

[2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch

Message ID 20210525080119.5455-3-mgorman@techsingularity.net (mailing list archive)
State New, archived
Headers show
Series Calculate pcp->high based on zone sizes and active CPUs | expand

Commit Message

Mel Gorman May 25, 2021, 8:01 a.m. UTC
The pcp high watermark is based on the batch size but there is no
relationship between them other than it is convenient to use early in
boot.

This patch takes the first step and bases pcp->high on the zone low
watermark split across the number of CPUs local to a zone while the batch
size remains the same to avoid increasing allocation latencies. The intent
behind the default pcp->high is "set the number of PCP pages such that
if they are all full that background reclaim is not started prematurely".

Note that in this patch the pcp->high values are adjusted after memory
hotplug events, min_free_kbytes adjustments and watermark scale factor
adjustments but not CPU hotplug events which is handled later in the
series.

On a test KVM instance;

Before grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  378
              batch: 63

After grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  649
              batch: 63

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 60 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 18 deletions(-)

Comments

Vlastimil Babka May 26, 2021, 6:14 p.m. UTC | #1
On 5/25/21 10:01 AM, Mel Gorman wrote:
> The pcp high watermark is based on the batch size but there is no
> relationship between them other than it is convenient to use early in
> boot.
> 
> This patch takes the first step and bases pcp->high on the zone low
> watermark split across the number of CPUs local to a zone while the batch
> size remains the same to avoid increasing allocation latencies. The intent
> behind the default pcp->high is "set the number of PCP pages such that
> if they are all full that background reclaim is not started prematurely".
> 
> Note that in this patch the pcp->high values are adjusted after memory
> hotplug events, min_free_kbytes adjustments and watermark scale factor
> adjustments but not CPU hotplug events which is handled later in the
> series.
> 
> On a test KVM instance;
> 
> Before grep -E "high:|batch" /proc/zoneinfo | tail -2
>               high:  378
>               batch: 63
> 
> After grep -E "high:|batch" /proc/zoneinfo | tail -2
>               high:  649
>               batch: 63
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

...

> @@ -6637,6 +6628,34 @@ static int zone_batchsize(struct zone *zone)
>  #endif
>  }
>  
> +static int zone_highsize(struct zone *zone, int batch)
> +{
> +#ifdef CONFIG_MMU
> +	int high;
> +	int nr_local_cpus;
> +
> +	/*
> +	 * The high value of the pcp is based on the zone low watermark
> +	 * so that if they are full then background reclaim will not be
> +	 * started prematurely. The value is split across all online CPUs
> +	 * local to the zone. Note that early in boot that CPUs may not be
> +	 * online yet.
> +	 */
> +	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
> +	high = low_wmark_pages(zone) / nr_local_cpus;
> +
> +	/*
> +	 * Ensure high is at least batch*4. The multiple is based on the
> +	 * historical relationship between high and batch.
> +	 */
> +	high = max(high, batch << 2);
> +
> +	return high;
> +#else
> +	return 0;
> +#endif
> +}
> +
>  /*
>   * pcp->high and pcp->batch values are related and generally batch is lower
>   * than high. They are also related to pcp->count such that count is lower
> @@ -6698,11 +6717,10 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
>   */
>  static void zone_set_pageset_high_and_batch(struct zone *zone)
>  {
> -	unsigned long new_high, new_batch;
> +	int new_high, new_batch;
>  
> -	new_batch = zone_batchsize(zone);
> -	new_high = 6 * new_batch;
> -	new_batch = max(1UL, 1 * new_batch);
> +	new_batch = max(1, zone_batchsize(zone));
> +	new_high = zone_highsize(zone, new_batch);
>  
>  	if (zone->pageset_high == new_high &&
>  	    zone->pageset_batch == new_batch)
> @@ -8170,6 +8188,12 @@ static void __setup_per_zone_wmarks(void)
>  		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
>  		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
>  
> +		/*
> +		 * The watermark size have changed so update the pcpu batch
> +		 * and high limits or the limits may be inappropriate.
> +		 */
> +		zone_set_pageset_high_and_batch(zone);

Hm so this puts the call in the path of various watermark related sysctl
handlers, but it's not protected by pcp_batch_high_lock. The zone lock won't
help against zone_pcp_update() from a hotplug handler. On the other hand, since
hotplug handlers also call __setup_per_zone_wmarks(), the zone_pcp_update()
calls there are now redundant and could be removed, no?
But later there will be a new sysctl in patch 6/6 using pcp_batch_high_lock,
thus that one will not be protected against the watermark related sysctl
handlers that reach here.

To solve all this, seems like the static lock in setup_per_zone_wmarks() could
become a top-level visible lock and pcp high/batch updates could switch to that
one instead of own pcp_batch_high_lock. And zone_pcp_update() calls from hotplug
handlers could be removed.

> +
>  		spin_unlock_irqrestore(&zone->lock, flags);
>  	}
>  
>
Mel Gorman May 27, 2021, 10:52 a.m. UTC | #2
On Wed, May 26, 2021 at 08:14:13PM +0200, Vlastimil Babka wrote:
> > @@ -6698,11 +6717,10 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
> >   */
> >  static void zone_set_pageset_high_and_batch(struct zone *zone)
> >  {
> > -	unsigned long new_high, new_batch;
> > +	int new_high, new_batch;
> >  
> > -	new_batch = zone_batchsize(zone);
> > -	new_high = 6 * new_batch;
> > -	new_batch = max(1UL, 1 * new_batch);
> > +	new_batch = max(1, zone_batchsize(zone));
> > +	new_high = zone_highsize(zone, new_batch);
> >  
> >  	if (zone->pageset_high == new_high &&
> >  	    zone->pageset_batch == new_batch)
> > @@ -8170,6 +8188,12 @@ static void __setup_per_zone_wmarks(void)
> >  		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
> >  		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
> >  
> > +		/*
> > +		 * The watermark size have changed so update the pcpu batch
> > +		 * and high limits or the limits may be inappropriate.
> > +		 */
> > +		zone_set_pageset_high_and_batch(zone);
> 
> Hm so this puts the call in the path of various watermark related sysctl
> handlers, but it's not protected by pcp_batch_high_lock. The zone lock won't
> help against zone_pcp_update() from a hotplug handler. On the other hand, since
> hotplug handlers also call __setup_per_zone_wmarks(), the zone_pcp_update()
> calls there are now redundant and could be removed, no?
> But later there will be a new sysctl in patch 6/6 using pcp_batch_high_lock,
> thus that one will not be protected against the watermark related sysctl
> handlers that reach here.
> 
> To solve all this, seems like the static lock in setup_per_zone_wmarks() could
> become a top-level visible lock and pcp high/batch updates could switch to that
> one instead of own pcp_batch_high_lock. And zone_pcp_update() calls from hotplug
> handlers could be removed.
> 

Hmm, the locking has very different hold times. The static lock in
setup_per_zone_wmarks is a spinlock that protects against parallel updates
of watermarks and is held for a short duration. The pcp_batch_high_lock
is a mutex that is held for a relatively long time while memory is being
offlined and can sleep. Memory hotplug updates the watermarks without
pcp_batch_high_lock held so overall, unifying the locking there should
be a separate series.

How about this as a fix for this patch?

---8<---
mm/page_alloc: Disassociate the pcp->high from pcp->batch -fix

Vlastimil Babka noted that __setup_per_zone_wmarks updating pcp->high
did not protect watermark-related sysctl handlers from a parallel
memory hotplug operations. This patch moves the PCP update to
setup_per_zone_wmarks and updates the PCP high value while protected
by the pcp_batch_high_lock mutex.

This is a fix to the mmotm patch mm-page_alloc-disassociate-the-pcp-high-from-pcp-batch.patch.
It'll cause a conflict with mm-page_alloc-adjust-pcp-high-after-cpu-hotplug-events.patch
but the resolution is simply to change the caller in setup_per_zone_wmarks
to zone_pcp_update(zone, 0)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 329b71e41db4..b1b3c66e9d88 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8199,12 +8199,6 @@ static void __setup_per_zone_wmarks(void)
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
 
-		/*
-		 * The watermark size have changed so update the pcpu batch
-		 * and high limits or the limits may be inappropriate.
-		 */
-		zone_set_pageset_high_and_batch(zone);
-
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
@@ -8221,11 +8215,19 @@ static void __setup_per_zone_wmarks(void)
  */
 void setup_per_zone_wmarks(void)
 {
+	struct zone *zone;
 	static DEFINE_SPINLOCK(lock);
 
 	spin_lock(&lock);
 	__setup_per_zone_wmarks();
 	spin_unlock(&lock);
+
+	/*
+	 * The watermark size have changed so update the pcpu batch
+	 * and high limits or the limits may be inappropriate.
+	 */
+	for_each_zone(zone)
+		zone_pcp_update(zone);
 }
 
 /*
Vlastimil Babka May 28, 2021, 10:27 a.m. UTC | #3
On 5/27/21 12:52 PM, Mel Gorman wrote:
> On Wed, May 26, 2021 at 08:14:13PM +0200, Vlastimil Babka wrote:
>> > @@ -6698,11 +6717,10 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
>> >   */
>> >  static void zone_set_pageset_high_and_batch(struct zone *zone)
>> >  {
>> > -	unsigned long new_high, new_batch;
>> > +	int new_high, new_batch;
>> >  
>> > -	new_batch = zone_batchsize(zone);
>> > -	new_high = 6 * new_batch;
>> > -	new_batch = max(1UL, 1 * new_batch);
>> > +	new_batch = max(1, zone_batchsize(zone));
>> > +	new_high = zone_highsize(zone, new_batch);
>> >  
>> >  	if (zone->pageset_high == new_high &&
>> >  	    zone->pageset_batch == new_batch)
>> > @@ -8170,6 +8188,12 @@ static void __setup_per_zone_wmarks(void)
>> >  		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
>> >  		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
>> >  
>> > +		/*
>> > +		 * The watermark size have changed so update the pcpu batch
>> > +		 * and high limits or the limits may be inappropriate.
>> > +		 */
>> > +		zone_set_pageset_high_and_batch(zone);
>> 
>> Hm so this puts the call in the path of various watermark related sysctl
>> handlers, but it's not protected by pcp_batch_high_lock. The zone lock won't
>> help against zone_pcp_update() from a hotplug handler. On the other hand, since
>> hotplug handlers also call __setup_per_zone_wmarks(), the zone_pcp_update()
>> calls there are now redundant and could be removed, no?
>> But later there will be a new sysctl in patch 6/6 using pcp_batch_high_lock,
>> thus that one will not be protected against the watermark related sysctl
>> handlers that reach here.
>> 
>> To solve all this, seems like the static lock in setup_per_zone_wmarks() could
>> become a top-level visible lock and pcp high/batch updates could switch to that
>> one instead of own pcp_batch_high_lock. And zone_pcp_update() calls from hotplug
>> handlers could be removed.
>> 
> 
> Hmm, the locking has very different hold times. The static lock in
> setup_per_zone_wmarks is a spinlock that protects against parallel updates
> of watermarks and is held for a short duration. The pcp_batch_high_lock
> is a mutex that is held for a relatively long time while memory is being
> offlined and can sleep. Memory hotplug updates the watermarks without
> pcp_batch_high_lock held so overall, unifying the locking there should
> be a separate series.
> 
> How about this as a fix for this patch?
> 
> ---8<---
> mm/page_alloc: Disassociate the pcp->high from pcp->batch -fix
> 
> Vlastimil Babka noted that __setup_per_zone_wmarks updating pcp->high
> did not protect watermark-related sysctl handlers from a parallel
> memory hotplug operations. This patch moves the PCP update to
> setup_per_zone_wmarks and updates the PCP high value while protected
> by the pcp_batch_high_lock mutex.
> 
> This is a fix to the mmotm patch mm-page_alloc-disassociate-the-pcp-high-from-pcp-batch.patch.
> It'll cause a conflict with mm-page_alloc-adjust-pcp-high-after-cpu-hotplug-events.patch
> but the resolution is simply to change the caller in setup_per_zone_wmarks
> to zone_pcp_update(zone, 0)
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Looks fine. But I would also remove the redudancy introduced by this patch+fix,
as part of the patch:

online_pages()
  zone_pcp_update(zone); <- this predates the patch
  init_per_zone_wmark_min()
    setup_per_zone_wmarks()
      for_each_zone(zone)
         zone_pcp_update(zone); <- new in this patch

offline_pages() similarly

In any case, for the fixed version:
Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 329b71e41db4..b1b3c66e9d88 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8199,12 +8199,6 @@ static void __setup_per_zone_wmarks(void)
>  		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
>  		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
>  
> -		/*
> -		 * The watermark size have changed so update the pcpu batch
> -		 * and high limits or the limits may be inappropriate.
> -		 */
> -		zone_set_pageset_high_and_batch(zone);
> -
>  		spin_unlock_irqrestore(&zone->lock, flags);
>  	}
>  
> @@ -8221,11 +8215,19 @@ static void __setup_per_zone_wmarks(void)
>   */
>  void setup_per_zone_wmarks(void)
>  {
> +	struct zone *zone;
>  	static DEFINE_SPINLOCK(lock);
>  
>  	spin_lock(&lock);
>  	__setup_per_zone_wmarks();
>  	spin_unlock(&lock);
> +
> +	/*
> +	 * The watermark size have changed so update the pcpu batch
> +	 * and high limits or the limits may be inappropriate.
> +	 */
> +	for_each_zone(zone)
> +		zone_pcp_update(zone);
>  }
>  
>  /*
>
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a48f305f0381..c0536e5d088a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2163,14 +2163,6 @@  void __init page_alloc_init_late(void)
 	/* Block until all are initialised */
 	wait_for_completion(&pgdat_init_all_done_comp);
 
-	/*
-	 * The number of managed pages has changed due to the initialisation
-	 * so the pcpu batch and high limits needs to be updated or the limits
-	 * will be artificially small.
-	 */
-	for_each_populated_zone(zone)
-		zone_pcp_update(zone);
-
 	/*
 	 * We initialized the rest of the deferred pages.  Permanently disable
 	 * on-demand struct page initialization.
@@ -6594,13 +6586,12 @@  static int zone_batchsize(struct zone *zone)
 	int batch;
 
 	/*
-	 * The per-cpu-pages pools are set to around 1000th of the
-	 * size of the zone.
+	 * The number of pages to batch allocate is either ~0.1%
+	 * of the zone or 1MB, whichever is smaller. The batch
+	 * size is striking a balance between allocation latency
+	 * and zone lock contention.
 	 */
-	batch = zone_managed_pages(zone) / 1024;
-	/* But no more than a meg. */
-	if (batch * PAGE_SIZE > 1024 * 1024)
-		batch = (1024 * 1024) / PAGE_SIZE;
+	batch = min(zone_managed_pages(zone) >> 10, (1024 * 1024) / PAGE_SIZE);
 	batch /= 4;		/* We effectively *= 4 below */
 	if (batch < 1)
 		batch = 1;
@@ -6637,6 +6628,34 @@  static int zone_batchsize(struct zone *zone)
 #endif
 }
 
+static int zone_highsize(struct zone *zone, int batch)
+{
+#ifdef CONFIG_MMU
+	int high;
+	int nr_local_cpus;
+
+	/*
+	 * The high value of the pcp is based on the zone low watermark
+	 * so that if they are full then background reclaim will not be
+	 * started prematurely. The value is split across all online CPUs
+	 * local to the zone. Note that early in boot that CPUs may not be
+	 * online yet.
+	 */
+	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
+	high = low_wmark_pages(zone) / nr_local_cpus;
+
+	/*
+	 * Ensure high is at least batch*4. The multiple is based on the
+	 * historical relationship between high and batch.
+	 */
+	high = max(high, batch << 2);
+
+	return high;
+#else
+	return 0;
+#endif
+}
+
 /*
  * pcp->high and pcp->batch values are related and generally batch is lower
  * than high. They are also related to pcp->count such that count is lower
@@ -6698,11 +6717,10 @@  static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
  */
 static void zone_set_pageset_high_and_batch(struct zone *zone)
 {
-	unsigned long new_high, new_batch;
+	int new_high, new_batch;
 
-	new_batch = zone_batchsize(zone);
-	new_high = 6 * new_batch;
-	new_batch = max(1UL, 1 * new_batch);
+	new_batch = max(1, zone_batchsize(zone));
+	new_high = zone_highsize(zone, new_batch);
 
 	if (zone->pageset_high == new_high &&
 	    zone->pageset_batch == new_batch)
@@ -8170,6 +8188,12 @@  static void __setup_per_zone_wmarks(void)
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
 
+		/*
+		 * The watermark size have changed so update the pcpu batch
+		 * and high limits or the limits may be inappropriate.
+		 */
+		zone_set_pageset_high_and_batch(zone);
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}