diff mbox series

[3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events

Message ID 20210521102826.28552-4-mgorman@techsingularity.net (mailing list archive)
State New, archived
Headers show
Series Calculate pcp->high based on zone sizes and active CPUs | expand

Commit Message

Mel Gorman May 21, 2021, 10:28 a.m. UTC
The PCP high watermark is based on the number of online CPUs so the
watermarks must be adjusted during CPU hotplug. At the time of
hot-remove, the number of online CPUs is already adjusted but during
hot-add, a delta needs to be applied to update PCP to the correct
value. After this patch is applied, the high watermarks are adjusted
correctly.

  # grep high: /proc/zoneinfo  | tail -1
              high:  649
  # echo 0 > /sys/devices/system/cpu/cpu4/online
  # grep high: /proc/zoneinfo  | tail -1
              high:  664
  # echo 1 > /sys/devices/system/cpu/cpu4/online
  # grep high: /proc/zoneinfo  | tail -1
              high:  649

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/cpuhotplug.h |  2 +-
 mm/internal.h              |  2 +-
 mm/memory_hotplug.c        |  4 ++--
 mm/page_alloc.c            | 35 +++++++++++++++++++++++++----------
 4 files changed, 29 insertions(+), 14 deletions(-)

Comments

Dave Hansen May 21, 2021, 10:13 p.m. UTC | #1
On 5/21/21 3:28 AM, Mel Gorman wrote:
> The PCP high watermark is based on the number of online CPUs so the
> watermarks must be adjusted during CPU hotplug. At the time of
> hot-remove, the number of online CPUs is already adjusted but during
> hot-add, a delta needs to be applied to update PCP to the correct
> value. After this patch is applied, the high watermarks are adjusted
> correctly.
> 
>   # grep high: /proc/zoneinfo  | tail -1
>               high:  649
>   # echo 0 > /sys/devices/system/cpu/cpu4/online
>   # grep high: /proc/zoneinfo  | tail -1
>               high:  664
>   # echo 1 > /sys/devices/system/cpu/cpu4/online
>   # grep high: /proc/zoneinfo  | tail -1
>               high:  649

This is actually a comment more about the previous patch, but it doesn't
really become apparent until the example above.

In your example, you mentioned increased exit() performance by using
"vm.percpu_pagelist_fraction to increase the pcp->high value".  That's
presumably because of the increased batching effects and fewer lock
acquisitions.

But, logically, doesn't that mean that, the more CPUs you have in a
node, the *higher* you want pcp->high to be?  If we took this to the
extreme and had an absurd number of CPUs in a node, we could end up with
a too-small pcp->high value.

Also, do you worry at all about a zone with a low min_free_kbytes seeing
increased zone lock contention?

...
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bf5cdc466e6c..2761b03b3a44 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6628,7 +6628,7 @@ static int zone_batchsize(struct zone *zone)
>  #endif
>  }
>  
> -static int zone_highsize(struct zone *zone)
> +static int zone_highsize(struct zone *zone, int cpu_online)
>  {
>  #ifdef CONFIG_MMU
>  	int high;
> @@ -6640,7 +6640,7 @@ static int zone_highsize(struct zone *zone)
>  	 * CPUs local to a zone. Note that early in boot that CPUs may
>  	 * not be online yet.
>  	 */
> -	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
> +	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
>  	high = low_wmark_pages(zone) / nr_local_cpus;

Is this "+ cpu_online" bias because the CPU isn't in cpumask_of_node()
when the CPU hotplug callback occurs?  If so, it might be nice to mention.
Mel Gorman May 24, 2021, 9:07 a.m. UTC | #2
On Fri, May 21, 2021 at 03:13:35PM -0700, Dave Hansen wrote:
> On 5/21/21 3:28 AM, Mel Gorman wrote:
> > The PCP high watermark is based on the number of online CPUs so the
> > watermarks must be adjusted during CPU hotplug. At the time of
> > hot-remove, the number of online CPUs is already adjusted but during
> > hot-add, a delta needs to be applied to update PCP to the correct
> > value. After this patch is applied, the high watermarks are adjusted
> > correctly.
> > 
> >   # grep high: /proc/zoneinfo  | tail -1
> >               high:  649
> >   # echo 0 > /sys/devices/system/cpu/cpu4/online
> >   # grep high: /proc/zoneinfo  | tail -1
> >               high:  664
> >   # echo 1 > /sys/devices/system/cpu/cpu4/online
> >   # grep high: /proc/zoneinfo  | tail -1
> >               high:  649
> 
> This is actually a comment more about the previous patch, but it doesn't
> really become apparent until the example above.
> 
> In your example, you mentioned increased exit() performance by using
> "vm.percpu_pagelist_fraction to increase the pcp->high value".  That's
> presumably because of the increased batching effects and fewer lock
> acquisitions.
> 

Yes

> But, logically, doesn't that mean that, the more CPUs you have in a
> node, the *higher* you want pcp->high to be?  If we took this to the
> extreme and had an absurd number of CPUs in a node, we could end up with
> a too-small pcp->high value.
> 

I see your point but I don't think increasing pcp->high for larger
numbers of CPUs is the right answer because then reclaim can be
triggered simply because too many PCPs have pages.

To address your point requires much deeper surgery. zone->lock would have
to be split to being a metadata lock and a free page lock. Then the free
areas would have to be split based on some factor -- number of CPUs or
memory size. That gets complex because then the page allocator loop needs
to walk multiple arenas as well as multiple zones as well as consider which
arena should be examined first. Fragmentation should also be considered
because a decision would need to be made on whether a pageblock should
fragment or whether other local areans should be examined. Anything that
walks PFNs such as compaction would also need to be aware of arenas and
their associated locks. Finally every acquisition of zone->lock would
have to be audited to determine exactly what it is protecting. Even with
all that, it still makes sense to disassociate pcp->high from pcp->batch
as this series does.

There is value to doing something like this but it's beyond what this
series is trying to do and doing the work without introducing regressions
would be very difficult.

> Also, do you worry at all about a zone with a low min_free_kbytes seeing
> increased zone lock contention?
> 
> ...
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index bf5cdc466e6c..2761b03b3a44 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6628,7 +6628,7 @@ static int zone_batchsize(struct zone *zone)
> >  #endif
> >  }
> >  
> > -static int zone_highsize(struct zone *zone)
> > +static int zone_highsize(struct zone *zone, int cpu_online)
> >  {
> >  #ifdef CONFIG_MMU
> >  	int high;
> > @@ -6640,7 +6640,7 @@ static int zone_highsize(struct zone *zone)
> >  	 * CPUs local to a zone. Note that early in boot that CPUs may
> >  	 * not be online yet.
> >  	 */
> > -	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
> > +	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
> >  	high = low_wmark_pages(zone) / nr_local_cpus;
> 
> Is this "+ cpu_online" bias because the CPU isn't in cpumask_of_node()
> when the CPU hotplug callback occurs?  If so, it might be nice to mention.

Fixed.
Dave Hansen May 24, 2021, 3:52 p.m. UTC | #3
On 5/24/21 2:07 AM, Mel Gorman wrote:
> On Fri, May 21, 2021 at 03:13:35PM -0700, Dave Hansen wrote:
>> On 5/21/21 3:28 AM, Mel Gorman wrote:
>>> The PCP high watermark is based on the number of online CPUs so the
>>> watermarks must be adjusted during CPU hotplug. At the time of
>>> hot-remove, the number of online CPUs is already adjusted but during
>>> hot-add, a delta needs to be applied to update PCP to the correct
>>> value. After this patch is applied, the high watermarks are adjusted
>>> correctly.
>>>
>>>   # grep high: /proc/zoneinfo  | tail -1
>>>               high:  649
>>>   # echo 0 > /sys/devices/system/cpu/cpu4/online
>>>   # grep high: /proc/zoneinfo  | tail -1
>>>               high:  664
>>>   # echo 1 > /sys/devices/system/cpu/cpu4/online
>>>   # grep high: /proc/zoneinfo  | tail -1
>>>               high:  649
>> This is actually a comment more about the previous patch, but it doesn't
>> really become apparent until the example above.
>>
>> In your example, you mentioned increased exit() performance by using
>> "vm.percpu_pagelist_fraction to increase the pcp->high value".  That's
>> presumably because of the increased batching effects and fewer lock
>> acquisitions.
>>
> Yes
> 
>> But, logically, doesn't that mean that, the more CPUs you have in a
>> node, the *higher* you want pcp->high to be?  If we took this to the
>> extreme and had an absurd number of CPUs in a node, we could end up with
>> a too-small pcp->high value.
>>
> I see your point but I don't think increasing pcp->high for larger
> numbers of CPUs is the right answer because then reclaim can be
> triggered simply because too many PCPs have pages.
> 
> To address your point requires much deeper surgery.
...
> There is value to doing something like this but it's beyond what this
> series is trying to do and doing the work without introducing regressions
> would be very difficult.

Agreed, such a solution is outside of the scope of what this set is
trying to do.

It would be nice to touch on this counter-intuitive property in the
changelog, and *maybe* add a WARN_ON_ONCE() if we hit an edge case.
Maybe WARN_ON_ONCE() if pcp->high gets below pcp->batch*SOMETHING.
Mel Gorman May 24, 2021, 4:01 p.m. UTC | #4
On Mon, May 24, 2021 at 08:52:02AM -0700, Dave Hansen wrote:
> > To address your point requires much deeper surgery.
> ...
> > There is value to doing something like this but it's beyond what this
> > series is trying to do and doing the work without introducing regressions
> > would be very difficult.
> 
> Agreed, such a solution is outside of the scope of what this set is
> trying to do.
> 
> It would be nice to touch on this counter-intuitive property in the
> changelog, and *maybe* add a WARN_ON_ONCE() if we hit an edge case.
> Maybe WARN_ON_ONCE() if pcp->high gets below pcp->batch*SOMETHING.
> 

I think it's reasonable to ensure pcp->batch is never above pcp->high so
I have this in zone_highsize now

+       /*
+        * Ensure high is at least batch*4. The multiple is based on the
+        * historical relationship between high and batch.
+        */
+       high = max(high, batch << 2);

Performance tests are running and I'll post a v2 assuming they pass.
diff mbox series

Patch

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 4a62b3980642..47e13582d9fc 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -54,7 +54,7 @@  enum cpuhp_state {
 	CPUHP_MM_MEMCQ_DEAD,
 	CPUHP_PERCPU_CNT_DEAD,
 	CPUHP_RADIX_DEAD,
-	CPUHP_PAGE_ALLOC_DEAD,
+	CPUHP_PAGE_ALLOC,
 	CPUHP_NET_DEV_DEAD,
 	CPUHP_PCI_XGENE_DEAD,
 	CPUHP_IOMMU_IOVA_DEAD,
diff --git a/mm/internal.h b/mm/internal.h
index 54bd0dc2c23c..651250e59ef5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -221,7 +221,7 @@  extern int user_min_free_kbytes;
 extern void free_unref_page(struct page *page);
 extern void free_unref_page_list(struct list_head *list);
 
-extern void zone_pcp_update(struct zone *zone);
+extern void zone_pcp_update(struct zone *zone, int cpu_online);
 extern void zone_pcp_reset(struct zone *zone);
 extern void zone_pcp_disable(struct zone *zone);
 extern void zone_pcp_enable(struct zone *zone);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 70620d0dd923..bebb3cead810 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -961,7 +961,7 @@  int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *z
 	node_states_set_node(nid, &arg);
 	if (need_zonelists_rebuild)
 		build_all_zonelists(NULL);
-	zone_pcp_update(zone);
+	zone_pcp_update(zone, 0);
 
 	/* Basic onlining is complete, allow allocation of onlined pages. */
 	undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
@@ -1835,7 +1835,7 @@  int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 		zone_pcp_reset(zone);
 		build_all_zonelists(NULL);
 	} else
-		zone_pcp_update(zone);
+		zone_pcp_update(zone, 0);
 
 	node_states_clear_node(node, &arg);
 	if (arg.status_change_nid >= 0) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf5cdc466e6c..2761b03b3a44 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6628,7 +6628,7 @@  static int zone_batchsize(struct zone *zone)
 #endif
 }
 
-static int zone_highsize(struct zone *zone)
+static int zone_highsize(struct zone *zone, int cpu_online)
 {
 #ifdef CONFIG_MMU
 	int high;
@@ -6640,7 +6640,7 @@  static int zone_highsize(struct zone *zone)
 	 * CPUs local to a zone. Note that early in boot that CPUs may
 	 * not be online yet.
 	 */
-	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
+	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
 	high = low_wmark_pages(zone) / nr_local_cpus;
 
 	return high;
@@ -6708,12 +6708,12 @@  static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
  * Calculate and set new high and batch values for all per-cpu pagesets of a
  * zone based on the zone's size.
  */
-static void zone_set_pageset_high_and_batch(struct zone *zone)
+static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online)
 {
 	int new_high, new_batch;
 
 	new_batch = max(1, zone_batchsize(zone));
-	new_high = zone_highsize(zone);
+	new_high = zone_highsize(zone, cpu_online);
 
 	if (zone->pageset_high == new_high &&
 	    zone->pageset_batch == new_batch)
@@ -6743,7 +6743,7 @@  void __meminit setup_zone_pageset(struct zone *zone)
 		per_cpu_pages_init(pcp, pzstats);
 	}
 
-	zone_set_pageset_high_and_batch(zone);
+	zone_set_pageset_high_and_batch(zone, 0);
 }
 
 /*
@@ -8001,6 +8001,7 @@  void __init set_dma_reserve(unsigned long new_dma_reserve)
 
 static int page_alloc_cpu_dead(unsigned int cpu)
 {
+	struct zone *zone;
 
 	lru_add_drain_cpu(cpu);
 	drain_pages(cpu);
@@ -8021,6 +8022,19 @@  static int page_alloc_cpu_dead(unsigned int cpu)
 	 * race with what we are doing.
 	 */
 	cpu_vm_stats_fold(cpu);
+
+	for_each_populated_zone(zone)
+		zone_pcp_update(zone, 0);
+
+	return 0;
+}
+
+static int page_alloc_cpu_online(unsigned int cpu)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone)
+		zone_pcp_update(zone, 1);
 	return 0;
 }
 
@@ -8046,8 +8060,9 @@  void __init page_alloc_init(void)
 		hashdist = 0;
 #endif
 
-	ret = cpuhp_setup_state_nocalls(CPUHP_PAGE_ALLOC_DEAD,
-					"mm/page_alloc:dead", NULL,
+	ret = cpuhp_setup_state_nocalls(CPUHP_PAGE_ALLOC,
+					"mm/page_alloc:pcp",
+					page_alloc_cpu_online,
 					page_alloc_cpu_dead);
 	WARN_ON(ret < 0);
 }
@@ -8185,7 +8200,7 @@  static void __setup_per_zone_wmarks(void)
 		 * The watermark size have changed so update the pcpu batch
 		 * and high limits or the limits may be inappropriate.
 		 */
-		zone_set_pageset_high_and_batch(zone);
+		zone_set_pageset_high_and_batch(zone, 0);
 
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
@@ -9007,10 +9022,10 @@  EXPORT_SYMBOL(free_contig_range);
  * The zone indicated has a new number of managed_pages; batch sizes and percpu
  * page high values need to be recalculated.
  */
-void __meminit zone_pcp_update(struct zone *zone)
+void zone_pcp_update(struct zone *zone, int cpu_online)
 {
 	mutex_lock(&pcp_batch_high_lock);
-	zone_set_pageset_high_and_batch(zone);
+	zone_set_pageset_high_and_batch(zone, cpu_online);
 	mutex_unlock(&pcp_batch_high_lock);
 }