[RFC,0/6] Drain remote per-cpu directly v2

Message ID	20220509130805.20335-1-mgorman@techsingularity.net (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Mel Gorman <mgorman@techsingularity.net> To: Nicolas Saenz Julienne <nsaenzju@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com>, Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@kernel.org>, LKML <linux-kernel@vger.kernel.org>, Linux-MM <linux-mm@kvack.org>, Mel Gorman <mgorman@techsingularity.net> Subject: [RFC PATCH 0/6] Drain remote per-cpu directly v2 Date: Mon, 9 May 2022 14:07:59 +0100 Message-Id: <20220509130805.20335-1-mgorman@techsingularity.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Drain remote per-cpu directly v2 \| expand [RFC,0/6] Drain remote per-cpu directly v2 [1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list [2/6] mm/page_alloc: Use only one PCP list for THP-sized allocations [3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper [4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue [5/6] mm/page_alloc: Protect PCP lists with a spinlock [6/6] mm/page_alloc: Remotely drain per-cpu lists

Mel Gorman May 9, 2022, 1:07 p.m. UTC

Changelog since v1
o Fix unsafe RT locking scheme
o Use spin_trylock on UP PREEMPT_RT

This series has the same intent as Nicolas' series "mm/page_alloc: Remote
per-cpu lists drain support" -- avoid interference of a high priority
task due to a workqueue item draining per-cpu page lists. While many
workloads can tolerate a brief interruption, it may be cause a real-time
task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
the draining in non-deterministic.

Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
The local_lock on its own prevents migration and the IRQ disabling protects
from corruption due to an interrupt arriving while a page allocation is
in progress. The locking is inherently unsafe for remote access unless
the CPU is hot-removed.

This series adjusts the locking. A spinlock is added to struct
per_cpu_pages to protect the list contents while local_lock_irq continues
to prevent migration and IRQ reentry. This allows a remote CPU to safely
drain a remote per-cpu list.

This series is a partial series. Follow-on work should allow the
local_irq_save to be converted to a local_irq to avoid IRQs being
disabled/enabled in most cases. Consequently, there are some TODO comments
highlighting the places that would change if local_irq was used. However,
there are enough corner cases that it deserves a series on its own
separated by one kernel release and the priority right now is to avoid
interference of high priority tasks.

Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
	and when it is storing per-cpu pages.

Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
	this is not necessary but it avoids per_cpu_pages consuming another
	cache line.

Patch 3 is a preparation patch to avoid code duplication.

Patch 4 is a simple micro-optimisation that improves code flow necessary for
	a later patch to avoid code duplication.

Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
	relying on local_lock to prevent migration, stabilise the pcp
	lookup and prevent IRQ reentrancy.

Patch 6 remote drains per-cpu pages directly instead of using a workqueue.

 include/linux/mm_types.h |   5 +
 include/linux/mmzone.h   |  12 +-
 mm/page_alloc.c          | 342 +++++++++++++++++++++++++--------------
 3 files changed, 230 insertions(+), 129 deletions(-)

Minchan Kim May 9, 2022, 3:58 p.m. UTC | #1

On Mon, May 09, 2022 at 02:07:59PM +0100, Mel Gorman wrote:
> Changelog since v1
> o Fix unsafe RT locking scheme
> o Use spin_trylock on UP PREEMPT_RT

Mel,


Is this only change from previous last version which has some
delta you fixed based on Vlastimil and me?

And is it still RFC?

Do you have some benchmark data?

I'd like to give Acked-by/Tested-by(even though there are a few
more places to align with new fields name in 1/6) since I have
tested these patchset in my workload and didn't spot any other
problems.

I really support this patch to be merged since the pcp draining
using workqueue is really harmful in the reclaim/cma path of
Android workload which has a variety of process priority control
and be stuck on the pcp draining.

> 
> This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> per-cpu lists drain support" -- avoid interference of a high priority
> task due to a workqueue item draining per-cpu page lists. While many
> workloads can tolerate a brief interruption, it may be cause a real-time
> task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> the draining in non-deterministic.
> 
> Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> The local_lock on its own prevents migration and the IRQ disabling protects
> from corruption due to an interrupt arriving while a page allocation is
> in progress. The locking is inherently unsafe for remote access unless
> the CPU is hot-removed.
> 
> This series adjusts the locking. A spinlock is added to struct
> per_cpu_pages to protect the list contents while local_lock_irq continues
> to prevent migration and IRQ reentry. This allows a remote CPU to safely
> drain a remote per-cpu list.
> 
> This series is a partial series. Follow-on work should allow the
> local_irq_save to be converted to a local_irq to avoid IRQs being
> disabled/enabled in most cases. Consequently, there are some TODO comments
> highlighting the places that would change if local_irq was used. However,
> there are enough corner cases that it deserves a series on its own
> separated by one kernel release and the priority right now is to avoid
> interference of high priority tasks.
> 
> Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
> 	and when it is storing per-cpu pages.
> 
> Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
> 	this is not necessary but it avoids per_cpu_pages consuming another
> 	cache line.
> 
> Patch 3 is a preparation patch to avoid code duplication.
> 
> Patch 4 is a simple micro-optimisation that improves code flow necessary for
> 	a later patch to avoid code duplication.
> 
> Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
> 	relying on local_lock to prevent migration, stabilise the pcp
> 	lookup and prevent IRQ reentrancy.
> 
> Patch 6 remote drains per-cpu pages directly instead of using a workqueue.
> 
>  include/linux/mm_types.h |   5 +
>  include/linux/mmzone.h   |  12 +-
>  mm/page_alloc.c          | 342 +++++++++++++++++++++++++--------------
>  3 files changed, 230 insertions(+), 129 deletions(-)
> 
> -- 
> 2.34.1
> 
>

Mel Gorman May 10, 2022, 9:27 a.m. UTC | #2

On Mon, May 09, 2022 at 08:58:51AM -0700, Minchan Kim wrote:
> On Mon, May 09, 2022 at 02:07:59PM +0100, Mel Gorman wrote:
> > Changelog since v1
> > o Fix unsafe RT locking scheme
> > o Use spin_trylock on UP PREEMPT_RT
> 
> Mel,
> 
> 
> Is this only change from previous last version which has some
> delta you fixed based on Vlastimil and me?
> 

Full diff is below although it can also be generated by
comparing the mm-pcpdrain-v1r8..mm-pcpdrain-v2r1 branches in
https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/

> And is it still RFC?
> 

It's still RFC because it's a different approach to Nicolas' series and
I want at least his Acked-by before taking the RFC label out and sending
it to Andrew.

> Do you have some benchmark data?
> 

Yes, but as reclaim is not fundamentally altered the main difference
in behavious is that work is done inline instead of being deferred to a
workqueue. That means in some cases, system CPU usage of a task will be
higher because it's paying the cost directly.

The workloads I used just hit reclaim directly to make sure it's
functionally not broken. There is no change in page aging decisions,
only timing of drains. I didn't check interference of a heavy workload
interfering with a CPU-bound workload running on NOHZ CPUs as I assumed
both you and Nicolas had a test case ready to use.

The main one I paid interest to was a fault latency benchmark in
the presense of heavy reclaim called stutterp. It simulates a simple
workload. One part uses a lot of anonymous memory, a second measures mmap
latency and a third copies a large file.  The primary metric is checking
for mmap latency. It was originally put together to debug interactivity
issues on a desktop in the presense of heavy IO where the desktop
applications were being pushed to backing storage.

stutterp
                              5.18.0-rc1             5.18.0-rc1
                                 vanilla       mm-pcpdrain-v2r1
1st-qrtle mmap-4      15.9557 (   0.00%)     15.4045 (   3.45%)
1st-qrtle mmap-6      10.8025 (   0.00%)     11.1204 (  -2.94%)
1st-qrtle mmap-8      16.9338 (   0.00%)     17.0595 (  -0.74%)
1st-qrtle mmap-12     41.4746 (   0.00%)      9.4003 (  77.33%)
1st-qrtle mmap-18     47.7123 (   0.00%)    100.0275 (-109.65%)
1st-qrtle mmap-24     17.7098 (   0.00%)     16.9633 (   4.22%)
1st-qrtle mmap-30     69.2565 (   0.00%)     38.2205 (  44.81%)
1st-qrtle mmap-32     49.1295 (   0.00%)     46.8819 (   4.57%)
3rd-qrtle mmap-4      18.4706 (   0.00%)     17.4799 (   5.36%)
3rd-qrtle mmap-6      11.4526 (   0.00%)     11.5567 (  -0.91%)
3rd-qrtle mmap-8      19.5903 (   0.00%)     19.0046 (   2.99%)
3rd-qrtle mmap-12     50.3095 (   0.00%)     25.3254 (  49.66%)
3rd-qrtle mmap-18     67.3319 (   0.00%)    147.6404 (-119.27%)
3rd-qrtle mmap-24     41.3779 (   0.00%)     84.4035 (-103.98%)
3rd-qrtle mmap-30    127.1375 (   0.00%)    148.8884 ( -17.11%)
3rd-qrtle mmap-32     79.7423 (   0.00%)    182.3042 (-128.62%)
Max-99    mmap-4      46.9123 (   0.00%)     49.7731 (  -6.10%)
Max-99    mmap-6      42.5414 (   0.00%)     16.6173 (  60.94%)
Max-99    mmap-8      43.1237 (   0.00%)     23.3478 (  45.86%)
Max-99    mmap-12     62.8025 (   0.00%)   1947.3862 (-3000.81%)
Max-99    mmap-18  27936.8695 (   0.00%)    232.7122 (  99.17%)
Max-99    mmap-24 204543.9436 (   0.00%)   5805.2478 (  97.16%)
Max-99    mmap-30   2350.0289 (   0.00%)  10300.6344 (-338.32%)
Max-99    mmap-32  56164.2271 (   0.00%)   7789.7526 (  86.13%)
Max       mmap-4     840.3468 (   0.00%)   1137.4462 ( -35.35%)
Max       mmap-6  255233.3996 (   0.00%)  91304.0952 (  64.23%)
Max       mmap-8  210910.6497 (   0.00%) 117931.0796 (  44.08%)
Max       mmap-12 108268.9537 (   0.00%) 319971.6910 (-195.53%)
Max       mmap-18 608805.3195 (   0.00%) 197483.2205 (  67.56%)
Max       mmap-24 327697.5605 (   0.00%) 382842.5356 ( -16.83%)
Max       mmap-30 688684.5335 (   0.00%) 669992.7705 (   2.71%)
Max       mmap-32 396842.0114 (   0.00%) 415978.2539 (  -4.82%)

                  5.18.0-rc1  5.18.0-rc1
                     vanillamm-pcpdrain-v2r1
Duration User        1438.08     1637.21
Duration System     12267.41    10307.96
Duration Elapsed     3929.70     3443.53


It's a mixed bag but this workload is always a mixed bag and it's stressing
reclaim.  At some points, latencies are worse, in others better. Overall,
it completed faster and this was on a 1-socket machine.

On a 2-socket machine, the overall completions times were worse

                  5.18.0-rc1  5.18.0-rc1
                     vanillamm-pcpdrain-v2r1
Duration User        3713.75     2899.90
Duration System    303507.56   378909.94
Duration Elapsed    15444.59    19067.40

In general this type of workload is variable given the nature of what it
does and can give different results on each execution. When originally
designed, it was to deal with stalls lasting several seconds to reduce
them to the sub-millisecond range.

The intent of the series is switching out-of-line work to in-line so
what it should be measuring is interference effects and not straight-line
performance and I haven't written a specific test case yet. When writing
the series initially, it was to investigate if the PCP could be lockless
and failing that, if disabling IRQs could be avoided in the common case.
It just turned out that part of that made remote draining possible and
I focused closer on that because it's more important.

> I'd like to give Acked-by/Tested-by(even though there are a few
> more places to align with new fields name in 1/6)

Which ones are of concern?

Some of the page->lru references I left alone in the init paths simply
because in those contexts, the page wasn't on a buddy or PCP list. In
free_unref_page_list the page is not on the LRU, it's just been isolated
from the LRU. In alloc_pages_bulk, it's not on a buddy, pcp or LRU list
and is just a list placeholder so I left it alone. In
free_tail_pages_check the context was a page that was likely previously
on a LRU.

> since I have
> tested these patchset in my workload and didn't spot any other
> problems.
> 

Can you describe this workload, is it available anywhere and does it
require Android to execute?

If you have positive results, it would be appreciated if you could post
them or just note in a Tested-by/Acked-by that it had a measurable impact
on the reclaim/cma path.

> I really support this patch to be merged since the pcp draining
> using workqueue is really harmful in the reclaim/cma path of
> Android workload which has a variety of process priority control
> and be stuck on the pcp draining.
> 

Diff between v1 and v2

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17d11eb0413e..4ac39d30ec8f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -132,8 +132,11 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = {
 	.lock = INIT_LOCAL_LOCK(lock),
 };
 
-#ifdef CONFIG_SMP
-/* On SMP, spin_trylock is sufficient protection */
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
+/*
+ * On SMP, spin_trylock is sufficient protection.
+ * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
+ */
 #define pcp_trylock_prepare(flags)	do { } while (0)
 #define pcp_trylock_finish(flag)	do { } while (0)
 #else
@@ -3091,14 +3094,14 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	if (to_drain > 0) {
 		unsigned long flags;
 
-		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
-		local_irq_save(flags);
-
-		spin_lock(&pcp->lock);
+		/*
+		 * free_pcppages_bulk expects IRQs disabled for zone->lock
+		 * so even though pcp->lock is not intended to be IRQ-safe,
+		 * it's needed in this context.
+		 */
+		spin_lock_irqsave(&pcp->lock, flags);
 		free_pcppages_bulk(zone, to_drain, pcp, 0);
-		spin_unlock(&pcp->lock);
-
-		local_irq_restore(flags);
+		spin_unlock_irqrestore(&pcp->lock, flags);
 	}
 }
 #endif
@@ -3114,14 +3117,10 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 	if (pcp->count) {
 		unsigned long flags;
 
-		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
-		local_irq_save(flags);
-
-		spin_lock(&pcp->lock);
+		/* See drain_zone_pages on why this is disabling IRQs */
+		spin_lock_irqsave(&pcp->lock, flags);
 		free_pcppages_bulk(zone, pcp->count, pcp, 0);
-		spin_unlock(&pcp->lock);
-
-		local_irq_restore(flags);
+		spin_unlock_irqrestore(&pcp->lock, flags);
 	}
 }
 
@@ -3480,6 +3479,13 @@ void free_unref_page_list(struct list_head *list)
 		}
 	}
 
+	/*
+	 * Preparation could have drained the list due to failing to prepare
+	 * or all pages are being isolated.
+	 */
+	if (list_empty(list))
+		return;
+
 	VM_BUG_ON(in_hardirq());
 
 	local_lock_irqsave(&pagesets.lock, flags);
@@ -3515,6 +3521,9 @@ void free_unref_page_list(struct list_head *list)
 		 * allocator directly. This is expensive as the zone lock will
 		 * be acquired multiple times but if a drain is in progress
 		 * then an expensive operation is already taking place.
+		 *
+		 * TODO: Always false at the moment due to local_lock_irqsave
+		 *       and is preparation for converting to local_lock.
 		 */
 		if (unlikely(!free_unref_page_commit(page, migratetype, 0, true)))
 			free_one_page(page_zone(page), page, page_to_pfn(page), 0, migratetype, FPI_NONE);
@@ -3717,9 +3726,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 	 * IRQ re-entrancy. If pcp->lock cannot be acquired, the caller
 	 * uses rmqueue_buddy.
 	 *
-	 * TODO: Convert local_lock_irqsave to local_lock. Care
-	 * 	 is needed as the type of local_lock would need a
-	 * 	 PREEMPT_RT version due to threaded IRQs.
+	 * TODO: Convert local_lock_irqsave to local_lock.
 	 */
 	if (unlikely(!locked)) {
 		pcp_trylock_prepare(UP_flags);

Minchan Kim May 10, 2022, 6:13 p.m. UTC | #3

On Tue, May 10, 2022 at 10:27:33AM +0100, Mel Gorman wrote:
> On Mon, May 09, 2022 at 08:58:51AM -0700, Minchan Kim wrote:
> > On Mon, May 09, 2022 at 02:07:59PM +0100, Mel Gorman wrote:
> > > Changelog since v1
> > > o Fix unsafe RT locking scheme
> > > o Use spin_trylock on UP PREEMPT_RT
> > 
> > Mel,
> > 
> > 
> > Is this only change from previous last version which has some
> > delta you fixed based on Vlastimil and me?
> > 
> 
> Full diff is below although it can also be generated by
> comparing the mm-pcpdrain-v1r8..mm-pcpdrain-v2r1 branches in
> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/

I took the delta when I started testing so the testing result
would be valid. Thanks.

> 
> > And is it still RFC?
> > 
> 
> It's still RFC because it's a different approach to Nicolas' series and
> I want at least his Acked-by before taking the RFC label out and sending
> it to Andrew.
> 
> > Do you have some benchmark data?
> > 
> 
> Yes, but as reclaim is not fundamentally altered the main difference
> in behavious is that work is done inline instead of being deferred to a
> workqueue. That means in some cases, system CPU usage of a task will be
> higher because it's paying the cost directly.

Sure but the reclaim path is already expensive so I doubt we could
see the sizable measurement on the system CPU usage.

What I wanted to see was whether we have regression due to adding
spin_lock/unlock instructions in hot path. Due to squeeze it to
a cacheline, I expected the regression would be just marginal.

> 
> The workloads I used just hit reclaim directly to make sure it's
> functionally not broken. There is no change in page aging decisions,
> only timing of drains. I didn't check interference of a heavy workload
> interfering with a CPU-bound workload running on NOHZ CPUs as I assumed
> both you and Nicolas had a test case ready to use.

The my workload is not NOHZ CPUs but run apps under heavy memory
pressure so they goes to direct reclaim and be stuck on drain_all_pages
until work on workqueue run.

unit: nanosecond
max(dur)        avg(dur)                count(dur)
166713013       487511.77786438033      1283

From traces, system encountered the drain_all_pages 1283 times and
worst case was 166ms and avg was 487us.

The other problem was alloc_contig_range in CMA. The PCP draining
takes several hundred millisecond sometimes though there is no
memory pressure or a few of pages to be migrated out but CPU were
fully booked.

Your patch perfectly removed those wasted time.

> 
> The main one I paid interest to was a fault latency benchmark in
> the presense of heavy reclaim called stutterp. It simulates a simple
> workload. One part uses a lot of anonymous memory, a second measures mmap
> latency and a third copies a large file.  The primary metric is checking
> for mmap latency. It was originally put together to debug interactivity
> issues on a desktop in the presense of heavy IO where the desktop
> applications were being pushed to backing storage.
> 
> stutterp
>                               5.18.0-rc1             5.18.0-rc1
>                                  vanilla       mm-pcpdrain-v2r1
> 1st-qrtle mmap-4      15.9557 (   0.00%)     15.4045 (   3.45%)
> 1st-qrtle mmap-6      10.8025 (   0.00%)     11.1204 (  -2.94%)
> 1st-qrtle mmap-8      16.9338 (   0.00%)     17.0595 (  -0.74%)
> 1st-qrtle mmap-12     41.4746 (   0.00%)      9.4003 (  77.33%)
> 1st-qrtle mmap-18     47.7123 (   0.00%)    100.0275 (-109.65%)
> 1st-qrtle mmap-24     17.7098 (   0.00%)     16.9633 (   4.22%)
> 1st-qrtle mmap-30     69.2565 (   0.00%)     38.2205 (  44.81%)
> 1st-qrtle mmap-32     49.1295 (   0.00%)     46.8819 (   4.57%)
> 3rd-qrtle mmap-4      18.4706 (   0.00%)     17.4799 (   5.36%)
> 3rd-qrtle mmap-6      11.4526 (   0.00%)     11.5567 (  -0.91%)
> 3rd-qrtle mmap-8      19.5903 (   0.00%)     19.0046 (   2.99%)
> 3rd-qrtle mmap-12     50.3095 (   0.00%)     25.3254 (  49.66%)
> 3rd-qrtle mmap-18     67.3319 (   0.00%)    147.6404 (-119.27%)
> 3rd-qrtle mmap-24     41.3779 (   0.00%)     84.4035 (-103.98%)
> 3rd-qrtle mmap-30    127.1375 (   0.00%)    148.8884 ( -17.11%)
> 3rd-qrtle mmap-32     79.7423 (   0.00%)    182.3042 (-128.62%)
> Max-99    mmap-4      46.9123 (   0.00%)     49.7731 (  -6.10%)
> Max-99    mmap-6      42.5414 (   0.00%)     16.6173 (  60.94%)
> Max-99    mmap-8      43.1237 (   0.00%)     23.3478 (  45.86%)
> Max-99    mmap-12     62.8025 (   0.00%)   1947.3862 (-3000.81%)
> Max-99    mmap-18  27936.8695 (   0.00%)    232.7122 (  99.17%)
> Max-99    mmap-24 204543.9436 (   0.00%)   5805.2478 (  97.16%)
> Max-99    mmap-30   2350.0289 (   0.00%)  10300.6344 (-338.32%)
> Max-99    mmap-32  56164.2271 (   0.00%)   7789.7526 (  86.13%)
> Max       mmap-4     840.3468 (   0.00%)   1137.4462 ( -35.35%)
> Max       mmap-6  255233.3996 (   0.00%)  91304.0952 (  64.23%)
> Max       mmap-8  210910.6497 (   0.00%) 117931.0796 (  44.08%)
> Max       mmap-12 108268.9537 (   0.00%) 319971.6910 (-195.53%)
> Max       mmap-18 608805.3195 (   0.00%) 197483.2205 (  67.56%)
> Max       mmap-24 327697.5605 (   0.00%) 382842.5356 ( -16.83%)
> Max       mmap-30 688684.5335 (   0.00%) 669992.7705 (   2.71%)
> Max       mmap-32 396842.0114 (   0.00%) 415978.2539 (  -4.82%)
> 
>                   5.18.0-rc1  5.18.0-rc1
>                      vanillamm-pcpdrain-v2r1
> Duration User        1438.08     1637.21
> Duration System     12267.41    10307.96
> Duration Elapsed     3929.70     3443.53
> 
> 
> It's a mixed bag but this workload is always a mixed bag and it's stressing
> reclaim.  At some points, latencies are worse, in others better. Overall,
> it completed faster and this was on a 1-socket machine.
> 
> On a 2-socket machine, the overall completions times were worse
> 
>                   5.18.0-rc1  5.18.0-rc1
>                      vanillamm-pcpdrain-v2r1
> Duration User        3713.75     2899.90
> Duration System    303507.56   378909.94
> Duration Elapsed    15444.59    19067.40
> 
> In general this type of workload is variable given the nature of what it
> does and can give different results on each execution. When originally
> designed, it was to deal with stalls lasting several seconds to reduce
> them to the sub-millisecond range.
> 
> The intent of the series is switching out-of-line work to in-line so
> what it should be measuring is interference effects and not straight-line
> performance and I haven't written a specific test case yet. When writing
> the series initially, it was to investigate if the PCP could be lockless
> and failing that, if disabling IRQs could be avoided in the common case.
> It just turned out that part of that made remote draining possible and
> I focused closer on that because it's more important.
> 
> > I'd like to give Acked-by/Tested-by(even though there are a few
> > more places to align with new fields name in 1/6)
> 
> Which ones are of concern?
> 
> Some of the page->lru references I left alone in the init paths simply
> because in those contexts, the page wasn't on a buddy or PCP list. In
> free_unref_page_list the page is not on the LRU, it's just been isolated
> from the LRU. In alloc_pages_bulk, it's not on a buddy, pcp or LRU list
> and is just a list placeholder so I left it alone. In
> free_tail_pages_check the context was a page that was likely previously
> on a LRU.

Just nits: all are list macros.

free_pcppages_bulk's list_last_entry should be pcp_list.

mark_free_pages's list_for_each_entry should be buddy_list

__rmqueue_pcplist's list_first_enty should be pcp_list.

> 
> > since I have
> > tested these patchset in my workload and didn't spot any other
> > problems.
> > 
> 
> Can you describe this workload, is it available anywhere and does it
> require Android to execute?

I wrote down above. It runs on Android but I don't think it's
android specific issue but anyone could see such a long latency
from PCP draining once one of cores are monopolized by higher
priority processes or too many pending kworks.

> 
> If you have positive results, it would be appreciated if you could post
> them or just note in a Tested-by/Acked-by that it had a measurable impact
> on the reclaim/cma path.

Sure.

All patches in this series.

Tested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>

Thanks.

Mel Gorman May 11, 2022, 12:47 p.m. UTC | #4

On Tue, May 10, 2022 at 11:13:05AM -0700, Minchan Kim wrote:
> > Yes, but as reclaim is not fundamentally altered the main difference
> > in behavious is that work is done inline instead of being deferred to a
> > workqueue. That means in some cases, system CPU usage of a task will be
> > higher because it's paying the cost directly.
> 
> Sure but the reclaim path is already expensive so I doubt we could
> see the sizable measurement on the system CPU usage.
> 

It would be difficult to distinguish from the noise.

> What I wanted to see was whether we have regression due to adding
> spin_lock/unlock instructions in hot path. Due to squeeze it to
> a cacheline, I expected the regression would be just marginal.
> 

Ah, yes, I did test for this. page-fault-test hits the relevant paths
very heavily and did show minor differences.

                                     5.18.0-rc1               5.18.0-rc1
                                        vanilla         mm-pcpdrain-v2r1
Hmean     faults/sec-1   886331.5718 (   0.00%)   885462.7479 (  -0.10%)
Hmean     faults/sec-3  2337706.1583 (   0.00%)  2332130.4909 *  -0.24%*
Hmean     faults/sec-5  2851594.2897 (   0.00%)  2844123.9307 (  -0.26%)
Hmean     faults/sec-7  3543251.5507 (   0.00%)  3516889.0442 *  -0.74%*
Hmean     faults/sec-8  3947098.0024 (   0.00%)  3916162.8476 *  -0.78%*
Stddev    faults/sec-1     2302.9105 (   0.00%)     2065.0845 (  10.33%)
Stddev    faults/sec-3     7275.2442 (   0.00%)     6033.2620 (  17.07%)
Stddev    faults/sec-5    24726.0328 (   0.00%)    12525.1026 (  49.34%)
Stddev    faults/sec-7     9974.2542 (   0.00%)     9543.9627 (   4.31%)
Stddev    faults/sec-8     9468.0191 (   0.00%)     7958.2607 (  15.95%)
CoeffVar  faults/sec-1        0.2598 (   0.00%)        0.2332 (  10.24%)
CoeffVar  faults/sec-3        0.3112 (   0.00%)        0.2587 (  16.87%)
CoeffVar  faults/sec-5        0.8670 (   0.00%)        0.4404 (  49.21%)
CoeffVar  faults/sec-7        0.2815 (   0.00%)        0.2714 (   3.60%)
CoeffVar  faults/sec-8        0.2399 (   0.00%)        0.2032 (  15.28%)

There is a small hit in the number of faults per second but it's within
the noise and the results are more stable with the series so I'd mark it
down as a small but potentially measurable impact.

> > 
> > The workloads I used just hit reclaim directly to make sure it's
> > functionally not broken. There is no change in page aging decisions,
> > only timing of drains. I didn't check interference of a heavy workload
> > interfering with a CPU-bound workload running on NOHZ CPUs as I assumed
> > both you and Nicolas had a test case ready to use.
> 
> The my workload is not NOHZ CPUs but run apps under heavy memory
> pressure so they goes to direct reclaim and be stuck on drain_all_pages
> until work on workqueue run.
> 
> unit: nanosecond
> max(dur)        avg(dur)                count(dur)
> 166713013       487511.77786438033      1283
> 
> From traces, system encountered the drain_all_pages 1283 times and
> worst case was 166ms and avg was 487us.
> 
> The other problem was alloc_contig_range in CMA. The PCP draining
> takes several hundred millisecond sometimes though there is no
> memory pressure or a few of pages to be migrated out but CPU were
> fully booked.
> 
> Your patch perfectly removed those wasted time.
> 

Those stalls are painful and it's a direct impact where a workload does
not make progress. The NOHZ stall is different in that it's worried
about interference. Both problems should have the same solution.

Do you mind if I quote these paragraphs in the leader to v3?

> > Which ones are of concern?
> > 
> > Some of the page->lru references I left alone in the init paths simply
> > because in those contexts, the page wasn't on a buddy or PCP list. In
> > free_unref_page_list the page is not on the LRU, it's just been isolated
> > from the LRU. In alloc_pages_bulk, it's not on a buddy, pcp or LRU list
> > and is just a list placeholder so I left it alone. In
> > free_tail_pages_check the context was a page that was likely previously
> > on a LRU.
> 
> Just nits: all are list macros.
> 
> free_pcppages_bulk's list_last_entry should be pcp_list.
> 
> mark_free_pages's list_for_each_entry should be buddy_list
> 
> __rmqueue_pcplist's list_first_enty should be pcp_list.
> 

Ah, you're completely correct.

> > 
> > > since I have
> > > tested these patchset in my workload and didn't spot any other
> > > problems.
> > > 
> > 
> > Can you describe this workload, is it available anywhere and does it
> > require Android to execute?
> 
> I wrote down above. It runs on Android but I don't think it's
> android specific issue but anyone could see such a long latency
> from PCP draining once one of cores are monopolized by higher
> priority processes or too many pending kworks.
> 

Yeah, I agree it's not an Android-specific problem. It could be detected by
tracing the time spent in drain_all_pages for any arbitrary workload. The
BCC funclatency tool could measure it.

> > 
> > If you have positive results, it would be appreciated if you could post
> > them or just note in a Tested-by/Acked-by that it had a measurable impact
> > on the reclaim/cma path.
> 
> Sure.
> 
> All patches in this series.
> 
> Tested-by: Minchan Kim <minchan@kernel.org>
> Acked-by: Minchan Kim <minchan@kernel.org>
> 

Thanks, I've added that to all the patches. I'll wait another day for
more feedback before sending out a v3. The following is the diff between
v2 and v3 based on your feedback.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4ac39d30ec8f..0f5a6a5b0302 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1497,7 +1497,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		do {
 			int mt;
 
-			page = list_last_entry(list, struct page, lru);
+			page = list_last_entry(list, struct page, pcp_list);
 			mt = get_pcppage_migratetype(page);
 
 			/* must delete to avoid corrupting pcp list */
@@ -3276,7 +3276,7 @@ void mark_free_pages(struct zone *zone)
 
 	for_each_migratetype_order(order, t) {
 		list_for_each_entry(page,
-				&zone->free_area[order].free_list[t], lru) {
+				&zone->free_area[order].free_list[t], buddy_list) {
 			unsigned long i;
 
 			pfn = page_to_pfn(page);
@@ -3761,7 +3761,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 			}
 		}
 
-		page = list_first_entry(list, struct page, lru);
+		page = list_first_entry(list, struct page, pcp_list);
 		list_del(&page->pcp_list);
 		pcp->count -= 1 << order;
 	} while (check_new_pcp(page, order));

Minchan Kim May 11, 2022, 5:20 p.m. UTC | #5

On Wed, May 11, 2022 at 01:47:00PM +0100, Mel Gorman wrote:
> On Tue, May 10, 2022 at 11:13:05AM -0700, Minchan Kim wrote:
> > > Yes, but as reclaim is not fundamentally altered the main difference
> > > in behavious is that work is done inline instead of being deferred to a
> > > workqueue. That means in some cases, system CPU usage of a task will be
> > > higher because it's paying the cost directly.
> > 
> > Sure but the reclaim path is already expensive so I doubt we could
> > see the sizable measurement on the system CPU usage.
> > 
> 
> It would be difficult to distinguish from the noise.
> 
> > What I wanted to see was whether we have regression due to adding
> > spin_lock/unlock instructions in hot path. Due to squeeze it to
> > a cacheline, I expected the regression would be just marginal.
> > 
> 
> Ah, yes, I did test for this. page-fault-test hits the relevant paths
> very heavily and did show minor differences.
> 
>                                      5.18.0-rc1               5.18.0-rc1
>                                         vanilla         mm-pcpdrain-v2r1
> Hmean     faults/sec-1   886331.5718 (   0.00%)   885462.7479 (  -0.10%)
> Hmean     faults/sec-3  2337706.1583 (   0.00%)  2332130.4909 *  -0.24%*
> Hmean     faults/sec-5  2851594.2897 (   0.00%)  2844123.9307 (  -0.26%)
> Hmean     faults/sec-7  3543251.5507 (   0.00%)  3516889.0442 *  -0.74%*
> Hmean     faults/sec-8  3947098.0024 (   0.00%)  3916162.8476 *  -0.78%*
> Stddev    faults/sec-1     2302.9105 (   0.00%)     2065.0845 (  10.33%)
> Stddev    faults/sec-3     7275.2442 (   0.00%)     6033.2620 (  17.07%)
> Stddev    faults/sec-5    24726.0328 (   0.00%)    12525.1026 (  49.34%)
> Stddev    faults/sec-7     9974.2542 (   0.00%)     9543.9627 (   4.31%)
> Stddev    faults/sec-8     9468.0191 (   0.00%)     7958.2607 (  15.95%)
> CoeffVar  faults/sec-1        0.2598 (   0.00%)        0.2332 (  10.24%)
> CoeffVar  faults/sec-3        0.3112 (   0.00%)        0.2587 (  16.87%)
> CoeffVar  faults/sec-5        0.8670 (   0.00%)        0.4404 (  49.21%)
> CoeffVar  faults/sec-7        0.2815 (   0.00%)        0.2714 (   3.60%)
> CoeffVar  faults/sec-8        0.2399 (   0.00%)        0.2032 (  15.28%)
> 
> There is a small hit in the number of faults per second but it's within
> the noise and the results are more stable with the series so I'd mark it
> down as a small but potentially measurable impact.

Thanks for sharing. It would be great to have in the description, too.

> 
> > > 
> > > The workloads I used just hit reclaim directly to make sure it's
> > > functionally not broken. There is no change in page aging decisions,
> > > only timing of drains. I didn't check interference of a heavy workload
> > > interfering with a CPU-bound workload running on NOHZ CPUs as I assumed
> > > both you and Nicolas had a test case ready to use.
> > 
> > The my workload is not NOHZ CPUs but run apps under heavy memory
> > pressure so they goes to direct reclaim and be stuck on drain_all_pages
> > until work on workqueue run.
> > 
> > unit: nanosecond
> > max(dur)        avg(dur)                count(dur)
> > 166713013       487511.77786438033      1283
> > 
> > From traces, system encountered the drain_all_pages 1283 times and
> > worst case was 166ms and avg was 487us.
> > 
> > The other problem was alloc_contig_range in CMA. The PCP draining
> > takes several hundred millisecond sometimes though there is no
> > memory pressure or a few of pages to be migrated out but CPU were
> > fully booked.
> > 
> > Your patch perfectly removed those wasted time.
> > 
> 
> Those stalls are painful and it's a direct impact where a workload does
> not make progress. The NOHZ stall is different in that it's worried
> about interference. Both problems should have the same solution.
> 
> Do you mind if I quote these paragraphs in the leader to v3?

Please have it in the description.

> 
> > > Which ones are of concern?
> > > 
> > > Some of the page->lru references I left alone in the init paths simply
> > > because in those contexts, the page wasn't on a buddy or PCP list. In
> > > free_unref_page_list the page is not on the LRU, it's just been isolated
> > > from the LRU. In alloc_pages_bulk, it's not on a buddy, pcp or LRU list
> > > and is just a list placeholder so I left it alone. In
> > > free_tail_pages_check the context was a page that was likely previously
> > > on a LRU.
> > 
> > Just nits: all are list macros.
> > 
> > free_pcppages_bulk's list_last_entry should be pcp_list.
> > 
> > mark_free_pages's list_for_each_entry should be buddy_list
> > 
> > __rmqueue_pcplist's list_first_enty should be pcp_list.
> > 
> 
> Ah, you're completely correct.
> 
> > > 
> > > > since I have
> > > > tested these patchset in my workload and didn't spot any other
> > > > problems.
> > > > 
> > > 
> > > Can you describe this workload, is it available anywhere and does it
> > > require Android to execute?
> > 
> > I wrote down above. It runs on Android but I don't think it's
> > android specific issue but anyone could see such a long latency
> > from PCP draining once one of cores are monopolized by higher
> > priority processes or too many pending kworks.
> > 
> 
> Yeah, I agree it's not an Android-specific problem. It could be detected by
> tracing the time spent in drain_all_pages for any arbitrary workload. The
> BCC funclatency tool could measure it.
> 
> > > 
> > > If you have positive results, it would be appreciated if you could post
> > > them or just note in a Tested-by/Acked-by that it had a measurable impact
> > > on the reclaim/cma path.
> > 
> > Sure.
> > 
> > All patches in this series.
> > 
> > Tested-by: Minchan Kim <minchan@kernel.org>
> > Acked-by: Minchan Kim <minchan@kernel.org>
> > 
> 
> Thanks, I've added that to all the patches. I'll wait another day for
> more feedback before sending out a v3. The following is the diff between
> v2 and v3 based on your feedback.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4ac39d30ec8f..0f5a6a5b0302 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1497,7 +1497,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  		do {
>  			int mt;
>  
> -			page = list_last_entry(list, struct page, lru);
> +			page = list_last_entry(list, struct page, pcp_list);
>  			mt = get_pcppage_migratetype(page);
>  
>  			/* must delete to avoid corrupting pcp list */
> @@ -3276,7 +3276,7 @@ void mark_free_pages(struct zone *zone)
>  
>  	for_each_migratetype_order(order, t) {
>  		list_for_each_entry(page,
> -				&zone->free_area[order].free_list[t], lru) {
> +				&zone->free_area[order].free_list[t], buddy_list) {
>  			unsigned long i;
>  
>  			pfn = page_to_pfn(page);
> @@ -3761,7 +3761,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>  			}
>  		}
>  
> -		page = list_first_entry(list, struct page, lru);
> +		page = list_first_entry(list, struct page, pcp_list);
>  		list_del(&page->pcp_list);
>  		pcp->count -= 1 << order;
>  	} while (check_new_pcp(page, order));

Looks good to me. Please stick the my Tested-by/Acked-by.

Thanks, Mel.

[RFC,0/6] Drain remote per-cpu directly v2

Message

Comments