[00/25] Increase success rates and reduce latency of compaction v2

Message ID	20190104125011.16071-1-mgorman@techsingularity.net (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.15 as permitted sender) client-ip=46.22.139.15; From: Mel Gorman <mgorman@techsingularity.net> To: Linux-MM <linux-mm@kvack.org> Cc: David Rientjes <rientjes@google.com>, Andrea Arcangeli <aarcange@redhat.com>, Vlastimil Babka <vbabka@suse.cz>, ying.huang@intel.com, kirill@shutemov.name, Andrew Morton <akpm@linux-foundation.org>, Linux List Kernel Mailing <linux-kernel@vger.kernel.org>, Mel Gorman <mgorman@techsingularity.net> Subject: [PATCH 00/25] Increase success rates and reduce latency of compaction v2 Date: Fri, 4 Jan 2019 12:49:46 +0000 Message-Id: <20190104125011.16071-1-mgorman@techsingularity.net> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Increase success rates and reduce latency of compaction v2 \| expand [00/25] Increase success rates and reduce latency of compaction v2 [01/25] mm, compaction: Shrink compact_control [02/25] mm, compaction: Rearrange compact_control [03/25] mm, compaction: Remove last_migrated_pfn from compact_control [04/25] mm, compaction: Remove unnecessary zone parameter in some instances [06/25] mm, compaction: Skip pageblocks with reserved pages [07/25] mm, migrate: Immediately fail migration of a page with no migration handler [08/25] mm, compaction: Always finish scanning of a full pageblock [09/25] mm, compaction: Use the page allocator bulk-free helper for lists of pages [10/25] mm, compaction: Ignore the fragmentation avoidance boost for isolation and compaction [11/25] mm, compaction: Use free lists to quickly locate a migration source [12/25] mm, compaction: Keep migration source private to a single compaction instance [13/25] mm, compaction: Use free lists to quickly locate a migration target [14/25] mm, compaction: Avoid rescanning the same pageblock multiple times [15/25] mm, compaction: Finish pageblock scanning on contention [16/25] mm, compaction: Check early for huge pages encountered by the migration scanner [17/25] mm, compaction: Keep cached migration PFNs synced for unusable pageblocks [18/25] mm, compaction: Rework compact_should_abort as compact_check_resched [19/25] mm, compaction: Do not consider a need to reschedule as contention [20/25] mm, compaction: Reduce unnecessary skipping of migration target scanner [21/25] mm, compaction: Round-robin the order while searching the free lists for a target [22/25] mm, compaction: Sample pageblocks for free pages [23/25] mm, compaction: Be selective about what pageblocks to clear skip hints [24/25] mm, compaction: Capture a page under direct compaction [25/25] mm, compaction: Do not direct compact remote memory

Message ID

20190104125011.16071-1-mgorman@techsingularity.net (mailing list archive)

Headers

Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 46.22.139.15 as permitted sender) client-ip=46.22.139.15;
From: Mel Gorman <mgorman@techsingularity.net>
To: Linux-MM <linux-mm@kvack.org>
Cc: David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	ying.huang@intel.com,
	kirill@shutemov.name,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 00/25] Increase success rates and reduce latency of compaction
 v2
Date: Fri,  4 Jan 2019 12:49:46 +0000
Message-Id: <20190104125011.16071-1-mgorman@techsingularity.net>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Increase success rates and reduce latency of compaction v2 | expand

Message

Mel Gorman Jan. 4, 2019, 12:49 p.m. UTC

This series reduces scan rates and success rates of compaction, primarily
by using the free lists to shorten scans, better controlling of skip
information and whether multiple scanners can target the same block and
capturing pageblocks before being stolen by parallel requests. The series
is based on the 4.21/5.0 merge window after Andrew's tree had been merged.
It's known to rebase cleanly.

Primarily I'm using thpscale to measure the impact of the series. The
benchmark creates a large file, maps it, faults it, punches holes in the
mapping so that the virtual address space is fragmented and then tries
to allocate THP. It re-executes for different numbers of threads. From a
fragmentation perspective, the workload is relatively benign but it does
stress compaction.

The overall impact on latencies for a 1-socket machine is

				      baseline		      patches
Amean     fault-both-3      5362.80 (   0.00%)     4446.89 *  17.08%*
Amean     fault-both-5      9488.75 (   0.00%)     5660.86 *  40.34%*
Amean     fault-both-7     11909.86 (   0.00%)     8549.63 *  28.21%*
Amean     fault-both-12    16185.09 (   0.00%)    11508.36 *  28.90%*
Amean     fault-both-18    12057.72 (   0.00%)    19013.48 * -57.69%*
Amean     fault-both-24    23939.95 (   0.00%)    19676.16 *  17.81%*
Amean     fault-both-30    26606.14 (   0.00%)    27363.23 (  -2.85%)
Amean     fault-both-32    31677.12 (   0.00%)    23154.09 *  26.91%*

While there is a glitch at the 18-thread mark, it's known that the base
page allocation latency was much lower and huge pages were taking
longer -- partially due a high allocation success rate.

The allocation success rates are much improved

			 	 baseline		 patches
Percentage huge-3        70.93 (   0.00%)       98.30 (  38.60%)
Percentage huge-5        56.02 (   0.00%)       83.36 (  48.81%)
Percentage huge-7        60.98 (   0.00%)       89.04 (  46.01%)
Percentage huge-12       73.02 (   0.00%)       94.36 (  29.23%)
Percentage huge-18       94.37 (   0.00%)       95.87 (   1.58%)
Percentage huge-24       84.95 (   0.00%)       97.41 (  14.67%)
Percentage huge-30       83.63 (   0.00%)       96.69 (  15.62%)
Percentage huge-32       81.69 (   0.00%)       96.10 (  17.65%)

That's a nearly perfect allocation success rate.

The biggest impact is on the scan rates

Compaction migrate scanned   106520811    26934599
Compaction free scanned     4180735040    26584944

The number of pages scanned for migration was reduced by 74% and the
free scanner was reduced by 99.36%. So much less work in exchange
for lower latency and better success rates.

The series was also evaluated using a workload that heavily fragments
memory but the benefits there are also significant, albeit not presented.

It was commented that we should be rethinking scanning entirely and to
a large extent I agree. However, to achieve that you need a lot of this
series in place first so it's best to make the linear scanners as best
as possible before ripping them out.

 include/linux/compaction.h |    3 +-
 include/linux/gfp.h        |    7 +-
 include/linux/mmzone.h     |    2 +
 include/linux/sched.h      |    4 +
 kernel/sched/core.c        |    3 +
 mm/compaction.c            | 1031 ++++++++++++++++++++++++++++++++++----------
 mm/internal.h              |   23 +-
 mm/migrate.c               |    2 +-
 mm/page_alloc.c            |   70 ++-
 9 files changed, 908 insertions(+), 237 deletions(-)

Comments

Andrew Morton Jan. 7, 2019, 11:43 p.m. UTC | #1

On Fri,  4 Jan 2019 12:49:46 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:

> This series reduces scan rates and success rates of compaction, primarily
> by using the free lists to shorten scans, better controlling of skip
> information and whether multiple scanners can target the same block and
> capturing pageblocks before being stolen by parallel requests. The series
> is based on the 4.21/5.0 merge window after Andrew's tree had been merged.
> It's known to rebase cleanly.
> 
> ...
>
>  include/linux/compaction.h |    3 +-
>  include/linux/gfp.h        |    7 +-
>  include/linux/mmzone.h     |    2 +
>  include/linux/sched.h      |    4 +
>  kernel/sched/core.c        |    3 +
>  mm/compaction.c            | 1031 ++++++++++++++++++++++++++++++++++----------
>  mm/internal.h              |   23 +-
>  mm/migrate.c               |    2 +-
>  mm/page_alloc.c            |   70 ++-
>  9 files changed, 908 insertions(+), 237 deletions(-)

Boy that's a lot of material.  I just tossed it in there unread for
now.  Do you have any suggestions as to how we can move ahead with
getting this appropriately reviewed and tested?

Mel Gorman Jan. 8, 2019, 9:12 a.m. UTC | #2

On Mon, Jan 07, 2019 at 03:43:54PM -0800, Andrew Morton wrote:
> On Fri,  4 Jan 2019 12:49:46 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > This series reduces scan rates and success rates of compaction, primarily
> > by using the free lists to shorten scans, better controlling of skip
> > information and whether multiple scanners can target the same block and
> > capturing pageblocks before being stolen by parallel requests. The series
> > is based on the 4.21/5.0 merge window after Andrew's tree had been merged.
> > It's known to rebase cleanly.
> > 
> > ...
> >
> >  include/linux/compaction.h |    3 +-
> >  include/linux/gfp.h        |    7 +-
> >  include/linux/mmzone.h     |    2 +
> >  include/linux/sched.h      |    4 +
> >  kernel/sched/core.c        |    3 +
> >  mm/compaction.c            | 1031 ++++++++++++++++++++++++++++++++++----------
> >  mm/internal.h              |   23 +-
> >  mm/migrate.c               |    2 +-
> >  mm/page_alloc.c            |   70 ++-
> >  9 files changed, 908 insertions(+), 237 deletions(-)
> 
> Boy that's a lot of material. 

It's unfortunate I know. It just turned out that there is a lot that had
to change to make the most important patches in the series work without
obvious side-effects.

> I just tossed it in there unread for
> now.  Do you have any suggestions as to how we can move ahead with
> getting this appropriately reviewed and tested?
> 

The main workloads that should see a difference are those that use
MADV_HUGEPAGE or change /sys/kernel/mm/transparent_hugepage/defrag. I'm
expecting MADV_HUGEPAGE is more common in practice. By default, there
should be little change as direct compaction is not used heavily for THP.
Although SLUB workloads might see a difference given a long enough uptime,
it will be relatively difficult to detect.

As this was partially motivated by the __GFP_THISNODE discussion, I
would like to hear from David if this series makes an impact, if any,
when starting Google workloads on a fragmented system.

Similarly, I would be interested in hearing if Andrea's KVM startup times
see any benefit. I'm expecting less here as I expect that workload is
still bound by reclaim thrashing the local node in reclaim. Still, a
confirmation would be nice and if there is any benefit then it's a plus
even if the workload gets reclaimed excessively.

Local tests didn't show up anything interesting *other* than what is
already in the changelogs as those workloads are specifically targetting
those paths. Intel LKP has not reported any regressions (functional or
performance) despite being on git.kernel.org for a few weeks. However,
as they are using default configurations, this is not much of a surprise.

Review is harder. Vlastimil would normally be the best fit as he has
worked on compaction but for him or for anyone else, I'm expecting they're
dealing with a backlog after the holidays.  I know I still have to get
to Vlastimil's recent series on THP allocations so I'm guilty of the same
crime with respect to review.

Vlastimil Babka Jan. 15, 2019, 11:59 a.m. UTC | #3

On 1/4/19 1:49 PM, Mel Gorman wrote:
> It's non-obvious that high-order free pages are split into order-0 pages
> from the function name. Fix it.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>