[v3] mm: Proactive compaction

For some applications we need to allocate almost all memory as
hugepages. However, on a running system, higher order allocations can
fail if the memory is fragmented. Linux kernel currently does on-demand
compaction as we request more hugepages but this style of compaction
incurs very high latency. Experiments with one-time full memory
compaction (followed by hugepage allocations) shows that kernel is able
to restore a highly fragmented memory state to a fairly compacted memory
state within <1 sec for a 32G system. Such data suggests that a more
proactive compaction can help us allocate a large fraction of memory as
hugepages keeping allocation latencies low.

For a more proactive compaction, the approach taken here is to define
a new tunable called 'proactiveness' which dictates bounds for external
fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to
maintain.

The tunable is exposed through sysfs:
  /sys/kernel/mm/compaction/proactiveness

It takes value in range [0, 100], with a default of 20.

Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}) but this one reduces them
to just one (proactiveness). Also, the new tunable is an opaque value
instead of asking for specific bounds of "external fragmentation" which
would have been difficult to estimate. The internal interpretation of
this opaque value allows for future fine-tuning.

Currently, we use a simple translation from this tunable to [low, high]
"proactive compaction score" thresholds (low=100-proactiveness,
high=low+10%). The score for a node is defined as weighed mean of per-zone
external fragmentation wrt HUGETLB_PAGE_ORDER order. A zone's present_pages
determines its weight. Proactive compaction is triggered when a node's
score exceeds its high threshold value and continues till it reaches its
the low value.

To periodically check per-node score, we reuse per-node kcompactd
threads which are woken up every 500 milliseconds to check the same. If
a node's score exceeds its high threshold (as derived from user provided
proactiveness value), proactive compaction is started till its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.

This patch is largely based on ideas from Michal Hocko posted here:
https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/

Performance data
================

System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.

1. Kernel hugepage allocation latencies

With system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:

(all latency values are in microseconds)

- With vanilla 5.6.0-rc3

echo 0 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness

  percentile latency
  –––––––––– –––––––
	   5    7894
	  10    9496
	  25   12561
	  30   15295
	  40   18244
	  50   21229
	  60   27556
	  75   30147
	  80   31047
	  90   32859
	  95   33799

Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
762G total free => 98% of free memory could be allocated as hugepages)

- With 5.6.0-rc3 + this patch, with proactiveness=20

echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness

  percentile latency
  –––––––––– –––––––
	   5       2
	  10       2
	  25       3
	  30       3
	  40       3
	  50       4
	  60       4
	  75       4
	  80       4
	  90       5
	  95     429

Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
762G total free => 98% of free memory could be allocated as hugepages)

2. JAVA heap allocation

In this test, we first fragment memory using the same method as for (1).

Then, we start a Java process with a heap size set to 700G and request
the heap to be allocated with THP hugepages. We also set THP to madvise
to allow hugepage backing of this heap.

/usr/bin/time
 java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

The above command allocates 700G of Java heap using hugepages.

- With vanilla 5.6.0-rc3

17.39user 1666.48system 27:37.89elapsed

- With 5.6.0-rc3 + this patch, with proactiveness=20

8.35user 194.58system 3:19.62elapsed

Elapsed time remains around 3:15 as proactiveness is further increased.

Note that proactive compaction happens throughout the runtime of these
benchmarks. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java benchmark, proactiveness=20. The test starts with a
node's score of 80 or higher, depending on the delay between the
fragmentation step and actually starting the benchmark, which gives
more-or-less time for the initial round of compaction. As the benchmark
consumes hugepages, node's score quickly rises above the high threshold
(90) and proactive compaction starts again which (hopefully) brings down
extfrag levels to the low threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior
================

Above workloads produces a memory state which is easy to compact.
However, if memory is filled with unmovable pages, proactive compaction
should essentially back off. To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/

Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
To: Mel Gorman <mgorman@techsingularity.net>
To: Michal Hocko <mhocko@suse.com>
To: Vlastimil Babka <vbabka@suse.cz>
CC: Matthew Wilcox <willy@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Mike Kravetz <mike.kravetz@oracle.com>
CC: Joonsoo Kim <iamjoonsoo.kim@lge.com>
CC: David Rientjes <rientjes@google.com>
CC: Nitin Gupta <nitin@nitingupta.dev>
CC: linux-kernel <linux-kernel@vger.kernel.org>
CC: linux-mm <linux-mm@kvack.org>
CC: Linux API <linux-api@vger.kernel.org>

---
Changelog v3 vs v2:
 - Make proactiveness a global tunable and not per-node. Also upadated the
   patch description to reflect the same (Vlastimil Babka).
 - Don't start proactive compaction if kswapd is running (Vlastimil Babka).
 - Clarified in the description that compaction runs in parallel with
   the workload, instead of a one-time compaction followed by a stream of
   hugepage allocations.

Changelog v2 vs v1:
 - Introduce per-node and per-zone "proactive compaction score". This
   score is compared against watermarks which are set according to
   user provided proactiveness value.
 - Separate code-paths for proactive compaction from targeted compaction
   i.e. where pgdat->kcompactd_max_order is non-zero.
 - Renamed hpage_compaction_effort -> proactiveness. In future we may
   use more than extfrag wrt hugepage size to determine proactive
   compaction score.
---
 include/linux/compaction.h |   1 +
 mm/compaction.c            | 195 +++++++++++++++++++++++++++++++++++--
 mm/internal.h              |   1 +
 mm/page_alloc.c            |   1 +
 mm/vmstat.c                |  12 +++
 5 files changed, 204 insertions(+), 6 deletions(-)

Message ID	20200310222539.1981-1-nigupta@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=qj00=43=kvack.org=owner-linux-mm@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B4B4A138D for <patchwork-linux-mm@patchwork.kernel.org>; Tue, 10 Mar 2020 22:25:43 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5875B24655 for <patchwork-linux-mm@patchwork.kernel.org>; Tue, 10 Mar 2020 22:25:43 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=nvidia.com header.i=@nvidia.com header.b="AjwpdAAw" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5875B24655 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 858686B0003; Tue, 10 Mar 2020 18:25:42 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 82DD86B0006; Tue, 10 Mar 2020 18:25:42 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6CFC06B0007; Tue, 10 Mar 2020 18:25:42 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0254.hostedemail.com [216.40.44.254]) by kanga.kvack.org (Postfix) with ESMTP id 4C9FF6B0003 for <linux-mm@kvack.org>; Tue, 10 Mar 2020 18:25:42 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 25153824999B for <linux-mm@kvack.org>; Tue, 10 Mar 2020 22:25:42 +0000 (UTC) X-FDA: 76580885724.08.bread07_148effc41010e X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,nigupta@nvidia.com,,RULES_HIT:30029:30034:30045:30051:30054:30064:30069:30070:30075:30090,0,RBL:216.228.121.64:@nvidia.com:.lbl8.mailshell.net-64.10.201.10 62.18.0.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: bread07_148effc41010e X-Filterd-Recvd-Size: 22096 Received: from hqnvemgate25.nvidia.com (hqnvemgate25.nvidia.com [216.228.121.64]) by imf16.hostedemail.com (Postfix) with ESMTP for <linux-mm@kvack.org>; Tue, 10 Mar 2020 22:25:41 +0000 (UTC) Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate25.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id <B5e6813b60000>; Tue, 10 Mar 2020 15:24:54 -0700 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Tue, 10 Mar 2020 15:25:39 -0700 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Tue, 10 Mar 2020 15:25:39 -0700 Received: from HQMAIL111.nvidia.com (172.20.187.18) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Tue, 10 Mar 2020 22:25:39 +0000 Received: from hqnvemgw03.nvidia.com (10.124.88.68) by HQMAIL111.nvidia.com (172.20.187.18) with Microsoft SMTP Server (TLS) id 15.0.1473.3 via Frontend Transport; Tue, 10 Mar 2020 22:25:39 +0000 Received: from ng-desktop.nvidia.com (Not Verified[10.110.48.88]) by hqnvemgw03.nvidia.com with Trustwave SEG (v7,5,8,10121) id <B5e6813e20001>; Tue, 10 Mar 2020 15:25:38 -0700 From: Nitin Gupta <nigupta@nvidia.com> To: Mel Gorman <mgorman@techsingularity.net>, Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz> CC: Matthew Wilcox <willy@infradead.org>, Andrew Morton <akpm@linux-foundation.org>, Mike Kravetz <mike.kravetz@oracle.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, David Rientjes <rientjes@google.com>, Nitin Gupta <nitin@nitingupta.dev>, linux-kernel <linux-kernel@vger.kernel.org>, linux-mm <linux-mm@kvack.org>, Linux API <linux-api@vger.kernel.org> Subject: [PATCH v3] mm: Proactive compaction Date: Tue, 10 Mar 2020 15:25:39 -0700 Message-ID: <20200310222539.1981-1-nigupta@nvidia.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1583879094; bh=yk0sjnOBXvL3+E9y22Jyb4JDzGxqG60PIJUBHAl3V0c=; h=X-PGP-Universal:From:To:CC:Subject:Date:Message-ID:X-Mailer: MIME-Version:Content-Type:Content-Transfer-Encoding; b=AjwpdAAwLWIXD64vXhdeajJSaqvyGQLUCdskb+cSefXWqd8Q2en0iqPHX+aKjSnk6 IoPm4QmIRHwYaIGn9Wa2dfCRLCRtNU2iejwackOJvwmGP6lzSCcKfPNmzvpSef2yre 9FRWuyXsuD+7C54dOF28TifQT0dZIC5JPhZg3FvTY8TJ3m6BV6kVyvXED9sMMHRbrA kKQIvuIsNbTjujX6/z6OKiBZeFm0FXK2kh/v3594i/to+7G3fkgmuV6RbdpU1+dTdN H/bPQw3a8Wi6Kn9pVJJEVuaTh5WMeHzBkOznJjEJPB2/f7mlrIMJoqG6E5Kg7cHeuM 0MCKoHybwzYdA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org>
Series	[v3] mm: Proactive compaction \| expand [v3] mm: Proactive compaction

[v3] mm: Proactive compaction

Commit Message

Comments

Patch