[014/165] mm: proactive compaction

From: Nitin Gupta <nigupta@nvidia.com>

From: Nitin Gupta <nigupta@nvidia.com>
Subject: mm: proactive compaction

For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented.  Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency.  Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system.  Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.

For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.

The tunable takes a value in range [0, 100], with a default of 20.

Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl.  Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate.  The internal interpretation of this opaque
value allows for future fine-tuning.

Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%). 
The score for a node is defined as weighted mean of per-zone external
fragmentation.  A zone's present_pages determines its weight.

To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same.  If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value.  By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.

This patch is largely based on ideas from Michal Hocko [2].  See also the
LWN article [3].

Performance data
================

System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap.  The workload is mainly anonymous
userspace pages, which are easy to move around.  I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.

1. Kernel hugepage allocation latencies

With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:

(all latency values are in microseconds)

- With vanilla 5.6.0-rc3

  percentile latency
  –––––––––– –––––––
	   5    7894
	  10    9496
	  25   12561
	  30   15295
	  40   18244
	  50   21229
	  60   27556
	  75   30147
	  80   31047
	  90   32859
	  95   33799

Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

- With 5.6.0-rc3 + this patch, with proactiveness=20

sysctl -w vm.compaction_proactiveness=20

  percentile latency
  –––––––––– –––––––
	   5       2
	  10       2
	  25       3
	  30       3
	  40       3
	  50       4
	  60       4
	  75       4
	  80       4
	  90       5
	  95     429

Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

2. JAVA heap allocation

In this test, we first fragment memory using the same method as for (1).

Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages.  We also set THP to madvise to
allow hugepage backing of this heap.

/usr/bin/time
 java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

The above command allocates 700G of Java heap using hugepages.

- With vanilla 5.6.0-rc3

17.39user 1666.48system 27:37.89elapsed

- With 5.6.0-rc3 + this patch, with proactiveness=20

8.35user 194.58system 3:19.62elapsed

Elapsed time remains around 3:15, as proactiveness is further increased.

Note that proactive compaction happens throughout the runtime of these
workloads.  The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java workload, proactiveness is set to 20.  The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction.  As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark.  kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior
================

Above workloads produce a memory state which is easy to compact.  However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off.  To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/sysctl/vm.rst |   15 +
 include/linux/compaction.h              |    2 
 kernel/sysctl.c                         |    9 +
 mm/compaction.c                         |  183 +++++++++++++++++++++-
 mm/internal.h                           |    1 
 mm/vmstat.c                             |   18 ++
 6 files changed, 223 insertions(+), 5 deletions(-)

Message ID	20200812013100.8bvFBw5C4%akpm@linux-foundation.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=d60w=BW=kvack.org=owner-linux-mm@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AE2CF1392 for <patchwork-linux-mm@patchwork.kernel.org>; Wed, 12 Aug 2020 01:31:05 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5C43420829 for <patchwork-linux-mm@patchwork.kernel.org>; Wed, 12 Aug 2020 01:31:05 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="lHXEU2Ij" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5C43420829 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 23F6F6B0023; Tue, 11 Aug 2020 21:31:04 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 1C7826B0024; Tue, 11 Aug 2020 21:31:04 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 08FB28D0001; Tue, 11 Aug 2020 21:31:04 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0039.hostedemail.com [216.40.44.39]) by kanga.kvack.org (Postfix) with ESMTP id E07F26B0023 for <linux-mm@kvack.org>; Tue, 11 Aug 2020 21:31:03 -0400 (EDT) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id A8EEE8248047 for <linux-mm@kvack.org>; Wed, 12 Aug 2020 01:31:03 +0000 (UTC) X-FDA: 77140188006.10.mask48_390f6ce26fe7 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin10.hostedemail.com (Postfix) with ESMTP id 774F316A4A4 for <linux-mm@kvack.org>; Wed, 12 Aug 2020 01:31:03 +0000 (UTC) X-Spam-Summary: 1,0,0,b09285cd7e53a239,d41d8cd98f00b204,akpm@linux-foundation.org,,RULES_HIT:41:152:327:355:379:800:960:966:967:968:973:988:989:1260:1263:1277:1311:1313:1314:1345:1359:1381:1431:1437:1513:1515:1516:1518:1521:1593:1594:1605:1730:1747:1777:1792:1801:2196:2198:2199:2200:2393:2525:2553:2559:2566:2682:2685:2690:2693:2731:2859:2898:2902:2916:2933:2937:2939:2942:2945:2947:2951:2954:3022:3138:3139:3140:3141:3142:3165:3865:3866:3867:3868:3870:3871:3872:3874:3934:3936:3938:3941:3944:3947:3950:3953:3956:3959:4250:4321:4385:4605:5007:6119:6261:6653:6737:7576:7875:7903:7904:8599:8603:8908:8957:9025:9545:10004:10913:10954:11026:11232:11473:11658:11783:11914:12043:12048:12291:12295:12296:12297:12438:12517:12519:12555:12679:12740:12783:12895:12986:13053:13141:13146:13149:13161:13180:13229:13230:13846:21063:21080:21324:21433:21451:21627:21740:21749:21795:21810:21811:21939:21987:21990:30034:30045:30051:30054:30064:30069:30070:30075:30090,0,RBL:198.145.29.99:@linux-foundation.or g:.lbl8. X-HE-Tag: mask48_390f6ce26fe7 X-Filterd-Recvd-Size: 20799 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf37.hostedemail.com (Postfix) with ESMTP for <linux-mm@kvack.org>; Wed, 12 Aug 2020 01:31:02 +0000 (UTC) Received: from localhost.localdomain (c-73-231-172-41.hsd1.ca.comcast.net [73.231.172.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 5356F2076C; Wed, 12 Aug 2020 01:31:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1597195861; bh=wfHJlIL9hf1inHQ0kvOMVQV6i1nRcEZ5a1MuBRmQLzk=; h=Date:From:To:Subject:In-Reply-To:From; b=lHXEU2Ijeb92rmS4M85fga12xp2D3RR6JqnusAwsJ00Nw0dZDSoDJteBit252GCLy F9byRZDRiT86ubvONj7sVBtuAczIYG8xSG5Re8An9Iy1ePJAkUNelX3RWxJcqXl/XK S+skrGL/MHKcOATDuQDGGmLbR15cnViPuzoAMuyQ= Date: Tue, 11 Aug 2020 18:31:00 -0700 From: Andrew Morton <akpm@linux-foundation.org> To: akpm@linux-foundation.org, iamjoonsoo.kim@lge.com, khalid.aziz@oracle.com, linux-mm@kvack.org, mgorman@techsingularity.net, mhocko@suse.com, mike.kravetz@oracle.com, mm-commits@vger.kernel.org, ngupta@nitingupta.dev, nigupta@nvidia.com, oleksandr@redhat.com, rientjes@google.com, torvalds@linux-foundation.org, vbabka@suse.cz, willy@infradead.org Subject: [patch 014/165] mm: proactive compaction Message-ID: <20200812013100.8bvFBw5C4%akpm@linux-foundation.org> In-Reply-To: <20200811182949.e12ae9a472e3b5e27e16ad6c@linux-foundation.org> User-Agent: s-nail v14.8.16 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 774F316A4A4 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org>
Series	[001/165] percpu: return number of released bytes from pcpu_free_area() \| expand [001/165] percpu: return number of released bytes from pcpu_free_area() [002/165] mm: memcg/percpu: account percpu memory to memory cgroups [003/165] mm: memcg/percpu: per-memcg percpu memory statistics [004/165] mm: memcg: charge memcg percpu memory to the parent cgroup [006/165] mm/hugetlb: add mempolicy check in the reservation routine [007/165] mm/vmscan: make active/inactive ratio as 1:1 for anon lru [008/165] mm/vmscan: protect the workingset on anonymous LRU [009/165] mm/workingset: prepare the workingset detection infrastructure for anon LRU [010/165] mm/swapcache: support to handle the shadow entries [011/165] mm/swap: implement workingset detection for anonymous LRU [012/165] mm/vmscan: restore active/inactive ratio for anonymous LRU [013/165] /proc/PID/smaps: consistent whitespace output format [014/165] mm: proactive compaction [015/165] mm: fix compile error due to COMPACTION_HPAGE_ORDER [016/165] mm: use unsigned types for fragmentation score [017/165] mm/compaction: correct the comments of compact_defer_shift [018/165] mm: mempolicy: fix kerneldoc of numa_map_to_online_node() [019/165] mm/mempolicy.c: check parameters first in kernel_get_mempolicy [020/165] include/linux/mempolicy.h: fix typo [021/165] mm, oom: make the calculation of oom badness more accurate [022/165] doc, mm: sync up oom_score_adj documentation [023/165] doc, mm: clarify /proc/<pid>/oom_score value range [024/165] mm, oom: show process exiting information in __oom_kill_process() [025/165] hugetlbfs: prevent filesystem stacking of hugetlbfs [026/165] hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem [027/165] mm/migrate: optimize migrate_vma_setup() for holes [028/165] mm/migrate: add migrate-shared test for migrate_vma_() [029/165] mm: thp: remove debug_cow switch [030/165] mm/vmstat: add events for THP migration without split [031/165] mm/cma.c: fix NULL pointer dereference when cma could not be activated [032/165] mm: cma: fix the name of CMA areas [033/165] mm: hugetlb: fix the name of hugetlb CMA [034/165] cma: don't quit at first error when activating reserved areas [035/165] include/linux/sched/mm.h: optimize current_gfp_context() [036/165] mm: mmu_notifier: fix and extend kerneldoc [037/165] x86/mm: use max memory block size on bare metal [038/165] mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid() [039/165] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done [040/165] mm, memory_hotplug: update pcp lists everytime onlining a memory block [041/165] mm: drop duplicated words in <linux/pgtable.h> [042/165] mm: drop duplicated words in <linux/mm.h> [043/165] include/linux/highmem.h: fix duplicated words in a comment [044/165] include/linux/frontswap.h: drop duplicated word in a comment [045/165] include/linux/memcontrol.h: drop duplicate word and fix spello [046/165] sh/mm: drop unused MAX_PHYSADDR_BITS [047/165] sparc: drop unused MAX_PHYSADDR_BITS [048/165] mm/compaction.c: delete duplicated word [049/165] mm/filemap.c: delete duplicated word [050/165] mm/hmm.c: delete duplicated word [051/165] mm/hugetlb.c: delete duplicated words [052/165] mm/memcontrol.c: delete duplicated words [053/165] mm/memory.c: delete duplicated words [054/165] mm/migrate.c: delete duplicated word [055/165] mm/nommu.c: delete duplicated words [056/165] mm/page_alloc.c: delete or fix duplicated words [057/165] mm/shmem.c: delete duplicated word [058/165] mm/slab_common.c: delete duplicated word [059/165] mm/usercopy.c: delete duplicated word [060/165] mm/vmscan.c: delete or fix duplicated words [061/165] mm/zpool.c: delete duplicated word and fix grammar [062/165] mm/zsmalloc.c: fix duplicated words [063/165] syscalls: use uaccess_kernel in addr_limit_user_check [064/165] nds32: use uaccess_kernel in show_regs [065/165] riscv: include <asm/pgtable.h> in <asm/uaccess.h> [066/165] uaccess: remove segment_eq [067/165] uaccess: add force_uaccess_{begin,end} helpers [068/165] exec: use force_uaccess_begin during exec and exit [069/165] alpha: fix annotation of io{read,write}{16,32}be() [070/165] include/linux/compiler-clang.h: drop duplicated word in a comment [071/165] include/linux/exportfs.h: drop duplicated word in a comment [072/165] include/linux/async_tx.h: drop duplicated word in a comment [073/165] include/linux/xz.h: drop duplicated word [074/165] kernel: add a kernel_wait helper [075/165] ./Makefile: add debug option to enable function aligned on 32 bytes [076/165] kernel.h: remove duplicate include of asm/div64.h [077/165] include/: replace HTTP links with HTTPS ones [078/165] include/linux/poison.h: remove obsolete comment [079/165] sparse: group the defines by functionality [080/165] lib/bitmap.c: fix bitmap_cut() for partial overlapping case [081/165] lib/test_bitmap.c: add test for bitmap_cut() [082/165] lib/generic-radix-tree.c: remove unneeded __rcu [083/165] lib/test_bitops: do the full test during module init [084/165] lib/test_lockup.c: make symbol 'test_works' static [085/165] lib/Kconfig.debug: make TEST_LOCKUP depend on module [086/165] lib/test_lockup.c: fix return value of test_lockup_init() [087/165] lib/: replace HTTP links with HTTPS ones [088/165] kstrto: correct documentation references to simple_strto() [089/165] kstrto: do not describe simple_strto*() as obsolete/replaced [090/165] lz4: fix kernel decompression speed [091/165] lib/test_bits.c: add tests of GENMASK [092/165] checkpatch: add test for possible misuse of IS_ENABLED() without CONFIG_ [093/165] checkpatch: add --fix option for ASSIGN_IN_IF [094/165] checkpatch: fix CONST_STRUCT when const_structs.checkpatch is missing [095/165] checkpatch: add test for repeated words [096/165] checkpatch: remove missing switch/case break test [097/165] autofs: fix doubled word [098/165] fs/minix: check return value of sb_getblk() [099/165] fs/minix: don't allow getting deleted inodes [100/165] fs/minix: reject too-large maximum file size [101/165] fs/minix: set s_maxbytes correctly [102/165] fs/minix: fix block limit check for V1 filesystems [103/165] fs/minix: remove expected error message in block_to_path() [104/165] nilfs2: only call unlock_new_inode() if I_NEW [105/165] nilfs2: convert __nilfs_msg to integrate the level and format [106/165] nilfs2: use a more common logging style [107/165] fs/ufs: avoid potential u32 multiplication overflow [108/165] fatfs: switch write_lock to read_lock in fat_ioctl_get_attributes [109/165] VFAT/FAT/MSDOS FILESYSTEM: replace HTTP links with HTTPS ones [110/165] fat: fix fat_ra_init() for data clusters == 0 [111/165] fs/signalfd.c: fix inconsistent return codes for signalfd4 [112/165] selftests: kmod: use variable NAME in kmod_test_0001() [113/165] kmod: remove redundant "be an" in the comment [114/165] test_kmod: avoid potential double free in trigger_config_run_type() [115/165] coredump: add %f for executable filename [116/165] exec: change uselib(2) IS_SREG() failure to EACCES [117/165] exec: move S_ISREG() check earlier [118/165] exec: move path_noexec() check earlier [119/165] kdump: append kernel build-id string to VMCOREINFO [120/165] drivers/rapidio/devices/rio_mport_cdev.c: use struct_size() helper [121/165] drivers/rapidio/rio-scan.c: use struct_size() helper [122/165] rapidio/rio_mport_cdev: use array_size() helper in copy_{from,to}_user() [123/165] kernel/panic.c: make oops_may_print() return bool [124/165] lib/Kconfig.debug: fix typo in the help text of CONFIG_PANIC_TIMEOUT [125/165] panic: make print_oops_end_marker() static [126/165] kcov: unconditionally add -fno-stack-protector to compiler options [127/165] kcov: make some symbols static [128/165] scripts/gdb: fix python 3.8 SyntaxWarning [129/165] ipc: uninline functions [130/165] ipc/shm.c: remove the superfluous break [131/165] mm/page_isolation: prefer the node of the source page [132/165] mm/migrate: move migration helper from .h to .c [133/165] mm/hugetlb: unify migration callbacks [134/165] mm/migrate: clear __GFP_RECLAIM to make the migration callback consistent with regular TH… [135/165] mm/migrate: introduce a standard migration target allocation function [136/165] mm/mempolicy: use a standard migration target allocation callback [137/165] mm/page_alloc: remove a wrapper for alloc_migration_target() [138/165] mm/gup: restrict CMA region by using allocation scope API [139/165] mm/hugetlb: make hugetlb migration callback CMA aware [140/165] mm/gup: use a standard migration target allocation callback [141/165] mm: do page fault accounting in handle_mm_fault [142/165] mm/alpha: use general page fault accounting [143/165] mm/arc: use general page fault accounting [144/165] mm/arm: use general page fault accounting [145/165] mm/arm64: use general page fault accounting [146/165] mm/csky: use general page fault accounting [147/165] mm/hexagon: use general page fault accounting [148/165] mm/ia64: use general page fault accounting [149/165] mm/m68k: use general page fault accounting [150/165] mm/microblaze: use general page fault accounting [151/165] mm/mips: use general page fault accounting [152/165] mm/nds32: use general page fault accounting [153/165] mm/nios2: use general page fault accounting [154/165] mm/openrisc: use general page fault accounting [155/165] mm/parisc: use general page fault accounting [156/165] mm/powerpc: use general page fault accounting [157/165] mm/riscv: use general page fault accounting [158/165] mm/s390: use general page fault accounting [159/165] mm/sh: use general page fault accounting [160/165] mm/sparc32: use general page fault accounting [161/165] mm/sparc64: use general page fault accounting [162/165] mm/x86: use general page fault accounting [163/165] mm/xtensa: use general page fault accounting [164/165] mm: clean up the last pieces of page fault accountings [165/165] mm/gup: remove task_struct pointer for all gup code

[014/165] mm: proactive compaction

Commit Message

Patch