From patchwork Wed Nov 7 18:38:18 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mel Gorman X-Patchwork-Id: 10672879 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9F24313AD for ; Wed, 7 Nov 2018 18:38:40 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 89EBE2C0BE for ; Wed, 7 Nov 2018 18:38:40 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7DD6D2C0C5; Wed, 7 Nov 2018 18:38:40 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 25DE32C0E6 for ; Wed, 7 Nov 2018 18:38:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 904FD6B053B; Wed, 7 Nov 2018 13:38:28 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 88F2E6B0542; Wed, 7 Nov 2018 13:38:28 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 52ED86B053B; Wed, 7 Nov 2018 13:38:28 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com [209.85.208.71]) by kanga.kvack.org (Postfix) with ESMTP id D5F846B0541 for ; Wed, 7 Nov 2018 13:38:27 -0500 (EST) Received: by mail-ed1-f71.google.com with SMTP id r20-v6so6792208eds.18 for ; Wed, 07 Nov 2018 10:38:27 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:in-reply-to:references; bh=27XlKhGhVykC9+0keuc5iaQlYSXf6+i5XK7EQZa0ipI=; b=h5wc2jgX8gFEo7+LMb+MFiiVoABY6Wn/JxTsNuUEMGSVhEQBmTWXVxYkIwgeLJOrKp g3CK/Hx0K877tThbu1fuWf78mDxE2oXo2E/rtzcXQNG5fP090raa5S52Ni+jig6t4+/R uO4TZX5OacsUSsleX82N6KHN8dwsSgck2HIuAEJQRwQ81Xq8so2NsNV0Vr33lS4Ot68J P7H9BLh9DYQsbOAN00QzOKLhzQJO/ogKUUEPBIvEVNBgS+jIyZQMbnW63RgB9IEDRAZz AvEF3UAcF8zILjF7UwpWvek9bFtxUYOtdusbvwBBGg74RXQcyPxn5v0dQdIC6w16azZ1 nqSw== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 81.17.249.193 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net X-Gm-Message-State: AGRZ1gIoMjaaOUWR36l3DvqS4QjvswslxeN6X/dCGX6H54tWHGHYiLd0 OkLFswqBlRouurCG3OFkazCNBw74AYaPIvslGYbspywsE3OvQFXS+9/V9ZZOaFyvbnNfeLx6xf2 408JzFFkRbCVQHd6PXSQu63lOAjArnlx2flASiulueVvzvTRCFjJ66FNcKI1lvDrzgA== X-Received: by 2002:a17:906:1c5b:: with SMTP id l27-v6mr920836ejg.118.1541615906992; Wed, 07 Nov 2018 10:38:26 -0800 (PST) X-Google-Smtp-Source: AJdET5cPNs+k4Ivhfz92CSbVTJbhySjy9Hb+wXd7wcnwTrNNvAcrcEp071wiJrlrYGcH83imspL4 X-Received: by 2002:a17:906:1c5b:: with SMTP id l27-v6mr920785ejg.118.1541615905050; Wed, 07 Nov 2018 10:38:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541615905; cv=none; d=google.com; s=arc-20160816; b=K/FdVu0TycMh208b8m7/+RX9x3eTkZJ6Ya3kW2D0LQ3YJngYiJDo3DwvJSGwsOy0UG pxCD8YXeqszYWL37Phmovzxliw3g/QNiFUX6TzxzoRvk2ohsfYs31srBmYWMli/zefYz Ev04J95mFfW5MJDx/Ce1mJANnQW8ZEcFPD7WyXXOYg4qU8ZnhTbOX3GZj5+ybYQStEbI GChXqDmIaOMtOn8TUbmggD8ftO2X4FvMPqHzWM2fSj2O171pkf/QTHu6//rpuQ9wX6KQ EM/ljbtaEbk8gCR8dTbTYmlNcZgLQYX+jsYHddGhghldUQTjqGPu2RirWxVnV4iuv/r3 Bh4A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from; bh=27XlKhGhVykC9+0keuc5iaQlYSXf6+i5XK7EQZa0ipI=; b=dVN8WBgjfBkBzDOQSyluznlDFtXXCV/baw/3J1goAvQ+Z+MjwvEqXMdPrRWLTiDHVg ELZt6RjNyQEEMJqRQzA8YSHsHZpyS7jErWCFvPj2Z/+0algmBup+D1lpwZ/KhJtLFNFU AmZidH2croLg5ATVcc1qSWRZWarJ/Tlv04E2zdFXmzFk063zH5ie/eN4b0Y0V/6CzdbR jjo5C9tiFMdZoo48JoK6nuJS529pYd3x/SEhhdAFL+McPW7YB1fPwV7S/dY+jcC7Nw4O +N43A2zCSl7acNf0I9TCOf2IyVS7ujzN+zRZkKfoZx7vhurH5L36DRV+pZQyQHuQC7bl 2PAw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 81.17.249.193 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from outbound-smtp25.blacknight.com (outbound-smtp25.blacknight.com. [81.17.249.193]) by mx.google.com with ESMTPS id x9-v6si718255ejv.316.2018.11.07.10.38.23 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 07 Nov 2018 10:38:23 -0800 (PST) Received-SPF: pass (google.com: domain of mgorman@techsingularity.net designates 81.17.249.193 as permitted sender) client-ip=81.17.249.193; Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 81.17.249.193 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from mail.blacknight.com (pemlinmail06.blacknight.ie [81.17.255.152]) by outbound-smtp25.blacknight.com (Postfix) with ESMTPS id 9F3D3B8A9B for ; Wed, 7 Nov 2018 18:38:23 +0000 (GMT) Received: (qmail 19805 invoked from network); 7 Nov 2018 18:38:23 -0000 Received: from unknown (HELO stampy.163woodhaven.lan) (mgorman@techsingularity.net@[37.228.229.69]) by 81.17.254.9 with ESMTPA; 7 Nov 2018 18:38:23 -0000 From: Mel Gorman To: Linux-MM Cc: Andrew Morton , Vlastimil Babka , David Rientjes , Andrea Arcangeli , Zi Yan , LKML , Mel Gorman Subject: [PATCH 1/5] mm, page_alloc: Spread allocations across zones before introducing fragmentation Date: Wed, 7 Nov 2018 18:38:18 +0000 Message-Id: <20181107183822.15567-2-mgorman@techsingularity.net> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20181107183822.15567-1-mgorman@techsingularity.net> References: <20181107183822.15567-1-mgorman@techsingularity.net> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP The page allocator zone lists are iterated based on the watermarks of each zone which does not take anti-fragmentation into account. On x86, node 0 may have multiple zones while other nodes have one zone. A consequence is that tasks running on node 0 may fragment ZONE_NORMAL even though ZONE_DMA32 has plenty of free memory. This patch special cases the allocator fast path such that it'll try an allocation from a lower local zone before fragmenting a higher zone. In this case, stealing of pageblocks or orders larger than a pageblock are still allowed in the fast path as they are uninteresting from a fragmentation point of view. This was evaluated using a benchmark designed to fragment memory before attempting THPs. It's implemented in mmtests as the following configurations configs/config-global-dhp__workload_thpfioscale configs/config-global-dhp__workload_thpfioscale-defrag configs/config-global-dhp__workload_thpfioscale-madvhugepage e.g. from mmtests ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1 The broad details of the workload are as follows; 1. Create an XFS filesystem (not specified in the configuration but done as part of the testing for this patch) 2. Start 4 fio threads that write a number of 64K files inefficiently. Inefficiently means that files are created on first access and not created in advance (fio parameterr create_on_open=1) and fallocate is not used (fallocate=none). With multiple IO issuers this creates a mix of slab and page cache allocations over time. The total size of the files is 150% physical memory so that the slabs and page cache pages get mixed 3. Warm up a number of fio read-only threads accessing the same files created in step 2. This part runs for the same length of time it took to create the files. It'll fault back in old data and further interleave slab and page cache allocations. As it's now low on memory due to step 2, fragmentation occurs as pageblocks get stolen. 4. While step 3 is still running, start a process that tries to allocate 75% of memory as huge pages with a number of threads. The number of threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP threads contending with fio, any other threads or forcing cross-NUMA scheduling. Note that the test has not been used on a machine with less than 8 cores. The benchmark records whether huge pages were allocated and what the fault latency was in microseconds 5. Measure the number of events potentially causing external fragmentation, the fault latency and the huge page allocation success rate. 6. Cleanup Note that due to the use of IO and page cache that this benchmark is not suitable for running on large machines where the time to fragment memory may be excessive. Also note that while this is one mix that generates fragmentation that it's not the only mix that generates fragmentation. Differences in workload that are more slab-intensive or whether SLUB is used with high-order pages may yield different results. When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that may cause external fragmentation issues in the future. Hence, the primary metric here is the number of external fragmentation events that occur with order < 9. The secondary metric is allocation latency and huge page allocation success rates but note that differences in latencies and what the success rate also can affect the number of external fragmentation event which is why it's a secondary metric. 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc1 extfrag events < order 9: 1023463 4.20-rc1+patch: 358574 (65% reduction) thpfioscale Fault Latencies 4.20.0-rc1 4.20.0-rc1 vanilla lowzone-v2r4 Min fault-base-1 588.00 ( 0.00%) 557.00 ( 5.27%) Min fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%) Amean fault-base-1 663.58 ( 0.00%) 663.65 ( -0.01%) Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%) 4.20.0-rc1 4.20.0-rc1 vanilla lowzone-v2r4 Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%) Fault latencies are reduced while allocation success rates remain at zero asthis configuration does not make any heavy effort to allocate THP and fio is heavily active at the time and filling memory. However, a 65% reduction of serious fragmentation events reduces the changes of external fragmentation being a problem in the future. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc1 extfrag events < order 9: 342549 4.20-rc1+patch: 337890 (1% reduction) thpfioscale Fault Latencies 4.20.0-rc1 4.20.0-rc1 vanilla lowzone-v2r4 Amean fault-base-1 1517.06 ( 0.00%) 1531.37 ( -0.94%) Amean fault-huge-1 1129.50 ( 0.00%) 1160.95 ( -2.78%) thpfioscale Percentage Faults Huge 4.20.0-rc1 4.20.0-rc1 vanilla lowzone-v2r4 Percentage huge-1 78.01 ( 0.00%) 78.97 ( 1.23%) Nothing dramatic. Fragmentation events were only reduced slightly which is very different to what was reported in V1. A big difference with V1 is the relative size of Normal to the DMA32 zone. This machine has double the memory so the impact of using a small zone to avoid fragmentation events is much lower. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc1 extfrag events < order 9: 209820 4.20-rc1+patch: 185923 (11% reduction) thpfioscale Fault Latencies 4.20.0-rc1 4.20.0-rc1 vanilla lowzone-v2r4 Amean fault-base-5 1324.93 ( 0.00%) 1334.99 ( -0.76%) Amean fault-huge-5 4681.71 ( 0.00%) 2428.43 ( 48.13%) 4.20.0-rc1 4.20.0-rc1 vanilla lowzone-v2r4 Percentage huge-5 1.05 ( 0.00%) 1.13 ( 7.94%) The reduction of external fragmentation events is expected. A careful reader may spot that the reduction is lower than it was on v1. This is due to the removal of __GFP_THISNODE in commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP allocations can now spill over to remote nodes instead of fragmenting local memory. This reduces the impact of the use of a lower zone to avoid fragmentation. It's also worth noting relative to v1 that the allocation success rate is slightly higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc1 extfrag events < order 9: 167464 4.20-rc1+patch: 130081 (22% reduction) thpfioscale Fault Latencies 4.20.0-rc1 4.20.0-rc1 vanilla lowzone-v2r4 Amean fault-base-5 7721.82 ( 0.00%) 6652.67 ( 13.85%) Amean fault-huge-5 3896.10 ( 0.00%) 2486.89 * 36.17%* thpfioscale Percentage Faults Huge 4.20.0-rc1 4.20.0-rc1 vanilla lowzone-v2r4 Percentage huge-5 95.02 ( 0.00%) 94.49 ( -0.56%) In this case, there was both a reduction in the external fragmentation causing events and the huge page allocation success latency with little change in the allocation success rates which were already high. A careful reader will note that V1 had very different outcomes both in terms of the number of fragmentation events and the allocation success rates. In this case, it's due to the baseline including the THP __GFP_THISNODE removaal patch. Overall, the patch significantly reduces the number of external fragmentation causing events so the success of THP over long periods of time would be improved for this adverse workload. While there are large differences compared to how V1 behaved, this is almost entirely accounted for by ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings"). Signed-off-by: Mel Gorman --- mm/internal.h | 13 +++++--- mm/page_alloc.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 98 insertions(+), 15 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 291eb2b6d1d8..544355156c92 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -480,10 +480,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, #define ALLOC_OOM ALLOC_NO_WATERMARKS #endif -#define ALLOC_HARDER 0x10 /* try to alloc harder */ -#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */ -#define ALLOC_CPUSET 0x40 /* check for correct cpuset */ -#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */ +#define ALLOC_HARDER 0x10 /* try to alloc harder */ +#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */ +#define ALLOC_CPUSET 0x40 /* check for correct cpuset */ +#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */ +#ifdef CONFIG_ZONE_DMA32 +#define ALLOC_NOFRAGMENT 0x100 /* avoid mixing pageblock types */ +#else +#define ALLOC_NOFRAGMENT 0x0 +#endif enum ttu_flags; struct tlbflush_unmap_batch; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a919ba5cb3c8..5db746c642df 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2375,20 +2375,30 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * condition simpler. */ static __always_inline bool -__rmqueue_fallback(struct zone *zone, int order, int start_migratetype) +__rmqueue_fallback(struct zone *zone, int order, int start_migratetype, + unsigned int alloc_flags) { struct free_area *area; int current_order; + int min_order = order; struct page *page; int fallback_mt; bool can_steal; + /* + * Do not steal pages from freelists belonging to other pageblocks + * i.e. orders < pageblock_order. In the event there is on local + * zone free, the allocation will retry later. + */ + if (alloc_flags & ALLOC_NOFRAGMENT) + min_order = pageblock_order; + /* * Find the largest available free page in the other list. This roughly * approximates finding the pageblock with the most free pages, which * would be too costly to do exactly. */ - for (current_order = MAX_ORDER - 1; current_order >= order; + for (current_order = MAX_ORDER - 1; current_order >= min_order; --current_order) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, @@ -2447,7 +2457,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) * Call me with the zone->lock already held. */ static __always_inline struct page * -__rmqueue(struct zone *zone, unsigned int order, int migratetype) +__rmqueue(struct zone *zone, unsigned int order, int migratetype, + unsigned int alloc_flags) { struct page *page; @@ -2457,7 +2468,8 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype) if (migratetype == MIGRATE_MOVABLE) page = __rmqueue_cma_fallback(zone, order); - if (!page && __rmqueue_fallback(zone, order, migratetype)) + if (!page && __rmqueue_fallback(zone, order, migratetype, + alloc_flags)) goto retry; } @@ -2472,13 +2484,14 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype) */ static int rmqueue_bulk(struct zone *zone, unsigned int order, unsigned long count, struct list_head *list, - int migratetype) + int migratetype, unsigned int alloc_flags) { int i, alloced = 0; spin_lock(&zone->lock); for (i = 0; i < count; ++i) { - struct page *page = __rmqueue(zone, order, migratetype); + struct page *page = __rmqueue(zone, order, migratetype, + alloc_flags); if (unlikely(page == NULL)) break; @@ -2934,6 +2947,7 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) /* Remove page from the per-cpu list, caller must protect the list */ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype, + unsigned int alloc_flags, struct per_cpu_pages *pcp, struct list_head *list) { @@ -2943,7 +2957,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype, if (list_empty(list)) { pcp->count += rmqueue_bulk(zone, 0, pcp->batch, list, - migratetype); + migratetype, alloc_flags); if (unlikely(list_empty(list))) return NULL; } @@ -2959,7 +2973,8 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype, /* Lock and remove page from the per-cpu list */ static struct page *rmqueue_pcplist(struct zone *preferred_zone, struct zone *zone, unsigned int order, - gfp_t gfp_flags, int migratetype) + gfp_t gfp_flags, int migratetype, + unsigned int alloc_flags) { struct per_cpu_pages *pcp; struct list_head *list; @@ -2969,7 +2984,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, local_irq_save(flags); pcp = &this_cpu_ptr(zone->pageset)->pcp; list = &pcp->lists[migratetype]; - page = __rmqueue_pcplist(zone, migratetype, pcp, list); + page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list); if (page) { __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); zone_statistics(preferred_zone, zone); @@ -2992,7 +3007,7 @@ struct page *rmqueue(struct zone *preferred_zone, if (likely(order == 0)) { page = rmqueue_pcplist(preferred_zone, zone, order, - gfp_flags, migratetype); + gfp_flags, migratetype, alloc_flags); goto out; } @@ -3011,7 +3026,7 @@ struct page *rmqueue(struct zone *preferred_zone, trace_mm_page_alloc_zone_locked(page, order, migratetype); } if (!page) - page = __rmqueue(zone, order, migratetype); + page = __rmqueue(zone, order, migratetype, alloc_flags); } while (page && check_new_pages(page, order)); spin_unlock(&zone->lock); if (!page) @@ -3253,6 +3268,36 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) } #endif /* CONFIG_NUMA */ +#ifdef CONFIG_ZONE_DMA32 +/* + * The restriction on ZONE_DMA32 as being a suitable zone to use to avoid + * fragmentation is subtle. If the preferred zone was HIGHMEM then + * premature use of a lower zone may cause lowmem pressure problems that + * are wose than fragmentation. If the next zone is ZONE_DMA then it is + * probably too small. It only makes sense to spread allocations to avoid + * fragmentation between the Normal and DMA32 zones. + */ +static inline unsigned int alloc_flags_nofragment(struct zone *zone) +{ + if (zone_idx(zone) != ZONE_NORMAL) + return 0; + + /* + * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and + * the pointer is within zone->zone_pgdat->node_zones[]. + */ + if (!populated_zone(--zone)) + return 0; + + return ALLOC_NOFRAGMENT; +} +#else +static inline unsigned int alloc_flags_nofragment(struct zone *zone) +{ + return 0; +} +#endif + /* * get_page_from_freelist goes through the zonelist trying to allocate * a page. @@ -3264,11 +3309,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, struct zoneref *z = ac->preferred_zoneref; struct zone *zone; struct pglist_data *last_pgdat_dirty_limit = NULL; + bool no_fallback; +retry: /* * Scan zonelist, looking for a zone with enough free. * See also __cpuset_node_allowed() comment in kernel/cpuset.c. */ + no_fallback = alloc_flags & ALLOC_NOFRAGMENT; for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) { struct page *page; @@ -3307,6 +3355,21 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, } } + if (no_fallback) { + int local_nid; + + /* + * If moving to a remote node, retry but allow + * fragmenting fallbacks. Locality is more important + * than fragmentation avoidance. + */ + local_nid = zone_to_nid(ac->preferred_zoneref->zone); + if (zone_to_nid(zone) != local_nid) { + alloc_flags &= ~ALLOC_NOFRAGMENT; + goto retry; + } + } + mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; if (!zone_watermark_fast(zone, order, mark, ac_classzone_idx(ac), alloc_flags)) { @@ -3374,6 +3437,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, } } + /* + * It's possible on a UMA machine to get through all zones that are + * fragmented. If avoiding fragmentation, reset and try again + */ + if (no_fallback) { + alloc_flags &= ~ALLOC_NOFRAGMENT; + goto retry; + } + return NULL; } @@ -4371,6 +4443,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, finalise_ac(gfp_mask, &ac); + /* + * Forbid the first pass from falling back to types that fragment + * memory until all local zones are considered. + */ + alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone); + /* First allocation attempt */ page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac); if (likely(page)) From patchwork Wed Nov 7 18:38:20 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mel Gorman X-Patchwork-Id: 10672873 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 222DC17D4 for ; Wed, 7 Nov 2018 18:38:32 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A60732A578 for ; Wed, 7 Nov 2018 18:38:31 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 930F12BE22; Wed, 7 Nov 2018 18:38:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 32F9D2A578 for ; Wed, 7 Nov 2018 18:38:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B9A856B0539; Wed, 7 Nov 2018 13:38:27 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id B4C706B053D; Wed, 7 Nov 2018 13:38:27 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 99F066B053E; Wed, 7 Nov 2018 13:38:27 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com [209.85.208.69]) by kanga.kvack.org (Postfix) with ESMTP id 20FCE6B0539 for ; Wed, 7 Nov 2018 13:38:27 -0500 (EST) Received: by mail-ed1-f69.google.com with SMTP id c8-v6so9633999edt.23 for ; Wed, 07 Nov 2018 10:38:27 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:in-reply-to:references; bh=/IFDqMcsiLCwlOp01Mj4rPJcW58LrbIIlbTT+r9hw9o=; b=P9yCqdl8u4WbXCdnvGT2DqgkeJvO5vyGM68qAglYaWojRYyhA0i+wA3KYEpToDj/Aj UWiHem/xrbskPmeUA2oYbjO2nFjTFXy5+Zhx+1S5ZxB4V0OU6+V7yV/JU5BoK34aukXn mB8ykOj3SZPAHwo+MKTi5C9+3+eWK7kU1/Ay809uMMv7+8VmNOQPwlKzDIha9vlYclKs JhEWy3jV/3W9Jlk5tVyW/Z7/YqkKc9ptJevILDzeXd8oMReJOTs65RXQ6i98QDfebXMu /QIuD+sIzdRSfi3UJKNGck1vEKfIt3/7/jtCy29+a76vWt4S3rfedtKtf6nf3MT7N6Os 58Mg== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net X-Gm-Message-State: AGRZ1gJ9pFJYIbteA/N1P3TV19pudWKlfjdp3eI9QBYbcQw6ZwnvvByN S/y9/J9GQ/NufuC8Qke/Al1hp0NHHslbSG4M8hAnqTOKu5mstzMeKaNtYo2gMRjuNNeBVZkWdbL dsC1X7BOsmOYPtWPk4Rt0KF7GucYwDxTY9vJKWipcOXSi4Yk4SdmICWfOx91sX2d0yQ== X-Received: by 2002:a50:b1af:: with SMTP id m44-v6mr1070224edd.118.1541615906547; Wed, 07 Nov 2018 10:38:26 -0800 (PST) X-Google-Smtp-Source: AJdET5fCOMGAKfc0R3RYmLAg6iL7c/P3LlIzzkxkGdGQ+radS5GAhwcei9urQZrRZGqgewiqHF3o X-Received: by 2002:a50:b1af:: with SMTP id m44-v6mr1070141edd.118.1541615904492; Wed, 07 Nov 2018 10:38:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541615904; cv=none; d=google.com; s=arc-20160816; b=Cq+nDFhd+H9Bu+M/vbdOnuDPdOXaYo/JAodXoL05imJllHGxFVYvIjhhf5vv6PkfcA Nli+nli5sk1sdgkU4Vzg+jiiYJX6vrtaMjLcKuxve1jSJ3az7A0xvi+D2EkYp7vtAFtH 7+xc8vrxYCnV5Ad7Vk4gn/bg2FJu8ei4jiPxHprfO4XFxfoVbqBiRR0cBW/cspTFtXsh bgB7J0wEwmAVJ+cbUlF2lTp9oXwPwZbn+uQlpvMr56LHMJSxplVTwv9SKNn/2/IKxDfO wKwK95VcBBMt8kR6LABncks//tfplfqk2krLRTo+2Ijho82QtPCpVoXhofkO2F8BjKDH 5r1A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from; bh=/IFDqMcsiLCwlOp01Mj4rPJcW58LrbIIlbTT+r9hw9o=; b=H1ikr6PObAbpfiDmBxvz5Xfl4nVsQW6WfmuQNnNyRhc/R1OaJ+T+0MPe1NurkQo0ML POXSx+z3Wg8VQlfKTmR7UsFBLbyuj7VBniMWTK8i55DjB/hsarqmBk0fVXwNpt7VLQpF 3V+AR2g47NaKDi2nKy/LMxHqqg8POA1jAT2J5uODJMlVn4pbB3dbiekhX4wEqUao89Hf 8SZ9I/FqyBTRBcjh5FBINcTtoV6xWN7gs0WQ8Dj2PybPQ9beuD9Vx2pnkXgb5M2WrKaq ILau1wbBhuonKog1OtT+sBjyih4437AWAgDoy2ZDU+eWZm/TJgaLm5vhbhm2wy41ouj9 PDvg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from outbound-smtp04.blacknight.com (outbound-smtp04.blacknight.com. [81.17.249.35]) by mx.google.com with ESMTPS id a2si1012892edv.415.2018.11.07.10.38.24 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 07 Nov 2018 10:38:24 -0800 (PST) Received-SPF: pass (google.com: domain of mgorman@techsingularity.net designates 81.17.249.35 as permitted sender) client-ip=81.17.249.35; Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from mail.blacknight.com (pemlinmail06.blacknight.ie [81.17.255.152]) by outbound-smtp04.blacknight.com (Postfix) with ESMTPS id 2464498A6C for ; Wed, 7 Nov 2018 18:38:24 +0000 (UTC) Received: (qmail 19840 invoked from network); 7 Nov 2018 18:38:23 -0000 Received: from unknown (HELO stampy.163woodhaven.lan) (mgorman@techsingularity.net@[37.228.229.69]) by 81.17.254.9 with ESMTPA; 7 Nov 2018 18:38:23 -0000 From: Mel Gorman To: Linux-MM Cc: Andrew Morton , Vlastimil Babka , David Rientjes , Andrea Arcangeli , Zi Yan , LKML , Mel Gorman Subject: [PATCH 3/5] mm: Reclaim small amounts of memory when an external fragmentation event occurs Date: Wed, 7 Nov 2018 18:38:20 +0000 Message-Id: <20181107183822.15567-4-mgorman@techsingularity.net> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20181107183822.15567-1-mgorman@techsingularity.net> References: <20181107183822.15567-1-mgorman@techsingularity.net> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken unconditionally to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback or swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc1 extfrag events < order 9: 1023463 4.20-rc1+patch: 358574 (65% reduction) 4.20-rc1+patch1-3: 19274 (98% reduction) 4.20.0-rc1 4.20.0-rc1 lowzone-v2r4 boost-v2r4 Amean fault-base-1 663.65 ( 0.00%) 659.85 * 0.57%* Amean fault-huge-1 0.00 ( 0.00%) 172.19 * -99.00%* 4.20.0-rc1 4.20.0-rc1 lowzone-v2r4 boost-v2r4 Percentage huge-1 0.00 ( 0.00%) 1.68 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc1 extfrag events < order 9: 342549 4.20-rc1+patch: 337890 ( 1% reduction) 4.20-rc1+patch1-3: 12801 (96% reduction) thpfioscale Fault Latencies thpfioscale Fault Latencies 4.20.0-rc1 4.20.0-rc1 lowzone-v2r4 boost-v2r4 Amean fault-base-1 1531.37 ( 0.00%) 1578.91 ( -3.10%) Amean fault-huge-1 1160.95 ( 0.00%) 1090.23 * 6.09%* 4.20.0-rc1 4.20.0-rc1 lowzone-v2r4 boost-v2r4 Percentage huge-1 78.97 ( 0.00%) 82.59 ( 4.58%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc1 extfrag events < order 9: 209820 4.20-rc1+patch: 185923 (11% reduction) 4.20-rc1+patch1-3: 11240 (95% reduction) 4.20.0-rc1 4.20.0-rc1 lowzone-v2r4 boost-v2r4 Amean fault-base-5 1334.99 ( 0.00%) 1395.28 ( -4.52%) Amean fault-huge-5 2428.43 ( 0.00%) 539.69 ( 77.78%) 4.20.0-rc1 4.20.0-rc1 lowzone-v2r4 boost-v2r4 Percentage huge-5 1.13 ( 0.00%) 0.53 ( -52.94%) This is an illustration of why latencies are not the primary metric. There is a 95% reduction in fragmentation causing events but the huge page latencies look fantastic until you account for the fact it might be because the success rate was lower. Given how low it was initially, this is partially down to luck. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc1 extfrag events < order 9: 167464 4.20-rc1+patch: 130081 (22% reduction) 4.20-rc1+patch1-3: 12057 (92% reduction) thpfioscale Fault Latencies 4.20.0-rc1 4.20.0-rc1 lowzone-v2r4 boost-v2r4 Amean fault-base-5 6652.67 ( 0.00%) 8691.83 * -30.65%* Amean fault-huge-5 2486.89 ( 0.00%) 2899.83 * -16.60%* 4.20.0-rc1 4.20.0-rc1 lowzone-v2r4 boost-v2r4 Percentage huge-5 94.49 ( 0.00%) 95.55 ( 1.13%) There is a large reduction in fragmentation events with a very slightly higher THP allocation success rate. The latencies look bad but a closer look at the data seems to indicate the problem is at the tails. Given the high THP allocation success rate, the system is under quite some pressure. Signed-off-by: Mel Gorman --- Documentation/sysctl/vm.txt | 19 +++++++ include/linux/mm.h | 1 + include/linux/mmzone.h | 11 ++-- kernel/sysctl.c | 8 +++ mm/page_alloc.c | 53 +++++++++++++++++-- mm/vmscan.c | 123 ++++++++++++++++++++++++++++++++++++++++---- 6 files changed, 199 insertions(+), 16 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 7d73882e2c27..2244520d7913 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -63,6 +63,7 @@ files can be found in mm/swap.c. - swappiness - user_reserve_kbytes - vfs_cache_pressure +- watermark_boost_factor - watermark_scale_factor - zone_reclaim_mode @@ -856,6 +857,24 @@ ten times more freeable objects than there are. ============================================================= +watermark_boost_factor: + +This factor controls the level of reclaim when memory is being fragmented. +It defines the percentage of the low watermark of a zone that will be +reclaimed if pages of different mobility are being mixed within pageblocks. +The intent is so that compaction has less work to do and increase the +success rate of future high-order allocations such as SLUB allocations, +THP and hugetlbfs pages. + +To make it sensible with respect to the watermark_scale_factor parameter, +the unit is in fractions of 10,000. The default value of 15000 means +that 150% of the high watermark will be reclaimed in the event of a +pageblock being mixed due to fragmentation. If this value is smaller +than a pageblock then a pageblocks worth of pages will be reclaimed (e.g. +2MB on 64-bit x86). A boost factor of 0 will disable the feature. + +============================================================= + watermark_scale_factor: This factor controls the aggressiveness of kswapd. It defines the diff --git a/include/linux/mm.h b/include/linux/mm.h index fcf9cc9d535f..81926daf6dfb 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2194,6 +2194,7 @@ extern void zone_pcp_reset(struct zone *zone); /* page_alloc.c */ extern int min_free_kbytes; +extern int watermark_boost_factor; extern int watermark_scale_factor; /* nommu.c */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e43e8e79db99..d352c1dab486 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -269,10 +269,10 @@ enum zone_watermarks { NR_WMARK }; -#define min_wmark_pages(z) (z->_watermark[WMARK_MIN]) -#define low_wmark_pages(z) (z->_watermark[WMARK_LOW]) -#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH]) -#define wmark_pages(z, i) (z->_watermark[i]) +#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost) +#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost) +#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) +#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) struct per_cpu_pages { int count; /* number of pages in the list */ @@ -364,6 +364,7 @@ struct zone { /* zone watermarks, access with *_wmark_pages(zone) macros */ unsigned long _watermark[NR_WMARK]; + unsigned long watermark_boost; unsigned long nr_reserved_highatomic; @@ -885,6 +886,8 @@ static inline int is_highmem(struct zone *zone) struct ctl_table; int min_free_kbytes_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); +int watermark_boost_factor_sysctl_handler(struct ctl_table *, int, + void __user *, size_t *, loff_t *); int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES]; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 5fc724e4e454..1825f712e73b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1462,6 +1462,14 @@ static struct ctl_table vm_table[] = { .proc_handler = min_free_kbytes_sysctl_handler, .extra1 = &zero, }, + { + .procname = "watermark_boost_factor", + .data = &watermark_boost_factor, + .maxlen = sizeof(watermark_boost_factor), + .mode = 0644, + .proc_handler = watermark_boost_factor_sysctl_handler, + .extra1 = &zero, + }, { .procname = "watermark_scale_factor", .data = &watermark_scale_factor, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ad996a769bd5..4abac725a149 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -263,6 +263,7 @@ compound_page_dtor * const compound_page_dtors[] = { int min_free_kbytes = 1024; int user_min_free_kbytes = -1; +int watermark_boost_factor __read_mostly = 15000; int watermark_scale_factor = 10; static unsigned long nr_kernel_pages __meminitdata; @@ -2129,6 +2130,21 @@ static bool can_steal_fallback(unsigned int order, int start_mt) return false; } +static inline void boost_watermark(struct zone *zone) +{ + unsigned long max_boost; + + if (!watermark_boost_factor) + return; + + max_boost = mult_frac(wmark_pages(zone, WMARK_HIGH), + watermark_boost_factor, 10000); + max_boost = max(pageblock_nr_pages, max_boost); + + zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages, + max_boost); +} + /* * This function implements actual steal behaviour. If order is large enough, * we can steal whole pageblock. If not, we first move freepages in this @@ -2160,6 +2176,14 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, goto single_page; } + /* + * Boost watermarks to increase reclaim pressure to reduce the + * likelihood of future fallbacks. Wake kswapd now as the node + * may be balanced overall and kswapd will not wake naturally. + */ + boost_watermark(zone); + wakeup_kswapd(zone, 0, 0, zone_idx(zone)); + /* We are not allowed to try stealing from the whole block */ if (!whole_block) goto single_page; @@ -3277,11 +3301,19 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) * probably too small. It only makes sense to spread allocations to avoid * fragmentation between the Normal and DMA32 zones. */ -static inline unsigned int alloc_flags_nofragment(struct zone *zone) +static inline unsigned int +alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask) { if (zone_idx(zone) != ZONE_NORMAL) return 0; + /* + * A fragmenting fallback will try waking kswapd. ALLOC_NOFRAGMENT + * may break that so such callers can introduce fragmentation. + */ + if (!(gfp_mask & __GFP_KSWAPD_RECLAIM)) + return 0; + /* * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and * the pointer is within zone->zone_pgdat->node_zones[]. @@ -3292,7 +3324,8 @@ static inline unsigned int alloc_flags_nofragment(struct zone *zone) return ALLOC_NOFRAGMENT; } #else -static inline unsigned int alloc_flags_nofragment(struct zone *zone) +static inline unsigned int +alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask) { return 0; } @@ -4447,7 +4480,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, * Forbid the first pass from falling back to types that fragment * memory until all local zones are considered. */ - alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone); + alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, + gfp_mask); /* First allocation attempt */ page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac); @@ -7438,6 +7472,7 @@ static void __setup_per_zone_wmarks(void) zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2; + zone->watermark_boost = 0; spin_unlock_irqrestore(&zone->lock, flags); } @@ -7538,6 +7573,18 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write, return 0; } +int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos) +{ + int rc; + + rc = proc_dointvec_minmax(table, write, buffer, length, ppos); + if (rc) + return rc; + + return 0; +} + int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { diff --git a/mm/vmscan.c b/mm/vmscan.c index 62ac0c488624..5ba76ec4f01e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3378,6 +3378,30 @@ static void age_active_anon(struct pglist_data *pgdat, } while (memcg); } +static bool pgdat_watermark_boosted(pg_data_t *pgdat, int classzone_idx) +{ + int i; + struct zone *zone; + + /* + * Check for watermark boosts top-down as the higher zones + * are more likely to be boosted. Both watermarks and boosts + * should not be checked at the time time as reclaim would + * start prematurely when there is no boosting and a lower + * zone is balanced. + */ + for (i = classzone_idx; i >= 0; i--) { + zone = pgdat->node_zones + i; + if (!managed_zone(zone)) + continue; + + if (zone->watermark_boost) + return true; + } + + return false; +} + /* * Returns true if there is an eligible zone balanced for the request order * and classzone_idx @@ -3388,9 +3412,12 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) unsigned long mark = -1; struct zone *zone; + /* + * Check watermarks bottom-up as lower zones are more likely to + * meet watermarks. + */ for (i = 0; i <= classzone_idx; i++) { zone = pgdat->node_zones + i; - if (!managed_zone(zone)) continue; @@ -3516,14 +3543,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; unsigned long pflags; + unsigned long nr_boost_reclaim; + unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; + bool boosted; struct zone *zone; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .order = order, - .priority = DEF_PRIORITY, - .may_writepage = !laptop_mode, .may_unmap = 1, - .may_swap = 1, }; psi_memstall_enter(&pflags); @@ -3531,9 +3558,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) count_vm_event(PAGEOUTRUN); + /* + * Account for the reclaim boost. Note that the zone boost is left in + * place so that parallel allocations that are near the watermark will + * stall or direct reclaim until kswapd is finished. + */ + nr_boost_reclaim = 0; + for (i = 0; i <= classzone_idx; i++) { + zone = pgdat->node_zones + i; + if (!managed_zone(zone)) + continue; + + nr_boost_reclaim += zone->watermark_boost; + zone_boosts[i] = zone->watermark_boost; + } + boosted = nr_boost_reclaim; + +restart: + sc.priority = DEF_PRIORITY; do { unsigned long nr_reclaimed = sc.nr_reclaimed; bool raise_priority = true; + bool balanced; bool ret; sc.reclaim_idx = classzone_idx; @@ -3560,13 +3606,39 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) } /* - * Only reclaim if there are no eligible zones. Note that - * sc.reclaim_idx is not used as buffer_heads_over_limit may - * have adjusted it. + * If the pgdat is imbalanced then ignore boosting and preserve + * the watermarks for a later time and restart. Note that the + * zone watermarks will be still reset at the end of balancing + * on the grounds that the normal reclaim should be enough to + * re-evaluate if boosting is required when kswapd next wakes. + */ + balanced = pgdat_balanced(pgdat, sc.order, classzone_idx); + if (!balanced && nr_boost_reclaim) { + nr_boost_reclaim = 0; + goto restart; + } + + /* + * If boosting is not active then only reclaim if there are no + * eligible zones. Note that sc.reclaim_idx is not used as + * buffer_heads_over_limit may have adjusted it. */ - if (pgdat_balanced(pgdat, sc.order, classzone_idx)) + if (!nr_boost_reclaim && balanced) goto out; + /* Limit the priority of boosting to avoid reclaim writeback */ + if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) + raise_priority = false; + + /* + * Do not writeback or swap pages for boosted reclaim. The + * intent is to relieve pressure not issue sub-optimal IO + * from reclaim context. If no pages are reclaimed, the + * reclaim will be aborted. + */ + sc.may_writepage = !laptop_mode && !nr_boost_reclaim; + sc.may_swap = !nr_boost_reclaim; + /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. All @@ -3618,6 +3690,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) * progress in reclaiming pages */ nr_reclaimed = sc.nr_reclaimed - nr_reclaimed; + nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed); + + /* + * If reclaim made no progress for a boost, stop reclaim as + * IO cannot be queued and it could be an infinite loop in + * extreme circumstances. + */ + if (nr_boost_reclaim && !nr_reclaimed) + break; + if (raise_priority || !nr_reclaimed) sc.priority--; } while (sc.priority >= 1); @@ -3626,6 +3708,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) pgdat->kswapd_failures++; out: + /* If reclaim was boosted, account for the reclaim done in this pass */ + if (boosted) { + unsigned long flags; + + for (i = 0; i <= classzone_idx; i++) { + if (!zone_boosts[i]) + continue; + + /* Increments are under the zone lock */ + zone = pgdat->node_zones + i; + spin_lock_irqsave(&zone->lock, flags); + zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]); + spin_unlock_irqrestore(&zone->lock, flags); + } + + /* + * As there is now likely space, wakeup kcompact to defragment + * pageblocks. + */ + wakeup_kcompactd(pgdat, pageblock_order, classzone_idx); + } + snapshot_refaults(NULL, pgdat); __fs_reclaim_release(); psi_memstall_leave(&pflags); @@ -3854,7 +3958,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, /* Hopeless node, leave it to direct reclaim if possible */ if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES || - pgdat_balanced(pgdat, order, classzone_idx)) { + (pgdat_balanced(pgdat, order, classzone_idx) && + !pgdat_watermark_boosted(pgdat, classzone_idx))) { /* * There may be plenty of free memory available, but it's too * fragmented for high-order allocations. Wake up kcompactd From patchwork Wed Nov 7 18:38:21 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mel Gorman X-Patchwork-Id: 10672875 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3F79B15A6 for ; Wed, 7 Nov 2018 18:38:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6A3CF2A578 for ; Wed, 7 Nov 2018 18:38:34 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 5DB352BDF6; Wed, 7 Nov 2018 18:38:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E92962A578 for ; Wed, 7 Nov 2018 18:38:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0C8A16B053F; Wed, 7 Nov 2018 13:38:28 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id F1E606B053D; Wed, 7 Nov 2018 13:38:27 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D48E96B053F; Wed, 7 Nov 2018 13:38:27 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) by kanga.kvack.org (Postfix) with ESMTP id 4241A6B053B for ; Wed, 7 Nov 2018 13:38:27 -0500 (EST) Received: by mail-ed1-f70.google.com with SMTP id x14-v6so7760537edr.7 for ; Wed, 07 Nov 2018 10:38:27 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:in-reply-to:references; bh=uFQnCcHqIU/Yknj9U3wfwZdpKWeQfkDsvoLCKtb3I64=; b=oL8N+k2jZP7y+JFKACujTMdERr60u/q7FucVmK9BbnxCSfJvK4C213C1SfHaNY+4r0 zO1E83yYGRCRVuDGrg0mz/pw1nH+0APgCYJV1L8IipTEGZq5uB8pf8UAMtT/agXQJHnV BMAa1fJKyj/Ps/4bzcDvrAG2/NdmCmT2BFkBasJFR+tHkvbFuLaK+eTDXKRqghgL8ScI HjI02/fYf+5EIC4E5+0Qfr5bdPDAKRgG2jgDGJUuAMjQNq4DyFg9GHI1s3NIOlWUz2Rr s5kzBNcav2+AuWwgjqz+6/cz/JHGNF3GJofYiKg+3CYdmYuf151aNbZqY9YcMQ2PE4en QSJw== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.230 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net X-Gm-Message-State: AGRZ1gIM2Uro+8ij7Y6urO5eClA0afmRSxD8Y6z92FG5oqekumnYJYwL Ob37o3duYqiop/OR/SkU3+jlbFncA+Zr/luSje7K4TSL0s2jIaY3hGklPyWh7OIYThaxeP5aa+y 7kF20eNc68VSKIGlh4EOEAECSF5nnz/xGqh6Oe0jWArV+PkLnJFxMach7pMP0FYv3uA== X-Received: by 2002:a50:e882:: with SMTP id f2-v6mr1087546edn.117.1541615906671; Wed, 07 Nov 2018 10:38:26 -0800 (PST) X-Google-Smtp-Source: AJdET5cJjiTPftO4id8uMOR/Y2ZcO1NLe7YjnL+nE1zAp8LjzB+4SEjo6X6LobR+mncBTsV83VVS X-Received: by 2002:a50:e882:: with SMTP id f2-v6mr1087463edn.117.1541615904910; Wed, 07 Nov 2018 10:38:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541615904; cv=none; d=google.com; s=arc-20160816; b=pByfHGgPo/9vMCjoNnM7m3ulzxgxjoCYLRCCK2H57sao/1OvoaQyvaTCC1rtDZcit3 M3OUzMyKZd9QvwY70xJrq0TcyDUbk73zdDuRll8lgk5rjzvz0heN26f+YSf4J6WKPDgz AJpsg/ZjEW/8ygJka0S5RIcuQ3AFtMXaPOYlX/FPePtY+I9bSkjjEjD0gj3wK1fZwU+C L6V6VrvFU43h3xLm3GUYG39hSkV6pbBNQl3T4SA0mOec+cVZ2h9yyvsDU2/xyKOlRAmI SmyqPzMcCEatHF5P2a6PK7wJE6eHg7h1g4q3cBp41NSGrWHGcDw7G1Bjk/ayn33A8m4f 09ow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from; bh=uFQnCcHqIU/Yknj9U3wfwZdpKWeQfkDsvoLCKtb3I64=; b=I6yj3GblsVSJ7pEjQyl4ZqnJEjlr5t3yZMIojMshFFPIT9X6BKXXnNxc20g62hTFnf JZzPNpz4KY7zhY8wcGWZq1xInQS3tMD8r0QM6v2Ug5M6uzVxGwjcdcjrELsaHv/VKoYy 04r03xMW7UFKFgz1ZHJBslOpMjI00dVuD+wwe7/5+OlT2yNTNQB/HxevsWteLkT0qBhf jOfyePdq/LcyYHPtoDI7XuYt0G90Msc67FhSkXuLDLSWglfTmoh8H/xg0K+IUB1yjhUW ay7E8n6wQGmUV8Qk2HKCH0zlT1TRtP4gBEz8MVzMqjoNxtfKBogBXvL79JGxlsTOTpa2 BwGA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.230 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from outbound-smtp13.blacknight.com (outbound-smtp13.blacknight.com. [46.22.139.230]) by mx.google.com with ESMTPS id y30-v6si1022063edb.128.2018.11.07.10.38.24 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 07 Nov 2018 10:38:24 -0800 (PST) Received-SPF: pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.230 as permitted sender) client-ip=46.22.139.230; Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.230 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from mail.blacknight.com (unknown [81.17.255.152]) by outbound-smtp13.blacknight.com (Postfix) with ESMTPS id 613971C23D4 for ; Wed, 7 Nov 2018 18:38:24 +0000 (GMT) Received: (qmail 19854 invoked from network); 7 Nov 2018 18:38:24 -0000 Received: from unknown (HELO stampy.163woodhaven.lan) (mgorman@techsingularity.net@[37.228.229.69]) by 81.17.254.9 with ESMTPA; 7 Nov 2018 18:38:24 -0000 From: Mel Gorman To: Linux-MM Cc: Andrew Morton , Vlastimil Babka , David Rientjes , Andrea Arcangeli , Zi Yan , LKML , Mel Gorman Subject: [PATCH 4/5] mm: Stall movable allocations until kswapd progresses during serious external fragmentation event Date: Wed, 7 Nov 2018 18:38:21 +0000 Message-Id: <20181107183822.15567-5-mgorman@techsingularity.net> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20181107183822.15567-1-mgorman@techsingularity.net> References: <20181107183822.15567-1-mgorman@techsingularity.net> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP An event that potentially causes external fragmentation problems has already been described but there are degrees of severity. A "serious" event is defined as one that steals a contiguous range of pages of an order lower than fragment_stall_order (PAGE_ALLOC_COSTLY_ORDER by default). If a movable allocation request that is allowed to sleep needs to steal a small block then it schedules until kswapd makes progress or a timeout passes. The watermarks are also boosted slightly faster so that kswapd makes greater effort to reclaim enough pages to avoid the fragmentation event. This stall is not guaranteed to avoid serious fragmentation events. If memory pressure is high enough, the pages freed by kswapd may be reallocated or the free pages may not be in pageblocks that contain only movable pages. Furthermore an allocation request that cannot stall (e.g. atomic allocations) or unmovable/reclaimable allocations will still proceed without stalling. The worst-case scenario for stalling is a combination of both high memory pressure where kswapd is having trouble keeping free pages over the pfmemalloc_reserve and movable allocations are fragmenting memory. In this case, an allocation request may sleep for longer. There are both vmstats to identify stalls are happening and a tracepoint to quantify what the stall durations are. Note that the granularity of the stall detection is a jiffy so the delay accounting is not precise. 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc1 extfrag events < order 9: 1023463 4.20-rc1+patch: 358574 (65% reduction) 4.20-rc1+patch1-3: 19274 (98% reduction) 4.20-rc1+patch1-4: 1351 (99.9% reduction) 4.20.0-rc1 4.20.0-rc1 boost-v2r4 stall-v2r6 Amean fault-base-1 659.85 ( 0.00%) 648.66 * 1.70%* Amean fault-huge-1 172.19 ( 0.00%) 167.79 ( 2.56%) thpfioscale Percentage Faults Huge 4.20.0-rc1 4.20.0-rc1 boost-v2r4 stall-v2r6 Percentage huge-1 1.68 ( 0.00%) 1.16 ( -30.69%) Fragmentation events are now reduced to negligible levels. The latencies and allocation success rates are roughly similar. Over the course of 16 minutes, there were 100 stalls due to fragmentation avoidance with a total stall time of 0.4 seconds. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc1 extfrag events < order 9: 342549 4.20-rc1+patch: 337890 ( 1% reduction) 4.20-rc1+patch1-3: 12801 (96% reduction) 4.20-rc1+patch1-4: 1117 (99.7% reduction) 4.20.0-rc1 4.20.0-rc1 boost-v2r4 stall-v2r6 Amean fault-base-1 1578.91 ( 0.00%) 43404.60 (-2649.02%) Amean fault-huge-1 1090.23 ( 0.00%) 1424.32 * -30.64%* 4.20.0-rc1 4.20.0-rc1 boost-v2r4 stall-v2r6 Percentage huge-1 82.59 ( 0.00%) 99.92 ( 20.97%) The fragmentation events were reduced but the latencies went a bit crazy. The "problem" is that the allocation success rates were very high and forward progress was being made. This put the system under further pressure and while compactions were succeeding, the latencies were high in cases where compaction failed. The THP allocation vm stats are illustrative in this case 4.20.0-rc1 4.20.0-rc1 boost-v2r4 stall-v2r6 THP fault alloc 4974 6016 THP fault fallback 1048 5 THP collapse alloc 65 56 THP collapse fail 4 4 THP split 0 3719 THP split failed 0 224 Note the THP fault alloc stats where they almost all succeeded relative to the baseline. While the latencies are much higher, it is the case that the application specifically requested THP while the system was under heavy memory pressure. There were 314 stalls over the course of 16 minutes for a total stall time of roughly 11 seconds. The distribution of stalls is as follows 205 4000 1 8000 1 20000 1 32000 1 36000 6 40000 1 56000 98 100000 This is showing that 98 of the stalls waited until the timeout expired at 25 jiffies which 100000 microseconds on this particular configuration. If this is considered problematic, the timeout can be reduced to tradeoff fault times against fragmentation avoidance. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc1 extfrag events < order 9: 209820 4.20-rc1+patch: 185923 (11% reduction) 4.20-rc1+patch1-3: 11240 (95% reduction) 4.20-rc1+patch1-4: 13241 (93% reduction) 4.20.0-rc1 4.20.0-rc1 boost-v2r4 stall-v2r6 Amean fault-base-5 1395.28 ( 0.00%) 1508.94 * -8.15%* Amean fault-huge-5 539.69 ( 0.00%) 614.88 * -13.93%* 4.20.0-rc1 4.20.0-rc1 boost-v2r4 stall-v2r6 Percentage huge-5 0.53 ( 0.00%) 3.38 ( 534.38% There is a slight increase in fragmentation events but given that it's already heavily reduced, there are elements of luck. There is a small increase in latencies which is partially offset by a slight increase in THP allocation success rates. There were 65 stalls over the course of 87 minutes with stall time of a total of roughly 0.4 milliseconds. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc1 extfrag events < order 9: 167464 4.20-rc1+patch: 130081 (22% reduction) 4.20-rc1+patch1-3: 12057 (92% reduction) 4.20-rc1+patch1-4: 11060 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc1 4.20.0-rc1 boost-v2r4 stall-v2r6 Amean fault-base-5 8691.83 ( 0.00%) 9363.89 ( -7.73%) Amean fault-huge-5 2899.83 ( 0.00%) 3638.29 * -25.47%* 4.20.0-rc1 4.20.0-rc1 boost-v2r4 stall-v2r6 Percentage huge-5 95.55 ( 0.00%) 99.27 ( 3.89%) The fragmentation events are reduced and while there is some hit on the latency, the success rate is near 100% while under heavy pressure. There were 2486 stalls over the course of 85 minutes with a total stall time of roughly 12 seconds. This patch does reduce fragmentation rates overall but it's not free as some allocataions can stall for short periods of time and there are knock-on effects to latency when THP allocation success rates are higher. While it's within acceptable limits for the adverse test case, there may be other workloads that cannot tolerate the stalls. If this occurs, it can be tuned to disable the feature or more ideally, the test case is made available for analysis to see if the stall behaviour can be reduced while still limiting the fragmentation events. On the flip-side, it has been checked that setting the fragment_stall_order to 9 eliminated fragmentation events entirely. Signed-off-by: Mel Gorman --- Documentation/sysctl/vm.txt | 23 +++++++++++ include/linux/mm.h | 1 + include/linux/mmzone.h | 2 + include/linux/vm_event_item.h | 1 + include/trace/events/kmem.h | 21 ++++++++++ kernel/sysctl.c | 10 +++++ mm/internal.h | 1 + mm/page_alloc.c | 94 +++++++++++++++++++++++++++++++++++++------ mm/vmstat.c | 1 + 9 files changed, 142 insertions(+), 12 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 2244520d7913..f7d3fcb9d4ce 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -31,6 +31,7 @@ files can be found in mm/swap.c. - dirty_writeback_centisecs - drop_caches - extfrag_threshold +- fragment_stall_order - hugetlb_shm_group - laptop_mode - legacy_va_layout @@ -275,6 +276,28 @@ any throttling. ============================================================== +fragment_stall_order + +External fragmentation control is managed on a pageblock level where the +page allocator tries to avoid mixing pages of different mobility within page +blocks (e.g. order 9 on 64-bit x86). If external fragmentation is perfectly +controlled then a THP allocation will often succeed up to the number of +movable pageblocks in the system as reported by /proc/pagetypeinfo. + +When memory is low, the system may have to mix pageblocks and will wake +kswapd to try control future fragmentation. fragment_stall_order controls if +the allocating task will stall if possible until kswapd makes some progress +in preference to fragmenting the system. This incurs a small stall penalty +in exchange for future success at allocating huge pages. If the stalls +are undesirable and high-order allocations are irrelevant then this can +be disabled by writing 0 to the tunable. Writing the pageblock order will +strongly (but not perfectly) control external fragmentation. + +The default will stall for fragmenting allocations smaller than the +PAGE_ALLOC_COSTLY_ORDER (defined as order-3 at the time of writing). + +============================================================== + hugetlb_shm_group hugetlb_shm_group contains group id that is allowed to create SysV diff --git a/include/linux/mm.h b/include/linux/mm.h index 81926daf6dfb..ef98eb3f8360 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2196,6 +2196,7 @@ extern void zone_pcp_reset(struct zone *zone); extern int min_free_kbytes; extern int watermark_boost_factor; extern int watermark_scale_factor; +extern int fragment_stall_order; /* nommu.c */ extern atomic_long_t mmap_pages_allocated; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index d352c1dab486..cffec484ac8a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -890,6 +890,8 @@ int watermark_boost_factor_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); +int fragment_stall_order_sysctl_handler(struct ctl_table *, int, + void __user *, size_t *, loff_t *); extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES]; int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 47a3441cf4c4..7661abe5236e 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PAGEOUTRUN, PGROTATED, DROP_PAGECACHE, DROP_SLAB, OOM_KILL, + FRAGMENTSTALL, #ifdef CONFIG_NUMA_BALANCING NUMA_PTE_UPDATES, NUMA_HUGE_PTE_UPDATES, diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index eb57e3037deb..caadd8681ac5 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -315,6 +315,27 @@ TRACE_EVENT(mm_page_alloc_extfrag, __entry->change_ownership) ); +TRACE_EVENT(mm_fragmentation_stall, + + TP_PROTO(int nid, unsigned long duration), + + TP_ARGS(nid, duration), + + TP_STRUCT__entry( + __field( int, nid ) + __field( unsigned long, duration ) + ), + + TP_fast_assign( + __entry->nid = nid; + __entry->duration = duration + ), + + TP_printk("nid=%d duration=%lu", + __entry->nid, + __entry->duration) +); + #endif /* _TRACE_KMEM_H */ /* This part must be outside protection */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 1825f712e73b..eb09c79ddbef 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -126,6 +126,7 @@ static int zero; static int __maybe_unused one = 1; static int __maybe_unused two = 2; static int __maybe_unused four = 4; +static int __maybe_unused max_order = MAX_ORDER; static unsigned long one_ul = 1; static int one_hundred = 100; static int one_thousand = 1000; @@ -1479,6 +1480,15 @@ static struct ctl_table vm_table[] = { .extra1 = &one, .extra2 = &one_thousand, }, + { + .procname = "fragment_stall_order", + .data = &fragment_stall_order, + .maxlen = sizeof(fragment_stall_order), + .mode = 0644, + .proc_handler = fragment_stall_order_sysctl_handler, + .extra1 = &zero, + .extra2 = &max_order, + }, { .procname = "percpu_pagelist_fraction", .data = &percpu_pagelist_fraction, diff --git a/mm/internal.h b/mm/internal.h index 544355156c92..5506a4596d59 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -489,6 +489,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, #else #define ALLOC_NOFRAGMENT 0x0 #endif +#define ALLOC_FRAGMENT_STALL 0x200 /* stall if fragmenting heavily */ enum ttu_flags; struct tlbflush_unmap_batch; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4abac725a149..86a6e86c51bb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -265,6 +265,7 @@ int min_free_kbytes = 1024; int user_min_free_kbytes = -1; int watermark_boost_factor __read_mostly = 15000; int watermark_scale_factor = 10; +int fragment_stall_order __read_mostly = (PAGE_ALLOC_COSTLY_ORDER + 1); static unsigned long nr_kernel_pages __meminitdata; static unsigned long nr_all_pages __meminitdata; @@ -2130,9 +2131,10 @@ static bool can_steal_fallback(unsigned int order, int start_mt) return false; } -static inline void boost_watermark(struct zone *zone) +static inline void boost_watermark(struct zone *zone, bool fast_boost) { unsigned long max_boost; + unsigned long nr; if (!watermark_boost_factor) return; @@ -2140,9 +2142,36 @@ static inline void boost_watermark(struct zone *zone) max_boost = mult_frac(wmark_pages(zone, WMARK_HIGH), watermark_boost_factor, 10000); max_boost = max(pageblock_nr_pages, max_boost); + nr = pageblock_nr_pages; - zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages, - max_boost); + /* Scale relative to the MIGRATE_PCPTYPES similar to min_free_kbytes */ + if (fast_boost) + nr += pageblock_nr_pages * (MIGRATE_PCPTYPES << 1); + + zone->watermark_boost = min(zone->watermark_boost + nr, max_boost); +} + +static void stall_fragmentation(struct zone *pzone) +{ + DEFINE_WAIT(wait); + long remaining = 0; + long timeout = HZ/10; + pg_data_t *pgdat = pzone->zone_pgdat; + + if (current->flags & PF_MEMALLOC) + return; + + boost_watermark(pzone, true); + prepare_to_wait(&pgdat->pfmemalloc_wait, &wait, TASK_INTERRUPTIBLE); + if (waitqueue_active(&pgdat->kswapd_wait)) + wake_up_interruptible(&pgdat->kswapd_wait); + remaining = schedule_timeout(timeout); + finish_wait(&pgdat->pfmemalloc_wait, &wait); + if (remaining != timeout) { + trace_mm_fragmentation_stall(pgdat->node_id, + jiffies_to_usecs(timeout - remaining)); + count_vm_event(FRAGMENTSTALL); + } } /* @@ -2153,8 +2182,9 @@ static inline void boost_watermark(struct zone *zone) * of pages are free or compatible, we can change migratetype of the pageblock * itself, so pages freed in the future will be put on the correct free list. */ -static void steal_suitable_fallback(struct zone *zone, struct page *page, - int start_type, bool whole_block) +static bool steal_suitable_fallback(struct zone *zone, struct page *page, + int start_type, bool whole_block, + unsigned int alloc_flags) { unsigned int current_order = page_order(page); struct free_area *area; @@ -2181,9 +2211,14 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, * likelihood of future fallbacks. Wake kswapd now as the node * may be balanced overall and kswapd will not wake naturally. */ - boost_watermark(zone); + boost_watermark(zone, false); wakeup_kswapd(zone, 0, 0, zone_idx(zone)); + if ((alloc_flags & ALLOC_FRAGMENT_STALL) && + current_order < fragment_stall_order) { + return false; + } + /* We are not allowed to try stealing from the whole block */ if (!whole_block) goto single_page; @@ -2224,11 +2259,12 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, page_group_by_mobility_disabled) set_pageblock_migratetype(page, start_type); - return; + return true; single_page: area = &zone->free_area[current_order]; list_move(&page->lru, &area->free_list[start_type]); + return true; } /* @@ -2467,13 +2503,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, page = list_first_entry(&area->free_list[fallback_mt], struct page, lru); - steal_suitable_fallback(zone, page, start_migratetype, can_steal); + if (!steal_suitable_fallback(zone, page, start_migratetype, can_steal, + alloc_flags)) + return false; trace_mm_page_alloc_extfrag(page, order, current_order, start_migratetype, fallback_mt); return true; - } /* @@ -3340,9 +3377,12 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, const struct alloc_context *ac) { struct zoneref *z = ac->preferred_zoneref; + struct zone *pzone = z->zone; struct zone *zone; struct pglist_data *last_pgdat_dirty_limit = NULL; bool no_fallback; + bool fragment_stall; + int wmark_idx = alloc_flags & ALLOC_WMARK_MASK; retry: /* @@ -3350,6 +3390,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, * See also __cpuset_node_allowed() comment in kernel/cpuset.c. */ no_fallback = alloc_flags & ALLOC_NOFRAGMENT; + fragment_stall = alloc_flags & ALLOC_FRAGMENT_STALL; + for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) { struct page *page; @@ -3388,7 +3430,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, } } - if (no_fallback) { + if (no_fallback || fragment_stall) { int local_nid; /* @@ -3396,9 +3438,12 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, * fragmenting fallbacks. Locality is more important * than fragmentation avoidance. */ - local_nid = zone_to_nid(ac->preferred_zoneref->zone); + local_nid = zone_to_nid(pzone); if (zone_to_nid(zone) != local_nid) { + if (fragment_stall) + stall_fragmentation(pzone); alloc_flags &= ~ALLOC_NOFRAGMENT; + alloc_flags &= ~ALLOC_FRAGMENT_STALL; goto retry; } } @@ -3474,8 +3519,12 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, * It's possible on a UMA machine to get through all zones that are * fragmented. If avoiding fragmentation, reset and try again */ - if (no_fallback) { + if (no_fallback || fragment_stall) { + if (fragment_stall) + stall_fragmentation(pzone); + alloc_flags &= ~ALLOC_NOFRAGMENT; + alloc_flags &= ~ALLOC_FRAGMENT_STALL; goto retry; } @@ -4197,6 +4246,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, */ alloc_flags = gfp_to_alloc_flags(gfp_mask); + /* + * Consider stalling on heavy for movable allocations in preference to + * fragmenting unmovable/reclaimable pageblocks. + */ + if ((gfp_mask & (__GFP_MOVABLE|__GFP_DIRECT_RECLAIM)) == + (__GFP_MOVABLE|__GFP_DIRECT_RECLAIM)) + alloc_flags |= ALLOC_FRAGMENT_STALL; + /* * We need to recalculate the starting point for the zonelist iterator * because we might have used different nodemask in the fast path, or @@ -4218,6 +4275,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); if (page) goto got_pg; + alloc_flags &= ~ALLOC_FRAGMENT_STALL; /* * For costly allocations, try direct compaction first, as it's likely @@ -7585,6 +7643,18 @@ int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write, return 0; } +int fragment_stall_order_sysctl_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos) +{ + int rc; + + rc = proc_dointvec_minmax(table, write, buffer, length, ppos); + if (rc) + return rc; + + return 0; +} + int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { diff --git a/mm/vmstat.c b/mm/vmstat.c index 6038ce593ce3..9bb78adf4445 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1211,6 +1211,7 @@ const char * const vmstat_text[] = { "drop_pagecache", "drop_slab", "oom_kill", + "fragment_stall", #ifdef CONFIG_NUMA_BALANCING "numa_pte_updates", From patchwork Wed Nov 7 18:38:22 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mel Gorman X-Patchwork-Id: 10672877 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0AB5F15A6 for ; Wed, 7 Nov 2018 18:38:38 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2E62D2AC8A for ; Wed, 7 Nov 2018 18:38:37 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 223532BDF6; Wed, 7 Nov 2018 18:38:37 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D0DBA2BDF3 for ; Wed, 7 Nov 2018 18:38:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 532616B053D; Wed, 7 Nov 2018 13:38:28 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 4E5FB6B0542; Wed, 7 Nov 2018 13:38:28 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0C50F6B053B; Wed, 7 Nov 2018 13:38:28 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) by kanga.kvack.org (Postfix) with ESMTP id 81A406B053C for ; Wed, 7 Nov 2018 13:38:27 -0500 (EST) Received: by mail-ed1-f70.google.com with SMTP id x98-v6so9969462ede.0 for ; Wed, 07 Nov 2018 10:38:27 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:in-reply-to:references; bh=Q4s6rf2OyEcEMFKv5wP1ey6W2H3gSS7eCX/TKpgnw88=; b=JomevKQJQDmwBQu01h+lRRrZlV0HQ4mCEER3dZWrb/Gyc/BMfm/IDMcZvmzuUWE00B 2GmrA9B6twxNOFnv44zI02oRPI81IdDL6CeP2ptOfUG1JnxyVViHihDKIjRAGhrPzezE 9v2BatgdA54dk472flWInwwBTaSVqhBV5O2OvvraEuiiodCvrnMhV6FAMH0zGYkrayPl WvMDloFVmPXiSGnfpVt+ANvtEDm+Zw5cDmnWy+RDYEqXEsphYmxUMFfDj4IeY08lHvja Fu5/Ebvo+SKWNG1Oviw62jwK4zu2E2C932Tr1zbIBzzvEN1bIo7ZnVbybzNd/g/o1Mrh VZQw== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.230 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net X-Gm-Message-State: AGRZ1gJsoHHLLF0Qwt9u5xtPXK7ZNXWpJtpCRJk2t5GSmRDeiDC9MAR+ lpC1qZSB3xClwACkSw97yABD3gVH/h+fkYvhmqIUP1bJVj/W6ODzyAS0xckCvgsTcDmkatfSLsY 1bjRNAunp5X/W2tp8t3ZxCYp4q5Jk9KagYbcu+i8e8483XTaX/VOYuLeC5H2S1WktIQ== X-Received: by 2002:a17:906:5842:: with SMTP id h2-v6mr919228ejs.33.1541615906931; Wed, 07 Nov 2018 10:38:26 -0800 (PST) X-Google-Smtp-Source: AJdET5c69xpEk0LcWdfrYlWraVJ3NuMOpwdo7i+JJbVOMqL3OAtxGH5A/fkPejjHHdNCkUf7e4Os X-Received: by 2002:a17:906:5842:: with SMTP id h2-v6mr919176ejs.33.1541615905065; Wed, 07 Nov 2018 10:38:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541615905; cv=none; d=google.com; s=arc-20160816; b=DpEabSsFToSuE/RAiYK/+ud3QLIl+Sb8pbEX3XG9t+Xi6fwYrgBG0y+d+H69GVYwJ1 xTioYDDNRTIcK4PklK+nLyQov9LWL5R/bO4X+5/tRbBl/amRD6+a5WJzpke0nAIokvqo 6KPQHKLC3w//lBUt5qlo0xaa199gXl7gzO8qgUkMai/10dBo20B86Vo4+ui2JwDtjGRk ztgUjW7/ad2sWQDYbMhGiSZwU9BdFRWSUCHN2FBCR5Pbn4ceIvceE72VCVLHJStZgoKU s/zt4TL1eoqC8Xh3OiLnOEGZC5de5rJkDFdLyYRwji8vqYXfrg6fDybfF8rGrLqcEtRu icMw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from; bh=Q4s6rf2OyEcEMFKv5wP1ey6W2H3gSS7eCX/TKpgnw88=; b=qg66DcXWjpoy9oN3E+Nx0WxzNUvAk1i0Lp69+tsf/3tTyQMLwepmEjsdxd4gVHCwNz VcXY2V125DAyVgZuYzgpWRHVPwnwMH8Jn3RM34rFZI+bztqQCiD67SUBw0YKFcWPLupF N0SUs9EUmPi46XwwK8vwbKOzC/cy9CdDIaPnd8w64hJAnsleJaxpoPYdpY3nXFzh1dse WGL/ltpU2K/8i/fFjJjgJejlQDtoILHFuxPMdrrbe+XlVF9mSVqHBwmx4rATWQjtl/Fn JSZORkEjK3JYOxcyWm99sTd4go0ZC/CO1AdKdxI5+pF6LFPILxm0V94MPMmSgOY4DmRP eblA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.230 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from outbound-smtp13.blacknight.com (outbound-smtp13.blacknight.com. [46.22.139.230]) by mx.google.com with ESMTPS id q17-v6si756857ejp.291.2018.11.07.10.38.24 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 07 Nov 2018 10:38:24 -0800 (PST) Received-SPF: pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.230 as permitted sender) client-ip=46.22.139.230; Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.230 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from mail.blacknight.com (unknown [81.17.255.152]) by outbound-smtp13.blacknight.com (Postfix) with ESMTPS id 985851C26DF for ; Wed, 7 Nov 2018 18:38:24 +0000 (GMT) Received: (qmail 19880 invoked from network); 7 Nov 2018 18:38:24 -0000 Received: from unknown (HELO stampy.163woodhaven.lan) (mgorman@techsingularity.net@[37.228.229.69]) by 81.17.254.9 with ESMTPA; 7 Nov 2018 18:38:24 -0000 From: Mel Gorman To: Linux-MM Cc: Andrew Morton , Vlastimil Babka , David Rientjes , Andrea Arcangeli , Zi Yan , LKML , Mel Gorman Subject: [PATCH 5/5] mm: Target compaction on pageblocks that were recently fragmented Date: Wed, 7 Nov 2018 18:38:22 +0000 Message-Id: <20181107183822.15567-6-mgorman@techsingularity.net> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20181107183822.15567-1-mgorman@techsingularity.net> References: <20181107183822.15567-1-mgorman@techsingularity.net> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Despite the earlier patches, external fragmentation events are still inevitable as not all callers can stall or are appropriate to stall. In the event the result is a mixed pageblock, it is desirable to move all movable pages from that block so that unmovable/unreclaimable allocations do not further pollute the address space. This patch queues such pageblocks for early compaction and relies on kswapd to wake kcompactd when some pages are reclaimed. Waking kcompactd after kswapd makes progress is so that the compaction is more likely to have a suitable migration destination. This patch may be controversial as there are multiple other design decisions that can be made. We could refuse to change pageblock ownership in some cases but great care would need to be taken to avoid premature OOMs or a livelock. Similarly, we could tag pageblocks as mixed and search for them but that would increase scanning costs. Finally, there is a corner case that a mixed pageblock that is after the point where a free scanner can operate may fail to clean the pageblock but addressing that would require a fundamental alteration to how compaction works. Unlike the previous patches, the benefit here is harder to quantify as any work that is queued may or may not help an allocation request in the future. The timing of the allocation stream is critical and detecting differences in latency may be within the noise. Hence, the potential benefit of this patch is more conceptual than quantitive even though there are some positive results. 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc1 extfrag events < order 9: 1023463 4.20-rc1+patch: 358574 (65% reduction) 4.20-rc1+patch1-3: 19274 (98% reduction) 4.20-rc1+patch1-4: 1351 (99.9% reduction) 4.20-rc1+patch1-5: 2554 (99.8% reduction) 4.20.0-rc1 4.20.0-rc1 stall-v2r6 proactive-v2r6 Amean fault-base-1 648.66 ( 0.00%) 655.18 * -1.00%* Amean fault-huge-1 167.79 ( 0.00%) 163.00 ( 2.85%) 4.20.0-rc1 4.20.0-rc1 stall-v2r6 proactive-v2r6 Percentage huge-1 1.16 ( 0.00%) 0.03 ( -97.14%) The performance is similar but not necessarily indicative that the patch had any effect. There was no reported compaction activity so essentially the patch was a no-op. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc1 extfrag events < order 9: 342549 4.20-rc1+patch: 337890 ( 1% reduction) 4.20-rc1+patch1-3: 12801 (96% reduction) 4.20-rc1+patch1-4: 1511 (99.7% reduction) 4.20.0-rc1 4.20.0-rc1 stall-v2r6 proactive-v2r6 Amean fault-base-1 43404.60 ( 0.00%) 0.00 ( 100.00%) Amean fault-huge-1 1424.32 ( 0.00%) 540.99 * 62.02%* 4.20.0-rc1 4.20.0-rc1 stall-v2r6 proactive-v2r6 Percentage huge-1 99.92 ( 0.00%) 100.00 ( 0.08%) Slight increase in fragmentation events but the latency was improved and THP allocations had a 100% success rate. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc1 extfrag events < order 9: 209820 4.20-rc1+patch: 185923 (11% reduction) 4.20-rc1+patch1-3: 11240 (95% reduction) 4.20-rc1+patch1-4: 13241 (93% reduction) 4.20-rc1+patch1-5: 11916 (94% reduction) thpfioscale Fault Latencies 4.20.0-rc1 4.20.0-rc1 stall-v2r6 proactive-v2r6 Amean fault-base-5 1508.94 ( 0.00%) 1545.56 ( -2.43%) Amean fault-huge-5 614.88 ( 0.00%) 557.46 * 9.34%* 4.20.0-rc1 4.20.0-rc1 stall-v2r6 proactive-v2r6 Percentage huge-5 3.38 ( 0.00%) 4.53 ( 33.99%) Fragmentation-causing events are slightly reduced and there is a slight improvement in THP allocation latencies and success rates. Remember that no special effort is being made to allocate THP in this workload. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc1 extfrag events < order 9: 167464 4.20-rc1+patch: 130081 (22% reduction) 4.20-rc1+patch1-3: 12057 (92% reduction) 4.20-rc1+patch1-4: 11060 (93% reduction) 4.20-rc1+patch1-5: 8903 (95% reduction) 4.20.0-rc1 4.20.0-rc1 stall-v2r6 proactive-v2r6 Amean fault-base-5 9363.89 ( 0.00%) 9067.00 ( 3.17%) Amean fault-huge-5 3638.29 ( 0.00%) 1509.51 * 58.51%* thpfioscale Percentage Faults Huge 4.20.0-rc1 4.20.0-rc1 stall-v2r6 proactive-v2r6 Percentage huge-5 99.27 ( 0.00%) 99.93 ( 0.67%) There is a small decrease in fragmentation events but the most notable part is the decrease in latency with a similarly high THP allocation success rate. It is less obvious whether this is a universal win as fragmentation-causing events were already low and in the case of MADV_HUGEPAGE, the allocation success rates were already high. However, it's encouraging that the THP allocation latencies were improved. Signed-off-by: Mel Gorman --- include/linux/compaction.h | 4 ++ include/linux/migrate.h | 7 +- include/linux/mmzone.h | 4 ++ include/trace/events/compaction.h | 62 ++++++++++++++++ mm/compaction.c | 145 +++++++++++++++++++++++++++++++++++--- mm/migrate.c | 6 +- mm/page_alloc.c | 7 ++ 7 files changed, 224 insertions(+), 11 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 68250a57aace..1fc1ad055f66 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -177,6 +177,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, extern int kcompactd_run(int nid); extern void kcompactd_stop(int nid); extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx); +extern void kcompactd_queue_migration(struct zone *zone, struct page *page); #else static inline void reset_isolation_suitable(pg_data_t *pgdat) @@ -225,6 +226,9 @@ static inline void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_i { } +static inline void kcompactd_queue_migration(struct zone *zone, struct page *page) +{ +} #endif /* CONFIG_COMPACTION */ #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index f2b4abbca55e..f12cee38c0f0 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -61,7 +61,7 @@ static inline struct page *new_page_nodemask(struct page *page, #ifdef CONFIG_MIGRATION -extern void putback_movable_pages(struct list_head *l); +extern unsigned int putback_movable_pages(struct list_head *l); extern int migrate_page(struct address_space *mapping, struct page *newpage, struct page *page, enum migrate_mode mode); @@ -82,7 +82,10 @@ extern int migrate_page_move_mapping(struct address_space *mapping, int extra_count); #else -static inline void putback_movable_pages(struct list_head *l) {} +static inline unsigned int putback_movable_pages(struct list_head *l) +{ + return 0; +} static inline int migrate_pages(struct list_head *l, new_page_t new, free_page_t free, unsigned long private, enum migrate_mode mode, int reason) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index cffec484ac8a..980fad03ae8e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -497,6 +497,10 @@ struct zone { unsigned int compact_considered; unsigned int compact_defer_shift; int compact_order_failed; + +#define COMPACT_QUEUE_LENGTH 16 + unsigned long compact_queue[COMPACT_QUEUE_LENGTH]; + int nr_compact; #endif #if defined CONFIG_COMPACTION || defined CONFIG_CMA diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h index 6074eff3d766..6b5b61177d8c 100644 --- a/include/trace/events/compaction.h +++ b/include/trace/events/compaction.h @@ -353,6 +353,68 @@ DEFINE_EVENT(kcompactd_wake_template, mm_compaction_kcompactd_wake, TP_ARGS(nid, order, classzone_idx) ); +TRACE_EVENT(mm_compaction_wakeup_kcompactd_queue, + + TP_PROTO( + int nid, + enum zone_type zoneid, + unsigned long pfn, + int nr_queued), + + TP_ARGS(nid, pfn, zoneid, nr_queued), + + TP_STRUCT__entry( + __field(int, nid) + __field(enum zone_type, zoneid) + __field(unsigned long, pfn) + __field(int, nr_queued) + ), + + TP_fast_assign( + __entry->nid = nid; + __entry->zoneid = zoneid; + __entry->pfn = pfn; + __entry->nr_queued = nr_queued; + ), + + TP_printk("nid=%d zoneid=%-8s pfn=%lu nr_queued=%d", + __entry->nid, + __print_symbolic(__entry->zoneid, ZONE_TYPE), + __entry->pfn, + __entry->nr_queued) +); + +TRACE_EVENT(mm_compaction_kcompactd_migrated, + + TP_PROTO( + int nid, + enum zone_type zoneid, + int nr_migrated, + int nr_failed), + + TP_ARGS(nid, zoneid, nr_migrated, nr_failed), + + TP_STRUCT__entry( + __field(int, nid) + __field(enum zone_type, zoneid) + __field(int, nr_migrated) + __field(int, nr_failed) + ), + + TP_fast_assign( + __entry->nid = nid; + __entry->zoneid = zoneid, + __entry->nr_migrated = nr_migrated; + __entry->nr_failed = nr_failed; + ), + + TP_printk("nid=%d zoneid=%-8s nr_migrated=%d nr_failed=%d", + __entry->nid, + __print_symbolic(__entry->zoneid, ZONE_TYPE), + __entry->nr_migrated, + __entry->nr_failed) +); + #endif /* _TRACE_COMPACTION_H */ /* This part must be outside protection */ diff --git a/mm/compaction.c b/mm/compaction.c index ef29490b0f46..0fdeecd47a03 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1915,6 +1915,12 @@ void compaction_unregister_node(struct node *node) static inline bool kcompactd_work_requested(pg_data_t *pgdat) { + int zoneid; + + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) + if (pgdat->node_zones[zoneid].nr_compact) + return true; + return pgdat->kcompactd_max_order > 0 || kthread_should_stop(); } @@ -1938,6 +1944,92 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) return false; } +static void kcompactd_migrate_block(struct compact_control *cc, + unsigned long pfn) +{ + unsigned long end = min(pfn + pageblock_nr_pages, zone_end_pfn(cc->zone)); + unsigned long total_migrated = 0, total_failed = 0; + + cc->migrate_pfn = pfn; + while (pfn && pfn < end) { + int err; + unsigned long nr_migrated, nr_failed = 0; + + pfn = isolate_migratepages_range(cc, pfn, end); + if (!pfn) + break; + + nr_migrated = cc->nr_migratepages; + err = migrate_pages(&cc->migratepages, compaction_alloc, + compaction_free, (unsigned long)cc, + cc->mode, MR_COMPACTION); + if (err) { + nr_failed = putback_movable_pages(&cc->migratepages); + nr_migrated -= nr_failed; + } + cc->nr_migratepages = 0; + total_migrated += nr_migrated; + total_failed += nr_failed; + } + + trace_mm_compaction_kcompactd_migrated(zone_to_nid(cc->zone), + zone_idx(cc->zone), total_migrated, total_failed); +} + +static void kcompactd_init_cc(struct compact_control *cc, struct zone *zone) +{ + cc->nr_freepages = 0; + cc->nr_migratepages = 0; + cc->total_migrate_scanned = 0; + cc->total_free_scanned = 0; + cc->zone = zone; + INIT_LIST_HEAD(&cc->freepages); + INIT_LIST_HEAD(&cc->migratepages); +} + +static void kcompactd_do_queue(pg_data_t *pgdat) +{ + /* + * With no special task, compact all zones so that a page of requested + * order is allocatable. + */ + int zoneid; + struct zone *zone; + struct compact_control cc = { + .order = 0, + .total_migrate_scanned = 0, + .total_free_scanned = 0, + .classzone_idx = 0, + .mode = MIGRATE_SYNC, + .ignore_skip_hint = true, + .gfp_mask = GFP_KERNEL, + }; + trace_mm_compaction_kcompactd_wake(pgdat->node_id, 0, -1); + + migrate_prep(); + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { + unsigned long pfn = ULONG_MAX; + int limit; + + zone = &pgdat->node_zones[zoneid]; + if (!populated_zone(zone)) + continue; + + kcompactd_init_cc(&cc, zone); + cc.free_pfn = pageblock_start_pfn(zone_end_pfn(zone) - 1); + limit = zone->nr_compact; + while (zone->nr_compact && limit--) { + unsigned long flags; + + spin_lock_irqsave(&zone->lock, flags); + if (zone->nr_compact) + pfn = zone->compact_queue[--zone->nr_compact]; + spin_unlock_irqrestore(&zone->lock, flags); + kcompactd_migrate_block(&cc, pfn); + } + } +} + static void kcompactd_do_work(pg_data_t *pgdat) { /* @@ -1957,7 +2049,6 @@ static void kcompactd_do_work(pg_data_t *pgdat) }; trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order, cc.classzone_idx); - count_compact_event(KCOMPACTD_WAKE); for (zoneid = 0; zoneid <= cc.classzone_idx; zoneid++) { int status; @@ -1973,13 +2064,7 @@ static void kcompactd_do_work(pg_data_t *pgdat) COMPACT_CONTINUE) continue; - cc.nr_freepages = 0; - cc.nr_migratepages = 0; - cc.total_migrate_scanned = 0; - cc.total_free_scanned = 0; - cc.zone = zone; - INIT_LIST_HEAD(&cc.freepages); - INIT_LIST_HEAD(&cc.migratepages); + kcompactd_init_cc(&cc, zone); if (kthread_should_stop()) return; @@ -2025,6 +2110,19 @@ static void kcompactd_do_work(pg_data_t *pgdat) void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx) { + int i; + + /* Wake kcompact if there are compaction queue entries */ + for (i = 0; i < MAX_NR_ZONES; i++) { + struct zone *zone = &pgdat->node_zones[i]; + + if (!managed_zone(zone)) + continue; + + if (zone->nr_compact) + goto wake; + } + if (!order) return; @@ -2044,6 +2142,7 @@ void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx) if (!kcompactd_node_suitable(pgdat)) return; +wake: trace_mm_compaction_wakeup_kcompactd(pgdat->node_id, order, classzone_idx); wake_up_interruptible(&pgdat->kcompactd_wait); @@ -2076,6 +2175,8 @@ static int kcompactd(void *p) kcompactd_work_requested(pgdat)); psi_memstall_enter(&pflags); + count_compact_event(KCOMPACTD_WAKE); + kcompactd_do_queue(pgdat); kcompactd_do_work(pgdat); psi_memstall_leave(&pflags); } @@ -2083,6 +2184,34 @@ static int kcompactd(void *p) return 0; } +/* + * Queue a pageblock to have all movable pages migrated from. Note that + * kcompactd is not woken at this point. This assumes that kswapd has + * been woken to reclaim pages above the boosted watermark. kcompactd + * will be woken when kswapd has made progress. + */ +void kcompactd_queue_migration(struct zone *zone, struct page *page) +{ + unsigned long pfn = page_to_pfn(page) & ~(pageblock_nr_pages - 1); + int nr_queued = -1; + + /* Do not overflow the queue */ + if (zone->nr_compact == COMPACT_QUEUE_LENGTH) + goto trace; + + /* Only queue a pageblock once */ + for (nr_queued = 0; nr_queued < zone->nr_compact; nr_queued++) { + if (zone->compact_queue[nr_queued] == pfn) + return; + } + + zone->compact_queue[zone->nr_compact++] = pfn; + +trace: + trace_mm_compaction_wakeup_kcompactd_queue(zone_to_nid(zone), + zone_idx(zone), pfn, nr_queued); +} + /* * This kcompactd start function will be called by init and node-hot-add. * On node-hot-add, kcompactd will moved to proper cpus if cpus are hot-added. diff --git a/mm/migrate.c b/mm/migrate.c index f7e4bfdc13b7..2ee3c38d2269 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -164,12 +164,14 @@ void putback_movable_page(struct page *page) * built from lru, balloon, hugetlbfs page. See isolate_migratepages_range() * and isolate_huge_page(). */ -void putback_movable_pages(struct list_head *l) +unsigned int putback_movable_pages(struct list_head *l) { struct page *page; struct page *page2; + unsigned int nr_putback = 0; list_for_each_entry_safe(page, page2, l, lru) { + nr_putback++; if (unlikely(PageHuge(page))) { putback_active_hugepage(page); continue; @@ -195,6 +197,8 @@ void putback_movable_pages(struct list_head *l) putback_lru_page(page); } } + + return nr_putback; } /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 86a6e86c51bb..1e72f757253e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2214,6 +2214,9 @@ static bool steal_suitable_fallback(struct zone *zone, struct page *page, boost_watermark(zone, false); wakeup_kswapd(zone, 0, 0, zone_idx(zone)); + if (start_type == MIGRATE_MOVABLE || old_block_type == MIGRATE_MOVABLE) + kcompactd_queue_migration(zone, page); + if ((alloc_flags & ALLOC_FRAGMENT_STALL) && current_order < fragment_stall_order) { return false; @@ -6457,7 +6460,11 @@ static void pgdat_init_split_queue(struct pglist_data *pgdat) {} #ifdef CONFIG_COMPACTION static void pgdat_init_kcompactd(struct pglist_data *pgdat) { + int i; + init_waitqueue_head(&pgdat->kcompactd_wait); + for (i = 0; i < MAX_NR_ZONES; i++) + pgdat->node_zones[i].nr_compact = 0; } #else static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}