From patchwork Thu Nov 7 20:53:34 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 11233765 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 332F6139A for ; Thu, 7 Nov 2019 20:53:55 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D840421D6C for ; Thu, 7 Nov 2019 20:53:54 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="GWYhGD3J" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D840421D6C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2BAE46B0008; Thu, 7 Nov 2019 15:53:47 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 1821D6B000A; Thu, 7 Nov 2019 15:53:47 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E9FD66B000C; Thu, 7 Nov 2019 15:53:46 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0250.hostedemail.com [216.40.44.250]) by kanga.kvack.org (Postfix) with ESMTP id CD87F6B0008 for ; Thu, 7 Nov 2019 15:53:46 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 75CA02C12 for ; Thu, 7 Nov 2019 20:53:46 +0000 (UTC) X-FDA: 76130682852.19.head12_3c0f6390af42 X-Spam-Summary: 2,0,0,227f2b18ee7c4281,d41d8cd98f00b204,hannes@cmpxchg.org,:akpm@linux-foundation.org:aryabinin@virtuozzo.com:surenb@google.com:shakeelb@google.com:riel@surriel.com:mhocko@suse.com::cgroups@vger.kernel.org:linux-kernel@vger.kernel.org:kernel-team@fb.com,RULES_HIT:4:41:69:355:379:541:800:960:966:973:988:989:1260:1311:1314:1345:1359:1437:1515:1605:1730:1747:1777:1792:2196:2198:2199:2200:2393:2559:2562:2639:2690:2693:2731:2892:2897:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4385:4605:5007:6119:6261:6630:6653:7875:7903:7904:8603:9010:9036:9592:10004:10241:11026:11233:11473:11658:11914:12043:12291:12296:12297:12438:12517:12519:12555:12679:12683:12895:12986:13161:13215:13229:13894:14096:14394:21080:21324:21444:21450:21451:21627:21740:30054:30075,0,RBL:209.85.210.174:@cmpxchg.org:.lbl8.mailshell.net-62.2.0.100 66.100.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_ru les:0:0: X-HE-Tag: head12_3c0f6390af42 X-Filterd-Recvd-Size: 16413 Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) by imf18.hostedemail.com (Postfix) with ESMTP for ; Thu, 7 Nov 2019 20:53:45 +0000 (UTC) Received: by mail-pf1-f174.google.com with SMTP id x28so3306477pfo.6 for ; Thu, 07 Nov 2019 12:53:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=qRdJZCWoQVHU6tGp0In1fm+6GfN/BUtPT0fclmu2C2U=; b=GWYhGD3JI1LXsbYlNn35A2X+xOoFx+1OOLYxlWuBeImf06c93d9Z8Kdiktu84Wr1iV tMgGg7b39ls+CaM2dqK1rDii62ErZ16eovA6/vgmm0VBhUjsBTDD5Y5aWUreH0TFm6e0 QQ9ClH/4gtW975VCP2fWU2nhinR18YFol6HYbica7go0SG5iR+dA4vT3aUnQlQkClC41 0rApupvCOzLGr91IUSFDkKyMwkGKvvuSQINpLWdA7+yF3WKW43MX9ReNFUUdw1Qb8/7L 0n8d/67uiGWjNnCNbxnK5zo1U+fw8yMrPvwK0YSU/R5QIPyH4pVh8rUsYLvbl5SvAwBi DCYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=qRdJZCWoQVHU6tGp0In1fm+6GfN/BUtPT0fclmu2C2U=; b=KVwGlcTkpkMVpg913VQu6uWDZYu7upBtyKLbCGcZQupv/NVId1YQRQqFfTXdBGoQpB f5xGVZNZ9AQGaoAc+kPnXknU/B4uSkgt5bFvky1yK3lmkx4Ki/5w6ChdC6wi45s7+V+b TKfcsHMcscvzShpol/e0sYdc56wBN448k0PlJSEQIrF1wY9OizrCIpmZZ6RhSwSts2vU usyayZk+meKedKMN0ttnmwC3xwloXViESkvvDQOl/e6Ab8yoYKw+/Tzw1A3g2ExkWvli +warfTZFPvLQCyw5ndDvzDVJZzXtUXgfvxAQxIIVBNVWgtl07gfPss118pWAdhguGHt3 n6cA== X-Gm-Message-State: APjAAAUuGpGfh4BEln4rZEDUQr/2tL2zVIvLhXb9WEEhYrEOEmWgK3F2 AaMp7L9xgl2lUytfvEo3gtPfNQ== X-Google-Smtp-Source: APXvYqx+X69jNbs63NHgyD8oInFU7UoznYXAatLJ3EFFK0G+CUVeBkErCFLiMONS3esFbame6buMyA== X-Received: by 2002:a63:f34f:: with SMTP id t15mr7242229pgj.453.1573160024328; Thu, 07 Nov 2019 12:53:44 -0800 (PST) Received: from localhost ([2620:10d:c090:200::3:3792]) by smtp.gmail.com with ESMTPSA id y26sm3970584pfo.76.2019.11.07.12.53.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Nov 2019 12:53:43 -0800 (PST) From: Johannes Weiner To: Andrew Morton Cc: Andrey Ryabinin , Suren Baghdasaryan , Shakeel Butt , Rik van Riel , Michal Hocko , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 3/3] mm: vmscan: enforce inactive:active ratio at the reclaim root Date: Thu, 7 Nov 2019 12:53:34 -0800 Message-Id: <20191107205334.158354-4-hannes@cmpxchg.org> X-Mailer: git-send-email 2.24.0 In-Reply-To: <20191107205334.158354-1-hannes@cmpxchg.org> References: <20191107205334.158354-1-hannes@cmpxchg.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: We split the LRU lists into inactive and an active parts to maximize workingset protection while allowing just enough inactive cache space to faciltate readahead and writeback for one-off file accesses (e.g. a linear scan through a file, or logging); or just enough inactive anon to maintain recent reference information when reclaim needs to swap. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, inactive:active size decisions are done on a per-cgroup level. As a result, we'll reclaim a cgroup's workingset when it doesn't have cold pages, even when one of its siblings has plenty of it that should be reclaimed first. For example: workload A has 50M worth of hot cache but doesn't do any one-off file accesses; meanwhile, parallel workload B scans files and rarely accesses the same page twice. If these workloads were to run in an uncgrouped system, A would be protected from the high rate of cache faults from B. But if they were put in parallel cgroups for memory accounting purposes, B's fast cache fault rate would push out the hot cache pages of A. This is unexpected and undesirable - the "scan resistance" of the page cache is broken. This patch moves inactive:active size balancing decisions to the root of reclaim - the same level where the LRU order is established. It does this by looking at the recursize size of the inactive and the active file sets of the cgroup subtree at the beginning of the reclaim cycle, and then making a decision - scan or skip active pages - that applies throughout the entire run and to every cgroup involved. With that in place, in the test above, the VM will recognize that there are plenty of inactive pages in the combined cache set of workloads A and B and prefer the one-off cache in B over the hot pages in A. The scan resistance of the cache is restored. Signed-off-by: Johannes Weiner Reviewed-by: Suren Baghdasaryan Reviewed-by: Shakeel Butt --- include/linux/mmzone.h | 4 +- mm/vmscan.c | 185 ++++++++++++++++++++++++++--------------- 2 files changed, 118 insertions(+), 71 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 7a09087e8c77..454a230ad417 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -229,12 +229,12 @@ enum lru_list { #define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++) -static inline int is_file_lru(enum lru_list lru) +static inline bool is_file_lru(enum lru_list lru) { return (lru == LRU_INACTIVE_FILE || lru == LRU_ACTIVE_FILE); } -static inline int is_active_lru(enum lru_list lru) +static inline bool is_active_lru(enum lru_list lru) { return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 527617ee9b73..df859b1d583c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -79,6 +79,13 @@ struct scan_control { */ struct mem_cgroup *target_mem_cgroup; + /* Can active pages be deactivated as part of reclaim? */ +#define DEACTIVATE_ANON 1 +#define DEACTIVATE_FILE 2 + unsigned int may_deactivate:2; + unsigned int force_deactivate:1; + unsigned int skipped_deactivate:1; + /* Writepage batching in laptop mode; RECLAIM_WRITE */ unsigned int may_writepage:1; @@ -101,6 +108,9 @@ struct scan_control { /* One of the zones is ready for compaction */ unsigned int compaction_ready:1; + /* There is easily reclaimable cold cache in the current node */ + unsigned int cache_trim_mode:1; + /* The file pages on the current node are dangerously low */ unsigned int file_is_tiny:1; @@ -2154,6 +2164,20 @@ unsigned long reclaim_pages(struct list_head *page_list) return nr_reclaimed; } +static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, + struct lruvec *lruvec, struct scan_control *sc) +{ + if (is_active_lru(lru)) { + if (sc->may_deactivate & (1 << is_file_lru(lru))) + shrink_active_list(nr_to_scan, lruvec, sc, lru); + else + sc->skipped_deactivate = 1; + return 0; + } + + return shrink_inactive_list(nr_to_scan, lruvec, sc, lru); +} + /* * The inactive anon list should be small enough that the VM never has * to do too much work. @@ -2182,59 +2206,25 @@ unsigned long reclaim_pages(struct list_head *page_list) * 1TB 101 10GB * 10TB 320 32GB */ -static bool inactive_list_is_low(struct lruvec *lruvec, bool file, - struct scan_control *sc, bool trace) +static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru) { - enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE; - struct pglist_data *pgdat = lruvec_pgdat(lruvec); - enum lru_list inactive_lru = file * LRU_FILE; + enum lru_list active_lru = inactive_lru + LRU_ACTIVE; unsigned long inactive, active; unsigned long inactive_ratio; - struct lruvec *target_lruvec; - unsigned long refaults; unsigned long gb; - inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx); - active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx); + inactive = lruvec_page_state(lruvec, inactive_lru); + active = lruvec_page_state(lruvec, active_lru); - /* - * When refaults are being observed, it means a new workingset - * is being established. Disable active list protection to get - * rid of the stale workingset quickly. - */ - target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); - refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); - if (file && target_lruvec->refaults != refaults) { - inactive_ratio = 0; - } else { - gb = (inactive + active) >> (30 - PAGE_SHIFT); - if (gb) - inactive_ratio = int_sqrt(10 * gb); - else - inactive_ratio = 1; - } - - if (trace) - trace_mm_vmscan_inactive_list_is_low(pgdat->node_id, sc->reclaim_idx, - lruvec_lru_size(lruvec, inactive_lru, MAX_NR_ZONES), inactive, - lruvec_lru_size(lruvec, active_lru, MAX_NR_ZONES), active, - inactive_ratio, file); + gb = (inactive + active) >> (30 - PAGE_SHIFT); + if (gb) + inactive_ratio = int_sqrt(10 * gb); + else + inactive_ratio = 1; return inactive * inactive_ratio < active; } -static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, - struct lruvec *lruvec, struct scan_control *sc) -{ - if (is_active_lru(lru)) { - if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true)) - shrink_active_list(nr_to_scan, lruvec, sc, lru); - return 0; - } - - return shrink_inactive_list(nr_to_scan, lruvec, sc, lru); -} - enum scan_balance { SCAN_EQUAL, SCAN_FRACT, @@ -2296,28 +2286,17 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, /* * If the system is almost out of file pages, force-scan anon. - * But only if there are enough inactive anonymous pages on - * the LRU. Otherwise, the small LRU gets thrashed. */ - if (sc->file_is_tiny && - !inactive_list_is_low(lruvec, false, sc, false) && - lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, - sc->reclaim_idx) >> sc->priority) { + if (sc->file_is_tiny) { scan_balance = SCAN_ANON; goto out; } /* - * If there is enough inactive page cache, i.e. if the size of the - * inactive list is greater than that of the active list *and* the - * inactive list actually has some pages to scan on this priority, we - * do not reclaim anything from the anonymous working set right now. - * Without the second condition we could end up never scanning an - * lruvec even if it has plenty of old anonymous pages unless the - * system is under heavy pressure. + * If there is enough inactive page cache, we do not reclaim + * anything from the anonymous working right now. */ - if (!inactive_list_is_low(lruvec, true, sc, false) && - lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) { + if (sc->cache_trim_mode) { scan_balance = SCAN_FILE; goto out; } @@ -2582,7 +2561,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) * Even if we did not try to evict anon pages at all, we want to * rebalance the anon lru active/inactive ratio. */ - if (total_swap_pages && inactive_list_is_low(lruvec, false, sc, true)) + if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON)) shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON); } @@ -2722,6 +2701,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) unsigned long nr_reclaimed, nr_scanned; struct lruvec *target_lruvec; bool reclaimable = false; + unsigned long file; target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); @@ -2731,6 +2711,44 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) nr_reclaimed = sc->nr_reclaimed; nr_scanned = sc->nr_scanned; + /* + * Target desirable inactive:active list ratios for the anon + * and file LRU lists. + */ + if (!sc->force_deactivate) { + unsigned long refaults; + + if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) + sc->may_deactivate |= DEACTIVATE_ANON; + else + sc->may_deactivate &= ~DEACTIVATE_ANON; + + /* + * When refaults are being observed, it means a new + * workingset is being established. Deactivate to get + * rid of any stale active pages quickly. + */ + refaults = lruvec_page_state(target_lruvec, + WORKINGSET_ACTIVATE); + if (refaults != target_lruvec->refaults || + inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) + sc->may_deactivate |= DEACTIVATE_FILE; + else + sc->may_deactivate &= ~DEACTIVATE_FILE; + } else + sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; + + /* + * If we have plenty of inactive file pages that aren't + * thrashing, try to reclaim those first before touching + * anonymous pages. + */ + file = lruvec_page_state(target_lruvec, LRU_INACTIVE_FILE); + if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) + sc->cache_trim_mode = 1; + else + sc->cache_trim_mode = 0; + /* * Prevent the reclaimer from falling into the cache trap: as * cache pages start out inactive, every cache fault will tip @@ -2741,10 +2759,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) * anon pages. Try to detect this based on file LRU size. */ if (!cgroup_reclaim(sc)) { - unsigned long file; - unsigned long free; - int z; unsigned long total_high_wmark = 0; + unsigned long free, anon; + int z; free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); file = node_page_state(pgdat, NR_ACTIVE_FILE) + @@ -2758,7 +2775,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) total_high_wmark += high_wmark_pages(zone); } - sc->file_is_tiny = file + free <= total_high_wmark; + /* + * Consider anon: if that's low too, this isn't a + * runaway file reclaim problem, but rather just + * extreme pressure. Reclaim as per usual then. + */ + anon = node_page_state(pgdat, NR_INACTIVE_ANON); + + sc->file_is_tiny = + file + free <= total_high_wmark && + !(sc->may_deactivate & DEACTIVATE_ANON) && + anon >> sc->priority; } shrink_node_memcgs(pgdat, sc); @@ -3062,9 +3089,27 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, if (sc->compaction_ready) return 1; + /* + * We make inactive:active ratio decisions based on the node's + * composition of memory, but a restrictive reclaim_idx or a + * memory.low cgroup setting can exempt large amounts of + * memory from reclaim. Neither of which are very common, so + * instead of doing costly eligibility calculations of the + * entire cgroup subtree up front, we assume the estimates are + * good, and retry with forcible deactivation if that fails. + */ + if (sc->skipped_deactivate) { + sc->priority = initial_priority; + sc->force_deactivate = 1; + sc->skipped_deactivate = 0; + goto retry; + } + /* Untapped cgroup reserves? Don't OOM, retry. */ if (sc->memcg_low_skipped) { sc->priority = initial_priority; + sc->force_deactivate = 0; + sc->skipped_deactivate = 0; sc->memcg_low_reclaim = 1; sc->memcg_low_skipped = 0; goto retry; @@ -3347,18 +3392,20 @@ static void age_active_anon(struct pglist_data *pgdat, struct scan_control *sc) { struct mem_cgroup *memcg; + struct lruvec *lruvec; if (!total_swap_pages) return; + lruvec = mem_cgroup_lruvec(NULL, pgdat); + if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON)) + return; + memcg = mem_cgroup_iter(NULL, NULL, NULL); do { - struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); - - if (inactive_list_is_low(lruvec, false, sc, true)) - shrink_active_list(SWAP_CLUSTER_MAX, lruvec, - sc, LRU_ACTIVE_ANON); - + lruvec = mem_cgroup_lruvec(memcg, pgdat); + shrink_active_list(SWAP_CLUSTER_MAX, lruvec, + sc, LRU_ACTIVE_ANON); memcg = mem_cgroup_iter(NULL, memcg, NULL); } while (memcg); }