From patchwork Sun Nov 17 11:35:26 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hillf Danton X-Patchwork-Id: 11248081 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CC2CB1390 for ; Sun, 17 Nov 2019 11:35:43 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 95E0020740 for ; Sun, 17 Nov 2019 11:35:43 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 95E0020740 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B23236B0003; Sun, 17 Nov 2019 06:35:42 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id AD3916B0006; Sun, 17 Nov 2019 06:35:42 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9E9806B0007; Sun, 17 Nov 2019 06:35:42 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0134.hostedemail.com [216.40.44.134]) by kanga.kvack.org (Postfix) with ESMTP id 835A06B0003 for ; Sun, 17 Nov 2019 06:35:42 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id 51C4282499A8 for ; Sun, 17 Nov 2019 11:35:42 +0000 (UTC) X-FDA: 76165564524.08.push34_4eb05a81f0f30 X-Spam-Summary: 2,0,0,f19631e3e21abfd5,d41d8cd98f00b204,hdanton@sina.com,::rong.a.chen@intel.com:linux-kernel@vger.kernel.org:hdanton@sina.com,RULES_HIT:2:41:355:379:800:960:966:973:988:989:1260:1311:1314:1345:1437:1515:1535:1605:1730:1747:1777:1792:2194:2196:2198:2199:2200:2201:2393:2559:2562:2693:2730:2731:2898:2904:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:4049:4118:4321:4362:4385:5007:6119:6120:6261:7875:7903:8660:9010:10004:11026:11334:11473:11537:11658:11914:12043:12296:12297:12438:12555:12986:13148:13161:13200:13229:13230:13894:14096:21080:21220:21524:21627:21740:30034:30054:30056:30064:30070:30079,0,RBL:202.108.3.19:@sina.com:.lbl8.mailshell.net-62.18.2.100 64.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:23,LUA_SUMMARY:none X-HE-Tag: push34_4eb05a81f0f30 X-Filterd-Recvd-Size: 7675 Received: from r3-19.sinamail.sina.com.cn (r3-19.sinamail.sina.com.cn [202.108.3.19]) by imf43.hostedemail.com (Postfix) with SMTP for ; Sun, 17 Nov 2019 11:35:40 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([124.64.3.114]) by sina.com with ESMTP id 5DD130880000FFD0; Sun, 17 Nov 2019 19:35:37 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 93844015074068 From: Hillf Danton To: linux-mm Cc: Rong Chen , linux-kernel , Hillf Danton Subject: [RFC v3] memcg: add memcg lru Date: Sun, 17 Nov 2019 19:35:26 +0800 Message-Id: <20191117113526.5640-1-hdanton@sina.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently soft limit reclaim (slr) is frozen, see Documentation/admin-guide/cgroup-v2.rst for reasons. This work adds memcg hook into kswapd's logic to bypass slr, paving a brick for its cleanup later. After b23afb93d317 ("memcg: punt high overage reclaim to return-to-userland path"), high limit breachers (hlb) are reclaimed one after another spiraling up through the memcg hierarchy before returning to userspace. The current memcg high work helps to add the lru because we get to collect hlb at zero price and in particular without adding changes to the high work's behavior. Then a fifo list, which is essencially a simple copy of the page lru, is needed to facilitate queuing up hlb and ripping pages off them in round robin once kswapd starts doing its job. Finally new hook is added with slr's two problems addressed i.e. hierarchy-unaware reclaim and overreclaim. Thanks to Rong Chen for testing. V3 is based on next-20191115. Changes since v2 - fix build error reported by kbuild test robot - split hook function into two parts for better round robin - make memcg lru able to reclaim dirty pages to cut risk of premature oom reported by kernel test robot - drop change added in high work's behavior to defer reclaim Changes since v1 - drop MEMCG_LRU - add hook into kswapd's logic to bypass slr Changes since v0 - add MEMCG_LRU in init/Kconfig - drop changes in mm/vmscan.c - make memcg lru work in parallel to slr Signed-off-by: Hillf Danton --- -- --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -218,6 +218,8 @@ struct mem_cgroup { /* Upper bound of normal memory consumption range */ unsigned long high; + struct list_head lru_node; + /* Range enforcement for interrupt charges */ struct work_struct high_work; @@ -732,6 +734,9 @@ static inline void mod_lruvec_page_state local_irq_restore(flags); } +struct mem_cgroup *mem_cgroup_reclaim_high_begin(void); +void mem_cgroup_reclaim_high_end(struct mem_cgroup *memcg); + unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, unsigned long *total_scanned); @@ -1123,6 +1128,14 @@ static inline void __mod_lruvec_slab_sta __mod_node_page_state(page_pgdat(page), idx, val); } +static inline struct mem_cgroup *mem_cgroup_reclaim_high_begin(void) +{ + return NULL; +} +static inline void mem_cgroup_reclaim_high_end(struct mem_cgroup *memcg) +{ +} + static inline unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2221,15 +2221,81 @@ static int memcg_hotplug_cpu_dead(unsign return 0; } +static DEFINE_SPINLOCK(memcg_lru_lock); +static LIST_HEAD(memcg_lru); /* a copy of page lru */ + +static void memcg_add_lru(struct mem_cgroup *memcg) +{ + spin_lock_irq(&memcg_lru_lock); + if (list_empty(&memcg->lru_node)) + list_add_tail(&memcg->lru_node, &memcg_lru); + spin_unlock_irq(&memcg_lru_lock); +} + +static struct mem_cgroup *memcg_pinch_lru(void) +{ + struct mem_cgroup *memcg, *next; + + spin_lock_irq(&memcg_lru_lock); + + list_for_each_entry_safe(memcg, next, &memcg_lru, lru_node) { + list_del_init(&memcg->lru_node); + + if (page_counter_read(&memcg->memory) > memcg->high) { + spin_unlock_irq(&memcg_lru_lock); + return memcg; + } + } + spin_unlock_irq(&memcg_lru_lock); + + return NULL; +} + +struct mem_cgroup *mem_cgroup_reclaim_high_begin(void) +{ + struct mem_cgroup *memcg, *victim; + + memcg = victim = memcg_pinch_lru(); + if (!memcg) + return NULL; + + while ((memcg = parent_mem_cgroup(memcg))) + if (page_counter_read(&memcg->memory) > memcg->high) { + memcg_memory_event(memcg, MEMCG_HIGH); + memcg_add_lru(memcg); + break; + } + + return victim; +} + +void mem_cgroup_reclaim_high_end(struct mem_cgroup *memcg) +{ + while (memcg) { + if (page_counter_read(&memcg->memory) > memcg->high) { + memcg_memory_event(memcg, MEMCG_HIGH); + memcg_add_lru(memcg); + return; + } + memcg = parent_mem_cgroup(memcg); + } +} + static void reclaim_high(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask) { + bool add_lru = true; do { if (page_counter_read(&memcg->memory) <= memcg->high) continue; memcg_memory_event(memcg, MEMCG_HIGH); try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); + if (add_lru && + page_counter_read(&memcg->memory) > memcg->high) { + memcg_add_lru(memcg); + add_lru = false; + } } while ((memcg = parent_mem_cgroup(memcg))); } @@ -4953,6 +5019,7 @@ static struct mem_cgroup *mem_cgroup_all if (memcg_wb_domain_init(memcg, GFP_KERNEL)) goto fail; + INIT_LIST_HEAD(&memcg->lru_node); INIT_WORK(&memcg->high_work, high_work_func); INIT_LIST_HEAD(&memcg->oom_notify); mutex_init(&memcg->thresholds_lock); --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2910,6 +2910,30 @@ static inline bool compaction_ready(stru return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx); } +#ifdef CONFIG_MEMCG +static void mem_cgroup_reclaim_high(struct pglist_data *pgdat, + struct scan_control *sc) +{ + struct mem_cgroup *memcg; + + memcg = mem_cgroup_reclaim_high_begin(); + if (memcg) { + unsigned long ntr = sc->nr_to_reclaim; + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + + sc->nr_to_reclaim = SWAP_CLUSTER_MAX; + shrink_lruvec(lruvec, sc); + sc->nr_to_reclaim = ntr; + } + mem_cgroup_reclaim_high_end(memcg); +} +#else +static void mem_cgroup_reclaim_high(struct pglist_data *pgdat, + struct scan_control *sc) +{ +} +#endif + /* * This is the direct reclaim path, for page-allocating processes. We only * try to reclaim pages from zones which will satisfy the caller's allocation @@ -2974,6 +2998,9 @@ static void shrink_zones(struct zonelist if (zone->zone_pgdat == last_pgdat) continue; + mem_cgroup_reclaim_high(zone->zone_pgdat, sc); + continue; + /* * This steals pages from memory cgroups over softlimit * and returns the number of reclaimed pages and @@ -3690,12 +3717,16 @@ restart: if (sc.priority < DEF_PRIORITY - 2) sc.may_writepage = 1; + mem_cgroup_reclaim_high(pgdat, &sc); + goto soft_limit_reclaim_end; + /* Call soft limit reclaim before calling shrink_node. */ sc.nr_scanned = 0; nr_soft_scanned = 0; nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(pgdat, sc.order, sc.gfp_mask, &nr_soft_scanned); sc.nr_reclaimed += nr_soft_reclaimed; +soft_limit_reclaim_end: /* * There should be no need to raise the scanning priority if