From patchwork Tue Dec 17 11:29:19 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yafang Shao X-Patchwork-Id: 11297359 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7C25914E3 for ; Tue, 17 Dec 2019 11:31:34 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3C42A207FF for ; Tue, 17 Dec 2019 11:31:34 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ZyV99MTr" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3C42A207FF Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1C9688E0059; Tue, 17 Dec 2019 06:31:33 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 154968E0040; Tue, 17 Dec 2019 06:31:33 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 041C68E0059; Tue, 17 Dec 2019 06:31:32 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0237.hostedemail.com [216.40.44.237]) by kanga.kvack.org (Postfix) with ESMTP id E3CD68E0040 for ; Tue, 17 Dec 2019 06:31:32 -0500 (EST) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 50F55181AEF0B for ; Tue, 17 Dec 2019 11:31:32 +0000 (UTC) X-FDA: 76274418024.21.oil79_2a765054eb432 X-Spam-Summary: 2,0,0,82847a1f4af18f49,d41d8cd98f00b204,laoar.shao@gmail.com,:hannes@cmpxchg.org:mhocko@kernel.org:vdavydov.dev@gmail.com:akpm@linux-foundation.org:viro@zeniv.linux.org.uk::linux-fsdevel@vger.kernel.org:laoar.shao@gmail.com:guro@fb.com:chris@chrisdown.name:dchinner@redhat.com,RULES_HIT:2:41:355:379:541:800:960:965:966:973:988:989:1260:1345:1359:1437:1535:1605:1730:1747:1777:1792:2196:2199:2393:2553:2559:2562:2693:2897:2898:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4049:4120:4250:4321:4385:4390:4395:4470:4605:5007:6119:6261:6653:7514:7875:7903:9413:10004:11026:11473:11658:11914:12043:12048:12291:12296:12297:12438:12517:12519:12555:12683:12895:12986:13161:13229:14096:14394:14687:21080:21433:21444:21450:21451:21627:21666:21740:21796:21990:30036:30054:30064:30070:30090,0,RBL:209.85.214.193:@gmail.com:.lbl8.mailshell.net-62.50.0.100 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:n one,Cust X-HE-Tag: oil79_2a765054eb432 X-Filterd-Recvd-Size: 9170 Received: from mail-pl1-f193.google.com (mail-pl1-f193.google.com [209.85.214.193]) by imf16.hostedemail.com (Postfix) with ESMTP for ; Tue, 17 Dec 2019 11:31:31 +0000 (UTC) Received: by mail-pl1-f193.google.com with SMTP id f20so4203453plj.5 for ; Tue, 17 Dec 2019 03:31:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=mUxzvwjgO2ZHffdQyZgrVhwhMVrr5KtTx2eyM983GSE=; b=ZyV99MTr7TRmBgXevNorWGAiLdpHGh5vhPG68r3LAmi3qW5mszrDUnC/nemNzdlGba 9gziqICgCOonEX3+3FqBCM/7nGUqtkOluvGIH08bkbH1VtljgDj6aGPb8QwkvdPWuU3Z 9sRTh3xgtAAq1SD1qQ/RVPtDPmjAHnhMjUM+nChqOQ4jz6qLeqpYdvyq1d86gr+eQLi4 JOj3m983q9z1A0D+hUJ3E9fxVpat52/bcanlQMQo6N/Lw3jcKIcS2Dxghd/8TKp8iNr4 Zoa094mEFPJqHxh+p2FfxPVu/+lWkWfaCxgXzDGGPnHnYMUHopz21/RBJ6f7Qhg/czUd XcwA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=mUxzvwjgO2ZHffdQyZgrVhwhMVrr5KtTx2eyM983GSE=; b=Uist7G1mlp3M2Sf6t9vJT3TfXCrVtmsRJk8jLL6tvdzPUzlUK2+OQN2U1+BCZ+/Ic8 GyPC7iAMNINxaAs2ihZraJ9qCSjQKf3BSTGoMztu3RNAufPE9IPVUjkW7IEvzMNT3l1Q asfOErj4kSKvQqW2ujhx9AHSKAdQAtk+Brcm3BfzikTtTTkcomnjaHR3T9mRqwwuIbh6 QaqGcNP3WgBTw5NnNK8K23NJh2wJDODi2f6vvTOx3dQpjAOg+GlWg2UDeC4LowqCPYVD i6f7suzz3R5Pt0BjNC8MnFmK0MnVoRQx7lkHr0If95BbEskUr2L0qsmWwyGVkAi47RHu eTkA== X-Gm-Message-State: APjAAAWJ9bRF2uiR/j5ghki9/E7oYeZpChzvr0aHnl5fRmKikXPJbA8m 7HMsp1KqIdLNWJFWMKTcZ3M= X-Google-Smtp-Source: APXvYqxenXdKMwF1wXuyr9N43cbveSFn99SPjKOUFQEVuvOKGVqYmKHFw15XIdQBlplXGhUHN/nbSw== X-Received: by 2002:a17:902:9a8f:: with SMTP id w15mr7059647plp.149.1576582290689; Tue, 17 Dec 2019 03:31:30 -0800 (PST) Received: from dev.localdomain ([203.100.54.194]) by smtp.gmail.com with ESMTPSA id q21sm26246460pff.105.2019.12.17.03.31.27 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 17 Dec 2019 03:31:30 -0800 (PST) From: Yafang Shao To: hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, akpm@linux-foundation.org, viro@zeniv.linux.org.uk Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Yafang Shao , Roman Gushchin , Chris Down , Dave Chinner Subject: [PATCH 4/4] memcg, inode: protect page cache from freeing inode Date: Tue, 17 Dec 2019 06:29:19 -0500 Message-Id: <1576582159-5198-5-git-send-email-laoar.shao@gmail.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1576582159-5198-1-git-send-email-laoar.shao@gmail.com> References: <1576582159-5198-1-git-send-email-laoar.shao@gmail.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On my server there're some running MEMCGs protected by memory.{min, low}, but I found the usage of these MEMCGs abruptly became very small, which were far less than the protect limit. It confused me and finally I found that was because of inode stealing. Once an inode is freed, all its belonging page caches will be dropped as well, no matter how may page caches it has. So if we intend to protect the page caches in a memcg, we must protect their host (the inode) first. Otherwise the memcg protection can be easily bypassed with freeing inode, especially if there're big files in this memcg. The inherent mismatch between memcg and inode is a trouble. One inode can be shared by different MEMCGs, but it is a very rare case. If an inode is shared, its belonging page caches may be charged to different MEMCGs. Currently there's no perfect solution to fix this kind of issue, but the inode majority-writer ownership switching can help it more or less. Cc: Roman Gushchin Cc: Chris Down Cc: Dave Chinner Signed-off-by: Yafang Shao --- fs/inode.c | 9 +++++++++ include/linux/memcontrol.h | 15 +++++++++++++++ mm/memcontrol.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 4 ++++ 4 files changed, 74 insertions(+) diff --git a/fs/inode.c b/fs/inode.c index fef457a..b022447 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -734,6 +734,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item, if (!spin_trylock(&inode->i_lock)) return LRU_SKIP; + + /* Page protection only works in reclaimer */ + if (inode->i_data.nrpages && current->reclaim_state) { + if (mem_cgroup_inode_protected(inode)) { + spin_unlock(&inode->i_lock); + return LRU_ROTATE; + } + } + /* * Referenced or dirty inodes are still in use. Give them another pass * through the LRU as we canot reclaim them now. diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1a315c7..21338f0 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -247,6 +247,9 @@ struct mem_cgroup { unsigned int tcpmem_active : 1; unsigned int tcpmem_pressure : 1; + /* Soft protection will be ignored if it's true */ + unsigned int in_low_reclaim : 1; + int under_oom; int swappiness; @@ -363,6 +366,7 @@ static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, struct mem_cgroup *memcg); +unsigned long mem_cgroup_inode_protected(struct inode *inode); int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask, struct mem_cgroup **memcgp, @@ -850,6 +854,11 @@ static inline enum mem_cgroup_protection mem_cgroup_protected( return MEMCG_PROT_NONE; } +static inline unsigned long mem_cgroup_inode_protected(struct inode *inode) +{ + return 0; +} + static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask, struct mem_cgroup **memcgp, @@ -926,6 +935,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page) return NULL; } +static inline struct mem_cgroup * +mem_cgroup_from_css(struct cgroup_subsys_state *css) +{ + return NULL; +} + static inline void mem_cgroup_put(struct mem_cgroup *memcg) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 234370c..efb53f3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6355,6 +6355,52 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, } /** + * Once an inode is freed, all its belonging page caches will be dropped as + * well, even if there're lots of page caches. So if we intend to protect + * page caches in a memcg, we must protect their host first. Otherwise the + * memory usage can be dropped abruptly if there're big files in this + * memcg. IOW the memcy protection can be easily bypassed with freeing + * inode. We should prevent it. + * The inherent mismatch between memcg and inode is a trouble. One inode + * can be shared by different MEMCGs, but it is a very rare case. If + * an inode is shared, its belonging page caches may be charged to + * different MEMCGs. Currently there's no perfect solution to fix this + * kind of issue, but the inode majority-writer ownership switching can + * help it more or less. + */ +unsigned long mem_cgroup_inode_protected(struct inode *inode) +{ + unsigned long cgroup_size; + unsigned long protect = 0; + struct bdi_writeback *wb; + struct mem_cgroup *memcg; + + wb = inode_to_wb(inode); + if (!wb) + goto out; + + memcg = mem_cgroup_from_css(wb->memcg_css); + if (!memcg || memcg == root_mem_cgroup) + goto out; + + protect = mem_cgroup_protection(memcg, memcg->in_low_reclaim); + if (!protect) + goto out; + + cgroup_size = mem_cgroup_size(memcg); + /* + * Don't need to protect this inode, if the usage is still above + * the limit after reclaiming this inode and its belonging page + * caches. + */ + if (inode->i_data.nrpages + protect < cgroup_size) + protect = 0; + +out: + return protect; +} + +/** * mem_cgroup_try_charge - try charging a page * @page: page to charge * @mm: mm context of the victim diff --git a/mm/vmscan.c b/mm/vmscan.c index 3c4c2da..1cc7fc2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2666,6 +2666,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) sc->memcg_low_skipped = 1; continue; } + memcg->in_low_reclaim = 1; memcg_memory_event(memcg, MEMCG_LOW); break; case MEMCG_PROT_NONE: @@ -2693,6 +2694,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority); + if (memcg->in_low_reclaim) + memcg->in_low_reclaim = 0; + /* Record the group's reclaim efficiency */ vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,