From patchwork Sun Mar 15 09:53:40 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yafang Shao <laoar.shao@gmail.com>
X-Patchwork-Id: 11438737
Return-Path: <SRS0=03jI=5A=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 42FAC1667
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Sun, 15 Mar 2020 07:53:03 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 227BA206E9
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Sun, 15 Mar 2020 07:53:03 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="AnNSrJxr"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727630AbgCOHxC (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Sun, 15 Mar 2020 03:53:02 -0400
Received: from mail-pf1-f195.google.com ([209.85.210.195]:33851 "EHLO
        mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727134AbgCOHxB (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Sun, 15 Mar 2020 03:53:01 -0400
Received: by mail-pf1-f195.google.com with SMTP id 23so8006846pfj.1
        for <linux-fsdevel@vger.kernel.org>;
 Sun, 15 Mar 2020 00:53:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=GR+4RQ/CtYjBoLtm6xDE+j3oEJzKJmkJkF8JHfLM9+I=;
        b=AnNSrJxr3L6owNh+RpUvCyIVb9IkGboEQh51GvFU7aHaxGPwVdI06d7i4AW+6kjc10
         CU9osRaYVfKKmQfSZl9vTifThOPbfhRjv+0Pe+t8+JYZWl6GoiaNb/fWM8qICrYBHyIF
         th9YAM0yFugK7p+txE2AbIIzs3wBnjfA9zEFp62SU3R1kSZ4chkUlsmRjlJzUGr4x8nn
         PQAVcywhC699mjASDGfGlFP1itIQbJxS20v79/vv5ag89WofOWxvGSU6tw5YN8LOtQKN
         sTe9umHJ3jbWrxZH7ys2gumvWAqJrmpm1RmsKCo34cDQdMzHGk3WF66SBZ9wP1lQdzyw
         loRg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=GR+4RQ/CtYjBoLtm6xDE+j3oEJzKJmkJkF8JHfLM9+I=;
        b=Ih72sSf4isXi1Nee8aPISj7XncEib7sANgr3QPWK4oytWYIc5wbfNez//WpVUYhy6v
         ozyTSkeOiedBKu5pxldM0i64kqEdq2uavoVlErVviYzkn/orQ+2Eu8qVHbF2xGdbu6Df
         6pTAdG1kPcLSa7gQqbFmlTzOy024A9ms8KLWWnxGoxDtZH9E50a6fUy9z+4qyAaN0aCW
         ic6i6F30vrDktqory1n2Gqmrn1nwidBGpH9L5GyxbTBm1tyAGaHGwB+1zSQNsgd8/QW8
         NAB+JDle5uJXe9IokkLkW2UenIuRPzNnoyFDJJtGr/kv93g/7ZSwvolXWn1T95i48+YH
         TsGA==
X-Gm-Message-State: ANhLgQ1vOeWssn2XHhJz5j7JqVK2GxiYIENPbZMVrxj+jL21OfIIxtyk
        xnUZb5PsU0CsA+rPT8mo0Lc=
X-Google-Smtp-Source: 
 ADFU+vtDK7Ak+zn2clzqTJRjeDSuOpfNuPgnLW/K67KXpqO+HpI+acgrjbUDJ/GsyeXLxy2Xy2P8dw==
X-Received: by 2002:a62:be04:: with SMTP id l4mr22649023pff.234.1584258780484;
        Sun, 15 Mar 2020 00:53:00 -0700 (PDT)
Received: from master.localdomain ([203.100.54.194])
        by smtp.gmail.com with ESMTPSA id
 w11sm62592984pfn.4.2020.03.15.00.52.57
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Sun, 15 Mar 2020 00:52:59 -0700 (PDT)
From: Yafang Shao <laoar.shao@gmail.com>
To: dchinner@redhat.com, hannes@cmpxchg.org, mhocko@kernel.org,
        vdavydov.dev@gmail.com, guro@fb.com, akpm@linux-foundation.org,
        viro@zeniv.linux.org.uk
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
        Yafang Shao <laoar.shao@gmail.com>
Subject: [PATCH v5 1/3] mm,
 list_lru: make memcg visible to lru walker isolation function
Date: Sun, 15 Mar 2020 05:53:40 -0400
Message-Id: <20200315095342.10178-2-laoar.shao@gmail.com>
X-Mailer: git-send-email 2.14.1
In-Reply-To: <20200315095342.10178-1-laoar.shao@gmail.com>
References: <20200315095342.10178-1-laoar.shao@gmail.com>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

The lru walker isolation function may use this memcg to do something, e.g.
the inode isolatation function will use the memcg to do inode protection in
followup patch. So make memcg visible to the lru walker isolation function.

Something should be emphasized in this patch is it replaces
for_each_memcg_cache_index() with for_each_mem_cgroup() in
list_lru_walk_node(). Because there's a gap between these two MACROs that
for_each_mem_cgroup() depends on CONFIG_MEMCG while the other one depends
on CONFIG_MEMCG_KMEM. But as list_lru_memcg_aware() returns false if
CONFIG_MEMCG_KMEM is not configured, it is safe to this replacement.
Another difference between for_each_memcg_cache_index() and
for_each_mem_cgroup() is that for_each_memcg_cache_index() excludes the
root_mem_cgroup because its kmemcg_id is -1, while for_each_mem_cgroup()
includes the root_mem_cgroup. So we need to skip the root_mem_cgroup
explicitly in the for loop.

Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h | 21 +++++++++++++++++
 mm/list_lru.c              | 47 +++++++++++++++++++++++---------------
 mm/memcontrol.c            | 15 ------------
 3 files changed, 49 insertions(+), 34 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a7a0a1a5c8d5..a624c423e60b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -445,6 +445,21 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 int mem_cgroup_scan_tasks(struct mem_cgroup *,
 			  int (*)(struct task_struct *, void *), void *);
 
+/*
+ * Iteration constructs for visiting all cgroups (under a tree).  If
+ * loops are exited prematurely (break), mem_cgroup_iter_break() must
+ * be used for reference counting.
+ */
+#define for_each_mem_cgroup_tree(iter, root)		\
+	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(root, iter, NULL))
+
+#define for_each_mem_cgroup(iter)			\
+	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(NULL, iter, NULL))
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	if (mem_cgroup_disabled())
@@ -945,6 +960,12 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return 0;
 }
 
+#define for_each_mem_cgroup_tree(iter)		\
+	for (iter = NULL; iter; )
+
+#define for_each_mem_cgroup(iter)		\
+	for (iter = NULL; iter; )
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	return 0;
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 0f1f6b06b7f3..6daa8c64d13d 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -207,11 +207,11 @@ unsigned long list_lru_count_node(struct list_lru *lru, int nid)
 EXPORT_SYMBOL_GPL(list_lru_count_node);
 
 static unsigned long
-__list_lru_walk_one(struct list_lru_node *nlru, int memcg_idx,
+__list_lru_walk_one(struct list_lru_node *nlru, struct mem_cgroup *memcg,
 		    list_lru_walk_cb isolate, void *cb_arg,
 		    unsigned long *nr_to_walk)
 {
-
+	int memcg_idx = memcg_cache_id(memcg);
 	struct list_lru_one *l;
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
@@ -273,7 +273,7 @@ list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 	unsigned long ret;
 
 	spin_lock(&nlru->lock);
-	ret = __list_lru_walk_one(nlru, memcg_cache_id(memcg), isolate, cb_arg,
+	ret = __list_lru_walk_one(nlru, memcg, isolate, cb_arg,
 				  nr_to_walk);
 	spin_unlock(&nlru->lock);
 	return ret;
@@ -289,7 +289,7 @@ list_lru_walk_one_irq(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
 	unsigned long ret;
 
 	spin_lock_irq(&nlru->lock);
-	ret = __list_lru_walk_one(nlru, memcg_cache_id(memcg), isolate, cb_arg,
+	ret = __list_lru_walk_one(nlru, memcg, isolate, cb_arg,
 				  nr_to_walk);
 	spin_unlock_irq(&nlru->lock);
 	return ret;
@@ -299,25 +299,34 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 				 list_lru_walk_cb isolate, void *cb_arg,
 				 unsigned long *nr_to_walk)
 {
-	long isolated = 0;
-	int memcg_idx;
+	struct list_lru_node *nlru;
+	struct mem_cgroup *memcg;
+	long isolated;
 
-	isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
-				      nr_to_walk);
-	if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
-		for_each_memcg_cache_index(memcg_idx) {
-			struct list_lru_node *nlru = &lru->node[nid];
+	/* iterate the global lru first */
+	isolated = list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
+				     nr_to_walk);
 
-			spin_lock(&nlru->lock);
-			isolated += __list_lru_walk_one(nlru, memcg_idx,
-							isolate, cb_arg,
-							nr_to_walk);
-			spin_unlock(&nlru->lock);
+	if (!list_lru_memcg_aware(lru))
+		goto out;
 
-			if (*nr_to_walk <= 0)
-				break;
-		}
+	nlru = &lru->node[nid];
+	for_each_mem_cgroup(memcg) {
+		/* already scanned the root memcg above */
+		if (mem_cgroup_is_root(memcg))
+			continue;
+
+		if (*nr_to_walk <= 0)
+			break;
+
+		spin_lock(&nlru->lock);
+		isolated += __list_lru_walk_one(nlru, memcg,
+						isolate, cb_arg,
+						nr_to_walk);
+		spin_unlock(&nlru->lock);
 	}
+
+out:
 	return isolated;
 }
 EXPORT_SYMBOL_GPL(list_lru_walk_node);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d09776cd6e10..688d51dbb731 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -222,21 +222,6 @@ enum res_type {
 /* Used for OOM nofiier */
 #define OOM_CONTROL		(0)
 
-/*
- * Iteration constructs for visiting all cgroups (under a tree).  If
- * loops are exited prematurely (break), mem_cgroup_iter_break() must
- * be used for reference counting.
- */
-#define for_each_mem_cgroup_tree(iter, root)		\
-	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(root, iter, NULL))
-
-#define for_each_mem_cgroup(iter)			\
-	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(NULL, iter, NULL))
-
 static inline bool should_force_charge(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||

From patchwork Sun Mar 15 09:53:41 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yafang Shao <laoar.shao@gmail.com>
X-Patchwork-Id: 11438741
Return-Path: <SRS0=03jI=5A=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 79C6114E5
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Sun, 15 Mar 2020 07:53:06 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 4FBD4206EB
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Sun, 15 Mar 2020 07:53:06 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="a6HzSxBn"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727640AbgCOHxF (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Sun, 15 Mar 2020 03:53:05 -0400
Received: from mail-pg1-f196.google.com ([209.85.215.196]:39698 "EHLO
        mail-pg1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727134AbgCOHxF (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Sun, 15 Mar 2020 03:53:05 -0400
Received: by mail-pg1-f196.google.com with SMTP id b22so1723524pgb.6
        for <linux-fsdevel@vger.kernel.org>;
 Sun, 15 Mar 2020 00:53:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=r9vLV6Qq04cRRMSwOI+dLFQIqvFaYX0FB1SLrcgz+pc=;
        b=a6HzSxBnbEYIPl5ofeVIDVAmAl+juJ9bvXsP5A1bAOFzTp023UsfLTUEyd747zvH7M
         Mq/BT0PkcYddsr0awLrIqFwyHHjsTbD2+3U0FMGxiGEbVYcYhMNlMa89+NYeBVMbme9F
         gs74dco12foKmtPt0XDqi1CgoUSyOhr7KS+eTqx5m4yLI2RBG5p9ATVBG8d2Rk8cyKqU
         FFhA9v529m90kT6vYBRtUzYVK1nNsPAhgnuRJpLAP+vpmu5ysZXZp8FYLfluJIlJV/uP
         nxzBSCsRMZEF52iDxuMqQN/M9wfI0ScGzNeV/7jzYi/rtvoi/lXHV2tQaUKwp9qm1anD
         7yOg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=r9vLV6Qq04cRRMSwOI+dLFQIqvFaYX0FB1SLrcgz+pc=;
        b=GPP8v+AGPHrdESx4f2P2+wOPfBVDFjgIwiv8/ODiJcXYIv5+ZSOk2xpTKlKliHoQce
         7eL9yLy+tMWMV8h/FwcfFzQ7XRk7/u7EUmRvX976C9lBZ1BM6t94mz/m7mb/babQn+Rt
         IJYH3NFadpc0FP4LgvSlD9ACHSf7QoKmlqp1wq0HzqtF0C6HJDd2lA5g/lcbIR4TQcjU
         iVSs6KDGY2arkoy+3iyenD7PpYDXX8srfFWegt4Cb/ga0IHIfw157i+yIHh+ODVe1BtC
         aGc6eAFUdXBeDCcIZYD1izz4+Dk67KzBq2QkKQ0VsCN+3P57pJn6h/PUErjL1dirOZxM
         2Erw==
X-Gm-Message-State: ANhLgQ3bImD96XdCEgJ2+G4F6F7AqJ8/vsFImfjR2op+o9qFHXNL9W01
        34EACogHssr5A5yAt1iZLWE=
X-Google-Smtp-Source: 
 ADFU+vuv2Fvoe3wxjcT4absfB1RnH6gkjP/X1kgSOnsg6owh1BfecuBWq3uSBVuRh4psUGbwZ9ECVQ==
X-Received: by 2002:a63:cd12:: with SMTP id i18mr9138743pgg.98.1584258783452;
        Sun, 15 Mar 2020 00:53:03 -0700 (PDT)
Received: from master.localdomain ([203.100.54.194])
        by smtp.gmail.com with ESMTPSA id
 w11sm62592984pfn.4.2020.03.15.00.53.00
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Sun, 15 Mar 2020 00:53:02 -0700 (PDT)
From: Yafang Shao <laoar.shao@gmail.com>
To: dchinner@redhat.com, hannes@cmpxchg.org, mhocko@kernel.org,
        vdavydov.dev@gmail.com, guro@fb.com, akpm@linux-foundation.org,
        viro@zeniv.linux.org.uk
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
        Yafang Shao <laoar.shao@gmail.com>
Subject: [PATCH v5 2/3] mm,
 shrinker: make memcg low reclaim visible to lru walker isolation function
Date: Sun, 15 Mar 2020 05:53:41 -0400
Message-Id: <20200315095342.10178-3-laoar.shao@gmail.com>
X-Mailer: git-send-email 2.14.1
In-Reply-To: <20200315095342.10178-1-laoar.shao@gmail.com>
References: <20200315095342.10178-1-laoar.shao@gmail.com>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

A new member memcg_low_reclaim is introduced in shrink_control struct,
which is derived from scan_control struct, in order to tell the shrinker
whether the reclaim session is under memcg low reclaim or not.
The followup patch will use this new member.

Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/shrinker.h |  3 +++
 mm/vmscan.c              | 27 ++++++++++++++++-----------
 2 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 0f80123650e2..dc42ae57e8dc 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -31,6 +31,9 @@ struct shrink_control {
 
 	/* current memcg being shrunk (for memcg aware shrinkers) */
 	struct mem_cgroup *memcg;
+
+	/* derived from struct scan_control */
+	bool memcg_low_reclaim;
 };
 
 #define SHRINK_STOP (~0UL)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 876370565455..385750840979 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -625,10 +625,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 
 /**
  * shrink_slab - shrink slab caches
- * @gfp_mask: allocation context
- * @nid: node whose slab caches to target
  * @memcg: memory cgroup whose slab caches to target
- * @priority: the reclaim priority
+ * @sc: scan_control struct for this reclaim session
+ * @nid: node whose slab caches to target
  *
  * Call the shrink functions to age shrinkable caches.
  *
@@ -638,15 +637,18 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
  * @memcg specifies the memory cgroup to target. Unaware shrinkers
  * are called only if it is the root cgroup.
  *
- * @priority is sc->priority, we take the number of objects and >> by priority
- * in order to get the scan target.
+ * @sc is the scan_control struct, we take the number of objects
+ * and >> by sc->priority in order to get the scan target.
  *
  * Returns the number of reclaimed slab objects.
  */
-static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
-				 struct mem_cgroup *memcg,
-				 int priority)
+static unsigned long shrink_slab(struct mem_cgroup *memcg,
+				 struct scan_control *sc,
+				 int nid)
 {
+	bool memcg_low_reclaim = sc->memcg_low_reclaim;
+	gfp_t gfp_mask = sc->gfp_mask;
+	int priority = sc->priority;
 	unsigned long ret, freed = 0;
 	struct shrinker *shrinker;
 
@@ -668,6 +670,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 			.gfp_mask = gfp_mask,
 			.nid = nid,
 			.memcg = memcg,
+			.memcg_low_reclaim = memcg_low_reclaim,
 		};
 
 		ret = do_shrink_slab(&sc, shrinker, priority);
@@ -694,6 +697,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 void drop_slab_node(int nid)
 {
 	unsigned long freed;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+	};
 
 	do {
 		struct mem_cgroup *memcg = NULL;
@@ -701,7 +707,7 @@ void drop_slab_node(int nid)
 		freed = 0;
 		memcg = mem_cgroup_iter(NULL, NULL, NULL);
 		do {
-			freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
+			freed += shrink_slab(memcg, &sc, nid);
 		} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
 	} while (freed > 10);
 }
@@ -2673,8 +2679,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 
 		shrink_lruvec(lruvec, sc);
 
-		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
-			    sc->priority);
+		shrink_slab(memcg, sc, pgdat->node_id);
 
 		/* Record the group's reclaim efficiency */
 		vmpressure(sc->gfp_mask, memcg, false,

From patchwork Sun Mar 15 09:53:42 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yafang Shao <laoar.shao@gmail.com>
X-Patchwork-Id: 11438745
Return-Path: <SRS0=03jI=5A=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2E13592A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Sun, 15 Mar 2020 07:53:09 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 03C53206E9
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Sun, 15 Mar 2020 07:53:09 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="K/ju9+Hw"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727649AbgCOHxI (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Sun, 15 Mar 2020 03:53:08 -0400
Received: from mail-pf1-f196.google.com ([209.85.210.196]:33713 "EHLO
        mail-pf1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727134AbgCOHxH (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Sun, 15 Mar 2020 03:53:07 -0400
Received: by mail-pf1-f196.google.com with SMTP id n7so8003950pfn.0
        for <linux-fsdevel@vger.kernel.org>;
 Sun, 15 Mar 2020 00:53:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=cqqtyHIeT4vfGVExTGiwALcTdHhyTSr2VPDKQxL1EV0=;
        b=K/ju9+HwkvMQ8RIQcSAyzUw5L+xOv8h/u7e4SKXbAigpSjEJlNZ4VzcqmotFdKTJzi
         MPiFN0U15K7LcdRCMJ5v83DeBnNTH7AYFTcmo4WuiPf31nWn7aYZXZ6qyIYB+hZMSVr9
         ucSqoal9HM18xd8qslsiBrNFzSElyQapeLHCe5QdiUmAr5Jo8TDO7fDtUE9bQ2KZW+9P
         BUwNt62uSh6niPIYPdwZDSaVa4l1KdDW1/gjxskXzAgbX6/ckpWhf6VCSWl+dKypnx2/
         zF+PB0s9jzeLV5lI2CmiMFT9VwHaEkEM7MiWq7uySJEIc6MYwV6qC9KBEhbpmaHEDndP
         zOhw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=cqqtyHIeT4vfGVExTGiwALcTdHhyTSr2VPDKQxL1EV0=;
        b=j6t5tVMAhgvu2IkNbUczus72tQOiKYMom8hMeCzdDWYXXUr9/lHExMZSyUULfV8vRJ
         WbKkri9hNZ3LB01D+r7+jHZ2LVhV5D0QQ65CDWlWcs+DDs9E6iRecwY6qajB/Zc8jTaj
         0B8UAgpQIWXuIhyo/zGNIskOzwgfSPYvJ8LjkGfJ6VivVbJkhcXI7/0plR0ZU9+3Ek5U
         INi3L7ZpiRAajvCPx2VOLNyVij2VhWKzVw/40a5UTfT9iX8pQxw/1Q/pVbks4d/Axuqt
         tKN0LCQ6i4n0A+slalk/J+eMelVvgCfZAThbN7Emmft3GLHoMAr8PGWdiAAU5ToO0L2M
         cCMw==
X-Gm-Message-State: ANhLgQ0KU9J8f/gNC5oa7c3JsHaTu1MV5c27fJD4UInQDVVLj2l9/N3Z
        l49t90ZHs8Tfda7tnsmcaAc=
X-Google-Smtp-Source: 
 ADFU+vtQyTwtKmcMpijmAwbRMTvUiIvUIaWTMRDklRnXTyVMJTEcTHqGnWSogjKN8TVv7nDyL3P6MA==
X-Received: by 2002:a63:27c5:: with SMTP id
 n188mr4284055pgn.345.1584258786402;
        Sun, 15 Mar 2020 00:53:06 -0700 (PDT)
Received: from master.localdomain ([203.100.54.194])
        by smtp.gmail.com with ESMTPSA id
 w11sm62592984pfn.4.2020.03.15.00.53.03
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Sun, 15 Mar 2020 00:53:05 -0700 (PDT)
From: Yafang Shao <laoar.shao@gmail.com>
To: dchinner@redhat.com, hannes@cmpxchg.org, mhocko@kernel.org,
        vdavydov.dev@gmail.com, guro@fb.com, akpm@linux-foundation.org,
        viro@zeniv.linux.org.uk
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
        Yafang Shao <laoar.shao@gmail.com>
Subject: [PATCH v5 3/3] inode: protect page cache from freeing inode
Date: Sun, 15 Mar 2020 05:53:42 -0400
Message-Id: <20200315095342.10178-4-laoar.shao@gmail.com>
X-Mailer: git-send-email 2.14.1
In-Reply-To: <20200315095342.10178-1-laoar.shao@gmail.com>
References: <20200315095342.10178-1-laoar.shao@gmail.com>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

On my server there're some running MEMCGs protected by memory.{min, low},
but I found the usage of these MEMCGs abruptly became very small, which
were far less than the protect limit. It confused me and finally I
found that was because of inode stealing.
Once an inode is freed, all its belonging page caches will be dropped as
well, no matter how may page caches it has. So if we intend to protect the
page caches in a memcg, we must protect their host (the inode) first.
Otherwise the memcg protection can be easily bypassed with freeing inode,
especially if there're big files in this memcg.

Supposes we have a memcg, and the stat of this memcg is,
        memory.current = 1024M
        memory.min = 512M
And in this memcg there's a inode with 800M page caches.
Once this memcg is scanned by kswapd or other regular reclaimers,
    kswapd <<<< It can be either of the regular reclaimers.
        shrink_node_memcgs
            switch (mem_cgroup_protected()) <<<< Not protected
                case MEMCG_PROT_NONE:  <<<< Will scan this memcg
                        beak;
            shrink_lruvec() <<<< Reclaim the page caches
            shrink_slab()   <<<< It may free this inode and drop all its
                                 page caches(800M).
So we must protect the inode first if we want to protect page caches.
Note that this inode may be a cold inode (in the tail of list lru), because
memcg protection protects all slabs and page cache pages whatever they are
cold or hot. IOW, this is a memcg-protection-specific issue.

The inherent mismatch between memcg and inode is a trouble. One inode can
be shared by different MEMCGs, but it is a very rare case. If an inode is
shared, its belonging page caches may be charged to different MEMCGs.
Currently there's no perfect solution to fix this kind of issue, but the
inode majority-writer ownership switching can help it more or less.

Cc: Dave Chinner <dchinner@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 fs/inode.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 73 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 7d57068b6b7a..6373cd09a06d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -55,6 +55,12 @@
  *   inode_hash_lock
  */
 
+struct inode_isolate_control {
+	struct list_head *freeable;
+	struct mem_cgroup *memcg;	/* derived from shrink_control */
+	bool memcg_low_reclaim;		/* derived from scan_control */
+};
+
 static unsigned int i_hash_mask __read_mostly;
 static unsigned int i_hash_shift __read_mostly;
 static struct hlist_head *inode_hashtable __read_mostly;
@@ -714,6 +720,59 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	return busy;
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+/*
+ * Once an inode is freed, all its belonging page caches will be dropped as
+ * well, even if there're lots of page caches. So if we intend to protect
+ * page caches in a memcg, we must protect their host(the inode) first.
+ * Otherwise the memcg protection can be easily bypassed with freeing inode,
+ * especially if there're big files in this memcg.
+ * Note that it may happen that the page caches are already charged to the
+ * memcg, but the inode hasn't been added to this memcg yet. In this case,
+ * this inode is not protected.
+ * The inherent mismatch between memcg and inode is a trouble. One inode
+ * can be shared by different MEMCGs, but it is a very rare case. If
+ * an inode is shared, its belonging page caches may be charged to
+ * different MEMCGs. Currently there's no perfect solution to fix this
+ * kind of issue, but the inode majority-writer ownership switching can
+ * help it more or less.
+ */
+static bool memcg_can_reclaim_inode(struct inode *inode,
+				    struct inode_isolate_control *iic)
+{
+	unsigned long protection;
+	struct mem_cgroup *memcg;
+	bool reclaimable = true;
+
+	if (!inode->i_data.nrpages)
+		goto out;
+
+	/* Excludes freeing inode via drop_caches */
+	if (!current->reclaim_state)
+		goto out;
+
+	memcg = iic->memcg;
+	if (!memcg || memcg == root_mem_cgroup)
+		goto out;
+
+	protection = mem_cgroup_protection(memcg, iic->memcg_low_reclaim);
+	if (!protection)
+		goto out;
+
+	if (inode->i_data.nrpages)
+		reclaimable = false;
+
+out:
+	return reclaimable;
+}
+#else /* CONFIG_MEMCG_KMEM */
+static bool memcg_can_reclaim_inode(struct inode *inode,
+				    struct inode_isolate_control *iic)
+{
+	return true;
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
 /*
  * Isolate the inode from the LRU in preparation for freeing it.
  *
@@ -732,8 +791,9 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 static enum lru_status inode_lru_isolate(struct list_head *item,
 		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
 {
-	struct list_head *freeable = arg;
-	struct inode	*inode = container_of(item, struct inode, i_lru);
+	struct inode_isolate_control *iic = arg;
+	struct list_head *freeable = iic->freeable;
+	struct inode *inode = container_of(item, struct inode, i_lru);
 
 	/*
 	 * we are inverting the lru lock/inode->i_lock here, so use a trylock.
@@ -742,6 +802,11 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 	if (!spin_trylock(&inode->i_lock))
 		return LRU_SKIP;
 
+	if (!memcg_can_reclaim_inode(inode, iic)) {
+		spin_unlock(&inode->i_lock);
+		return LRU_ROTATE;
+	}
+
 	/*
 	 * Referenced or dirty inodes are still in use. Give them another pass
 	 * through the LRU as we canot reclaim them now.
@@ -799,9 +864,14 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
 {
 	LIST_HEAD(freeable);
 	long freed;
+	struct inode_isolate_control iic = {
+		.freeable = &freeable,
+		.memcg = sc->memcg,
+		.memcg_low_reclaim = sc->memcg_low_reclaim,
+	};
 
 	freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
-				     inode_lru_isolate, &freeable);
+				     inode_lru_isolate, &iic);
 	dispose_list(&freeable);
 	return freed;
 }