[v4,2/4] mm/oom: handle remote ooms

Message ID	20211120045011.3074840-3-almasrymina@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Fri, 19 Nov 2021 20:50:08 -0800 In-Reply-To: <20211120045011.3074840-1-almasrymina@google.com> Message-Id: <20211120045011.3074840-3-almasrymina@google.com> Mime-Version: 1.0 References: <20211120045011.3074840-1-almasrymina@google.com> Subject: [PATCH v4 2/4] mm/oom: handle remote ooms From: Mina Almasry <almasrymina@google.com> To: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, Andrew Morton <akpm@linux-foundation.org> Cc: Mina Almasry <almasrymina@google.com>, Jonathan Corbet <corbet@lwn.net>, Alexander Viro <viro@zeniv.linux.org.uk>, Hugh Dickins <hughd@google.com>, Shuah Khan <shuah@kernel.org>, Shakeel Butt <shakeelb@google.com>, Greg Thelen <gthelen@google.com>, Dave Chinner <david@fromorbit.com>, Matthew Wilcox <willy@infradead.org>, Roman Gushchin <guro@fb.com>, "Theodore Ts'o" <tytso@mit.edu>, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Deterministic charging of shared memory \| expand [v4,0/4] Deterministic charging of shared memory [v4,1/4] mm: support deterministic memory charging of filesystems [v4,2/4] mm/oom: handle remote ooms [v4,3/4] mm, shmem: add filesystem memcg= option documentation [v4,4/4] mm, shmem, selftests: add tmpfs memcg= mount option tests

Message ID

20211120045011.3074840-3-almasrymina@google.com (mailing list archive)

State

New

Headers

Date: Fri, 19 Nov 2021 20:50:08 -0800
In-Reply-To: <20211120045011.3074840-1-almasrymina@google.com>
Message-Id: <20211120045011.3074840-3-almasrymina@google.com>
Mime-Version: 1.0
References: <20211120045011.3074840-1-almasrymina@google.com>
Subject: [PATCH v4 2/4] mm/oom: handle remote ooms
From: Mina Almasry <almasrymina@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
 Andrew Morton <akpm@linux-foundation.org>
Cc: Mina Almasry <almasrymina@google.com>, Jonathan Corbet <corbet@lwn.net>,
	Alexander Viro <viro@zeniv.linux.org.uk>, Hugh Dickins <hughd@google.com>,
	Shuah Khan <shuah@kernel.org>, Shakeel Butt <shakeelb@google.com>,
 Greg Thelen <gthelen@google.com>,
	Dave Chinner <david@fromorbit.com>, Matthew Wilcox <willy@infradead.org>,
 Roman Gushchin <guro@fb.com>,
	"Theodore Ts'o" <tytso@mit.edu>, linux-kernel@vger.kernel.org,
 linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, cgroups@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Deterministic charging of shared memory | expand

Commit Message

Mina Almasry Nov. 20, 2021, 4:50 a.m. UTC

On remote ooms (OOMs due to remote charging), the oom-killer will attempt
to find a task to kill in the memcg under oom. The oom-killer may be
unable to find a process to kill if there are no killable processes in
the remote memcg. In this case, the oom-killer (out_of_memory()) will return
false, and depending on the gfp, that will generally get bubbled up to
mem_cgroup_charge_mapping() as an ENOMEM.

A few considerations on how to handle this edge case:

1. memcg= is an opt-in feature, so we have some flexibility with the
   behavior that we export to userspace using this feature to carry
   out remote charges that may result in remote ooms. The critical thing
   is to document this behavior so the userspace knows what to expect
   and handle the edge cases.

2. It is generally not desirable to kill the allocating process, because it's
   not a member of the remote memcg which is under oom, and so killing it
   will almost certainly not free any memory in the memcg under oom.

3. There are allocations that happen in pagefault paths, as well as
   those that happen in non-pagefault paths, and the error returned from
   mem_cgroup_charge_mapping() will be handled by the caller resulting
   in different behavior seen by the userspace in the pagefault and
   non-pagefault paths. For example, currently if mem_cgroup_charge_mapping()
   returns ENOMEM, the caller will generally get an ENOMEM on non-pagefault
   paths, and the caller will be stuck looping the pagefault forever in the
   pagefault path.

4. In general, it's desirable to give userspace the option to gracefully
   handle and recover from a failed remote charge rather than kill the
   process or put it into a situation that's hard to recover from.

With these considerations, the thing that makes most sense here is to
handle this edge case similarly to how we handle ENOSPC error, and to return
ENOSPC from mem_cgroup_charge_mapping() when the remote charge
fails. This has the desirable properties:

1. On pagefault allocations, the userspace will get a SIGBUS if the remote
   charge fails, and the userspace is able to catch this signal and handle it
   to recover gracefully as desired.

2. On non-pagefault paths, the userspace will get an ENOSPC error which
   it can also handle gracefully, if desired.

3. We would not leave the remote charging process in a looping
   pagetfault (a state somewhat hard to recover from) or kill it.

Implementation notes:

1. To get the ENOSPC behavior we alegedly want, in
   mem_cgroup_charge_mapping() we detect whether charge_memcg() has
   failed, and we return ENOSPC here.

2. If the oom-killer is invoked and finds nothing to kill, it prints out
   the "Out of memory and no killable processes..." message, which can
   be spammy if the system is executing many remote charges and
   generally will cause worry as it will likely be seen as a scary
   looking kernel warning, even though this is somewhat of an expected edge
   case to run into and we handle it adequately. Therefore, in out_of_memory()
   we return early to not print this warning. This is not necessary for the
   functionality of the remote charges.

Signed-off-by: Mina Almasry <almasrymina@google.com>


---

Changes in v4:
- Greatly expanded on the commit message to include all my current
thinking.
- Converted the patch to handle remote ooms similarly to ENOSPC, rather
than ENOMEM.

Changes in v3:
- Fixed build failures/warnings Reported-by: kernel test robot <lkp@intel.com>

Changes in v2:
- Moved the remote oom handling as Roman requested.
- Used mem_cgroup_from_task(current) instead of grabbing the memcg from
current->mm

---
 include/linux/memcontrol.h |  6 ++++++
 mm/memcontrol.c            | 31 ++++++++++++++++++++++++++++++-
 mm/oom_kill.c              |  9 +++++++++
 3 files changed, 45 insertions(+), 1 deletion(-)

--
2.34.0.rc2.393.gf8c9666880-goog

Comments

Matthew Wilcox Nov. 20, 2021, 5:07 a.m. UTC | #1

On Fri, Nov 19, 2021 at 08:50:08PM -0800, Mina Almasry wrote:
> On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> to find a task to kill in the memcg under oom. The oom-killer may be
> unable to find a process to kill if there are no killable processes in
> the remote memcg. In this case, the oom-killer (out_of_memory()) will return
> false, and depending on the gfp, that will generally get bubbled up to
> mem_cgroup_charge_mapping() as an ENOMEM.

Why doesn't it try to run the shrinkers to get back some page cache /
slab cache memory from this memcg?  I understand it might not be able
to (eg if the memory is mlocked), but surely that's rare.

Mina Almasry Nov. 20, 2021, 5:31 a.m. UTC | #2

On Fri, Nov 19, 2021 at 9:07 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Nov 19, 2021 at 08:50:08PM -0800, Mina Almasry wrote:
> > On remote ooms (OOMs due to remote charging), the oom-killer will attempt
> > to find a task to kill in the memcg under oom. The oom-killer may be
> > unable to find a process to kill if there are no killable processes in
> > the remote memcg. In this case, the oom-killer (out_of_memory()) will return
> > false, and depending on the gfp, that will generally get bubbled up to
> > mem_cgroup_charge_mapping() as an ENOMEM.
>
> Why doesn't it try to run the shrinkers to get back some page cache /
> slab cache memory from this memcg?  I understand it might not be able
> to (eg if the memory is mlocked), but surely that's rare.
>

Please correct me if I'm wrong, but AFAICT I haven't disabled any
shrinkers from running or disabled any reclaim mechanism on remote
charges. What I see in the code is that when the remote memcg runs out
of memory is that try_to_free_mem_cgroup_pages() gets called as normal
and we do the MAX_RECLAIM_RETRIES as normal.

It's only in the (rare as you point out) case where the kernel is
completely unable to find memory in the remote memcg that we need to
do the special handling that's described in this patch. Although this
situation is probably quite rare in real-world scenarios, it's easily
reproducible (and the attached test in the series does so), so my
feeling is that it needs some sane handling, which I'm attempting to
do in this patch.

Shakeel Butt Nov. 20, 2021, 7:58 a.m. UTC | #3

On Fri, Nov 19, 2021 at 8:50 PM Mina Almasry <almasrymina@google.com> wrote:
>
[...]
> +/*
> + * Returns true if current's mm is a descendant of the memcg_under_oom (or
> + * equal to it). False otherwise. This is used by the oom-killer to detect
> + * ooms due to remote charging.
> + */
> +bool is_remote_oom(struct mem_cgroup *memcg_under_oom)
> +{
> +       struct mem_cgroup *current_memcg;
> +       bool is_remote_oom;
> +
> +       if (!memcg_under_oom)
> +               return false;
> +
> +       rcu_read_lock();
> +       current_memcg = mem_cgroup_from_task(current);
> +       if (current_memcg && !css_tryget_online(&current_memcg->css))

No need to grab a reference. You can call mem_cgroup_is_descendant() within rcu.

> +               current_memcg = NULL;
> +       rcu_read_unlock();
> +
> +       if (!current_memcg)
> +               return false;
> +
> +       is_remote_oom =
> +               !mem_cgroup_is_descendant(current_memcg, memcg_under_oom);
> +       css_put(&current_memcg->css);
> +
> +       return is_remote_oom;
> +}
> +

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0a9b0bba5f3c8..451feebabf160 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -932,6 +932,7 @@  int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm,

 struct mem_cgroup *mem_cgroup_get_from_path(const char *path);
 void mem_cgroup_put_name_in_seq(struct seq_file *seq, struct super_block *sb);
+bool is_remote_oom(struct mem_cgroup *memcg_under_oom);

 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 		int zid, int nr_pages);
@@ -1255,6 +1256,11 @@  static inline void mem_cgroup_put_name_in_seq(struct seq_file *seq,
 {
 }

+static inline bool is_remote_oom(struct mem_cgroup *memcg_under_oom)
+{
+	return false;
+}
+
 static inline int mem_cgroup_swapin_charge_page(struct page *page,
 			struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c4ba7f364c214..3e5bc2c32c9b7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2668,6 +2668,35 @@  void mem_cgroup_put_name_in_seq(struct seq_file *m, struct super_block *sb)
 	__putname(buf);
 }

+/*
+ * Returns true if current's mm is a descendant of the memcg_under_oom (or
+ * equal to it). False otherwise. This is used by the oom-killer to detect
+ * ooms due to remote charging.
+ */
+bool is_remote_oom(struct mem_cgroup *memcg_under_oom)
+{
+	struct mem_cgroup *current_memcg;
+	bool is_remote_oom;
+
+	if (!memcg_under_oom)
+		return false;
+
+	rcu_read_lock();
+	current_memcg = mem_cgroup_from_task(current);
+	if (current_memcg && !css_tryget_online(&current_memcg->css))
+		current_memcg = NULL;
+	rcu_read_unlock();
+
+	if (!current_memcg)
+		return false;
+
+	is_remote_oom =
+		!mem_cgroup_is_descendant(current_memcg, memcg_under_oom);
+	css_put(&current_memcg->css);
+
+	return is_remote_oom;
+}
+
 /*
  * Set or clear (if @memcg is NULL) charge association from file system to
  * memcg.  If @memcg != NULL, then a css reference must be held by the caller to
@@ -6814,7 +6843,7 @@  int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm,
 	if (mapping_memcg) {
 		ret = charge_memcg(folio, mapping_memcg, gfp);
 		css_put(&mapping_memcg->css);
-		return ret;
+		return ret == -ENOMEM ? -ENOSPC : ret;
 	}

 	return mem_cgroup_charge(folio, mm, gfp);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0a7e16b16b8c3..8db500b337415 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1108,6 +1108,15 @@  bool out_of_memory(struct oom_control *oc)
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
+		if (is_remote_oom(oc->memcg)) {
+			/*
+			 * For remote ooms with no killable processes, return
+			 * false here without logging the warning below as we
+			 * expect the caller to handle this as they please.
+			 */
+			return false;
+		}
+
 		dump_header(oc, NULL);
 		pr_warn("Out of memory and no killable processes...\n");
 		/*

[v4,2/4] mm/oom: handle remote ooms

Commit Message

Comments

Patch