[v2] memcg: allow exiting tasks to write back data to swap

Message ID	20241212115754.38f798b3@fangorn (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Thu, 12 Dec 2024 11:57:54 -0500 From: Rik van Riel <riel@surriel.com> To: Yosry Ahmed <yosryahmed@google.com> Cc: Balbir Singh <balbirs@nvidia.com>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, hakeel Butt <shakeel.butt@linux.dev>, Muchun Song <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org>, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, Nhat Pham <nphamcs@gmail.com> Subject: [PATCH v2] memcg: allow exiting tasks to write back data to swap Message-ID: <20241212115754.38f798b3@fangorn> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2] memcg: allow exiting tasks to write back data to swap \| expand [v2] memcg: allow exiting tasks to write back data to swap

Rik van Riel Dec. 12, 2024, 4:57 p.m. UTC

A task already in exit can get stuck trying to allocate pages, if its
cgroup is at the memory.max limit, the cgroup is using zswap, but
zswap writeback is enabled, and the remaining memory in the cgroup is
not compressible.

This seems like an unlikely confluence of events, but it can happen
quite easily if a cgroup is OOM killed due to exceeding its memory.max
limit, and all the tasks in the cgroup are trying to exit simultaneously.

When this happens, it can sometimes take hours for tasks to exit,
as they are all trying to squeeze things into zswap to bring the group's
memory consumption below memory.max.

Allowing these exiting programs to push some memory from their own
cgroup into swap allows them to quickly bring the cgroup's memory
consumption below memory.max, and exit in seconds rather than hours.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
v2: use mm_match_cgroup as suggested by Yosry Ahmed

 mm/memcontrol.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Yosry Ahmed Dec. 12, 2024, 5:06 p.m. UTC | #1

On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@surriel.com> wrote:
>
> A task already in exit can get stuck trying to allocate pages, if its
> cgroup is at the memory.max limit, the cgroup is using zswap, but
> zswap writeback is enabled, and the remaining memory in the cgroup is
> not compressible.
>
> This seems like an unlikely confluence of events, but it can happen
> quite easily if a cgroup is OOM killed due to exceeding its memory.max
> limit, and all the tasks in the cgroup are trying to exit simultaneously.
>
> When this happens, it can sometimes take hours for tasks to exit,
> as they are all trying to squeeze things into zswap to bring the group's
> memory consumption below memory.max.
>
> Allowing these exiting programs to push some memory from their own
> cgroup into swap allows them to quickly bring the cgroup's memory
> consumption below memory.max, and exit in seconds rather than hours.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>

Thanks for sending a v2.

I still think maybe this needs to be fixed on the memcg side, at least
by not making exiting tasks try really hard to reclaim memory to the
point where this becomes a problem. IIUC there could be other reasons
why reclaim may take too long, but maybe not as pathological as this
case to be fair. I will let the memcg maintainers chime in for this.

If there's a fundamental reason why this cannot be fixed on the memcg
side, I don't object to this change.

Nhat, any objections on your end? I think your fleet workloads were
the first users of this interface. Does this break their expectations?

> ---
> v2: use mm_match_cgroup as suggested by Yosry Ahmed
>
>  mm/memcontrol.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..ba1cd9c04a02 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5371,6 +5371,18 @@ bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
>         if (!zswap_is_enabled())
>                 return true;
>
> +       /*
> +        * Always allow exiting tasks to push data to swap. A process in
> +        * the middle of exit cannot get OOM killed, but may need to push
> +        * uncompressible data to swap in order to get the cgroup memory
> +        * use below the limit, and make progress with the exit.
> +        */
> +       if (unlikely(current->flags & PF_EXITING)) {
> +               struct mm_struct *mm = READ_ONCE(current->mm);
> +               if (mm && mm_match_cgroup(mm, memcg))
> +                       return true;
> +       }
> +
>         for (; memcg; memcg = parent_mem_cgroup(memcg))
>                 if (!READ_ONCE(memcg->zswap_writeback))
>                         return false;
> --
> 2.47.0
>
>

Shakeel Butt Dec. 12, 2024, 5:51 p.m. UTC | #2

On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote:
> On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@surriel.com> wrote:
> >
> > A task already in exit can get stuck trying to allocate pages, if its
> > cgroup is at the memory.max limit, the cgroup is using zswap, but
> > zswap writeback is enabled, and the remaining memory in the cgroup is
> > not compressible.
> >
> > This seems like an unlikely confluence of events, but it can happen
> > quite easily if a cgroup is OOM killed due to exceeding its memory.max
> > limit, and all the tasks in the cgroup are trying to exit simultaneously.
> >
> > When this happens, it can sometimes take hours for tasks to exit,
> > as they are all trying to squeeze things into zswap to bring the group's
> > memory consumption below memory.max.
> >
> > Allowing these exiting programs to push some memory from their own
> > cgroup into swap allows them to quickly bring the cgroup's memory
> > consumption below memory.max, and exit in seconds rather than hours.
> >
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> 
> Thanks for sending a v2.
> 
> I still think maybe this needs to be fixed on the memcg side, at least
> by not making exiting tasks try really hard to reclaim memory to the
> point where this becomes a problem. IIUC there could be other reasons
> why reclaim may take too long, but maybe not as pathological as this
> case to be fair. I will let the memcg maintainers chime in for this.
> 
> If there's a fundamental reason why this cannot be fixed on the memcg
> side, I don't object to this change.
> 
> Nhat, any objections on your end? I think your fleet workloads were
> the first users of this interface. Does this break their expectations?
> 

Let me give my personal take. This seems like a stopgap or a quick hack
to resolve the very specific situation happening in real world. I am ok
with having this solution but only temporarily. The reason why I think
this is short term fix or a quick hack is because it is not specifically
solving the fundamental issue here. The same situation can reoccur if
let's say the swap storage was slow or stuck or contended. A somewhat
similar situation is when there are lot of unreclaimable memory either
through pinning or maybe mlock.

The fundamental issue is that the exiting process (killed by oomd or
simple exit) has to allocated memory but the cgroup is at limit and the
reclaim is very very slow.

I can see attacking this issue with multiple angles. Some mixture of
reusing kernel's oom reaper and some buffer to allow the exiting process
to go over the limit. Let's brainstorm and explore this direction.

In the meantime, I think we can have this stopgap solution.

Rik van Riel Dec. 12, 2024, 6:02 p.m. UTC | #3

On Thu, 2024-12-12 at 09:51 -0800, Shakeel Butt wrote:
> 
> The fundamental issue is that the exiting process (killed by oomd or
> simple exit) has to allocated memory but the cgroup is at limit and
> the
> reclaim is very very slow.
> 
> I can see attacking this issue with multiple angles.

Besides your proposed ideas, I suppose we could also limit
the gfp_mask of an exiting reclaimer with eg. __GFP_NORETRY,
but I do not know how effective that would be, since a single
pass through the memory reclaim code was still taking dozens
of seconds when I traced the "stuck" workloads.

Nhat Pham Dec. 12, 2024, 6:11 p.m. UTC | #4

On Thu, Dec 12, 2024 at 9:07 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@surriel.com> wrote:
>
> I still think maybe this needs to be fixed on the memcg side, at least
> by not making exiting tasks try really hard to reclaim memory to the
> point where this becomes a problem. IIUC there could be other reasons
> why reclaim may take too long, but maybe not as pathological as this
> case to be fair. I will let the memcg maintainers chime in for this.

FWIW, we did have some internal discussions regarding this. We think
that for now, this is a good-enough stopgap solution - it remains to
be seen whether other more "permanent fixes" are needed, or will not
also regress other scenarios. And they are definitely more complicated
than the solution Rik is proposing here :)

>
> If there's a fundamental reason why this cannot be fixed on the memcg
> side, I don't object to this change.
>
> Nhat, any objections on your end? I think your fleet workloads were
> the first users of this interface. Does this break their expectations?

I had similar concerns as yours, so we rolled the solution to the
hosts in trouble. AFAICS:

1. It allowed the pathological workload to make forward progress with
the exiting procedure.

2. The other workloads (who also have memory.zswap.writeback disabled)
did not observe any regression.

Nhat Pham Dec. 12, 2024, 6:18 p.m. UTC | #5

On Thu, Dec 12, 2024 at 10:03 AM Rik van Riel <riel@surriel.com> wrote:
>
> On Thu, 2024-12-12 at 09:51 -0800, Shakeel Butt wrote:
> >
> > The fundamental issue is that the exiting process (killed by oomd or
> > simple exit) has to allocated memory but the cgroup is at limit and
> > the
> > reclaim is very very slow.
> >
> > I can see attacking this issue with multiple angles.
>
> Besides your proposed ideas, I suppose we could also limit
> the gfp_mask of an exiting reclaimer with eg. __GFP_NORETRY,
> but I do not know how effective that would be, since a single
> pass through the memory reclaim code was still taking dozens
> of seconds when I traced the "stuck" workloads.

I know we already discussed this, but it'd be nice if we can let the
exiting task go ahead with the page fault and bypass the memory
limits, if the page fault is crucial for it to make forward progress.
Not sure how feasible that is, and how to decide which page fault is
really crucial though :)

For the pathological memory.zswap.writeback disabling case in
particular, another thing we can do here is to make these
incompressible pages ineligible for further reclaim attempt, either by
putting them on a non-reclaim LRU, or putting them in the zswap LRU to
maintain total ordering of the LRUs. That way we can move on to other
sources (slab caches for example) quicker, or fail earlier? That said,
it remains to be seen what will happen if these incompressible pages
are literally all that are left...?

I'm biased to this idea though, because they have other benefits.
Maybe I'm just looking for excuses to revive the project ;)

Johannes Weiner Dec. 12, 2024, 6:30 p.m. UTC | #6

On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote:
> On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@surriel.com> wrote:
> >
> > A task already in exit can get stuck trying to allocate pages, if its
> > cgroup is at the memory.max limit, the cgroup is using zswap, but
> > zswap writeback is enabled, and the remaining memory in the cgroup is
> > not compressible.
> >
> > This seems like an unlikely confluence of events, but it can happen
> > quite easily if a cgroup is OOM killed due to exceeding its memory.max
> > limit, and all the tasks in the cgroup are trying to exit simultaneously.
> >
> > When this happens, it can sometimes take hours for tasks to exit,
> > as they are all trying to squeeze things into zswap to bring the group's
> > memory consumption below memory.max.
> >
> > Allowing these exiting programs to push some memory from their own
> > cgroup into swap allows them to quickly bring the cgroup's memory
> > consumption below memory.max, and exit in seconds rather than hours.
> >
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> 
> Thanks for sending a v2.
> 
> I still think maybe this needs to be fixed on the memcg side, at least
> by not making exiting tasks try really hard to reclaim memory to the
> point where this becomes a problem. IIUC there could be other reasons
> why reclaim may take too long, but maybe not as pathological as this
> case to be fair. I will let the memcg maintainers chime in for this.
> 
> If there's a fundamental reason why this cannot be fixed on the memcg
> side, I don't object to this change.
> 
> Nhat, any objections on your end? I think your fleet workloads were
> the first users of this interface. Does this break their expectations?

Yes, I don't think we can do this, unfortunately :( There can be a
variety of reasons for why a user might want to prohibit disk swap for
a certain cgroup, and we can't assume it's okay to make exceptions.

There might also not *be* any disk swap to overflow into after Nhat's
virtual swap patches. Presumably zram would still have the issue too.

So I'm also inclined to think this needs a reclaim/memcg-side fix. We
have a somewhat tumultous history of policy in that space:

commit 7775face207922ea62a4e96b9cd45abfdc7b9840
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date:   Tue Mar 5 15:46:47 2019 -0800

    memcg: killed threads should not invoke memcg OOM killer

allowed dying tasks to simply force all charges and move on. This
turned out to be too aggressive; there were instances of exiting,
uncontained memcg tasks causing global OOMs. This lead to that:

commit a4ebf1b6ca1e011289677239a2a361fde4a88076
Author: Vasily Averin <vasily.averin@linux.dev>
Date:   Fri Nov 5 13:38:09 2021 -0700

    memcg: prohibit unconditional exceeding the limit of dying tasks

which reverted the bypass rather thoroughly. Now NO dying tasks, *not
even OOM victims*, can force charges. I am not sure this is correct,
either:

If we return -ENOMEM to an OOM victim in a fault, the fault handler
will re-trigger OOM, which will find the existing OOM victim and do
nothing, then restart the fault. This is a memory deadlock. The page
allocator gives OOM victims access to reserves for that reason.

Actually, it looks even worse. For some reason we're not triggering
OOM from dying tasks:

        ret = task_is_dying() || out_of_memory(&oc);

Even though dying tasks are in no way privileged or allowed to exit
expediently. Why shouldn't they trigger the OOM killer like anybody
else trying to allocate memory?

As it stands, it seems we have dying tasks getting trapped in an
endless fault->reclaim cycle; with no access to the OOM killer and no
access to reserves. Presumably this is what's going on here?

I think we want something like this:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 53db98d2c4a1..be6b6e72bde5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (mem_cgroup_margin(memcg) >= (1 << order))
 		goto unlock;

-	/*
-	 * A few threads which were not waiting at mutex_lock_killable() can
-	 * fail to bail out. Therefore, check again after holding oom_lock.
-	 */
-	ret = task_is_dying() || out_of_memory(&oc);
+	ret = out_of_memory(&oc);

 unlock:
 	mutex_unlock(&oom_lock);
@@ -2198,6 +2194,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (unlikely(current->flags & PF_MEMALLOC))
 		goto force;

+	if (unlikely(tsk_is_oom_victim(current)))
+		goto force;
+
 	if (unlikely(task_in_memcg_oom(current)))
 		goto nomem;

Roman Gushchin Dec. 12, 2024, 6:31 p.m. UTC | #7

On Thu, Dec 12, 2024 at 11:57:54AM -0500, Rik van Riel wrote:
> A task already in exit can get stuck trying to allocate pages, if its
> cgroup is at the memory.max limit, the cgroup is using zswap, but
> zswap writeback is enabled, and the remaining memory in the cgroup is
> not compressible.

Is it about a single task or groups of tasks or the entire cgroup?
If former, why it's a problem? A tight memcg limit can slow things down
in general and I don't see why we should treat the exit() path differently.

If it's about the entire cgroup and we have essentially a deadlock,
I feel like we need to look into the oom reaper side.

Alternatively we can allow to go over the limit in this case, but only
assuming all tasks in the cgroup are going to die and no new task can
enter it.

Thanks!

Rik van Riel Dec. 12, 2024, 8 p.m. UTC | #8

On Thu, 12 Dec 2024 18:31:57 +0000
Roman Gushchin <roman.gushchin@linux.dev> wrote:

> Is it about a single task or groups of tasks or the entire cgroup?
> If former, why it's a problem? A tight memcg limit can slow things down
> in general and I don't see why we should treat the exit() path differently.
> 
I think the exit path does need to be treated a little differently,
since this exit may be the only way such a cgroup can free up memory.

> If it's about the entire cgroup and we have essentially a deadlock,
> I feel like we need to look into the oom reaper side.

You mean something like the below?

I have not tested it yet, because we don't have any stuck
cgroups right now among the workloads that I'm monitoring.

---8<---

From c0e545fd45bd3ee24cd79b3d3e9b375e968ef460 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@surriel.com>
Date: Thu, 12 Dec 2024 14:50:49 -0500
Subject: [PATCH] memcg,oom: speed up reclaim for exiting tasks

When a memcg reaches its memory limit, and reclaim becomes unavailable
or slow for some reason, for example only zswap is available, but zswap
writeback is disabled, it can take a long time for tasks to exit, and
for the cgroup to get back to normal (or cleaned up).

Speed up memcg reclaim for exiting tasks by limiting how much work
reclaim does, and by invoking the OOM reaper if reclaim does not
free up enough memory to allow the task to make progress.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/oom.h |  8 ++++++++
 mm/memcontrol.c     | 11 +++++++++++
 mm/oom_kill.c       |  6 +-----
 3 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 1e0fc6931ce9..b2d9cf936664 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -111,4 +111,12 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+#ifdef CONFIG_MMU
+extern void queue_oom_reaper(struct task_struct *tsk);
+#else
+static intern void queue_oom_reaper(struct task_struct *tsk)
+{
+}
+#endif
+
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..21f42758d430 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2231,6 +2231,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!gfpflags_allow_blocking(gfp_mask))
 		goto nomem;
 
+	if (unlikely(current->flags & PF_EXITING))
+		gfp_mask |= __GFP_NORETRY;
+
 	memcg_memory_event(mem_over_limit, MEMCG_MAX);
 	raised_max_event = true;
 
@@ -2284,6 +2287,14 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		goto retry;
 	}
 nomem:
+	/*
+	 * We ran out of memory while inside exit. Maybe the OOM
+	 * reaper can help reduce cgroup memory use and get things
+	 * moving again?
+	 */
+	if (unlikely(current->flags & PF_EXITING))
+		queue_oom_reaper(current);
+
 	/*
 	 * Memcg doesn't have a dedicated reserve for atomic
 	 * allocations. But like the global atomic pool, we need to
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1c485beb0b93..8d5278e45c63 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -686,7 +686,7 @@ static void wake_oom_reaper(struct timer_list *timer)
  * before the exit path is able to wake the futex waiters.
  */
 #define OOM_REAPER_DELAY (2*HZ)
-static void queue_oom_reaper(struct task_struct *tsk)
+void queue_oom_reaper(struct task_struct *tsk)
 {
 	/* mm is already queued? */
 	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
@@ -735,10 +735,6 @@ static int __init oom_init(void)
 	return 0;
 }
 subsys_initcall(oom_init)
-#else
-static inline void queue_oom_reaper(struct task_struct *tsk)
-{
-}
 #endif /* CONFIG_MMU */
 
 /**

Shakeel Butt Dec. 12, 2024, 9:35 p.m. UTC | #9

On Thu, Dec 12, 2024 at 01:30:12PM -0500, Johannes Weiner wrote:
> On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote:
> > On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@surriel.com> wrote:
> > >
> > > A task already in exit can get stuck trying to allocate pages, if its
> > > cgroup is at the memory.max limit, the cgroup is using zswap, but
> > > zswap writeback is enabled, and the remaining memory in the cgroup is
> > > not compressible.
> > >
> > > This seems like an unlikely confluence of events, but it can happen
> > > quite easily if a cgroup is OOM killed due to exceeding its memory.max
> > > limit, and all the tasks in the cgroup are trying to exit simultaneously.
> > >
> > > When this happens, it can sometimes take hours for tasks to exit,
> > > as they are all trying to squeeze things into zswap to bring the group's
> > > memory consumption below memory.max.
> > >
> > > Allowing these exiting programs to push some memory from their own
> > > cgroup into swap allows them to quickly bring the cgroup's memory
> > > consumption below memory.max, and exit in seconds rather than hours.
> > >
> > > Signed-off-by: Rik van Riel <riel@surriel.com>
> > 
> > Thanks for sending a v2.
> > 
> > I still think maybe this needs to be fixed on the memcg side, at least
> > by not making exiting tasks try really hard to reclaim memory to the
> > point where this becomes a problem. IIUC there could be other reasons
> > why reclaim may take too long, but maybe not as pathological as this
> > case to be fair. I will let the memcg maintainers chime in for this.
> > 
> > If there's a fundamental reason why this cannot be fixed on the memcg
> > side, I don't object to this change.
> > 
> > Nhat, any objections on your end? I think your fleet workloads were
> > the first users of this interface. Does this break their expectations?
> 
> Yes, I don't think we can do this, unfortunately :( There can be a
> variety of reasons for why a user might want to prohibit disk swap for
> a certain cgroup, and we can't assume it's okay to make exceptions.
> 
> There might also not *be* any disk swap to overflow into after Nhat's
> virtual swap patches. Presumably zram would still have the issue too.

Very good points.

> 
> So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> have a somewhat tumultous history of policy in that space:
> 
> commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date:   Tue Mar 5 15:46:47 2019 -0800
> 
>     memcg: killed threads should not invoke memcg OOM killer
> 
> allowed dying tasks to simply force all charges and move on. This
> turned out to be too aggressive; there were instances of exiting,
> uncontained memcg tasks causing global OOMs. This lead to that:
> 
> commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> Author: Vasily Averin <vasily.averin@linux.dev>
> Date:   Fri Nov 5 13:38:09 2021 -0700
> 
>     memcg: prohibit unconditional exceeding the limit of dying tasks
> 
> which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> even OOM victims*, can force charges. I am not sure this is correct,
> either:
> 
> If we return -ENOMEM to an OOM victim in a fault, the fault handler
> will re-trigger OOM, which will find the existing OOM victim and do
> nothing, then restart the fault. This is a memory deadlock. The page
> allocator gives OOM victims access to reserves for that reason.
> 
> Actually, it looks even worse. For some reason we're not triggering
> OOM from dying tasks:
> 
>         ret = task_is_dying() || out_of_memory(&oc);
> 
> Even though dying tasks are in no way privileged or allowed to exit
> expediently. Why shouldn't they trigger the OOM killer like anybody
> else trying to allocate memory?

This is a very good point and actually out_of_memory() will mark the
dying process as oom victim and put it in the oom reaper's list which
should help further in such situation.

> 
> As it stands, it seems we have dying tasks getting trapped in an
> endless fault->reclaim cycle; with no access to the OOM killer and no
> access to reserves. Presumably this is what's going on here?
> 
> I think we want something like this:

The following patch looks good to me. Let's test this out (hopefully Rik
will be able to find a live impacted machine) and move forward with this
fix.

> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 53db98d2c4a1..be6b6e72bde5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (mem_cgroup_margin(memcg) >= (1 << order))
>  		goto unlock;
>  
> -	/*
> -	 * A few threads which were not waiting at mutex_lock_killable() can
> -	 * fail to bail out. Therefore, check again after holding oom_lock.
> -	 */
> -	ret = task_is_dying() || out_of_memory(&oc);
> +	ret = out_of_memory(&oc);
>  
>  unlock:
>  	mutex_unlock(&oom_lock);
> @@ -2198,6 +2194,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (unlikely(current->flags & PF_MEMALLOC))
>  		goto force;
>  
> +	if (unlikely(tsk_is_oom_victim(current)))
> +		goto force;
> +
>  	if (unlikely(task_in_memcg_oom(current)))
>  		goto nomem;
>

Yosry Ahmed Dec. 12, 2024, 9:41 p.m. UTC | #10

On Thu, Dec 12, 2024 at 1:35 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Dec 12, 2024 at 01:30:12PM -0500, Johannes Weiner wrote:
> > On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote:
> > > On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@surriel.com> wrote:
> > > >
> > > > A task already in exit can get stuck trying to allocate pages, if its
> > > > cgroup is at the memory.max limit, the cgroup is using zswap, but
> > > > zswap writeback is enabled, and the remaining memory in the cgroup is
> > > > not compressible.
> > > >
> > > > This seems like an unlikely confluence of events, but it can happen
> > > > quite easily if a cgroup is OOM killed due to exceeding its memory.max
> > > > limit, and all the tasks in the cgroup are trying to exit simultaneously.
> > > >
> > > > When this happens, it can sometimes take hours for tasks to exit,
> > > > as they are all trying to squeeze things into zswap to bring the group's
> > > > memory consumption below memory.max.
> > > >
> > > > Allowing these exiting programs to push some memory from their own
> > > > cgroup into swap allows them to quickly bring the cgroup's memory
> > > > consumption below memory.max, and exit in seconds rather than hours.
> > > >
> > > > Signed-off-by: Rik van Riel <riel@surriel.com>
> > >
> > > Thanks for sending a v2.
> > >
> > > I still think maybe this needs to be fixed on the memcg side, at least
> > > by not making exiting tasks try really hard to reclaim memory to the
> > > point where this becomes a problem. IIUC there could be other reasons
> > > why reclaim may take too long, but maybe not as pathological as this
> > > case to be fair. I will let the memcg maintainers chime in for this.
> > >
> > > If there's a fundamental reason why this cannot be fixed on the memcg
> > > side, I don't object to this change.
> > >
> > > Nhat, any objections on your end? I think your fleet workloads were
> > > the first users of this interface. Does this break their expectations?
> >
> > Yes, I don't think we can do this, unfortunately :( There can be a
> > variety of reasons for why a user might want to prohibit disk swap for
> > a certain cgroup, and we can't assume it's okay to make exceptions.
> >
> > There might also not *be* any disk swap to overflow into after Nhat's
> > virtual swap patches. Presumably zram would still have the issue too.
>
> Very good points.
>
> >
> > So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> > have a somewhat tumultous history of policy in that space:
> >
> > commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> > Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date:   Tue Mar 5 15:46:47 2019 -0800
> >
> >     memcg: killed threads should not invoke memcg OOM killer
> >
> > allowed dying tasks to simply force all charges and move on. This
> > turned out to be too aggressive; there were instances of exiting,
> > uncontained memcg tasks causing global OOMs. This lead to that:
> >
> > commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> > Author: Vasily Averin <vasily.averin@linux.dev>
> > Date:   Fri Nov 5 13:38:09 2021 -0700
> >
> >     memcg: prohibit unconditional exceeding the limit of dying tasks
> >
> > which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> > even OOM victims*, can force charges. I am not sure this is correct,
> > either:
> >
> > If we return -ENOMEM to an OOM victim in a fault, the fault handler
> > will re-trigger OOM, which will find the existing OOM victim and do
> > nothing, then restart the fault. This is a memory deadlock. The page
> > allocator gives OOM victims access to reserves for that reason.
> >
> > Actually, it looks even worse. For some reason we're not triggering
> > OOM from dying tasks:
> >
> >         ret = task_is_dying() || out_of_memory(&oc);
> >
> > Even though dying tasks are in no way privileged or allowed to exit
> > expediently. Why shouldn't they trigger the OOM killer like anybody
> > else trying to allocate memory?
>
> This is a very good point and actually out_of_memory() will mark the
> dying process as oom victim and put it in the oom reaper's list which
> should help further in such situation.
>
> >
> > As it stands, it seems we have dying tasks getting trapped in an
> > endless fault->reclaim cycle; with no access to the OOM killer and no
> > access to reserves. Presumably this is what's going on here?
> >
> > I think we want something like this:
>
> The following patch looks good to me. Let's test this out (hopefully Rik
> will be able to find a live impacted machine) and move forward with this
> fix.

I agree with this too. As Shakeel mentioned, this seemed like a
stopgap and not an actual fix for the underlying problem. Johannes
further outlined how the stopgap can be problematic.

Let's try to fix this on the memcg/reclaim/OOM side, and properly
treat dying tasks instead of forcing them into potentially super slow
reclaim paths. Hopefully Johannes's patch fixes this.

Roman Gushchin Dec. 13, 2024, 12:32 a.m. UTC | #11

On Thu, Dec 12, 2024 at 01:30:12PM -0500, Johannes Weiner wrote:
> On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote:
> > On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@surriel.com> wrote:
> > >
> > > A task already in exit can get stuck trying to allocate pages, if its
> > > cgroup is at the memory.max limit, the cgroup is using zswap, but
> > > zswap writeback is enabled, and the remaining memory in the cgroup is
> > > not compressible.
> > >
> > > This seems like an unlikely confluence of events, but it can happen
> > > quite easily if a cgroup is OOM killed due to exceeding its memory.max
> > > limit, and all the tasks in the cgroup are trying to exit simultaneously.
> > >
> > > When this happens, it can sometimes take hours for tasks to exit,
> > > as they are all trying to squeeze things into zswap to bring the group's
> > > memory consumption below memory.max.
> > >
> > > Allowing these exiting programs to push some memory from their own
> > > cgroup into swap allows them to quickly bring the cgroup's memory
> > > consumption below memory.max, and exit in seconds rather than hours.
> > >
> > > Signed-off-by: Rik van Riel <riel@surriel.com>
> > 
> > Thanks for sending a v2.
> > 
> > I still think maybe this needs to be fixed on the memcg side, at least
> > by not making exiting tasks try really hard to reclaim memory to the
> > point where this becomes a problem. IIUC there could be other reasons
> > why reclaim may take too long, but maybe not as pathological as this
> > case to be fair. I will let the memcg maintainers chime in for this.
> > 
> > If there's a fundamental reason why this cannot be fixed on the memcg
> > side, I don't object to this change.
> > 
> > Nhat, any objections on your end? I think your fleet workloads were
> > the first users of this interface. Does this break their expectations?
> 
> Yes, I don't think we can do this, unfortunately :( There can be a
> variety of reasons for why a user might want to prohibit disk swap for
> a certain cgroup, and we can't assume it's okay to make exceptions.
> 
> There might also not *be* any disk swap to overflow into after Nhat's
> virtual swap patches. Presumably zram would still have the issue too.
> 
> So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> have a somewhat tumultous history of policy in that space:
> 
> commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date:   Tue Mar 5 15:46:47 2019 -0800
> 
>     memcg: killed threads should not invoke memcg OOM killer
> 
> allowed dying tasks to simply force all charges and move on. This
> turned out to be too aggressive; there were instances of exiting,
> uncontained memcg tasks causing global OOMs. This lead to that:
> 
> commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> Author: Vasily Averin <vasily.averin@linux.dev>
> Date:   Fri Nov 5 13:38:09 2021 -0700
> 
>     memcg: prohibit unconditional exceeding the limit of dying tasks
> 
> which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> even OOM victims*, can force charges. I am not sure this is correct,
> either:
> 
> If we return -ENOMEM to an OOM victim in a fault, the fault handler
> will re-trigger OOM, which will find the existing OOM victim and do
> nothing, then restart the fault. This is a memory deadlock. The page
> allocator gives OOM victims access to reserves for that reason.
> 
> Actually, it looks even worse. For some reason we're not triggering
> OOM from dying tasks:
> 
>         ret = task_is_dying() || out_of_memory(&oc);
> 
> Even though dying tasks are in no way privileged or allowed to exit
> expediently. Why shouldn't they trigger the OOM killer like anybody
> else trying to allocate memory?
> 
> As it stands, it seems we have dying tasks getting trapped in an
> endless fault->reclaim cycle; with no access to the OOM killer and no
> access to reserves. Presumably this is what's going on here?
> 
> I think we want something like this:
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 53db98d2c4a1..be6b6e72bde5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (mem_cgroup_margin(memcg) >= (1 << order))
>  		goto unlock;
>  
> -	/*
> -	 * A few threads which were not waiting at mutex_lock_killable() can
> -	 * fail to bail out. Therefore, check again after holding oom_lock.
> -	 */
> -	ret = task_is_dying() || out_of_memory(&oc);
> +	ret = out_of_memory(&oc);

I like the idea, but at first glance it might reintroduce the problem
fixed by 7775face2079 ("memcg: killed threads should not invoke memcg OOM killer").

Roman Gushchin Dec. 13, 2024, 12:49 a.m. UTC | #12

On Thu, Dec 12, 2024 at 03:00:03PM -0500, Rik van Riel wrote:
> On Thu, 12 Dec 2024 18:31:57 +0000
> Roman Gushchin <roman.gushchin@linux.dev> wrote:
> 
> > Is it about a single task or groups of tasks or the entire cgroup?
> > If former, why it's a problem? A tight memcg limit can slow things down
> > in general and I don't see why we should treat the exit() path differently.
> > 
> I think the exit path does need to be treated a little differently,
> since this exit may be the only way such a cgroup can free up memory.

It's true if all tasks in a cgroup are exiting. Otherwise there are
other options (at least in theory).

> 
> > If it's about the entire cgroup and we have essentially a deadlock,
> > I feel like we need to look into the oom reaper side.
> 
> You mean something like the below?
> 
> I have not tested it yet, because we don't have any stuck
> cgroups right now among the workloads that I'm monitoring.

Yeah, something like this...

Thanks!

Balbir Singh Dec. 13, 2024, 2:54 a.m. UTC | #13

On 12/13/24 07:00, Rik van Riel wrote:
> On Thu, 12 Dec 2024 18:31:57 +0000
> Roman Gushchin <roman.gushchin@linux.dev> wrote:
> 
>> Is it about a single task or groups of tasks or the entire cgroup?
>> If former, why it's a problem? A tight memcg limit can slow things down
>> in general and I don't see why we should treat the exit() path differently.
>>
> I think the exit path does need to be treated a little differently,
> since this exit may be the only way such a cgroup can free up memory.
> 
>> If it's about the entire cgroup and we have essentially a deadlock,
>> I feel like we need to look into the oom reaper side.
> 
> You mean something like the below?
> 
> I have not tested it yet, because we don't have any stuck
> cgroups right now among the workloads that I'm monitoring.
> 
> ---8<---
> 
> From c0e545fd45bd3ee24cd79b3d3e9b375e968ef460 Mon Sep 17 00:00:00 2001
> From: Rik van Riel <riel@surriel.com>
> Date: Thu, 12 Dec 2024 14:50:49 -0500
> Subject: [PATCH] memcg,oom: speed up reclaim for exiting tasks
> 
> When a memcg reaches its memory limit, and reclaim becomes unavailable
> or slow for some reason, for example only zswap is available, but zswap
> writeback is disabled, it can take a long time for tasks to exit, and
> for the cgroup to get back to normal (or cleaned up).
> 
> Speed up memcg reclaim for exiting tasks by limiting how much work
> reclaim does, and by invoking the OOM reaper if reclaim does not
> free up enough memory to allow the task to make progress.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  include/linux/oom.h |  8 ++++++++
>  mm/memcontrol.c     | 11 +++++++++++
>  mm/oom_kill.c       |  6 +-----
>  3 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 1e0fc6931ce9..b2d9cf936664 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -111,4 +111,12 @@ extern void oom_killer_enable(void);
>  
>  extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>  
> +#ifdef CONFIG_MMU
> +extern void queue_oom_reaper(struct task_struct *tsk);
> +#else
> +static intern void queue_oom_reaper(struct task_struct *tsk)
> +{
> +}
> +#endif
> +
>  #endif /* _INCLUDE_LINUX_OOM_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..21f42758d430 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2231,6 +2231,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (!gfpflags_allow_blocking(gfp_mask))
>  		goto nomem;
>  
> +	if (unlikely(current->flags & PF_EXITING))
> +		gfp_mask |= __GFP_NORETRY;
> +

if (task_is_dying())
	gfp_mask |= __GFP_NORETRY


>  	memcg_memory_event(mem_over_limit, MEMCG_MAX);
>  	raised_max_event = true;
>  
> @@ -2284,6 +2287,14 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		goto retry;
>  	}
>  nomem:
> +	/*
> +	 * We ran out of memory while inside exit. Maybe the OOM
> +	 * reaper can help reduce cgroup memory use and get things
> +	 * moving again?
> +	 */
> +	if (unlikely(current->flags & PF_EXITING))
> +		queue_oom_reaper(current);
> +

I am not sure this is helpful without task_will_free_mem(), the dying
task shouldn't get sent to the OOM killer when we run out of memory.

I did notice that we have heuristics around task_is_dying() and
passed_oom, sounds like the end result of your changes would be to
ignore the heuristics of passed_oom


Balbir Singh.

Johannes Weiner Dec. 13, 2024, 4:42 a.m. UTC | #14

On Fri, Dec 13, 2024 at 12:32:11AM +0000, Roman Gushchin wrote:
> On Thu, Dec 12, 2024 at 01:30:12PM -0500, Johannes Weiner wrote:
> > On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote:
> > > On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@surriel.com> wrote:
> > > >
> > > > A task already in exit can get stuck trying to allocate pages, if its
> > > > cgroup is at the memory.max limit, the cgroup is using zswap, but
> > > > zswap writeback is enabled, and the remaining memory in the cgroup is
> > > > not compressible.
> > > >
> > > > This seems like an unlikely confluence of events, but it can happen
> > > > quite easily if a cgroup is OOM killed due to exceeding its memory.max
> > > > limit, and all the tasks in the cgroup are trying to exit simultaneously.
> > > >
> > > > When this happens, it can sometimes take hours for tasks to exit,
> > > > as they are all trying to squeeze things into zswap to bring the group's
> > > > memory consumption below memory.max.
> > > >
> > > > Allowing these exiting programs to push some memory from their own
> > > > cgroup into swap allows them to quickly bring the cgroup's memory
> > > > consumption below memory.max, and exit in seconds rather than hours.
> > > >
> > > > Signed-off-by: Rik van Riel <riel@surriel.com>
> > > 
> > > Thanks for sending a v2.
> > > 
> > > I still think maybe this needs to be fixed on the memcg side, at least
> > > by not making exiting tasks try really hard to reclaim memory to the
> > > point where this becomes a problem. IIUC there could be other reasons
> > > why reclaim may take too long, but maybe not as pathological as this
> > > case to be fair. I will let the memcg maintainers chime in for this.
> > > 
> > > If there's a fundamental reason why this cannot be fixed on the memcg
> > > side, I don't object to this change.
> > > 
> > > Nhat, any objections on your end? I think your fleet workloads were
> > > the first users of this interface. Does this break their expectations?
> > 
> > Yes, I don't think we can do this, unfortunately :( There can be a
> > variety of reasons for why a user might want to prohibit disk swap for
> > a certain cgroup, and we can't assume it's okay to make exceptions.
> > 
> > There might also not *be* any disk swap to overflow into after Nhat's
> > virtual swap patches. Presumably zram would still have the issue too.
> > 
> > So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> > have a somewhat tumultous history of policy in that space:
> > 
> > commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> > Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date:   Tue Mar 5 15:46:47 2019 -0800
> > 
> >     memcg: killed threads should not invoke memcg OOM killer
> > 
> > allowed dying tasks to simply force all charges and move on. This
> > turned out to be too aggressive; there were instances of exiting,
> > uncontained memcg tasks causing global OOMs. This lead to that:
> > 
> > commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> > Author: Vasily Averin <vasily.averin@linux.dev>
> > Date:   Fri Nov 5 13:38:09 2021 -0700
> > 
> >     memcg: prohibit unconditional exceeding the limit of dying tasks
> > 
> > which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> > even OOM victims*, can force charges. I am not sure this is correct,
> > either:
> > 
> > If we return -ENOMEM to an OOM victim in a fault, the fault handler
> > will re-trigger OOM, which will find the existing OOM victim and do
> > nothing, then restart the fault. This is a memory deadlock. The page
> > allocator gives OOM victims access to reserves for that reason.
> > 
> > Actually, it looks even worse. For some reason we're not triggering
> > OOM from dying tasks:
> > 
> >         ret = task_is_dying() || out_of_memory(&oc);
> > 
> > Even though dying tasks are in no way privileged or allowed to exit
> > expediently. Why shouldn't they trigger the OOM killer like anybody
> > else trying to allocate memory?
> > 
> > As it stands, it seems we have dying tasks getting trapped in an
> > endless fault->reclaim cycle; with no access to the OOM killer and no
> > access to reserves. Presumably this is what's going on here?
> > 
> > I think we want something like this:
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 53db98d2c4a1..be6b6e72bde5 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	if (mem_cgroup_margin(memcg) >= (1 << order))
> >  		goto unlock;
> >  
> > -	/*
> > -	 * A few threads which were not waiting at mutex_lock_killable() can
> > -	 * fail to bail out. Therefore, check again after holding oom_lock.
> > -	 */
> > -	ret = task_is_dying() || out_of_memory(&oc);
> > +	ret = out_of_memory(&oc);
> 
> I like the idea, but at first glance it might reintroduce the problem
> fixed by 7775face2079 ("memcg: killed threads should not invoke memcg OOM killer").

The race and warning pointed out in the changelog might have been
sufficiently mitigated by this more recent commit:

commit 1378b37d03e8147c67fde60caf0474ea879163d8
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Thu Aug 6 23:22:08 2020 -0700

    memcg, oom: check memcg margin for parallel oom

If not, another possibility would be this:

	ret = tsk_is_oom_victim(task) || out_of_memory(&oc);

But we should probably first restore reliable forward progress on
dying tasks, then worry about the spurious printk later.

Michal Hocko Dec. 16, 2024, 3:39 p.m. UTC | #15

On Thu 12-12-24 13:30:12, Johannes Weiner wrote:
[...]
> So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> have a somewhat tumultous history of policy in that space:
> 
> commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date:   Tue Mar 5 15:46:47 2019 -0800
> 
>     memcg: killed threads should not invoke memcg OOM killer
> 
> allowed dying tasks to simply force all charges and move on. This
> turned out to be too aggressive; there were instances of exiting,
> uncontained memcg tasks causing global OOMs. This lead to that:
> 
> commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> Author: Vasily Averin <vasily.averin@linux.dev>
> Date:   Fri Nov 5 13:38:09 2021 -0700
> 
>     memcg: prohibit unconditional exceeding the limit of dying tasks
> 
> which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> even OOM victims*, can force charges. I am not sure this is correct,
> either:

IIRC the reason going this route was a lack of per-memcg oom reserves.
Global oom victims are getting some slack because the amount of reserves
be bound. This is not the case for memcgs though.

> If we return -ENOMEM to an OOM victim in a fault, the fault handler
> will re-trigger OOM, which will find the existing OOM victim and do
> nothing, then restart the fault.

IIRC the task will handle the pending SIGKILL if the #PF fails. If the
charge happens from the exit path then we rely on ENOMEM returned from
gup as a signal to back off. Do we have any caller that keeps retrying
on ENOMEM?

> This is a memory deadlock. The page
> allocator gives OOM victims access to reserves for that reason.

> Actually, it looks even worse. For some reason we're not triggering
> OOM from dying tasks:
> 
>         ret = task_is_dying() || out_of_memory(&oc);
> 
> Even though dying tasks are in no way privileged or allowed to exit
> expediently. Why shouldn't they trigger the OOM killer like anybody
> else trying to allocate memory?

Good question! I suspect this early bail out is based on an assumption
that a dying task will free up the memory soon so oom killer is
unnecessary.

> As it stands, it seems we have dying tasks getting trapped in an
> endless fault->reclaim cycle; with no access to the OOM killer and no
> access to reserves. Presumably this is what's going on here?

As mentioned above this seems really surprising and it would indicate
that something in the exit path would keep retrying when getting ENOMEM
from gup or GFP_ACCOUNT allocation. GFP_NOFAIL requests are allowed to
over-consume.

> I think we want something like this:
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 53db98d2c4a1..be6b6e72bde5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (mem_cgroup_margin(memcg) >= (1 << order))
>  		goto unlock;
>  
> -	/*
> -	 * A few threads which were not waiting at mutex_lock_killable() can
> -	 * fail to bail out. Therefore, check again after holding oom_lock.
> -	 */
> -	ret = task_is_dying() || out_of_memory(&oc);
> +	ret = out_of_memory(&oc);

I am not against this as it would allow to do an async oom_reaper memory
reclaim in the worst case. This could potentially reintroduce the "No
victim available" case described by 7775face2079 ("memcg: killed threads
should not invoke memcg OOM killer") but that seemed to be a very
specific and artificial usecase IIRC.

>  
>  unlock:
>  	mutex_unlock(&oom_lock);
> @@ -2198,6 +2194,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (unlikely(current->flags & PF_MEMALLOC))
>  		goto force;
>  
> +	if (unlikely(tsk_is_oom_victim(current)))
> +		goto force;
> +
>  	if (unlikely(task_in_memcg_oom(current)))
>  		goto nomem;

This is more problematic as it doesn't cap a potential runaway and
eventual global OOM which is not really great. In the past this could be
possible through vmalloc which didn't bail out early for killed tasks.
That risk has been mitigated by dd544141b9eb ("vmalloc: back off when
the current task is OOM-killed"). I would like to keep some sort of
protection from those runaways. Whether that is a limited "reserve" for
oom victims that would be per memcg or do no let them consume above the
hard limit at all. Fundamentally a limited reserves doesn't solve the
underlying problem, it just make it less likely so the latter would be
preferred by me TBH.

Before we do that it would be really good to understand the source of
those retries. Maybe I am missing something really obvious but those
shouldn't really happen.

[v2] memcg: allow exiting tasks to write back data to swap

Commit Message

Comments

Patch