diff mbox series

[memcg,v3,2/3] mm, oom: do not trigger out_of_memory from the #PF

Message ID f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com (mailing list archive)
State New
Headers show
Series memcg: prohibit unconditional exceeding the limit of dying tasks | expand

Commit Message

Vasily Averin Oct. 23, 2021, 1:20 p.m. UTC
From: Michal Hocko <mhocko@suse.com>

Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory. This can happen for
2 different reasons. a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
normal allocation fails.

The later is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations. Anyway, if the OOM killer has been already invoked there
is no reason to invoke it again from the #PF path. Especially when the
OOM condition might be gone by that time and we have no way to find out
other than allocate.

Moreover if the allocation failed and the OOM killer hasn't been
invoked then we are unlikely to do the right thing from the #PF context
because we have already lost the allocation context and restictions and
therefore might oom kill a task from a different NUMA domain.

This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.

[VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
This is a local problem related to memcg, however, it causes unnecessary
global OOM kills that are repeated over and over again and escalate into
a real disaster. This has been broken since kmem accounting has been
introduced for cgroup v1 (3.8). There was no kmem specific reclaim
for the separate limit so the only way to handle kmem hard limit
was to return with ENOMEM.
In upstream the problem will be fixed by removing the outdated kmem limit,
however stable and LTS kernels cannot do it and are still affected.
This patch fixes the problem and should be backported into stable/LTS.]

Cc: stable@vger.kernel.org

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v2: hook with fatal_signal_pending() has beem moved into a separate patch,
    improved patch description, removed "Fixed" mark.
---
 mm/oom_kill.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

Comments

Tetsuo Handa Oct. 23, 2021, 3:01 p.m. UTC | #1
On 2021/10/23 22:20, Vasily Averin wrote:
>  /*
> - * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> - * killing is already in progress so do nothing.
> + * The pagefault handler calls here because some allocation has failed. We have
> + * to take care of the memcg OOM here because this is the only safe context without
> + * any locks held but let the oom killer triggered from the allocation context care
> + * about the global OOM.
>   */

Excuse me for a stupid question. I consider

  if (!mutex_trylock(&oom_lock))
    return;
  out_of_memory(&oc);
  mutex_unlock(&oom_lock);

here as the last resort (safeguard) when neither __alloc_pages_may_oom()
nor mem_cgroup_out_of_memory() can make progress. This patch says

  let the oom killer triggered from the allocation context care
  about the global OOM.

but what if the OOM killer cannot be invoked from the allocation context?
Is there a guarantee that all memory allocations which might result in
VM_FAULT_OOM can invoke the OOM killer?
Vasily Averin Oct. 23, 2021, 7:15 p.m. UTC | #2
On 23.10.2021 18:01, Tetsuo Handa wrote:
> On 2021/10/23 22:20, Vasily Averin wrote:
>>  /*
>> - * The pagefault handler calls here because it is out of memory, so kill a
>> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
>> - * killing is already in progress so do nothing.
>> + * The pagefault handler calls here because some allocation has failed. We have
>> + * to take care of the memcg OOM here because this is the only safe context without
>> + * any locks held but let the oom killer triggered from the allocation context care
>> + * about the global OOM.
>>   */
> 
> Excuse me for a stupid question. I consider
> 
>   if (!mutex_trylock(&oom_lock))
>     return;
>   out_of_memory(&oc);
>   mutex_unlock(&oom_lock);
> 
> here as the last resort (safeguard) when neither __alloc_pages_may_oom()
> nor mem_cgroup_out_of_memory() can make progress. This patch says
> 
>   let the oom killer triggered from the allocation context care
>   about the global OOM.
> 
> but what if the OOM killer cannot be invoked from the allocation context?
> Is there a guarantee that all memory allocations which might result in
> VM_FAULT_OOM can invoke the OOM killer?

I don't think this question is stupid, since I asked it myself :)
You can find this discussion here:
https://lkml.org/lkml/2021/10/21/900

Let me quote it here too
:> 1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory()
:> for exampel it can be caused by incorrect vm fault operations, 
:> (a) which can return this error without calling allocator at all.
:
:I would argue this to be a bug. How can that particular code tell
:whether the system is OOM and the oom killer is the a reasonable measure
:to take?
:
:> (b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory.
:
: I am not sure I can see any sensible scenario where pagefault oom killer
: would be an appropriate fix for that.
:
:> (c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller.
:> We'll drop it in upstream kernels, however how to handle it in old kenrels?
:
:Triggering the global oom killer for legacy kmem charge failure is
:clearly wrong. Removing oom killer from #PF would fix that problem.

I would note: (c) is not theoretical but real life problem, in this case allocation was failed without execution of OOM,
however, it is in this case that execution out_of_memory() from pagefault_out_of_memory() leads to a disaster.

Thank you,
	Vasily Averin
Michal Hocko Oct. 25, 2021, 8:04 a.m. UTC | #3
On Sun 24-10-21 00:01:07, Tetsuo Handa wrote:
> On 2021/10/23 22:20, Vasily Averin wrote:
> >  /*
> > - * The pagefault handler calls here because it is out of memory, so kill a
> > - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> > - * killing is already in progress so do nothing.
> > + * The pagefault handler calls here because some allocation has failed. We have
> > + * to take care of the memcg OOM here because this is the only safe context without
> > + * any locks held but let the oom killer triggered from the allocation context care
> > + * about the global OOM.
> >   */
> 
> Excuse me for a stupid question. I consider
> 
>   if (!mutex_trylock(&oom_lock))
>     return;
>   out_of_memory(&oc);
>   mutex_unlock(&oom_lock);
> 
> here as the last resort (safeguard) when neither __alloc_pages_may_oom()
> nor mem_cgroup_out_of_memory() can make progress. This patch says
> 
>   let the oom killer triggered from the allocation context care
>   about the global OOM.
> 
> but what if the OOM killer cannot be invoked from the allocation context?
> Is there a guarantee that all memory allocations which might result in
> VM_FAULT_OOM can invoke the OOM killer?

I do not think there is any guarantee. This code has meant to be a
safeguard but it turns out to be adding more harm than a safety. There
are several scenarios mentioned in this thread where this would be
counter productive or outright wrong thing to do.

On the other hand it is hard to imagine any legitimate situation where
this would be a right thing to do. Maybe you have something more
specific in mind? What would be the legit code to rely on OOM handling
out of the line (where the details about the allocation scope is lost)?
Michal Hocko Oct. 25, 2021, 9:34 a.m. UTC | #4
On Sat 23-10-21 16:20:18, Vasily Averin wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory. This can happen for
> 2 different reasons. a) Memcg is out of memory and we rely on
> mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> normal allocation fails.
> 
> The later is quite problematic because allocation paths already trigger
> out_of_memory and the page allocator tries really hard to not fail
> allocations. Anyway, if the OOM killer has been already invoked there
> is no reason to invoke it again from the #PF path. Especially when the
> OOM condition might be gone by that time and we have no way to find out
> other than allocate.
> 
> Moreover if the allocation failed and the OOM killer hasn't been
> invoked then we are unlikely to do the right thing from the #PF context
> because we have already lost the allocation context and restictions and
> therefore might oom kill a task from a different NUMA domain.
> 
> This all suggests that there is no legitimate reason to trigger
> out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> that no #PF path returns with VM_FAULT_OOM without allocation print a
> warning that this is happening before we restart the #PF.
> 
> [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
> This is a local problem related to memcg, however, it causes unnecessary
> global OOM kills that are repeated over and over again and escalate into
> a real disaster. This has been broken since kmem accounting has been
> introduced for cgroup v1 (3.8). There was no kmem specific reclaim
> for the separate limit so the only way to handle kmem hard limit
> was to return with ENOMEM.
> In upstream the problem will be fixed by removing the outdated kmem limit,
> however stable and LTS kernels cannot do it and are still affected.
> This patch fixes the problem and should be backported into stable/LTS.]
> 
> Cc: stable@vger.kernel.org

I would be still careful about backporting to stable trees. At least
wait for a release cycle to catch potential problems before backporting.
The problem with kmem is documented and for quite a lot of time and we
haven't received a single bug report IIRC so this is likely not a real
problem people are facing out there.


> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

I think this is the right thing to do. Hopefuly we won't find some
tricky code which would depend on this behavior. If this turns out to be
the case then we will clearly learn about it by the kernel message and
we can to handle that situation more gracefully.

Maybe we will want to update the chengelog to be more specific based on
the review comments but this one should describe the problem quite well
already.
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
> v2: hook with fatal_signal_pending() has beem moved into a separate patch,
>     improved patch description, removed "Fixed" mark.
> ---
>  mm/oom_kill.c | 22 ++++++++--------------
>  1 file changed, 8 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1deef8c7a71b..f98954befafb 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1120,19 +1120,15 @@ bool out_of_memory(struct oom_control *oc)
>  }
>  
>  /*
> - * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> - * killing is already in progress so do nothing.
> + * The pagefault handler calls here because some allocation has failed. We have
> + * to take care of the memcg OOM here because this is the only safe context without
> + * any locks held but let the oom killer triggered from the allocation context care
> + * about the global OOM.
>   */
>  void pagefault_out_of_memory(void)
>  {
> -	struct oom_control oc = {
> -		.zonelist = NULL,
> -		.nodemask = NULL,
> -		.memcg = NULL,
> -		.gfp_mask = 0,
> -		.order = 0,
> -	};
> +	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
> +				      DEFAULT_RATELIMIT_BURST);
>  
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
> @@ -1140,10 +1136,8 @@ void pagefault_out_of_memory(void)
>  	if (fatal_signal_pending(current))
>  		return;
>  
> -	if (!mutex_trylock(&oom_lock))
> -		return;
> -	out_of_memory(&oc);
> -	mutex_unlock(&oom_lock);
> +	if (__ratelimit(&pfoom_rs))
> +		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
>  }
>  
>  SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> -- 
> 2.32.0
Tetsuo Handa Oct. 26, 2021, 1:56 p.m. UTC | #5
On 2021/10/25 17:04, Michal Hocko wrote:
> I do not think there is any guarantee. This code has meant to be a
> safeguard but it turns out to be adding more harm than a safety. There
> are several scenarios mentioned in this thread where this would be
> counter productive or outright wrong thing to do.

Setting PR_IO_FLUSHER via prctl(PR_SET_IO_FLUSHER) + hitting legacy kmem
charge limit might be an unexpected combination?

> 
> On the other hand it is hard to imagine any legitimate situation where
> this would be a right thing to do. Maybe you have something more
> specific in mind? What would be the legit code to rely on OOM handling
> out of the line (where the details about the allocation scope is lost)?

I don't have specific scenario, but I feel that it might be a chance to
retry killable vmalloc(). Commit b8c8a338f75e ("Revert "vmalloc: back off
when the current task is killed"") was 4.5 years ago, and fuzz testing found
many bugs triggered by memory allocation fault injection. Thus, I think that
the direction is going towards "we can fail memory allocation upon SIGKILL
(rather than worrying about depleting memory reserves and/or escalating to
global OOM killer invocations)". Most memory allocation requests which
allocate memory for userspace process are willing to give up upon SIGKILL.

Like you are trying to add NOFS, NOIO, NOFAIL support to vmalloc(), you could
consider KILLABLE support as well. Of course, direct reclaim makes it difficult
to immediately give up upon SIGKILL, but killable allocation sounds still nice
even if best-effort basis.
Michal Hocko Oct. 26, 2021, 2:07 p.m. UTC | #6
On Tue 26-10-21 22:56:44, Tetsuo Handa wrote:
> On 2021/10/25 17:04, Michal Hocko wrote:
> > I do not think there is any guarantee. This code has meant to be a
> > safeguard but it turns out to be adding more harm than a safety. There
> > are several scenarios mentioned in this thread where this would be
> > counter productive or outright wrong thing to do.
> 
> Setting PR_IO_FLUSHER via prctl(PR_SET_IO_FLUSHER) + hitting legacy kmem
> charge limit might be an unexpected combination?

I am not sure I follow or why PR_SET_IO_FLUSHER should be relevant. But
triggering the global OOM killer on kmem charge limit failure is
certainly not the right thing to do. Quite opposite because this would
be effectivelly a global DoS as a result of a local memory constrain.
 
> > On the other hand it is hard to imagine any legitimate situation where
> > this would be a right thing to do. Maybe you have something more
> > specific in mind? What would be the legit code to rely on OOM handling
> > out of the line (where the details about the allocation scope is lost)?
> 
> I don't have specific scenario, but I feel that it might be a chance to
> retry killable vmalloc(). Commit b8c8a338f75e ("Revert "vmalloc: back off
> when the current task is killed"") was 4.5 years ago, and fuzz testing found
> many bugs triggered by memory allocation fault injection. Thus, I think that
> the direction is going towards "we can fail memory allocation upon SIGKILL
> (rather than worrying about depleting memory reserves and/or escalating to
> global OOM killer invocations)". Most memory allocation requests which
> allocate memory for userspace process are willing to give up upon SIGKILL.
> 
> Like you are trying to add NOFS, NOIO, NOFAIL support to vmalloc(), you could
> consider KILLABLE support as well. Of course, direct reclaim makes it difficult
> to immediately give up upon SIGKILL, but killable allocation sounds still nice
> even if best-effort basis.

This is all fine but I am not sure how this is realated to this patch.
The previous patch already gives up in pagefault_out_of_memory on fatal
signal pending. So this code is not really reachable.

Also alowing more allocations to fail doesn't really suggest that we
should trigger OOM killer from #PF. I would argue that the opposite is
the case actually. Or I just haven't understood your concern?
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1deef8c7a71b..f98954befafb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1120,19 +1120,15 @@  bool out_of_memory(struct oom_control *oc)
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
  */
 void pagefault_out_of_memory(void)
 {
-	struct oom_control oc = {
-		.zonelist = NULL,
-		.nodemask = NULL,
-		.memcg = NULL,
-		.gfp_mask = 0,
-		.order = 0,
-	};
+	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
@@ -1140,10 +1136,8 @@  void pagefault_out_of_memory(void)
 	if (fatal_signal_pending(current))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
-		return;
-	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
+	if (__ratelimit(&pfoom_rs))
+		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
 }
 
 SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)