[memcg,3/3] memcg: handle memcg oom failures

Message ID	fb33f4bd-34cd-2187-eff4-7c1c11d5ae94@virtuozzo.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=02Vf=PI=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 65DF86135E From: Vasily Averin <vvs@virtuozzo.com> Subject: [PATCH memcg 3/3] memcg: handle memcg oom failures To: Michal Hocko <mhocko@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, Andrew Morton <akpm@linux-foundation.org> Cc: Roman Gushchin <guro@fb.com>, Uladzislau Rezki <urezki@gmail.com>, Vlastimil Babka <vbabka@suse.cz>, Shakeel Butt <shakeelb@google.com>, Mel Gorman <mgorman@techsingularity.net>, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel@openvz.org References: <YW/WoJDFM3ddHn7Y@dhcp22.suse.cz> <cover.1634730787.git.vvs@virtuozzo.com> Message-ID: <fb33f4bd-34cd-2187-eff4-7c1c11d5ae94@virtuozzo.com> Date: Wed, 20 Oct 2021 15:14:27 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: <cover.1634730787.git.vvs@virtuozzo.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	memcg: prohibit unconditional exceeding the limit of dying tasks \| expand [memcg,RFC,0/3] memcg: prohibit unconditional exceeding the limit of dying tasks [memcg,1/3] mm: do not firce global OOM from inside dying tasks [memcg,2/3] memcg: remove charge forcinig for dying tasks [memcg,3/3] memcg: handle memcg oom failures

Vasily Averin Oct. 20, 2021, 12:14 p.m. UTC

mem_cgroup_oom() can fail if current task was marked unkillable
and oom killer cannot find any victim.

Currently we force memcg charge for such allocations,
however it allow memcg-limited userspace task in to overuse assigned limits
and potentially trigger the global memory shortage.

Let's fail the memory charge in such cases.

This failure should be somehow recognised in #PF context,
so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED

ToDo: what is the best way to notify pagefault_out_of_memory() about 
    mem_cgroup_out_of_memory failure ?

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c | 32 +++++++++++++++++++++++---------
 1 file changed, 23 insertions(+), 9 deletions(-)

Michal Hocko Oct. 20, 2021, 1:02 p.m. UTC | #1

On Wed 20-10-21 15:14:27, Vasily Averin wrote:
> mem_cgroup_oom() can fail if current task was marked unkillable
> and oom killer cannot find any victim.
> 
> Currently we force memcg charge for such allocations,
> however it allow memcg-limited userspace task in to overuse assigned limits
> and potentially trigger the global memory shortage.

You should really go into more details whether that is a practical
problem to handle. OOM_FAILED means that the memcg oom killer couldn't
find any oom victim so it cannot help with a forward progress. There are
not that many situations when that can happen. Naming that would be
really useful.

> Let's fail the memory charge in such cases.
> 
> This failure should be somehow recognised in #PF context,

explain why

> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED
> 
> ToDo: what is the best way to notify pagefault_out_of_memory() about 
>     mem_cgroup_out_of_memory failure ?

why don't you simply remove out_of_memory from pagefault_out_of_memory
and leave it only with the blocking memcg OOM handling? Wouldn't that be a
more generic solution? Your first patch already goes that way partially.

This change is more risky than the first one. If somebody returns
VM_FAULT_OOM without invoking allocator then it can loop for ever but
invoking OOM killer in such a situation is equally wrong as the oom
killer cannot really help, right?

Vasily Averin Oct. 20, 2021, 3:46 p.m. UTC | #2

On 20.10.2021 16:02, Michal Hocko wrote:
> On Wed 20-10-21 15:14:27, Vasily Averin wrote:
>> mem_cgroup_oom() can fail if current task was marked unkillable
>> and oom killer cannot find any victim.
>>
>> Currently we force memcg charge for such allocations,
>> however it allow memcg-limited userspace task in to overuse assigned limits
>> and potentially trigger the global memory shortage.
> 
> You should really go into more details whether that is a practical
> problem to handle. OOM_FAILED means that the memcg oom killer couldn't
> find any oom victim so it cannot help with a forward progress. There are
> not that many situations when that can happen. Naming that would be
> really useful.

I've pointed above: 
"if current task was marked unkillable and oom killer cannot find any victim."
This may happen when current task cannot be oom-killed because it was marked
unkillable i.e. it have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
and other processes in memcg are either dying, or are kernel threads or are marked unkillable 
by the same way. Or when memcg have this process only.

If we always approve such kind of allocation, it can be misused.
Process can mmap a lot of memory,
ant then touch it and generate page fault and make overcharged memory allocations.
Finally it can consume all node memory and trigger global memory shortage on the host.

>> Let's fail the memory charge in such cases.
>>
>> This failure should be somehow recognised in #PF context,
> 
> explain why

When #PF cannot allocate memory (due to reason described above), handle_mm_fault returns VM_FAULT_OOM,
then its caller executes pagefault_out_of_memory(). If last one cannot recognize the real reason of this fail,
it expect it was global memory shortage and executed global out_ouf_memory() that can kill random process 
or just crash node if sysctl vm.panic_on_oom is set to 1.

Currently pagefault_out_of_memory() knows about possible async memcg OOM and handles it correctly.
However it is not aware that memcg can reject some other allocations, do not recognize the fault
as memcg-related and allows to run global OOM.

>> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED
>>
>> ToDo: what is the best way to notify pagefault_out_of_memory() about 
>>     mem_cgroup_out_of_memory failure ?
> 
> why don't you simply remove out_of_memory from pagefault_out_of_memory
> and leave it only with the blocking memcg OOM handling? Wouldn't that be a
> more generic solution? Your first patch already goes that way partially.

I clearly understand that global out_of_memory should not be trggired by memcg restrictions.
I clearly understand that dying task will release some memory soon and we can do not run global oom if current task is dying.

However I'm not sure that I can remove out_of_memory at all. At least I do not have good arguments to do it.

> This change is more risky than the first one. If somebody returns
> VM_FAULT_OOM without invoking allocator then it can loop for ever but
> invoking OOM killer in such a situation is equally wrong as the oom
> killer cannot really help, right?

I'm not ready to comment this right now and take time out to think about.

Thank you,
	Vasily Averin

Michal Hocko Oct. 21, 2021, 11:49 a.m. UTC | #3

On Wed 20-10-21 18:46:56, Vasily Averin wrote:
> On 20.10.2021 16:02, Michal Hocko wrote:
> > On Wed 20-10-21 15:14:27, Vasily Averin wrote:
> >> mem_cgroup_oom() can fail if current task was marked unkillable
> >> and oom killer cannot find any victim.
> >>
> >> Currently we force memcg charge for such allocations,
> >> however it allow memcg-limited userspace task in to overuse assigned limits
> >> and potentially trigger the global memory shortage.
> > 
> > You should really go into more details whether that is a practical
> > problem to handle. OOM_FAILED means that the memcg oom killer couldn't
> > find any oom victim so it cannot help with a forward progress. There are
> > not that many situations when that can happen. Naming that would be
> > really useful.
> 
> I've pointed above: 
> "if current task was marked unkillable and oom killer cannot find any victim."
> This may happen when current task cannot be oom-killed because it was marked
> unkillable i.e. it have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> and other processes in memcg are either dying, or are kernel threads or are marked unkillable 
> by the same way. Or when memcg have this process only.
> 
> If we always approve such kind of allocation, it can be misused.
> Process can mmap a lot of memory,
> ant then touch it and generate page fault and make overcharged memory allocations.
> Finally it can consume all node memory and trigger global memory shortage on the host.

Yes, this is true but a) OOM_SCORE_ADJ_MIN tasks are excluded from the
OOM handling so they have to be careful with the memory consumption and
b) is this a theoretical or a practical concern. 

This is mostly what I wanted to make sure you describe in the changelog.

> >> Let's fail the memory charge in such cases.
> >>
> >> This failure should be somehow recognised in #PF context,
> > 
> > explain why
> 
> When #PF cannot allocate memory (due to reason described above), handle_mm_fault returns VM_FAULT_OOM,
> then its caller executes pagefault_out_of_memory(). If last one cannot recognize the real reason of this fail,
> it expect it was global memory shortage and executed global out_ouf_memory() that can kill random process 
> or just crash node if sysctl vm.panic_on_oom is set to 1.
> 
> Currently pagefault_out_of_memory() knows about possible async memcg OOM and handles it correctly.
> However it is not aware that memcg can reject some other allocations, do not recognize the fault
> as memcg-related and allows to run global OOM.

Again something to be added to the changelog.

> >> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED
> >>
> >> ToDo: what is the best way to notify pagefault_out_of_memory() about 
> >>     mem_cgroup_out_of_memory failure ?
> > 
> > why don't you simply remove out_of_memory from pagefault_out_of_memory
> > and leave it only with the blocking memcg OOM handling? Wouldn't that be a
> > more generic solution? Your first patch already goes that way partially.
> 
> I clearly understand that global out_of_memory should not be trggired by memcg restrictions.
> I clearly understand that dying task will release some memory soon and we can do not run global oom if current task is dying.
> 
> However I'm not sure that I can remove out_of_memory at all. At least I do not have good arguments to do it.

I do understand that handling a very specific case sounds easier but it
would be better to have a robust fix even if that requires some more
head scratching. So far we have collected several reasons why the it is
bad to trigger oom killer from the #PF path. There is no single argument
to keep it so it sounds like a viable path to pursue. Maybe there are
some very well hidden reasons but those should be documented and this is
a great opportunity to do either of the step.

Moreover if it turns out that there is a regression then this can be
easily reverted and a different, maybe memcg specific, solution can be
implemented.

Vasily Averin Oct. 21, 2021, 3:05 p.m. UTC | #4

On 21.10.2021 14:49, Michal Hocko wrote:
> I do understand that handling a very specific case sounds easier but it
> would be better to have a robust fix even if that requires some more
> head scratching. So far we have collected several reasons why the it is
> bad to trigger oom killer from the #PF path. There is no single argument
> to keep it so it sounds like a viable path to pursue. Maybe there are
> some very well hidden reasons but those should be documented and this is
> a great opportunity to do either of the step.
> 
> Moreover if it turns out that there is a regression then this can be
> easily reverted and a different, maybe memcg specific, solution can be
> implemented.

Now I'm agree,
however I still have a few open questions.

1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory()
for exampel it can be caused by incorrect vm fault operations, 
(a) which can return this error without calling allocator at all.
(b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory.
(c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller.
We'll drop it in upstream kernels, however how to handle it in old kenrels?

We can make sure that out_of_memory or alocator was called by set of some per-task flags.

Can pagefault_out_of_memory() send itself a SIGKILL in all these cases?

If not -- task will be looped. 
It is much better than execution of global OOM, however it would be even better to avoid it somehow.

You said: "We cannot really kill the task if we could we would have done it by the oom killer already".
However what to do if we even not tried to use oom-killer? (see (b) and (c)) 
or if we did not used the allocator at all (see (a))

2) in your patch we just exit from pagefault_out_of_memory(). and restart new #PF.
We can call schedule_timeout() and wait some time before a new #PF restart.
Additionally we can increase this delay in each new cycle. 
It helps to save CPU time for other tasks.
What do you think about?

Thank you,
	Vasily Averin

Michal Hocko Oct. 21, 2021, 4:47 p.m. UTC | #5

On Thu 21-10-21 18:05:28, Vasily Averin wrote:
> On 21.10.2021 14:49, Michal Hocko wrote:
> > I do understand that handling a very specific case sounds easier but it
> > would be better to have a robust fix even if that requires some more
> > head scratching. So far we have collected several reasons why the it is
> > bad to trigger oom killer from the #PF path. There is no single argument
> > to keep it so it sounds like a viable path to pursue. Maybe there are
> > some very well hidden reasons but those should be documented and this is
> > a great opportunity to do either of the step.
> > 
> > Moreover if it turns out that there is a regression then this can be
> > easily reverted and a different, maybe memcg specific, solution can be
> > implemented.
> 
> Now I'm agree,
> however I still have a few open questions.
> 
> 1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory()
> for exampel it can be caused by incorrect vm fault operations, 
> (a) which can return this error without calling allocator at all.

I would argue this to be a bug. How can that particular code tell
whether the system is OOM and the oom killer is the a reasonable measure
to take?

> (b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory.

I am not sure I can see any sensible scenario where pagefault oom killer
would be an appropriate fix for that.

> (c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller.
> We'll drop it in upstream kernels, however how to handle it in old kenrels?

Triggering the global oom killer for legacy kmem charge failure is
clearly wrong. Removing oom killer from #PF would fix that problem.

> We can make sure that out_of_memory or alocator was called by set of some per-task flags.

I am not sure I see how that would be useful other than reporting a
dubious VM_FAULT_OOM usage. I am also not sure how that would be
implemented as allocator can be called several times not to mention that
the allocation itself could have been done from a different context -
e.g. WQ.

> Can pagefault_out_of_memory() send itself a SIGKILL in all these cases?

In principle it can as sending signal is not prohibited. I would argue
it should not though because it is just wrong thing to do in all those
cases.

> If not -- task will be looped. 

Yes, but it will be killable from userspace. So this is not an
unrecoverable situation.
> It is much better than execution of global OOM, however it would be even better to avoid it somehow.

How?

> You said: "We cannot really kill the task if we could we would have done it by the oom killer already".
> However what to do if we even not tried to use oom-killer? (see (b) and (c)) 
> or if we did not used the allocator at all (see (a))

See above

> 2) in your patch we just exit from pagefault_out_of_memory(). and restart new #PF.
> We can call schedule_timeout() and wait some time before a new #PF restart.
> Additionally we can increase this delay in each new cycle. 
> It helps to save CPU time for other tasks.
> What do you think about?

I do not have a strong opinion on this. A short sleep makes sense. I am
not sure a more complex implementation is really needed.

[memcg,3/3] memcg: handle memcg oom failures

Commit Message

Comments

Patch