diff mbox series

[memcg] memcg: prohibit unconditional exceeding the limit of dying tasks

Message ID 5b06a490-55bc-a6a0-6c85-690254f86fad@virtuozzo.com (mailing list archive)
State New
Headers show
Series [memcg] memcg: prohibit unconditional exceeding the limit of dying tasks | expand

Commit Message

Vasily Averin Sept. 10, 2021, 12:39 p.m. UTC
The kernel currently allows dying tasks to exceed the memcg limits.
The allocation is expected to be the last one and the occupied memory
will be freed soon.
This is not always true because it can be part of the huge vmalloc
allocation. Allowed once, they will repeat over and over again.
Moreover lifetime of the allocated object can differ from
In addition the lifetime of the dying task.
Multiple such allocations running concurrently can not only overuse
the memcg limit, but can lead to a global out of memory and,
in the worst case, cause the host to panic.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c | 23 +++++------------------
 1 file changed, 5 insertions(+), 18 deletions(-)

Comments

Tetsuo Handa Sept. 10, 2021, 1:04 p.m. UTC | #1
On 2021/09/10 21:39, Vasily Averin wrote:
> The kernel currently allows dying tasks to exceed the memcg limits.
> The allocation is expected to be the last one and the occupied memory
> will be freed soon.
> This is not always true because it can be part of the huge vmalloc
> allocation. Allowed once, they will repeat over and over again.
> Moreover lifetime of the allocated object can differ from
> In addition the lifetime of the dying task.

Can't we add fatal_signal_pending(current) test to vmalloc() loop?

> Multiple such allocations running concurrently can not only overuse
> the memcg limit, but can lead to a global out of memory and,
> in the worst case, cause the host to panic.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Vasily Averin Sept. 10, 2021, 1:07 p.m. UTC | #2
On 9/10/21 3:39 PM, Vasily Averin wrote:
> The kernel currently allows dying tasks to exceed the memcg limits.
> The allocation is expected to be the last one and the occupied memory
> will be freed soon.
> This is not always true because it can be part of the huge vmalloc
> allocation. Allowed once, they will repeat over and over again.
> Moreover lifetime of the allocated object can differ from
> In addition the lifetime of the dying task.
> Multiple such allocations running concurrently can not only overuse
> the memcg limit, but can lead to a global out of memory and,
> in the worst case, cause the host to panic.

btw should_force_charge() function name become wrong with this.
Is it make sense to replace it by something like is_task_dying() ?

> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  mm/memcontrol.c | 23 +++++------------------
>  1 file changed, 5 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 389b5766e74f..67195fcfbddf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  		return OOM_ASYNC;
>  	}
>  
> +	if (should_force_charge())
> +		return OOM_SKIPPED;
> +
>  	mem_cgroup_mark_under_oom(memcg);
>  
>  	locked = mem_cgroup_oom_trylock(memcg);
> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_ATOMIC)
>  		goto force;
>  
> -	/*
> -	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> -	 */
> -	if (unlikely(should_force_charge()))
> -		goto force;
> -
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
>  	 * allocate memory. This might exceed the limits temporarily,
> @@ -2688,9 +2682,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_RETRY_MAYFAIL)
>  		goto nomem;
>  
> -	if (fatal_signal_pending(current))
> -		goto force;
> -
>  	/*
>  	 * keep retrying as long as the memcg oom killer is able to make
>  	 * a forward progress or bypass the charge if the oom killer
> @@ -2698,15 +2689,11 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
>  		       get_order(nr_pages * PAGE_SIZE));
> -	switch (oom_status) {
> -	case OOM_SUCCESS:
> +	if (oom_status == OOM_SUCCESS) {
>  		nr_retries = MAX_RECLAIM_RETRIES;
>  		goto retry;
> -	case OOM_FAILED:
> +	} else if (oom_status == OOM_FAILED)
>  		goto force;
> -	default:
> -		goto nomem;
> -	}
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
>  		return -ENOMEM;
>
Vasily Averin Sept. 10, 2021, 1:20 p.m. UTC | #3
On 9/10/21 4:04 PM, Tetsuo Handa wrote:
> On 2021/09/10 21:39, Vasily Averin wrote:
>> The kernel currently allows dying tasks to exceed the memcg limits.
>> The allocation is expected to be the last one and the occupied memory
>> will be freed soon.
>> This is not always true because it can be part of the huge vmalloc
>> allocation. Allowed once, they will repeat over and over again.
>> Moreover lifetime of the allocated object can differ from
>> In addition the lifetime of the dying task.
> 
> Can't we add fatal_signal_pending(current) test to vmalloc() loop?

1) this has been done in the past but has been reverted later.
2) any vmalloc changes will affect non-memcg allocations too.
 If we're doing memcg-related checks it's better to do it in one place.
3) it is not vmalloc-only issue. Huge number of  kmalloc page allocations 
from N concurrent threads will lead to the same problem. 

>> Multiple such allocations running concurrently can not only overuse
>> the memcg limit, but can lead to a global out of memory and,
>> in the worst case, cause the host to panic.
>>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Michal Hocko Sept. 10, 2021, 2:55 p.m. UTC | #4
On Fri 10-09-21 16:20:58, Vasily Averin wrote:
> On 9/10/21 4:04 PM, Tetsuo Handa wrote:
> > On 2021/09/10 21:39, Vasily Averin wrote:
> >> The kernel currently allows dying tasks to exceed the memcg limits.
> >> The allocation is expected to be the last one and the occupied memory
> >> will be freed soon.
> >> This is not always true because it can be part of the huge vmalloc
> >> allocation. Allowed once, they will repeat over and over again.
> >> Moreover lifetime of the allocated object can differ from
> >> In addition the lifetime of the dying task.
> > 
> > Can't we add fatal_signal_pending(current) test to vmalloc() loop?

We can and we should.

> 1) this has been done in the past but has been reverted later.

The reason for that should be addressed IIRC.

> 2) any vmalloc changes will affect non-memcg allocations too.
>  If we're doing memcg-related checks it's better to do it in one place.

I think those two things are just orthogonal. Bailing out from vmalloc
early sounds reasonable to me on its own. Allocating a large thing that
is likely to go away with the allocating context is just a waste of
resources and potential reason to disruptions to others.

> 3) it is not vmalloc-only issue. Huge number of  kmalloc page allocations 
> from N concurrent threads will lead to the same problem. 

Agreed. I do not think it is viable to sprinkle fatal_signal_pending or
similar checks all over the code. This should better be limited to
allocators and the charging function.

Our assumption that is described in the code simply doesn't hold and it
is close to impossible to check all charging paths to bail out properly
so I think we should just remove that optimistic attitude and do not
force charges unless that is absolutely necessary (e.g. __GFP_NOFAIL) or
impractical (e.g. atomic allocations) and bound.

I didn't get to review the patch yet. This is a tricky area with many
unobvious corner cases. I will try find some time next week.

Thanks!
Vasily Averin Sept. 13, 2021, 7:51 a.m. UTC | #5
On 9/10/21 3:39 PM, Vasily Averin wrote:
> The kernel currently allows dying tasks to exceed the memcg limits.
> The allocation is expected to be the last one and the occupied memory
> will be freed soon.
> This is not always true because it can be part of the huge vmalloc
> allocation. Allowed once, they will repeat over and over again.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 389b5766e74f..67195fcfbddf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_ATOMIC)
>  		goto force;
>  
> -	/*
> -	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> -	 */
> -	if (unlikely(should_force_charge()))
> -		goto force;
> -

Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps?
It is set inside do_exit only and (I hope) cannot trigger huge vmalloc allocations.

Thank you,
	Vasily Averin
Vasily Averin Sept. 13, 2021, 8:29 a.m. UTC | #6
On 9/10/21 5:55 PM, Michal Hocko wrote:
> On Fri 10-09-21 16:20:58, Vasily Averin wrote:
>> On 9/10/21 4:04 PM, Tetsuo Handa wrote:
>>> Can't we add fatal_signal_pending(current) test to vmalloc() loop?
> 
> We can and we should.
> 
>> 1) this has been done in the past but has been reverted later.
> 
> The reason for that should be addressed IIRC.

I don't know the details of this, and I need some time to investigate it.

>> 2) any vmalloc changes will affect non-memcg allocations too.
>>  If we're doing memcg-related checks it's better to do it in one place.
> 
> I think those two things are just orthogonal. Bailing out from vmalloc
> early sounds reasonable to me on its own. Allocating a large thing that
> is likely to go away with the allocating context is just a waste of
> resources and potential reason to disruptions to others.

I doubt that fatal signal should block any vmalloc allocations.
I assume there are situations where rollback of some cancelled operation uses vmalloc.
Or coredump saving on some remote storage can uses vmalloc.

However for me it's abnormal that even OOM-killer cannot cancel huge vmalloc allocation.
So I think tsk_is_oom_victim(current) check should be added to vm_area_alloc_pages() 
to break vmalloc cycle.

Thank you,
	Vasily Averin
Michal Hocko Sept. 13, 2021, 8:39 a.m. UTC | #7
On Mon 13-09-21 10:51:37, Vasily Averin wrote:
> On 9/10/21 3:39 PM, Vasily Averin wrote:
> > The kernel currently allows dying tasks to exceed the memcg limits.
> > The allocation is expected to be the last one and the occupied memory
> > will be freed soon.
> > This is not always true because it can be part of the huge vmalloc
> > allocation. Allowed once, they will repeat over and over again.
> 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 389b5766e74f..67195fcfbddf 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	if (gfp_mask & __GFP_ATOMIC)
> >  		goto force;
> >  
> > -	/*
> > -	 * Unlike in global OOM situations, memcg is not in a physical
> > -	 * memory shortage.  Allow dying and OOM-killed tasks to
> > -	 * bypass the last charges so that they can exit quickly and
> > -	 * free their memory.
> > -	 */
> > -	if (unlikely(should_force_charge()))
> > -		goto force;
> > -
> 
> Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps?

Why?

> It is set inside do_exit only and (I hope) cannot trigger huge vmalloc allocations.

Allocations in this code path should be rare but it is not like they are
non-existent. This is rather hard to review area spread at many places
so if we are deciding to make the existing model simpler (no bypassing)
then I would rather have no exceptions unless they are reaaly necessary
and document them if they are.
Michal Hocko Sept. 13, 2021, 8:42 a.m. UTC | #8
On Mon 13-09-21 11:29:37, Vasily Averin wrote:
> On 9/10/21 5:55 PM, Michal Hocko wrote:
> > On Fri 10-09-21 16:20:58, Vasily Averin wrote:
> >> On 9/10/21 4:04 PM, Tetsuo Handa wrote:
> >>> Can't we add fatal_signal_pending(current) test to vmalloc() loop?
> > 
> > We can and we should.
> > 
> >> 1) this has been done in the past but has been reverted later.
> > 
> > The reason for that should be addressed IIRC.
> 
> I don't know the details of this, and I need some time to investigate it.

b8c8a338f75e ("Revert "vmalloc: back off when the current task is killed"")
should give a good insight and references.

> >> 2) any vmalloc changes will affect non-memcg allocations too.
> >>  If we're doing memcg-related checks it's better to do it in one place.
> > 
> > I think those two things are just orthogonal. Bailing out from vmalloc
> > early sounds reasonable to me on its own. Allocating a large thing that
> > is likely to go away with the allocating context is just a waste of
> > resources and potential reason to disruptions to others.
> 
> I doubt that fatal signal should block any vmalloc allocations.
> I assume there are situations where rollback of some cancelled operation uses vmalloc.
> Or coredump saving on some remote storage can uses vmalloc.

If there really are any such requirements then this should be really
documented. 

> However for me it's abnormal that even OOM-killer cannot cancel huge vmalloc allocation.
> So I think tsk_is_oom_victim(current) check should be added to vm_area_alloc_pages() 
> to break vmalloc cycle.

Why should oom killed task behave any different than any other task
killed without a way to handle the signal?
Michal Hocko Sept. 13, 2021, 8:53 a.m. UTC | #9
On Fri 10-09-21 15:39:28, Vasily Averin wrote:
> The kernel currently allows dying tasks to exceed the memcg limits.
> The allocation is expected to be the last one and the occupied memory
> will be freed soon.
> This is not always true because it can be part of the huge vmalloc
> allocation. Allowed once, they will repeat over and over again.
> Moreover lifetime of the allocated object can differ from
> In addition the lifetime of the dying task.
> Multiple such allocations running concurrently can not only overuse
> the memcg limit, but can lead to a global out of memory and,
> in the worst case, cause the host to panic.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  mm/memcontrol.c | 23 +++++------------------
>  1 file changed, 5 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 389b5766e74f..67195fcfbddf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  		return OOM_ASYNC;
>  	}
>  
> +	if (should_force_charge())
> +		return OOM_SKIPPED;

mem_cgroup_out_of_memory already check for the bypass, now you are
duplicating that check with a different answer to the caller. This is
really messy. One of the two has to go away.
Vasily Averin Sept. 13, 2021, 9:37 a.m. UTC | #10
On 9/13/21 11:39 AM, Michal Hocko wrote:
> On Mon 13-09-21 10:51:37, Vasily Averin wrote:
>> On 9/10/21 3:39 PM, Vasily Averin wrote:
>>> The kernel currently allows dying tasks to exceed the memcg limits.
>>> The allocation is expected to be the last one and the occupied memory
>>> will be freed soon.
>>> This is not always true because it can be part of the huge vmalloc
>>> allocation. Allowed once, they will repeat over and over again.
>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 389b5766e74f..67195fcfbddf 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  	if (gfp_mask & __GFP_ATOMIC)
>>>  		goto force;
>>>  
>>> -	/*
>>> -	 * Unlike in global OOM situations, memcg is not in a physical
>>> -	 * memory shortage.  Allow dying and OOM-killed tasks to
>>> -	 * bypass the last charges so that they can exit quickly and
>>> -	 * free their memory.
>>> -	 */
>>> -	if (unlikely(should_force_charge()))
>>> -		goto force;
>>> -
>>
>> Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps?
> 
> Why?

On this stage task really dies and mostly releases taken resources.
It can allocate though, and this allocation can reach memcg limit due to the activity
of parallel memcg threads.

Noting bad should happen if we reject this allocation,
because the same thing can happen in non-memcg case too.
However I doubt misuse is possible here and we have possibility to allow graceful shutdown here.

In other words: we are not obliged to allow such allocations, but we CAN do it because
we hope that it is safe and cannot be misused.

>> It is set inside do_exit only and (I hope) cannot trigger huge vmalloc allocations.
> 
> Allocations in this code path should be rare but it is not like they are
> non-existent. This is rather hard to review area spread at many places
> so if we are deciding to make the existing model simpler (no bypassing)
> then I would rather have no exceptions unless they are reaaly necessary
> and document them if they are.

Thank you,
	vasily Averin
Michal Hocko Sept. 13, 2021, 10:10 a.m. UTC | #11
On Mon 13-09-21 12:37:56, Vasily Averin wrote:
> On 9/13/21 11:39 AM, Michal Hocko wrote:
> > On Mon 13-09-21 10:51:37, Vasily Averin wrote:
> >> On 9/10/21 3:39 PM, Vasily Averin wrote:
> >>> The kernel currently allows dying tasks to exceed the memcg limits.
> >>> The allocation is expected to be the last one and the occupied memory
> >>> will be freed soon.
> >>> This is not always true because it can be part of the huge vmalloc
> >>> allocation. Allowed once, they will repeat over and over again.
> >>
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index 389b5766e74f..67195fcfbddf 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>>  	if (gfp_mask & __GFP_ATOMIC)
> >>>  		goto force;
> >>>  
> >>> -	/*
> >>> -	 * Unlike in global OOM situations, memcg is not in a physical
> >>> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> >>> -	 * bypass the last charges so that they can exit quickly and
> >>> -	 * free their memory.
> >>> -	 */
> >>> -	if (unlikely(should_force_charge()))
> >>> -		goto force;
> >>> -
> >>
> >> Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps?
> > 
> > Why?
> 
> On this stage task really dies and mostly releases taken resources.
> It can allocate though, and this allocation can reach memcg limit due to the activity
> of parallel memcg threads.
> 
> Noting bad should happen if we reject this allocation,
> because the same thing can happen in non-memcg case too.
> However I doubt misuse is possible here and we have possibility to allow graceful shutdown here.
> 
> In other words: we are not obliged to allow such allocations, but we CAN do it because
> we hope that it is safe and cannot be misused.

This is a lot of hoping that has turned out to be a bad strategy in the
existing code.  So let's stop hoping and if we are shown that an
exit path really benefits from a special treatment then we can add it
with a good reasoning rathat than "we hope it's gonna be ok".
Vasily Averin Sept. 13, 2021, 10:35 a.m. UTC | #12
On 9/13/21 11:53 AM, Michal Hocko wrote:
> On Fri 10-09-21 15:39:28, Vasily Averin wrote:
>> The kernel currently allows dying tasks to exceed the memcg limits.
>> The allocation is expected to be the last one and the occupied memory
>> will be freed soon.
>> This is not always true because it can be part of the huge vmalloc
>> allocation. Allowed once, they will repeat over and over again.
>> Moreover lifetime of the allocated object can differ from
>> In addition the lifetime of the dying task.
>> Multiple such allocations running concurrently can not only overuse
>> the memcg limit, but can lead to a global out of memory and,
>> in the worst case, cause the host to panic.
>>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>>  mm/memcontrol.c | 23 +++++------------------
>>  1 file changed, 5 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 389b5766e74f..67195fcfbddf 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>>  		return OOM_ASYNC;
>>  	}
>>  
>> +	if (should_force_charge())
>> +		return OOM_SKIPPED;
> 
> mem_cgroup_out_of_memory already check for the bypass, now you are
> duplicating that check with a different answer to the caller. This is
> really messy. One of the two has to go away.

In this case mem_cgroup_out_of_memory() takes locks and mutexes but doing nothing
useful and its success causes try_charge_memcg() to repeat the loop unnecessarily.

I cannot change mem_cgroup_out_of_memory internals, because it is used in other places too.The check inside mem_cgroup_out_of_memory is required because situation can be changed after
check added into mem_cgroup_oom().

Though I got your argument, and will think how to improve the patch.
Anyway we'll need to do something with name of should_force_charge() function
that will NOT lead to forced charge.

Thank you,
	Vasily Averin

Thank you,
Michal Hocko Sept. 13, 2021, 10:55 a.m. UTC | #13
On Mon 13-09-21 13:35:00, Vasily Averin wrote:
> On 9/13/21 11:53 AM, Michal Hocko wrote:
> > On Fri 10-09-21 15:39:28, Vasily Averin wrote:
> >> The kernel currently allows dying tasks to exceed the memcg limits.
> >> The allocation is expected to be the last one and the occupied memory
> >> will be freed soon.
> >> This is not always true because it can be part of the huge vmalloc
> >> allocation. Allowed once, they will repeat over and over again.
> >> Moreover lifetime of the allocated object can differ from
> >> In addition the lifetime of the dying task.
> >> Multiple such allocations running concurrently can not only overuse
> >> the memcg limit, but can lead to a global out of memory and,
> >> in the worst case, cause the host to panic.
> >>
> >> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> >> ---
> >>  mm/memcontrol.c | 23 +++++------------------
> >>  1 file changed, 5 insertions(+), 18 deletions(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 389b5766e74f..67195fcfbddf 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
> >>  		return OOM_ASYNC;
> >>  	}
> >>  
> >> +	if (should_force_charge())
> >> +		return OOM_SKIPPED;
> > 
> > mem_cgroup_out_of_memory already check for the bypass, now you are
> > duplicating that check with a different answer to the caller. This is
> > really messy. One of the two has to go away.
> 
> In this case mem_cgroup_out_of_memory() takes locks and mutexes but doing nothing
> useful and its success causes try_charge_memcg() to repeat the loop unnecessarily.
> 
> I cannot change mem_cgroup_out_of_memory internals, because it is used in other places too.The check inside mem_cgroup_out_of_memory is required because situation can be changed after
> check added into mem_cgroup_oom().
> 
> Though I got your argument, and will think how to improve the patch.
> Anyway we'll need to do something with name of should_force_charge() function
> that will NOT lead to forced charge.

Here is what I would do.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 702a81dfe72d..58269721d438 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2588,6 +2588,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2622,15 +2623,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2688,8 +2680,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && should_force_charge())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2698,14 +2691,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
Vasily Averin Sept. 14, 2021, 10:01 a.m. UTC | #14
On 9/13/21 1:55 PM, Michal Hocko wrote:
> Here is what I would do.

Your patch fixes the problem, I just want to add rename of should_force_charge()
helper to task_is_dying() and will submit v2 version soon.

I think this patch is not optimal, however its improvements can be discussed later.

Thank you,
	Vasily Averin
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 389b5766e74f..67195fcfbddf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1834,6 +1834,9 @@  static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 		return OOM_ASYNC;
 	}
 
+	if (should_force_charge())
+		return OOM_SKIPPED;
+
 	mem_cgroup_mark_under_oom(memcg);
 
 	locked = mem_cgroup_oom_trylock(memcg);
@@ -2622,15 +2625,6 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2688,9 +2682,6 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
-
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
 	 * a forward progress or bypass the charge if the oom killer
@@ -2698,15 +2689,11 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
+	} else if (oom_status == OOM_FAILED)
 		goto force;
-	default:
-		goto nomem;
-	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;