mm: memcg: Fix memcg reclaim soft lockup

Message ID	1598426822-93737-1-git-send-email-xlpang@linux.alibaba.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=dJX+=CE=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2666820825 From: Xunlei Pang <xlpang@linux.alibaba.com> To: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xunlei Pang <xlpang@linux.alibaba.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH] mm: memcg: Fix memcg reclaim soft lockup Date: Wed, 26 Aug 2020 15:27:02 +0800 Message-Id: <1598426822-93737-1-git-send-email-xlpang@linux.alibaba.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: memcg: Fix memcg reclaim soft lockup \| expand mm: memcg: Fix memcg reclaim soft lockup

Xunlei Pang Aug. 26, 2020, 7:27 a.m. UTC

We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
the target memcg doesn't have any reclaimable memory.

It can be easily reproduced as below:
 watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
 CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12
 Call Trace:
  shrink_lruvec+0x49f/0x640
  shrink_node+0x2a6/0x6f0
  do_try_to_free_pages+0xe9/0x3e0
  try_to_free_mem_cgroup_pages+0xef/0x1f0
  try_charge+0x2c1/0x750
  mem_cgroup_charge+0xd7/0x240
  __add_to_page_cache_locked+0x2fd/0x370
  add_to_page_cache_lru+0x4a/0xc0
  pagecache_get_page+0x10b/0x2f0
  filemap_fault+0x661/0xad0
  ext4_filemap_fault+0x2c/0x40
  __do_fault+0x4d/0xf9
  handle_mm_fault+0x1080/0x1790

It only happens on our 1-vcpu instances, because there's no chance
for oom reaper to run to reclaim the to-be-killed process.

Add cond_resched() in such cases at the beginning of shrink_lruvec()
to give up the cpu to others.

Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
---
 mm/vmscan.c | 6 ++++++
 1 file changed, 6 insertions(+)

Michal Hocko Aug. 26, 2020, 8:11 a.m. UTC | #1

On Wed 26-08-20 15:27:02, Xunlei Pang wrote:
> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
> the target memcg doesn't have any reclaimable memory.

Do you have any scenario when this happens or is this some sort of a
test case?

> It can be easily reproduced as below:
>  watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
>  CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12
>  Call Trace:
>   shrink_lruvec+0x49f/0x640
>   shrink_node+0x2a6/0x6f0
>   do_try_to_free_pages+0xe9/0x3e0
>   try_to_free_mem_cgroup_pages+0xef/0x1f0
>   try_charge+0x2c1/0x750
>   mem_cgroup_charge+0xd7/0x240
>   __add_to_page_cache_locked+0x2fd/0x370
>   add_to_page_cache_lru+0x4a/0xc0
>   pagecache_get_page+0x10b/0x2f0
>   filemap_fault+0x661/0xad0
>   ext4_filemap_fault+0x2c/0x40
>   __do_fault+0x4d/0xf9
>   handle_mm_fault+0x1080/0x1790
> 
> It only happens on our 1-vcpu instances, because there's no chance
> for oom reaper to run to reclaim the to-be-killed process.
> 
> Add cond_resched() in such cases at the beginning of shrink_lruvec()
> to give up the cpu to others.

I do agree that we need a cond_resched but I cannot say I would like
this patch. The primary reason is that it doesn't catch all cases when
the memcg is not reclaimable. For example it wouldn't reschedule if the
memcg is protected by low/min. What do you think about this instead?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99e1796eb833..bbdc38b58cc5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2617,6 +2617,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 
 		mem_cgroup_calculate_protection(target_memcg, memcg);
 
+		cond_resched();
+
 		if (mem_cgroup_below_min(memcg)) {
 			/*
 			 * Hard protection.

This should catch both cases. I even have a vague recollection that
somebody has proposed something in that direction but I cannot remember
what has happened with that patch.

Xunlei Pang Aug. 26, 2020, 10:41 a.m. UTC | #2

On 2020/8/26 下午4:11, Michal Hocko wrote:
> On Wed 26-08-20 15:27:02, Xunlei Pang wrote:
>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
>> the target memcg doesn't have any reclaimable memory.
> 
> Do you have any scenario when this happens or is this some sort of a
> test case?

It can happen on tiny guest scenarios.

> 
>> It can be easily reproduced as below:
>>  watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
>>  CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12
>>  Call Trace:
>>   shrink_lruvec+0x49f/0x640
>>   shrink_node+0x2a6/0x6f0
>>   do_try_to_free_pages+0xe9/0x3e0
>>   try_to_free_mem_cgroup_pages+0xef/0x1f0
>>   try_charge+0x2c1/0x750
>>   mem_cgroup_charge+0xd7/0x240
>>   __add_to_page_cache_locked+0x2fd/0x370
>>   add_to_page_cache_lru+0x4a/0xc0
>>   pagecache_get_page+0x10b/0x2f0
>>   filemap_fault+0x661/0xad0
>>   ext4_filemap_fault+0x2c/0x40
>>   __do_fault+0x4d/0xf9
>>   handle_mm_fault+0x1080/0x1790
>>
>> It only happens on our 1-vcpu instances, because there's no chance
>> for oom reaper to run to reclaim the to-be-killed process.
>>
>> Add cond_resched() in such cases at the beginning of shrink_lruvec()
>> to give up the cpu to others.
> 
> I do agree that we need a cond_resched but I cannot say I would like
> this patch. The primary reason is that it doesn't catch all cases when
> the memcg is not reclaimable. For example it wouldn't reschedule if the
> memcg is protected by low/min. What do you think about this instead?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 99e1796eb833..bbdc38b58cc5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2617,6 +2617,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  
>  		mem_cgroup_calculate_protection(target_memcg, memcg);
>  
> +		cond_resched();
> +
>  		if (mem_cgroup_below_min(memcg)) {
>  			/*
>  			 * Hard protection.
> 
> This should catch both cases. I even have a vague recollection that
> somebody has proposed something in that direction but I cannot remember
> what has happened with that patch.
> 

It's the endless "retry" in try_charge() that caused the softlockup, and
I think mem_cgroup_protected() will eventually return MEMCG_PROT_NONE,
and shrink_node_memcgs() will call shrink_lruvec() for memcg
self-reclaim cases, so it's not a problem here.

But adding cond_resched() at upper shrink_node_memcgs() may eliminate
potential similar issues, I have no objection with this approach.
I tested it and works well, will send v2 later.

Michal Hocko Aug. 26, 2020, 11 a.m. UTC | #3

On Wed 26-08-20 18:41:18, xunlei wrote:
> On 2020/8/26 下午4:11, Michal Hocko wrote:
> > On Wed 26-08-20 15:27:02, Xunlei Pang wrote:
> >> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
> >> the target memcg doesn't have any reclaimable memory.
> > 
> > Do you have any scenario when this happens or is this some sort of a
> > test case?
> 
> It can happen on tiny guest scenarios.

OK, you made me more curious. If this is a tiny guest and this is a hard
limit reclaim path then we should trigger an oom killer which should
kill the offender and that in turn bail out from the try_charge lopp
(see should_force_charge). So how come this repeats enough in your setup
that it causes soft lockups?

Xunlei Pang Aug. 26, 2020, 11:45 a.m. UTC | #4

On 2020/8/26 下午7:00, Michal Hocko wrote:
> On Wed 26-08-20 18:41:18, xunlei wrote:
>> On 2020/8/26 下午4:11, Michal Hocko wrote:
>>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote:
>>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
>>>> the target memcg doesn't have any reclaimable memory.
>>>
>>> Do you have any scenario when this happens or is this some sort of a
>>> test case?
>>
>> It can happen on tiny guest scenarios.
> 
> OK, you made me more curious. If this is a tiny guest and this is a hard
> limit reclaim path then we should trigger an oom killer which should
> kill the offender and that in turn bail out from the try_charge lopp
> (see should_force_charge). So how come this repeats enough in your setup
> that it causes soft lockups?
> 

    oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
               get_order(nr_pages * PAGE_SIZE));
    switch (oom_status) {
    case OOM_SUCCESS:
        nr_retries = MAX_RECLAIM_RETRIES;
        goto retry;

It retries here endlessly, because oom reaper has no cpu to schedule.

Xunlei Pang Aug. 26, 2020, 11:54 a.m. UTC | #5

On 2020/8/26 下午7:45, xunlei wrote:
> On 2020/8/26 下午7:00, Michal Hocko wrote:
>> On Wed 26-08-20 18:41:18, xunlei wrote:
>>> On 2020/8/26 下午4:11, Michal Hocko wrote:
>>>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote:
>>>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
>>>>> the target memcg doesn't have any reclaimable memory.
>>>>
>>>> Do you have any scenario when this happens or is this some sort of a
>>>> test case?
>>>
>>> It can happen on tiny guest scenarios.
>>
>> OK, you made me more curious. If this is a tiny guest and this is a hard
>> limit reclaim path then we should trigger an oom killer which should
>> kill the offender and that in turn bail out from the try_charge lopp
>> (see should_force_charge). So how come this repeats enough in your setup
>> that it causes soft lockups?
>>
> 
>     oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
>                get_order(nr_pages * PAGE_SIZE));
>     switch (oom_status) {
>     case OOM_SUCCESS:
>         nr_retries = MAX_RECLAIM_RETRIES;

Actually we can add "cond_resched()" here, but I think it's better to
have one at the memcg reclaim path to avoid other unexpected issues.

>         goto retry;
> 
> It retries here endlessly, because oom reaper has no cpu to schedule.
>

Xunlei Pang Aug. 26, 2020, noon UTC | #6

On 2020/8/26 下午7:00, Michal Hocko wrote:
> On Wed 26-08-20 18:41:18, xunlei wrote:
>> On 2020/8/26 下午4:11, Michal Hocko wrote:
>>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote:
>>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
>>>> the target memcg doesn't have any reclaimable memory.
>>>
>>> Do you have any scenario when this happens or is this some sort of a
>>> test case?
>>
>> It can happen on tiny guest scenarios.
> 
> OK, you made me more curious. If this is a tiny guest and this is a hard
> limit reclaim path then we should trigger an oom killer which should
> kill the offender and that in turn bail out from the try_charge lopp
> (see should_force_charge). So how come this repeats enough in your setup
> that it causes soft lockups?
> 

should_force_charge() is false, the current trapped in endless loop is
not the oom victim.

Michal Hocko Aug. 26, 2020, 12:07 p.m. UTC | #7

On Wed 26-08-20 20:00:47, xunlei wrote:
> On 2020/8/26 下午7:00, Michal Hocko wrote:
> > On Wed 26-08-20 18:41:18, xunlei wrote:
> >> On 2020/8/26 下午4:11, Michal Hocko wrote:
> >>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote:
> >>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
> >>>> the target memcg doesn't have any reclaimable memory.
> >>>
> >>> Do you have any scenario when this happens or is this some sort of a
> >>> test case?
> >>
> >> It can happen on tiny guest scenarios.
> > 
> > OK, you made me more curious. If this is a tiny guest and this is a hard
> > limit reclaim path then we should trigger an oom killer which should
> > kill the offender and that in turn bail out from the try_charge lopp
> > (see should_force_charge). So how come this repeats enough in your setup
> > that it causes soft lockups?
> > 
> 
> should_force_charge() is false, the current trapped in endless loop is
> not the oom victim.

How is that possible? If the oom killer kills a task and that doesn't
resolve the oom situation then it would go after another one until all
tasks are killed. Or is your task living outside of the memcg it tries
to charge?

Xunlei Pang Aug. 26, 2020, 12:21 p.m. UTC | #8

On 2020/8/26 下午8:07, Michal Hocko wrote:
> On Wed 26-08-20 20:00:47, xunlei wrote:
>> On 2020/8/26 下午7:00, Michal Hocko wrote:
>>> On Wed 26-08-20 18:41:18, xunlei wrote:
>>>> On 2020/8/26 下午4:11, Michal Hocko wrote:
>>>>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote:
>>>>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
>>>>>> the target memcg doesn't have any reclaimable memory.
>>>>>
>>>>> Do you have any scenario when this happens or is this some sort of a
>>>>> test case?
>>>>
>>>> It can happen on tiny guest scenarios.
>>>
>>> OK, you made me more curious. If this is a tiny guest and this is a hard
>>> limit reclaim path then we should trigger an oom killer which should
>>> kill the offender and that in turn bail out from the try_charge lopp
>>> (see should_force_charge). So how come this repeats enough in your setup
>>> that it causes soft lockups?
>>>
>>
>> should_force_charge() is false, the current trapped in endless loop is
>> not the oom victim.
> 
> How is that possible? If the oom killer kills a task and that doesn't
> resolve the oom situation then it would go after another one until all
> tasks are killed. Or is your task living outside of the memcg it tries
> to charge?
> 

All tasks are in memcgs. Looks like the first oom victim is not finished
(unable to schedule), later mem_cgroup_oom()->...->oom_evaluate_task()
will set oc->chosen to -1 and abort.

Michal Hocko Aug. 26, 2020, 12:48 p.m. UTC | #9

On Wed 26-08-20 20:21:39, xunlei wrote:
> On 2020/8/26 下午8:07, Michal Hocko wrote:
> > On Wed 26-08-20 20:00:47, xunlei wrote:
> >> On 2020/8/26 下午7:00, Michal Hocko wrote:
> >>> On Wed 26-08-20 18:41:18, xunlei wrote:
> >>>> On 2020/8/26 下午4:11, Michal Hocko wrote:
> >>>>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote:
> >>>>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
> >>>>>> the target memcg doesn't have any reclaimable memory.
> >>>>>
> >>>>> Do you have any scenario when this happens or is this some sort of a
> >>>>> test case?
> >>>>
> >>>> It can happen on tiny guest scenarios.
> >>>
> >>> OK, you made me more curious. If this is a tiny guest and this is a hard
> >>> limit reclaim path then we should trigger an oom killer which should
> >>> kill the offender and that in turn bail out from the try_charge lopp
> >>> (see should_force_charge). So how come this repeats enough in your setup
> >>> that it causes soft lockups?
> >>>
> >>
> >> should_force_charge() is false, the current trapped in endless loop is
> >> not the oom victim.
> > 
> > How is that possible? If the oom killer kills a task and that doesn't
> > resolve the oom situation then it would go after another one until all
> > tasks are killed. Or is your task living outside of the memcg it tries
> > to charge?
> > 
> 
> All tasks are in memcgs. Looks like the first oom victim is not finished
> (unable to schedule), later mem_cgroup_oom()->...->oom_evaluate_task()
> will set oc->chosen to -1 and abort.

This shouldn't be possible for too long because oom_reaper would
make it invisible to the oom killer so it should proceed. Also
mem_cgroup_out_of_memory takes a mutex and that is an implicit
scheduling point already.

Which kernel version is this?

And just for the clarification. I am not against the additional
cond_resched. That sounds like a good thing in general because we do
want to have a predictable scheduling during reclaim which is
independent on reclaimability as much as possible. But I would like to
drill down to why you are seeing the lockup because those shouldn't
really happen.

Xunlei Pang Aug. 26, 2020, 1:16 p.m. UTC | #10

On 2020/8/26 下午8:48, Michal Hocko wrote:
> On Wed 26-08-20 20:21:39, xunlei wrote:
>> On 2020/8/26 下午8:07, Michal Hocko wrote:
>>> On Wed 26-08-20 20:00:47, xunlei wrote:
>>>> On 2020/8/26 下午7:00, Michal Hocko wrote:
>>>>> On Wed 26-08-20 18:41:18, xunlei wrote:
>>>>>> On 2020/8/26 下午4:11, Michal Hocko wrote:
>>>>>>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote:
>>>>>>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when
>>>>>>>> the target memcg doesn't have any reclaimable memory.
>>>>>>>
>>>>>>> Do you have any scenario when this happens or is this some sort of a
>>>>>>> test case?
>>>>>>
>>>>>> It can happen on tiny guest scenarios.
>>>>>
>>>>> OK, you made me more curious. If this is a tiny guest and this is a hard
>>>>> limit reclaim path then we should trigger an oom killer which should
>>>>> kill the offender and that in turn bail out from the try_charge lopp
>>>>> (see should_force_charge). So how come this repeats enough in your setup
>>>>> that it causes soft lockups?
>>>>>
>>>>
>>>> should_force_charge() is false, the current trapped in endless loop is
>>>> not the oom victim.
>>>
>>> How is that possible? If the oom killer kills a task and that doesn't
>>> resolve the oom situation then it would go after another one until all
>>> tasks are killed. Or is your task living outside of the memcg it tries
>>> to charge?
>>>
>>
>> All tasks are in memcgs. Looks like the first oom victim is not finished
>> (unable to schedule), later mem_cgroup_oom()->...->oom_evaluate_task()
>> will set oc->chosen to -1 and abort.
> 
> This shouldn't be possible for too long because oom_reaper would
> make it invisible to the oom killer so it should proceed. Also
> mem_cgroup_out_of_memory takes a mutex and that is an implicit
> scheduling point already.
> 
> Which kernel version is this?
> 

I reproduced it on "5.9.0-rc2".

oom_reaper also can't get scheduled because of 1-cpu, and the mutex
uses might_sleep() which is noop in case of "CONFIG_PREEMPT_VOLUNTARY is
not set" I mentioned in the commit log.

> And just for the clarification. I am not against the additional
> cond_resched. That sounds like a good thing in general because we do
> want to have a predictable scheduling during reclaim which is
> independent on reclaimability as much as possible. But I would like to
> drill down to why you are seeing the lockup because those shouldn't
> really happen.
>

Michal Hocko Aug. 26, 2020, 1:26 p.m. UTC | #11

On Wed 26-08-20 21:16:28, xunlei wrote:
[...]
> oom_reaper also can't get scheduled because of 1-cpu, and the mutex
> uses might_sleep() which is noop in case of "CONFIG_PREEMPT_VOLUNTARY is
> not set" I mentioned in the commit log.

OK, I see. I have clearly missed the 1cpu configuration. Sorry about
that.

Thanks for bearing with me.

Xunlei Pang Aug. 26, 2020, 1:48 p.m. UTC | #12

On 2020/8/26 下午9:26, Michal Hocko wrote:
> On Wed 26-08-20 21:16:28, xunlei wrote:
> [...]
>> oom_reaper also can't get scheduled because of 1-cpu, and the mutex
>> uses might_sleep() which is noop in case of "CONFIG_PREEMPT_VOLUNTARY is
>> not set" I mentioned in the commit log.
> 
> OK, I see. I have clearly missed the 1cpu configuration. Sorry about
> that.
> 
> Thanks for bearing with me.
> 

Thanks for the confirmation, has sent v2.

mm: memcg: Fix memcg reclaim soft lockup

Commit Message

Comments

Patch