Message ID | 1598426822-93737-1-git-send-email-xlpang@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: memcg: Fix memcg reclaim soft lockup | expand |
On Wed 26-08-20 15:27:02, Xunlei Pang wrote: > We've met softlockup with "CONFIG_PREEMPT_NONE=y", when > the target memcg doesn't have any reclaimable memory. Do you have any scenario when this happens or is this some sort of a test case? > It can be easily reproduced as below: > watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204] > CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12 > Call Trace: > shrink_lruvec+0x49f/0x640 > shrink_node+0x2a6/0x6f0 > do_try_to_free_pages+0xe9/0x3e0 > try_to_free_mem_cgroup_pages+0xef/0x1f0 > try_charge+0x2c1/0x750 > mem_cgroup_charge+0xd7/0x240 > __add_to_page_cache_locked+0x2fd/0x370 > add_to_page_cache_lru+0x4a/0xc0 > pagecache_get_page+0x10b/0x2f0 > filemap_fault+0x661/0xad0 > ext4_filemap_fault+0x2c/0x40 > __do_fault+0x4d/0xf9 > handle_mm_fault+0x1080/0x1790 > > It only happens on our 1-vcpu instances, because there's no chance > for oom reaper to run to reclaim the to-be-killed process. > > Add cond_resched() in such cases at the beginning of shrink_lruvec() > to give up the cpu to others. I do agree that we need a cond_resched but I cannot say I would like this patch. The primary reason is that it doesn't catch all cases when the memcg is not reclaimable. For example it wouldn't reschedule if the memcg is protected by low/min. What do you think about this instead? diff --git a/mm/vmscan.c b/mm/vmscan.c index 99e1796eb833..bbdc38b58cc5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2617,6 +2617,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) mem_cgroup_calculate_protection(target_memcg, memcg); + cond_resched(); + if (mem_cgroup_below_min(memcg)) { /* * Hard protection. This should catch both cases. I even have a vague recollection that somebody has proposed something in that direction but I cannot remember what has happened with that patch.
On 2020/8/26 下午4:11, Michal Hocko wrote: > On Wed 26-08-20 15:27:02, Xunlei Pang wrote: >> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when >> the target memcg doesn't have any reclaimable memory. > > Do you have any scenario when this happens or is this some sort of a > test case? It can happen on tiny guest scenarios. > >> It can be easily reproduced as below: >> watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204] >> CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12 >> Call Trace: >> shrink_lruvec+0x49f/0x640 >> shrink_node+0x2a6/0x6f0 >> do_try_to_free_pages+0xe9/0x3e0 >> try_to_free_mem_cgroup_pages+0xef/0x1f0 >> try_charge+0x2c1/0x750 >> mem_cgroup_charge+0xd7/0x240 >> __add_to_page_cache_locked+0x2fd/0x370 >> add_to_page_cache_lru+0x4a/0xc0 >> pagecache_get_page+0x10b/0x2f0 >> filemap_fault+0x661/0xad0 >> ext4_filemap_fault+0x2c/0x40 >> __do_fault+0x4d/0xf9 >> handle_mm_fault+0x1080/0x1790 >> >> It only happens on our 1-vcpu instances, because there's no chance >> for oom reaper to run to reclaim the to-be-killed process. >> >> Add cond_resched() in such cases at the beginning of shrink_lruvec() >> to give up the cpu to others. > > I do agree that we need a cond_resched but I cannot say I would like > this patch. The primary reason is that it doesn't catch all cases when > the memcg is not reclaimable. For example it wouldn't reschedule if the > memcg is protected by low/min. What do you think about this instead? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 99e1796eb833..bbdc38b58cc5 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2617,6 +2617,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > > mem_cgroup_calculate_protection(target_memcg, memcg); > > + cond_resched(); > + > if (mem_cgroup_below_min(memcg)) { > /* > * Hard protection. > > This should catch both cases. I even have a vague recollection that > somebody has proposed something in that direction but I cannot remember > what has happened with that patch. > It's the endless "retry" in try_charge() that caused the softlockup, and I think mem_cgroup_protected() will eventually return MEMCG_PROT_NONE, and shrink_node_memcgs() will call shrink_lruvec() for memcg self-reclaim cases, so it's not a problem here. But adding cond_resched() at upper shrink_node_memcgs() may eliminate potential similar issues, I have no objection with this approach. I tested it and works well, will send v2 later.
On Wed 26-08-20 18:41:18, xunlei wrote: > On 2020/8/26 下午4:11, Michal Hocko wrote: > > On Wed 26-08-20 15:27:02, Xunlei Pang wrote: > >> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when > >> the target memcg doesn't have any reclaimable memory. > > > > Do you have any scenario when this happens or is this some sort of a > > test case? > > It can happen on tiny guest scenarios. OK, you made me more curious. If this is a tiny guest and this is a hard limit reclaim path then we should trigger an oom killer which should kill the offender and that in turn bail out from the try_charge lopp (see should_force_charge). So how come this repeats enough in your setup that it causes soft lockups?
On 2020/8/26 下午7:00, Michal Hocko wrote: > On Wed 26-08-20 18:41:18, xunlei wrote: >> On 2020/8/26 下午4:11, Michal Hocko wrote: >>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote: >>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when >>>> the target memcg doesn't have any reclaimable memory. >>> >>> Do you have any scenario when this happens or is this some sort of a >>> test case? >> >> It can happen on tiny guest scenarios. > > OK, you made me more curious. If this is a tiny guest and this is a hard > limit reclaim path then we should trigger an oom killer which should > kill the offender and that in turn bail out from the try_charge lopp > (see should_force_charge). So how come this repeats enough in your setup > that it causes soft lockups? > oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages * PAGE_SIZE)); switch (oom_status) { case OOM_SUCCESS: nr_retries = MAX_RECLAIM_RETRIES; goto retry; It retries here endlessly, because oom reaper has no cpu to schedule.
On 2020/8/26 下午7:45, xunlei wrote: > On 2020/8/26 下午7:00, Michal Hocko wrote: >> On Wed 26-08-20 18:41:18, xunlei wrote: >>> On 2020/8/26 下午4:11, Michal Hocko wrote: >>>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote: >>>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when >>>>> the target memcg doesn't have any reclaimable memory. >>>> >>>> Do you have any scenario when this happens or is this some sort of a >>>> test case? >>> >>> It can happen on tiny guest scenarios. >> >> OK, you made me more curious. If this is a tiny guest and this is a hard >> limit reclaim path then we should trigger an oom killer which should >> kill the offender and that in turn bail out from the try_charge lopp >> (see should_force_charge). So how come this repeats enough in your setup >> that it causes soft lockups? >> > > oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, > get_order(nr_pages * PAGE_SIZE)); > switch (oom_status) { > case OOM_SUCCESS: > nr_retries = MAX_RECLAIM_RETRIES; Actually we can add "cond_resched()" here, but I think it's better to have one at the memcg reclaim path to avoid other unexpected issues. > goto retry; > > It retries here endlessly, because oom reaper has no cpu to schedule. >
On 2020/8/26 下午7:00, Michal Hocko wrote: > On Wed 26-08-20 18:41:18, xunlei wrote: >> On 2020/8/26 下午4:11, Michal Hocko wrote: >>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote: >>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when >>>> the target memcg doesn't have any reclaimable memory. >>> >>> Do you have any scenario when this happens or is this some sort of a >>> test case? >> >> It can happen on tiny guest scenarios. > > OK, you made me more curious. If this is a tiny guest and this is a hard > limit reclaim path then we should trigger an oom killer which should > kill the offender and that in turn bail out from the try_charge lopp > (see should_force_charge). So how come this repeats enough in your setup > that it causes soft lockups? > should_force_charge() is false, the current trapped in endless loop is not the oom victim.
On Wed 26-08-20 20:00:47, xunlei wrote: > On 2020/8/26 下午7:00, Michal Hocko wrote: > > On Wed 26-08-20 18:41:18, xunlei wrote: > >> On 2020/8/26 下午4:11, Michal Hocko wrote: > >>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote: > >>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when > >>>> the target memcg doesn't have any reclaimable memory. > >>> > >>> Do you have any scenario when this happens or is this some sort of a > >>> test case? > >> > >> It can happen on tiny guest scenarios. > > > > OK, you made me more curious. If this is a tiny guest and this is a hard > > limit reclaim path then we should trigger an oom killer which should > > kill the offender and that in turn bail out from the try_charge lopp > > (see should_force_charge). So how come this repeats enough in your setup > > that it causes soft lockups? > > > > should_force_charge() is false, the current trapped in endless loop is > not the oom victim. How is that possible? If the oom killer kills a task and that doesn't resolve the oom situation then it would go after another one until all tasks are killed. Or is your task living outside of the memcg it tries to charge?
On 2020/8/26 下午8:07, Michal Hocko wrote: > On Wed 26-08-20 20:00:47, xunlei wrote: >> On 2020/8/26 下午7:00, Michal Hocko wrote: >>> On Wed 26-08-20 18:41:18, xunlei wrote: >>>> On 2020/8/26 下午4:11, Michal Hocko wrote: >>>>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote: >>>>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when >>>>>> the target memcg doesn't have any reclaimable memory. >>>>> >>>>> Do you have any scenario when this happens or is this some sort of a >>>>> test case? >>>> >>>> It can happen on tiny guest scenarios. >>> >>> OK, you made me more curious. If this is a tiny guest and this is a hard >>> limit reclaim path then we should trigger an oom killer which should >>> kill the offender and that in turn bail out from the try_charge lopp >>> (see should_force_charge). So how come this repeats enough in your setup >>> that it causes soft lockups? >>> >> >> should_force_charge() is false, the current trapped in endless loop is >> not the oom victim. > > How is that possible? If the oom killer kills a task and that doesn't > resolve the oom situation then it would go after another one until all > tasks are killed. Or is your task living outside of the memcg it tries > to charge? > All tasks are in memcgs. Looks like the first oom victim is not finished (unable to schedule), later mem_cgroup_oom()->...->oom_evaluate_task() will set oc->chosen to -1 and abort.
On Wed 26-08-20 20:21:39, xunlei wrote: > On 2020/8/26 下午8:07, Michal Hocko wrote: > > On Wed 26-08-20 20:00:47, xunlei wrote: > >> On 2020/8/26 下午7:00, Michal Hocko wrote: > >>> On Wed 26-08-20 18:41:18, xunlei wrote: > >>>> On 2020/8/26 下午4:11, Michal Hocko wrote: > >>>>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote: > >>>>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when > >>>>>> the target memcg doesn't have any reclaimable memory. > >>>>> > >>>>> Do you have any scenario when this happens or is this some sort of a > >>>>> test case? > >>>> > >>>> It can happen on tiny guest scenarios. > >>> > >>> OK, you made me more curious. If this is a tiny guest and this is a hard > >>> limit reclaim path then we should trigger an oom killer which should > >>> kill the offender and that in turn bail out from the try_charge lopp > >>> (see should_force_charge). So how come this repeats enough in your setup > >>> that it causes soft lockups? > >>> > >> > >> should_force_charge() is false, the current trapped in endless loop is > >> not the oom victim. > > > > How is that possible? If the oom killer kills a task and that doesn't > > resolve the oom situation then it would go after another one until all > > tasks are killed. Or is your task living outside of the memcg it tries > > to charge? > > > > All tasks are in memcgs. Looks like the first oom victim is not finished > (unable to schedule), later mem_cgroup_oom()->...->oom_evaluate_task() > will set oc->chosen to -1 and abort. This shouldn't be possible for too long because oom_reaper would make it invisible to the oom killer so it should proceed. Also mem_cgroup_out_of_memory takes a mutex and that is an implicit scheduling point already. Which kernel version is this? And just for the clarification. I am not against the additional cond_resched. That sounds like a good thing in general because we do want to have a predictable scheduling during reclaim which is independent on reclaimability as much as possible. But I would like to drill down to why you are seeing the lockup because those shouldn't really happen.
On 2020/8/26 下午8:48, Michal Hocko wrote: > On Wed 26-08-20 20:21:39, xunlei wrote: >> On 2020/8/26 下午8:07, Michal Hocko wrote: >>> On Wed 26-08-20 20:00:47, xunlei wrote: >>>> On 2020/8/26 下午7:00, Michal Hocko wrote: >>>>> On Wed 26-08-20 18:41:18, xunlei wrote: >>>>>> On 2020/8/26 下午4:11, Michal Hocko wrote: >>>>>>> On Wed 26-08-20 15:27:02, Xunlei Pang wrote: >>>>>>>> We've met softlockup with "CONFIG_PREEMPT_NONE=y", when >>>>>>>> the target memcg doesn't have any reclaimable memory. >>>>>>> >>>>>>> Do you have any scenario when this happens or is this some sort of a >>>>>>> test case? >>>>>> >>>>>> It can happen on tiny guest scenarios. >>>>> >>>>> OK, you made me more curious. If this is a tiny guest and this is a hard >>>>> limit reclaim path then we should trigger an oom killer which should >>>>> kill the offender and that in turn bail out from the try_charge lopp >>>>> (see should_force_charge). So how come this repeats enough in your setup >>>>> that it causes soft lockups? >>>>> >>>> >>>> should_force_charge() is false, the current trapped in endless loop is >>>> not the oom victim. >>> >>> How is that possible? If the oom killer kills a task and that doesn't >>> resolve the oom situation then it would go after another one until all >>> tasks are killed. Or is your task living outside of the memcg it tries >>> to charge? >>> >> >> All tasks are in memcgs. Looks like the first oom victim is not finished >> (unable to schedule), later mem_cgroup_oom()->...->oom_evaluate_task() >> will set oc->chosen to -1 and abort. > > This shouldn't be possible for too long because oom_reaper would > make it invisible to the oom killer so it should proceed. Also > mem_cgroup_out_of_memory takes a mutex and that is an implicit > scheduling point already. > > Which kernel version is this? > I reproduced it on "5.9.0-rc2". oom_reaper also can't get scheduled because of 1-cpu, and the mutex uses might_sleep() which is noop in case of "CONFIG_PREEMPT_VOLUNTARY is not set" I mentioned in the commit log. > And just for the clarification. I am not against the additional > cond_resched. That sounds like a good thing in general because we do > want to have a predictable scheduling during reclaim which is > independent on reclaimability as much as possible. But I would like to > drill down to why you are seeing the lockup because those shouldn't > really happen. >
On Wed 26-08-20 21:16:28, xunlei wrote: [...] > oom_reaper also can't get scheduled because of 1-cpu, and the mutex > uses might_sleep() which is noop in case of "CONFIG_PREEMPT_VOLUNTARY is > not set" I mentioned in the commit log. OK, I see. I have clearly missed the 1cpu configuration. Sorry about that. Thanks for bearing with me.
On 2020/8/26 下午9:26, Michal Hocko wrote: > On Wed 26-08-20 21:16:28, xunlei wrote: > [...] >> oom_reaper also can't get scheduled because of 1-cpu, and the mutex >> uses might_sleep() which is noop in case of "CONFIG_PREEMPT_VOLUNTARY is >> not set" I mentioned in the commit log. > > OK, I see. I have clearly missed the 1cpu configuration. Sorry about > that. > > Thanks for bearing with me. > Thanks for the confirmation, has sent v2.
diff --git a/mm/vmscan.c b/mm/vmscan.c index 99e1796..349a88e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2449,6 +2449,12 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() && sc->priority == DEF_PRIORITY); + /* memcg reclaim may run into no reclaimable lru pages */ + if (nr[LRU_ACTIVE_FILE] == 0 && + nr[LRU_INACTIVE_FILE] == 0 && + nr[LRU_INACTIVE_ANON] == 0) + cond_resched(); + blk_start_plug(&plug); while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) {
We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg doesn't have any reclaimable memory. It can be easily reproduced as below: watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204] CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12 Call Trace: shrink_lruvec+0x49f/0x640 shrink_node+0x2a6/0x6f0 do_try_to_free_pages+0xe9/0x3e0 try_to_free_mem_cgroup_pages+0xef/0x1f0 try_charge+0x2c1/0x750 mem_cgroup_charge+0xd7/0x240 __add_to_page_cache_locked+0x2fd/0x370 add_to_page_cache_lru+0x4a/0xc0 pagecache_get_page+0x10b/0x2f0 filemap_fault+0x661/0xad0 ext4_filemap_fault+0x2c/0x40 __do_fault+0x4d/0xf9 handle_mm_fault+0x1080/0x1790 It only happens on our 1-vcpu instances, because there's no chance for oom reaper to run to reclaim the to-be-killed process. Add cond_resched() in such cases at the beginning of shrink_lruvec() to give up the cpu to others. Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> --- mm/vmscan.c | 6 ++++++ 1 file changed, 6 insertions(+)