Message ID | 1594437481-11144-1-git-send-email-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, oom: don't invoke oom killer if current has been reapered | expand |
On Sat, Jul 11, 2020 at 11:18 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > If the current's MMF_OOM_SKIP is set, it means that the current is exiting > or dying and likely to realease its address space. So we don't need to > invoke the oom killer again. Otherwise that may cause some unexpected > issues, for example, bellow is the issue found in our production > environment. > > There're many threads of a multi-threaded task parallel running in a > container on many cpus. Then many threads triggered OOM at the same time, > > CPU-1 CPU-2 ... CPU-n > thread-1 thread-2 ... thread-n > > wait oom_lock wait oom_lock ... hold oom_lock > > (sigkill received) > > select current as victim > and wakeup oom reaper > > release oom_lock > > (MMF_OOM_SKIP set by oom reaper) > > (lots of pages are freed) > hold oom_lock > > because MMF_OOM_SKIP > is set, kill others > > The thread running on CPU-n received sigkill and it will select current as > the victim and wakeup the oom reaper. Then oom reaper will reap its rss and > free lots of pages, as a result, there will be many free pages. > Although the multi-threaded task is exiting, the other threads will > continue to kill others because of the check of MMF_OOM_SKIP in > task_will_free_mem(). > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > --- > mm/oom_kill.c | 14 ++++++-------- > 1 file changed, 6 insertions(+), 8 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 6e94962..a8a155a 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -825,13 +825,6 @@ static bool task_will_free_mem(struct task_struct *task) > if (!__task_will_free_mem(task)) > return false; > > - /* > - * This task has already been drained by the oom reaper so there are > - * only small chances it will free some more > - */ > - if (test_bit(MMF_OOM_SKIP, &mm->flags)) > - return false; > - > if (atomic_read(&mm->mm_users) <= 1) > return true; > > @@ -963,7 +956,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > * so it can die quickly > */ > task_lock(victim); > - if (task_will_free_mem(victim)) { > + if (!test_bit(MMF_OOM_SKIP, &victim->mm->flags) && > + task_will_free_mem(victim)) { > mark_oom_victim(victim); > wake_oom_reaper(victim); > task_unlock(victim); > @@ -1056,6 +1050,10 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* current has been already reapered */ > + if (test_bit(MMF_OOM_SKIP, ¤t->mm->flags)) > + return true; > + Oops. Should check whether mm is NULL first: if (mm && test_bit(MMF_OOM_SKIP, mm->flags)) > /* > * If current has a pending SIGKILL or is exiting, then automatically > * select it. The goal is to allow it to allocate so that it may > -- > 1.8.3.1 >
On Fri 10-07-20 23:18:01, Yafang Shao wrote: > If the current's MMF_OOM_SKIP is set, it means that the current is exiting > or dying and likely to realease its address space. That is not actually true. The primary reason for this flag is to tell that the task is no longer relevant for the oom victim selection because most of its memory has been released. But the task might be stuck at many places and waiting for it to terminate might easily lockup the system. The design of the oom reaper is to guarantee a forward progress if when the victim cannot make a forward progress on its own. For that to work the oom killer cannot relly rely on the victim's state or that it would finish. If you remove this fundamental assumption then the oom killer can lockup again. > So we don't need to > invoke the oom killer again. Otherwise that may cause some unexpected > issues, for example, bellow is the issue found in our production > environment. Please see the above. > There're many threads of a multi-threaded task parallel running in a > container on many cpus. Then many threads triggered OOM at the same time, > > CPU-1 CPU-2 ... CPU-n > thread-1 thread-2 ... thread-n > > wait oom_lock wait oom_lock ... hold oom_lock > > (sigkill received) > > select current as victim > and wakeup oom reaper > > release oom_lock > > (MMF_OOM_SKIP set by oom reaper) > > (lots of pages are freed) > hold oom_lock Could you be more specific please? The page allocator never waits for the oom_lock and keeps retrying instead. Also __alloc_pages_may_oom tries to allocate with the lock held. Could you provide oom reports please?
On Mon 13-07-20 08:01:57, Michal Hocko wrote: > On Fri 10-07-20 23:18:01, Yafang Shao wrote: [...] > > There're many threads of a multi-threaded task parallel running in a > > container on many cpus. Then many threads triggered OOM at the same time, > > > > CPU-1 CPU-2 ... CPU-n > > thread-1 thread-2 ... thread-n > > > > wait oom_lock wait oom_lock ... hold oom_lock > > > > (sigkill received) > > > > select current as victim > > and wakeup oom reaper > > > > release oom_lock > > > > (MMF_OOM_SKIP set by oom reaper) > > > > (lots of pages are freed) > > hold oom_lock > > Could you be more specific please? The page allocator never waits for > the oom_lock and keeps retrying instead. Also __alloc_pages_may_oom > tries to allocate with the lock held. I suspect that you are looking at memcg oom killer. Because we do not do trylock there for some reason I do not immediatelly remember from top of my head. If this is really the case then I would recommend looking into how the page allocator implements this and follow the same pattern for memcg as well.
On Mon, Jul 13, 2020 at 2:21 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 13-07-20 08:01:57, Michal Hocko wrote: > > On Fri 10-07-20 23:18:01, Yafang Shao wrote: > [...] > > > There're many threads of a multi-threaded task parallel running in a > > > container on many cpus. Then many threads triggered OOM at the same time, > > > > > > CPU-1 CPU-2 ... CPU-n > > > thread-1 thread-2 ... thread-n > > > > > > wait oom_lock wait oom_lock ... hold oom_lock > > > > > > (sigkill received) > > > > > > select current as victim > > > and wakeup oom reaper > > > > > > release oom_lock > > > > > > (MMF_OOM_SKIP set by oom reaper) > > > > > > (lots of pages are freed) > > > hold oom_lock > > > > Could you be more specific please? The page allocator never waits for > > the oom_lock and keeps retrying instead. Also __alloc_pages_may_oom > > tries to allocate with the lock held. > > I suspect that you are looking at memcg oom killer. Right, these threads were waiting the oom_lock in mem_cgroup_out_of_memory(). > Because we do not do > trylock there for some reason I do not immediatelly remember from top of > my head. If this is really the case then I would recommend looking into > how the page allocator implements this and follow the same pattern for > memcg as well. > That is a good suggestion. But we can't try locking the global oom_lock here, because task ooming in memcg foo may can't help the tasks in memcg bar. IOW, we need to introduce the per memcg oom_lock, like bellow, mem_cgroup_out_of_memory + if (mutex_trylock(memcg->lock)) + return true. if (mutex_lock_killable(&oom_lock)) return true; And the memcg tree should also be considered.
On Mon 13-07-20 20:24:07, Yafang Shao wrote: > On Mon, Jul 13, 2020 at 2:21 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 13-07-20 08:01:57, Michal Hocko wrote: > > > On Fri 10-07-20 23:18:01, Yafang Shao wrote: > > [...] > > > > There're many threads of a multi-threaded task parallel running in a > > > > container on many cpus. Then many threads triggered OOM at the same time, > > > > > > > > CPU-1 CPU-2 ... CPU-n > > > > thread-1 thread-2 ... thread-n > > > > > > > > wait oom_lock wait oom_lock ... hold oom_lock > > > > > > > > (sigkill received) > > > > > > > > select current as victim > > > > and wakeup oom reaper > > > > > > > > release oom_lock > > > > > > > > (MMF_OOM_SKIP set by oom reaper) > > > > > > > > (lots of pages are freed) > > > > hold oom_lock > > > > > > Could you be more specific please? The page allocator never waits for > > > the oom_lock and keeps retrying instead. Also __alloc_pages_may_oom > > > tries to allocate with the lock held. > > > > I suspect that you are looking at memcg oom killer. > > Right, these threads were waiting the oom_lock in mem_cgroup_out_of_memory(). > > > Because we do not do > > trylock there for some reason I do not immediatelly remember from top of > > my head. If this is really the case then I would recommend looking into > > how the page allocator implements this and follow the same pattern for > > memcg as well. > > > > That is a good suggestion. > But we can't try locking the global oom_lock here, because task ooming > in memcg foo may can't help the tasks in memcg bar. I do not follow. oom_lock is not about fwd progress. It is a big lock to synchronize against oom_disable logic. I have this in mind diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 248e6cad0095..29d1f8c2d968 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1563,8 +1563,10 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, }; bool ret; - if (mutex_lock_killable(&oom_lock)) + if (!mutex_trylock(&oom_lock)) return true; + + /* * A few threads which were not waiting at mutex_lock_killable() can * fail to bail out. Therefore, check again after holding oom_lock. But as I've said I would need to double check the history on why we differ here. Btw. I suspect that mem_cgroup_out_of_memory call in mem_cgroup_oom_synchronize is bogus and can no longer trigger after 29ef680ae7c21 but this needs double checking as well. > IOW, we need to introduce the per memcg oom_lock, like bellow, I do not see why. Besides that we already do have per oom memcg hierarchy lock.
On Mon, Jul 13, 2020 at 8:45 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 13-07-20 20:24:07, Yafang Shao wrote: > > On Mon, Jul 13, 2020 at 2:21 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Mon 13-07-20 08:01:57, Michal Hocko wrote: > > > > On Fri 10-07-20 23:18:01, Yafang Shao wrote: > > > [...] > > > > > There're many threads of a multi-threaded task parallel running in a > > > > > container on many cpus. Then many threads triggered OOM at the same time, > > > > > > > > > > CPU-1 CPU-2 ... CPU-n > > > > > thread-1 thread-2 ... thread-n > > > > > > > > > > wait oom_lock wait oom_lock ... hold oom_lock > > > > > > > > > > (sigkill received) > > > > > > > > > > select current as victim > > > > > and wakeup oom reaper > > > > > > > > > > release oom_lock > > > > > > > > > > (MMF_OOM_SKIP set by oom reaper) > > > > > > > > > > (lots of pages are freed) > > > > > hold oom_lock > > > > > > > > Could you be more specific please? The page allocator never waits for > > > > the oom_lock and keeps retrying instead. Also __alloc_pages_may_oom > > > > tries to allocate with the lock held. > > > > > > I suspect that you are looking at memcg oom killer. > > > > Right, these threads were waiting the oom_lock in mem_cgroup_out_of_memory(). > > > > > Because we do not do > > > trylock there for some reason I do not immediatelly remember from top of > > > my head. If this is really the case then I would recommend looking into > > > how the page allocator implements this and follow the same pattern for > > > memcg as well. > > > > > > > That is a good suggestion. > > But we can't try locking the global oom_lock here, because task ooming > > in memcg foo may can't help the tasks in memcg bar. > > I do not follow. oom_lock is not about fwd progress. It is a big lock to > synchronize against oom_disable logic. > > I have this in mind > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 248e6cad0095..29d1f8c2d968 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1563,8 +1563,10 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > }; > bool ret; > > - if (mutex_lock_killable(&oom_lock)) > + if (!mutex_trylock(&oom_lock)) > return true; root_mem_cgroup / \ memcg_a (16G) memcg_b (32G) | | process a_1 (reach memcg_a limit) process b_1(reach memcg_b limit) hold oom_lock wait oom_lock So we can find that process a_1 will try to kill process in memcg_a, while process b_1 need to try to kill process in memcg_b. IOW, the process killed in memcg_a can't help the processes in memcg_b, so if process b_1 should not trylock oom_lock here. While if the memcg tree is , target mem_cgroup (16G) / \ | | process a_1 (reach memcg_a limit) process a_2(reach memcg_a limit) hold oom_lock wait oom_lock Then, process a_2 can trylock oom_lock here. IOW, these processes should in the same memcg. That's why I said that we should introduce per-memcg oom_lock. > + > + > /* > * A few threads which were not waiting at mutex_lock_killable() can > * fail to bail out. Therefore, check again after holding oom_lock. > > But as I've said I would need to double check the history on why we > differ here. Btw. I suspect that mem_cgroup_out_of_memory call in > mem_cgroup_oom_synchronize is bogus and can no longer trigger after > 29ef680ae7c21 but this needs double checking as well. > > > IOW, we need to introduce the per memcg oom_lock, like bellow, > > I do not see why. Besides that we already do have per oom memcg > hierarchy lock. >
On Mon 13-07-20 21:11:50, Yafang Shao wrote: > On Mon, Jul 13, 2020 at 8:45 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 13-07-20 20:24:07, Yafang Shao wrote: [...] > > > But we can't try locking the global oom_lock here, because task ooming > > > in memcg foo may can't help the tasks in memcg bar. > > > > I do not follow. oom_lock is not about fwd progress. It is a big lock to > > synchronize against oom_disable logic. > > > > I have this in mind > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 248e6cad0095..29d1f8c2d968 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1563,8 +1563,10 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > }; > > bool ret; > > > > - if (mutex_lock_killable(&oom_lock)) > > + if (!mutex_trylock(&oom_lock)) > > return true; > > root_mem_cgroup > / \ > memcg_a (16G) memcg_b (32G) > | | > process a_1 (reach memcg_a limit) process b_1(reach > memcg_b limit) > hold oom_lock wait oom_lock > > So we can find that process a_1 will try to kill process in memcg_a, > while process b_1 need to try to kill process in memcg_b. > IOW, the process killed in memcg_a can't help the processes in > memcg_b, so if process b_1 should not trylock oom_lock here. > > While if the memcg tree is , > target mem_cgroup (16G) > / \ > | > | > process a_1 (reach memcg_a limit) process a_2(reach > memcg_a limit) > hold oom_lock wait oom_lock > > Then, process a_2 can trylock oom_lock here. IOW, these processes > should in the same memcg. > > That's why I said that we should introduce per-memcg oom_lock. I still fail to understand your reaasoning. Sure, the oom lock is global so it doesn't have a per oom hierarchy resolution pretty much by definition. But that is not really important. The whole point of the trylock is to remove the ordering between the oom selection, the oom reaper and potential charge consumers which trigger the oom in parallel. With the blocking lock they would pile up in the order they have hit the OOM situation. With the trylock they would simply keep retrying until the oom is done. That would reduce the race window considerably. This is what the global oom is doing. Another alternative would be to check mem_cgroup_margin after the lock is taken but it would be better to keep in sync with the global case as much as possible unless there is a good reason to differ.
On 2020/07/11 12:18, Yafang Shao wrote: > If the current's MMF_OOM_SKIP is set, it means that the current is exiting > or dying and likely to realease its address space. So we don't need to > invoke the oom killer again. Otherwise that may cause some unexpected > issues, for example, bellow is the issue found in our production > environment. What kernel version are you using? Commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer") should solve this problem.
On 2020/07/14 4:05, Michal Hocko wrote: > Another alternative would be to check mem_cgroup_margin after the lock > is taken but it would be better to keep in sync with the global case as > much as possible unless there is a good reason to differ. > If mem_cgroup_margin() is something what zone_watermark_ok() from get_page_from_freelist() will do, checking mem_cgroup_margin() with oom_lock held will make more keep in sync with the global case (i.e. calling __alloc_pages_may_oom() with oom_lock held).
On 2020/07/14 9:15, Tetsuo Handa wrote: > On 2020/07/14 4:05, Michal Hocko wrote: >> Another alternative would be to check mem_cgroup_margin after the lock >> is taken but it would be better to keep in sync with the global case as >> much as possible unless there is a good reason to differ. >> > > If mem_cgroup_margin() is something what zone_watermark_ok() from > get_page_from_freelist() will do, checking mem_cgroup_margin() with > oom_lock held will make more keep in sync with the global case > (i.e. calling __alloc_pages_may_oom() with oom_lock held). > I meant: (i.e. calling get_page_from_freelist() from __alloc_pages_may_oom() with oom_lock held).
On Tue, Jul 14, 2020 at 3:05 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 13-07-20 21:11:50, Yafang Shao wrote: > > On Mon, Jul 13, 2020 at 8:45 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Mon 13-07-20 20:24:07, Yafang Shao wrote: > [...] > > > > But we can't try locking the global oom_lock here, because task ooming > > > > in memcg foo may can't help the tasks in memcg bar. > > > > > > I do not follow. oom_lock is not about fwd progress. It is a big lock to > > > synchronize against oom_disable logic. > > > > > > I have this in mind > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 248e6cad0095..29d1f8c2d968 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -1563,8 +1563,10 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > }; > > > bool ret; > > > > > > - if (mutex_lock_killable(&oom_lock)) > > > + if (!mutex_trylock(&oom_lock)) > > > return true; > > > > root_mem_cgroup > > / \ > > memcg_a (16G) memcg_b (32G) > > | | > > process a_1 (reach memcg_a limit) process b_1(reach > > memcg_b limit) > > hold oom_lock wait oom_lock > > > > So we can find that process a_1 will try to kill process in memcg_a, > > while process b_1 need to try to kill process in memcg_b. > > IOW, the process killed in memcg_a can't help the processes in > > memcg_b, so if process b_1 should not trylock oom_lock here. > > > > While if the memcg tree is , > > target mem_cgroup (16G) > > / \ > > | > > | > > process a_1 (reach memcg_a limit) process a_2(reach > > memcg_a limit) > > hold oom_lock wait oom_lock > > > > Then, process a_2 can trylock oom_lock here. IOW, these processes > > should in the same memcg. > > > > That's why I said that we should introduce per-memcg oom_lock. > > I still fail to understand your reaasoning. Sure, the oom lock is global > so it doesn't have a per oom hierarchy resolution pretty much by definition. > But that is not really important. The whole point of the trylock is to > remove the ordering between the oom selection, the oom reaper and > potential charge consumers which trigger the oom in parallel. With the > blocking lock they would pile up in the order they have hit the OOM > situation. With the trylock they would simply keep retrying until the > oom is done. That would reduce the race window considerably. This is > what the global oom is doing. > Thanks for the explanation. Seems trylock can work. > Another alternative would be to check mem_cgroup_margin after the lock > is taken but it would be better to keep in sync with the global case as > much as possible unless there is a good reason to differ. > > -- > Michal Hocko > SUSE Labs
On Tue, Jul 14, 2020 at 7:50 AM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > On 2020/07/11 12:18, Yafang Shao wrote: > > If the current's MMF_OOM_SKIP is set, it means that the current is exiting > > or dying and likely to realease its address space. So we don't need to > > invoke the oom killer again. Otherwise that may cause some unexpected > > issues, for example, bellow is the issue found in our production > > environment. > > What kernel version are you using? > The kernel version is 4.18. > Commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer") > should solve this problem. Yes, this commit should fix this issue. Thanks for the information. But it seems the proposal that using trylock in mem_cgroup_out_of_memory() should be better? The trylock could also solve the problem that different processes are doing oom at the same time.
On 2020/07/14 11:13, Yafang Shao wrote: > But it seems the proposal that using trylock in > mem_cgroup_out_of_memory() should be better? > The trylock could also solve the problem that different processes are > doing oom at the same time. I think trylock is worse. The trylock needlessly wastes CPU time which could have been utilized by the OOM killer/reaper for reclaiming memory. If concurrent OOM on multiple non-overwrapping memcg domains is a real problem, killable lock on per-memcg oom_lock will be better than trylock on global oom_lock.
On Tue, Jul 14, 2020 at 10:42 AM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > On 2020/07/14 11:13, Yafang Shao wrote: > > But it seems the proposal that using trylock in > > mem_cgroup_out_of_memory() should be better? > > The trylock could also solve the problem that different processes are > > doing oom at the same time. > > I think trylock is worse. The trylock needlessly wastes CPU time which could > have been utilized by the OOM killer/reaper for reclaiming memory. If it may wastes the CPU time, we can shed it out for 1 second like what it does in __alloc_pages_may_oom(): __alloc_pages_may_oom if (!mutex_trylock(&oom_lock)) { schedule_timeout_uninterruptible(1); // to avoid wasting CPU time return; } But I find that we doesn't sched it out in pagefault path, pagefault_out_of_memory if (!mutex_trylock(&oom_lock)) return; I haven't thought deeply what the difference is ... > If concurrent OOM on multiple non-overwrapping memcg domains is a real problem, > killable lock on per-memcg oom_lock will be better than trylock on global oom_lock. I am wondering why we select the process per memcg, for example, select_bad_process if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); else mem_cgroup_scan_tasks(root_mem_cgorup, oom_evaluate_task, oc); Then we can put the lock here to lock each memcg, and avoid the oom contention between different memcgs.
On 2020/07/14 11:58, Yafang Shao wrote: > On Tue, Jul 14, 2020 at 10:42 AM Tetsuo Handa > <penguin-kernel@i-love.sakura.ne.jp> wrote: >> >> On 2020/07/14 11:13, Yafang Shao wrote: >>> But it seems the proposal that using trylock in >>> mem_cgroup_out_of_memory() should be better? >>> The trylock could also solve the problem that different processes are >>> doing oom at the same time. >> >> I think trylock is worse. The trylock needlessly wastes CPU time which could >> have been utilized by the OOM killer/reaper for reclaiming memory. > > If it may wastes the CPU time, we can shed it out for 1 second like > what it does in __alloc_pages_may_oom(): > > __alloc_pages_may_oom > if (!mutex_trylock(&oom_lock)) { > schedule_timeout_uninterruptible(1); // to avoid wasting CPU time 1 second is HZ. 1 means 1 millisecond if CONFIG_HZ=1000. :-) > return; > } > > But I find that we doesn't sched it out in pagefault path, > > pagefault_out_of_memory > if (!mutex_trylock(&oom_lock)) > return; > > I haven't thought deeply what the difference is ... David Rientjes is proposing it for avoiding soft lockup, and Michal Hocko is refusing it. How to give the OOM killer/reaper enough CPU time for reclaiming memory is a dogfight. :-( https://lkml.kernel.org/r/alpine.DEB.2.21.2003181458100.70237@chino.kir.corp.google.com
On Tue, Jul 14, 2020 at 12:06 PM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > On 2020/07/14 11:58, Yafang Shao wrote: > > On Tue, Jul 14, 2020 at 10:42 AM Tetsuo Handa > > <penguin-kernel@i-love.sakura.ne.jp> wrote: > >> > >> On 2020/07/14 11:13, Yafang Shao wrote: > >>> But it seems the proposal that using trylock in > >>> mem_cgroup_out_of_memory() should be better? > >>> The trylock could also solve the problem that different processes are > >>> doing oom at the same time. > >> > >> I think trylock is worse. The trylock needlessly wastes CPU time which could > >> have been utilized by the OOM killer/reaper for reclaiming memory. > > > > If it may wastes the CPU time, we can shed it out for 1 second like > > what it does in __alloc_pages_may_oom(): > > > > __alloc_pages_may_oom > > if (!mutex_trylock(&oom_lock)) { > > schedule_timeout_uninterruptible(1); // to avoid wasting CPU time > > 1 second is HZ. 1 means 1 millisecond if CONFIG_HZ=1000. :-) > Right. Thanks for pointing it out :) > > return; > > } > > > > But I find that we doesn't sched it out in pagefault path, > > > > pagefault_out_of_memory > > if (!mutex_trylock(&oom_lock)) > > return; > > > > I haven't thought deeply what the difference is ... > > David Rientjes is proposing it for avoiding soft lockup, and Michal Hocko is refusing it. > How to give the OOM killer/reaper enough CPU time for reclaiming memory is a dogfight. :-( > > https://lkml.kernel.org/r/alpine.DEB.2.21.2003181458100.70237@chino.kir.corp.google.com OK. I will take a look at the discussion in that thread.
On Tue 14-07-20 08:50:57, Tetsuo Handa wrote: > On 2020/07/11 12:18, Yafang Shao wrote: > > If the current's MMF_OOM_SKIP is set, it means that the current is exiting > > or dying and likely to realease its address space. So we don't need to > > invoke the oom killer again. Otherwise that may cause some unexpected > > issues, for example, bellow is the issue found in our production > > environment. > > What kernel version are you using? > > Commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer") > should solve this problem. I haven't really seen the oom report but this would solve only half of the problem. It is still possible that other tasks from the oom memcg hierarchy can be blocked on the oom lock and then observe the oom reaped tasks. The trylock or the check for the margin is needed to address the other half of the problem.
On Tue 14-07-20 13:06:48, Tetsuo Handa wrote: > On 2020/07/14 11:58, Yafang Shao wrote: [...] > > But I find that we doesn't sched it out in pagefault path, > > > > pagefault_out_of_memory > > if (!mutex_trylock(&oom_lock)) > > return; > > > > I haven't thought deeply what the difference is ... > > David Rientjes is proposing it for avoiding soft lockup, and Michal Hocko is refusing it. > How to give the OOM killer/reaper enough CPU time for reclaiming memory is a dogfight. :-( > > https://lkml.kernel.org/r/alpine.DEB.2.21.2003181458100.70237@chino.kir.corp.google.com I believe there was a review feedback to be addressed and never followed up.
On Tue, Jul 14, 2020 at 2:43 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 14-07-20 08:50:57, Tetsuo Handa wrote: > > On 2020/07/11 12:18, Yafang Shao wrote: > > > If the current's MMF_OOM_SKIP is set, it means that the current is exiting > > > or dying and likely to realease its address space. So we don't need to > > > invoke the oom killer again. Otherwise that may cause some unexpected > > > issues, for example, bellow is the issue found in our production > > > environment. > > > > What kernel version are you using? > > > > Commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer") > > should solve this problem. > > I haven't really seen the oom report but this would solve only half of > the problem. It is still possible that other tasks from the oom memcg > hierarchy can be blocked on the oom lock and then observe the oom reaped > tasks. The trylock or the check for the margin is needed to address the > other half of the problem. Agreed. Different tasks doing memcg oom in the same memcg should also be resolved. I will send a patch for it.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6e94962..a8a155a 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -825,13 +825,6 @@ static bool task_will_free_mem(struct task_struct *task) if (!__task_will_free_mem(task)) return false; - /* - * This task has already been drained by the oom reaper so there are - * only small chances it will free some more - */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) - return false; - if (atomic_read(&mm->mm_users) <= 1) return true; @@ -963,7 +956,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message) * so it can die quickly */ task_lock(victim); - if (task_will_free_mem(victim)) { + if (!test_bit(MMF_OOM_SKIP, &victim->mm->flags) && + task_will_free_mem(victim)) { mark_oom_victim(victim); wake_oom_reaper(victim); task_unlock(victim); @@ -1056,6 +1050,10 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* current has been already reapered */ + if (test_bit(MMF_OOM_SKIP, ¤t->mm->flags)) + return true; + /* * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may
If the current's MMF_OOM_SKIP is set, it means that the current is exiting or dying and likely to realease its address space. So we don't need to invoke the oom killer again. Otherwise that may cause some unexpected issues, for example, bellow is the issue found in our production environment. There're many threads of a multi-threaded task parallel running in a container on many cpus. Then many threads triggered OOM at the same time, CPU-1 CPU-2 ... CPU-n thread-1 thread-2 ... thread-n wait oom_lock wait oom_lock ... hold oom_lock (sigkill received) select current as victim and wakeup oom reaper release oom_lock (MMF_OOM_SKIP set by oom reaper) (lots of pages are freed) hold oom_lock because MMF_OOM_SKIP is set, kill others The thread running on CPU-n received sigkill and it will select current as the victim and wakeup the oom reaper. Then oom reaper will reap its rss and free lots of pages, as a result, there will be many free pages. Although the multi-threaded task is exiting, the other threads will continue to kill others because of the check of MMF_OOM_SKIP in task_will_free_mem(). Signed-off-by: Yafang Shao <laoar.shao@gmail.com> --- mm/oom_kill.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-)