Message ID | 201805122318.HJG81246.MFVFLFJOOQtSHO@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sat 12-05-18 23:18:24, Tetsuo Handa wrote: [...] > @@ -4241,6 +4240,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > /* Retry as long as the OOM killer is making progress */ > if (did_some_progress) { > no_progress_loops = 0; > + /* > + * This schedule_timeout_*() serves as a guaranteed sleep for > + * PF_WQ_WORKER threads when __zone_watermark_ok() == false. > + */ > + if (!tsk_is_oom_victim(current)) > + schedule_timeout_uninterruptible(1); > goto retry; We already do have that sleep for PF_WQ_WORKER in should_reclaim_retry. Why do we need it here as well?
Michal Hocko wrote: > On Sat 12-05-18 23:18:24, Tetsuo Handa wrote: > [...] > > @@ -4241,6 +4240,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > > /* Retry as long as the OOM killer is making progress */ > > if (did_some_progress) { > > no_progress_loops = 0; > > + /* > > + * This schedule_timeout_*() serves as a guaranteed sleep for > > + * PF_WQ_WORKER threads when __zone_watermark_ok() == false. > > + */ > > + if (!tsk_is_oom_victim(current)) > > + schedule_timeout_uninterruptible(1); > > goto retry; > > We already do have that sleep for PF_WQ_WORKER in should_reclaim_retry. > Why do we need it here as well? Because that path depends on __zone_watermark_ok() == true which is not guaranteed to be executed. I consider that this "goto retry;" is a good location for making a short sleep. Current code is so conditional that there are cases which needlessly retry without sleeping (e.g. current thread finds an OOM victim at select_bad_process() and immediately retries allocation attempt rather than giving the OOM victim CPU resource for releasing memory) or needlessly sleep (e.g. current thread was selected as an OOM victim but mutex_trylock(&oom_lock) in __alloc_pages_may_oom() failed).
On Fri 18-05-18 19:14:12, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Sat 12-05-18 23:18:24, Tetsuo Handa wrote: > > [...] > > > @@ -4241,6 +4240,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > > > /* Retry as long as the OOM killer is making progress */ > > > if (did_some_progress) { > > > no_progress_loops = 0; > > > + /* > > > + * This schedule_timeout_*() serves as a guaranteed sleep for > > > + * PF_WQ_WORKER threads when __zone_watermark_ok() == false. > > > + */ > > > + if (!tsk_is_oom_victim(current)) > > > + schedule_timeout_uninterruptible(1); > > > goto retry; > > > > We already do have that sleep for PF_WQ_WORKER in should_reclaim_retry. > > Why do we need it here as well? > > Because that path depends on __zone_watermark_ok() == true which is not > guaranteed to be executed. Is there any reason we cannot do the special cased sleep for PF_WQ_WORKER in should_reclaim_retry? The current code is complex enough to make it even more so. If we need a hack for PF_WQ_WORKER case then we definitely want to have a single place to do so.
Michal Hocko wrote: > On Fri 18-05-18 19:14:12, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Sat 12-05-18 23:18:24, Tetsuo Handa wrote: > > > [...] > > > > @@ -4241,6 +4240,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > > > > /* Retry as long as the OOM killer is making progress */ > > > > if (did_some_progress) { > > > > no_progress_loops = 0; > > > > + /* > > > > + * This schedule_timeout_*() serves as a guaranteed sleep for > > > > + * PF_WQ_WORKER threads when __zone_watermark_ok() == false. > > > > + */ > > > > + if (!tsk_is_oom_victim(current)) > > > > + schedule_timeout_uninterruptible(1); > > > > goto retry; > > > > > > We already do have that sleep for PF_WQ_WORKER in should_reclaim_retry. > > > Why do we need it here as well? > > > > Because that path depends on __zone_watermark_ok() == true which is not > > guaranteed to be executed. > > Is there any reason we cannot do the special cased sleep for > PF_WQ_WORKER in should_reclaim_retry? The current code is complex enough > to make it even more so. If we need a hack for PF_WQ_WORKER case then we > definitely want to have a single place to do so. I don't understand why you are talking about PF_WQ_WORKER case. This sleep is not only for PF_WQ_WORKER case but also !PF_KTHREAD case. I added this comment because you suggested simply removing any sleep which waits for the OOM victim. Making special cased sleep for PF_WQ_WORKER in should_reclaim_retry() cannot become a reason to block this patch. You can propose it after this patch is applied. This patch is for mitigating lockup problem caused by forever holding oom_lock.
On Mon 21-05-18 00:56:05, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Fri 18-05-18 19:14:12, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Sat 12-05-18 23:18:24, Tetsuo Handa wrote: > > > > [...] > > > > > @@ -4241,6 +4240,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > > > > > /* Retry as long as the OOM killer is making progress */ > > > > > if (did_some_progress) { > > > > > no_progress_loops = 0; > > > > > + /* > > > > > + * This schedule_timeout_*() serves as a guaranteed sleep for > > > > > + * PF_WQ_WORKER threads when __zone_watermark_ok() == false. > > > > > + */ > > > > > + if (!tsk_is_oom_victim(current)) > > > > > + schedule_timeout_uninterruptible(1); > > > > > goto retry; > > > > > > > > We already do have that sleep for PF_WQ_WORKER in should_reclaim_retry. > > > > Why do we need it here as well? > > > > > > Because that path depends on __zone_watermark_ok() == true which is not > > > guaranteed to be executed. > > > > Is there any reason we cannot do the special cased sleep for > > PF_WQ_WORKER in should_reclaim_retry? The current code is complex enough > > to make it even more so. If we need a hack for PF_WQ_WORKER case then we > > definitely want to have a single place to do so. > > I don't understand why you are talking about PF_WQ_WORKER case. Because that seems to be the reason to have it there as per your comment. > This sleep is not only for PF_WQ_WORKER case but also !PF_KTHREAD case. > I added this comment because you suggested simply removing any sleep which > waits for the OOM victim. And now you have made the comment misleading and I suspect it is just not really needed as well. > Making special cased sleep for PF_WQ_WORKER in should_reclaim_retry() cannot > become a reason to block this patch. You can propose it after this patch is > applied. This patch is for mitigating lockup problem caused by forever holding > oom_lock. You are fiddling with other code paths at the same time so I _do_ care. Spilling random code without a proper explanation is just not going to fly.
Michal Hocko wrote: > > I don't understand why you are talking about PF_WQ_WORKER case. > > Because that seems to be the reason to have it there as per your > comment. OK. Then, I will fold below change into my patch. if (did_some_progress) { no_progress_loops = 0; + /* -+ * This schedule_timeout_*() serves as a guaranteed sleep for -+ * PF_WQ_WORKER threads when __zone_watermark_ok() == false. ++ * Try to give the OOM killer/reaper/victims some time for ++ * releasing memory. + */ + if (!tsk_is_oom_victim(current)) + schedule_timeout_uninterruptible(1); But Roman, my patch conflicts with your "mm, oom: cgroup-aware OOM killer" patch in linux-next. And it seems to me that your patch contains a bug which leads to premature memory allocation failure explained below. @@ -1029,6 +1050,7 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + bool delay = false; /* if set, delay next allocation attempt */ if (oom_killer_disabled) return false; @@ -1073,27 +1095,39 @@ bool out_of_memory(struct oom_control *oc) current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) && current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { get_task_struct(current); - oc->chosen = current; + oc->chosen_task = current; oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)"); return true; } + if (mem_cgroup_select_oom_victim(oc)) { /* mem_cgroup_select_oom_victim() returns true if select_victim_memcg() made oc->chosen_memcg != NULL. select_victim_memcg() makes oc->chosen_memcg = INFLIGHT_VICTIM if there is inflight memcg. But oc->chosen_task remains NULL because it did not call oom_evaluate_task(), didn't it? (And if it called oom_evaluate_task(), put_task_struct() is missing here.) */ + if (oom_kill_memcg_victim(oc)) { /* oom_kill_memcg_victim() returns true if oc->chosen_memcg == INFLIGHT_VICTIM. */ + delay = true; + goto out; + } + } + select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ - if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { + if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { dump_header(oc, NULL); panic("Out of memory and no killable processes...\n"); } - if (oc->chosen && oc->chosen != (void *)-1UL) { + if (oc->chosen_task && oc->chosen_task != (void *)-1UL) { oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : "Memory cgroup out of memory"); - /* - * Give the killed process a good chance to exit before trying - * to allocate memory again. - */ - schedule_timeout_killable(1); + delay = true; } - return !!oc->chosen; + +out: + /* + * Give the killed process a good chance to exit before trying + * to allocate memory again. + */ + if (delay) + schedule_timeout_killable(1); + /* out_of_memory() returns false because oc->chosen_task remains NULL. */ + return !!oc->chosen_task; } Can we apply my patch prior to your "mm, oom: cgroup-aware OOM killer" patch (which eliminates "delay" and "out:" from your patch) so that people can easily backport my patch? Or, do you want to apply a fix (which eliminates "delay" and "out:" from linux-next) prior to my patch?
On Wed 23-05-18 19:24:48, Tetsuo Handa wrote: > Michal Hocko wrote: > > > I don't understand why you are talking about PF_WQ_WORKER case. > > > > Because that seems to be the reason to have it there as per your > > comment. > > OK. Then, I will fold below change into my patch. > > if (did_some_progress) { > no_progress_loops = 0; > + /* > -+ * This schedule_timeout_*() serves as a guaranteed sleep for > -+ * PF_WQ_WORKER threads when __zone_watermark_ok() == false. > ++ * Try to give the OOM killer/reaper/victims some time for > ++ * releasing memory. > + */ > + if (!tsk_is_oom_victim(current)) > + schedule_timeout_uninterruptible(1); Do you really need this? You are still fiddling with this path at all? I see how removing the timeout might be reasonable after recent changes but why do you insist in adding it outside of the lock.
Michal Hocko wrote: > On Wed 23-05-18 19:24:48, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > > I don't understand why you are talking about PF_WQ_WORKER case. > > > > > > Because that seems to be the reason to have it there as per your > > > comment. > > > > OK. Then, I will fold below change into my patch. > > > > if (did_some_progress) { > > no_progress_loops = 0; > > + /* > > -+ * This schedule_timeout_*() serves as a guaranteed sleep for > > -+ * PF_WQ_WORKER threads when __zone_watermark_ok() == false. > > ++ * Try to give the OOM killer/reaper/victims some time for > > ++ * releasing memory. > > + */ > > + if (!tsk_is_oom_victim(current)) > > + schedule_timeout_uninterruptible(1); > > Do you really need this? You are still fiddling with this path at all? I > see how removing the timeout might be reasonable after recent changes > but why do you insist in adding it outside of the lock. Sigh... We can't remove this sleep without further changes. That's why I added * This schedule_timeout_*() serves as a guaranteed sleep for * PF_WQ_WORKER threads when __zone_watermark_ok() == false. so that we won't by error remove this sleep without further changes. This sleep is not only for waiting for OOM victims. Any thread who is holding oom_lock needs CPU resources in order to make forward progress. If oom_notify_list callbacks are registered, this sleep helps the owner of oom_lock to reclaim memory by processing the callbacks. If oom_notify_list callbacks did not release memory, this sleep still helps the owner of oom_lock to check whether there is inflight OOM victims. If there is no inflight OOM victims, this sleep still helps the owner of oom_lock to select a new OOM victim and call printk(). If there are already inflight OOM victims, this sleep still helps the OOM reaper and the OOM victims to release memory. Printing messages to consoles and reclaiming memory need CPU resources. More reliable way is to use mutex_lock_killable(&oom_lock) instead of mutex_trylock(&oom_lock) in __alloc_pages_may_oom(), but I'm giving way for now. There is no valid reason for removing this sleep now.
On Wed 23-05-18 22:45:20, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Wed 23-05-18 19:24:48, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > > I don't understand why you are talking about PF_WQ_WORKER case. > > > > > > > > Because that seems to be the reason to have it there as per your > > > > comment. > > > > > > OK. Then, I will fold below change into my patch. > > > > > > if (did_some_progress) { > > > no_progress_loops = 0; > > > + /* > > > -+ * This schedule_timeout_*() serves as a guaranteed sleep for > > > -+ * PF_WQ_WORKER threads when __zone_watermark_ok() == false. > > > ++ * Try to give the OOM killer/reaper/victims some time for > > > ++ * releasing memory. > > > + */ > > > + if (!tsk_is_oom_victim(current)) > > > + schedule_timeout_uninterruptible(1); > > > > Do you really need this? You are still fiddling with this path at all? I > > see how removing the timeout might be reasonable after recent changes > > but why do you insist in adding it outside of the lock. > > Sigh... We can't remove this sleep without further changes. That's why I added > > * This schedule_timeout_*() serves as a guaranteed sleep for > * PF_WQ_WORKER threads when __zone_watermark_ok() == false. > > so that we won't by error remove this sleep without further changes. Look. I am fed up with this discussion. You are fiddling with the code and moving hacks around with a lot of hand waving. Rahter than trying to look at the underlying problem. Your patch completely ignores PREEMPT as I've mentioned in previous versions. I do admit that the underlying problem is non-trivial to handle and it requires a deeper consideration. Fair enough. You can spend that time on the matter and come up with something clever. That would be great. But moving a sleep around because of some yada yada yada is not a way we want to treat this code. I would be OK with removing the sleep from the out_of_memory path based on your argumentation that we have a _proper_ synchronization with the exit path now. That would be a patch that has actually a solid background behind. Is it possible that something would wait longer or wouldn't preempt etc.? Yes possible but those need to be analyzed and thing through properly. See the difference from "we may need it because we've always been doing that and there is here and there that might happen". This cargo cult way of programming will only grow more and more hacks nobody can reason about long term.
Michal Hocko wrote: > Look. I am fed up with this discussion. You are fiddling with the code > and moving hacks around with a lot of hand waving. Rahter than trying to > look at the underlying problem. Your patch completely ignores PREEMPT as > I've mentioned in previous versions. I'm not ignoring PREEMPT. To fix this OOM lockup problem properly, as much efforts as fixing Spectre/Meltdown problems will be required. This patch is a mitigation for regression introduced by fixing CVE-2018-1000200. Nothing is good with deferring this patch. > I would be OK with removing the sleep from the out_of_memory path based > on your argumentation that we have a _proper_ synchronization with the > exit path now. Such attempt should be made in a separate patch. You suggested removing this sleep from my patch without realizing that we need explicit schedule_timeout_*() for PF_WQ_WORKER threads. My patch is trying to be as conservative/safe as possible (for easier backport) while reducing the risk of falling into OOM lockup. I worry that you are completely overlooking char *fmt, ...) */ if (!mutex_trylock(&oom_lock)) { *did_some_progress = 1; - schedule_timeout_uninterruptible(1); return NULL; } part in this patch. Currently, the short sleep is so random/inconsistent that schedule_timeout_uninterruptible(1) is called when we failed to grab oom_lock (even if current thread was already marked as an OOM victim), schedule_timeout_killable(1) is called when we killed a new OOM victim, and no sleep at all if we found that there are inflight OOM victims. This patch centralized the location to call schedule_timeout_uninterruptible(1) to "goto retry;" path so that current thread surely yields CPU resource to the owner of oom_lock. You are free to propose removing this centralized sleep after my change is applied. Of course, you are responsible for convincing that removing this centralized sleep (unless PF_WQ_WORKER threads) does not negatively affect the owner of oom_lock (e.g. a SCHED_IDLE thread who is holding oom_lock gets blocked longer than mine).
On Thu 24-05-18 19:51:24, Tetsuo Handa wrote: > Michal Hocko wrote: > > Look. I am fed up with this discussion. You are fiddling with the code > > and moving hacks around with a lot of hand waving. Rahter than trying to > > look at the underlying problem. Your patch completely ignores PREEMPT as > > I've mentioned in previous versions. > > I'm not ignoring PREEMPT. To fix this OOM lockup problem properly, as much > efforts as fixing Spectre/Meltdown problems will be required. This patch is > a mitigation for regression introduced by fixing CVE-2018-1000200. Nothing > is good with deferring this patch. > > > I would be OK with removing the sleep from the out_of_memory path based > > on your argumentation that we have a _proper_ synchronization with the > > exit path now. > > Such attempt should be made in a separate patch. > > You suggested removing this sleep from my patch without realizing that > we need explicit schedule_timeout_*() for PF_WQ_WORKER threads. And that sleep is in should_reclaim_retry. If that is not sufficient it should be addressed rather than spilling more of that crud all over the place. > My patch > is trying to be as conservative/safe as possible (for easier backport) > while reducing the risk of falling into OOM lockup. And it adss more crud > I worry that you are completely overlooking > > char *fmt, ...) > */ > if (!mutex_trylock(&oom_lock)) { > *did_some_progress = 1; > - schedule_timeout_uninterruptible(1); > return NULL; > } > > > part in this patch. I am not. But it doesn't really make much sense to convince you if you are not reading what I am writing. I am done with this thread. Discussion is just a waste of time.
Michal Hocko wrote: > On Thu 24-05-18 19:51:24, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > Look. I am fed up with this discussion. You are fiddling with the code > > > and moving hacks around with a lot of hand waving. Rahter than trying to > > > look at the underlying problem. Your patch completely ignores PREEMPT as > > > I've mentioned in previous versions. > > > > I'm not ignoring PREEMPT. To fix this OOM lockup problem properly, as much > > efforts as fixing Spectre/Meltdown problems will be required. This patch is > > a mitigation for regression introduced by fixing CVE-2018-1000200. Nothing > > is good with deferring this patch. > > > > > I would be OK with removing the sleep from the out_of_memory path based > > > on your argumentation that we have a _proper_ synchronization with the > > > exit path now. > > > > Such attempt should be made in a separate patch. > > > > You suggested removing this sleep from my patch without realizing that > > we need explicit schedule_timeout_*() for PF_WQ_WORKER threads. > > And that sleep is in should_reclaim_retry. If that is not sufficient it > should be addressed rather than spilling more of that crud all over the > place. Then, please show me (by writing a patch yourself) how to tell whether we should sleep there. What I can come up is shown below. -@@ -4241,6 +4240,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) - /* Retry as long as the OOM killer is making progress */ - if (did_some_progress) { - no_progress_loops = 0; -+ /* -+ * This schedule_timeout_*() serves as a guaranteed sleep for -+ * PF_WQ_WORKER threads when __zone_watermark_ok() == false. -+ */ -+ if (!tsk_is_oom_victim(current)) -+ schedule_timeout_uninterruptible(1); - goto retry; - } +@@ -3927,6 +3926,14 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) + (*no_progress_loops)++; + /* ++ * We do a short sleep here if the OOM killer/reaper/victims are ++ * holding oom_lock, in order to try to give them some CPU resources ++ * for releasing memory. ++ */ ++ if (mutex_is_locked(&oom_lock) && !tsk_is_oom_victim(current)) ++ schedule_timeout_uninterruptible(1); ++ ++ /* + * Make sure we converge to OOM if we cannot make any progress + * several times in the row. + */ As far as I know, whether a domain which the current thread belongs to is already OOM is not known as of should_reclaim_retry(). Therefore, sleeping there can become a pointless delay if the domain which the current thread belongs to and the domain which the owner of oom_lock (it can be a random thread inside out_of_memory() or exit_mmap()) belongs to differs. But you insist sleeping there means that you don't care about such pointless delay?
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 8ba6cb8..23ce67f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -479,6 +479,21 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); +/* + * We have to make sure not to cause premature new oom victim selection. + * + * __alloc_pages_may_oom() oom_reap_task_mm()/exit_mmap() + * mutex_trylock(&oom_lock) + * get_page_from_freelist(ALLOC_WMARK_HIGH) # fails + * unmap_page_range() # frees some memory + * set_bit(MMF_OOM_SKIP) + * out_of_memory() + * select_bad_process() + * test_bit(MMF_OOM_SKIP) # selects new oom victim + * mutex_unlock(&oom_lock) + * + * Therefore, the callers hold oom_lock when calling this function. + */ void __oom_reap_task_mm(struct mm_struct *mm) { struct vm_area_struct *vma; @@ -523,20 +538,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { bool ret = true; - /* - * We have to make sure to not race with the victim exit path - * and cause premature new oom victim selection: - * oom_reap_task_mm exit_mm - * mmget_not_zero - * mmput - * atomic_dec_and_test - * exit_oom_victim - * [...] - * out_of_memory - * select_bad_process - * # no TIF_MEMDIE task selects new victim - * unmap_page_range # frees some memory - */ mutex_lock(&oom_lock); if (!down_read_trylock(&mm->mmap_sem)) { @@ -1077,15 +1078,9 @@ bool out_of_memory(struct oom_control *oc) dump_header(oc, NULL); panic("Out of memory and no killable processes...\n"); } - if (oc->chosen && oc->chosen != (void *)-1UL) { + if (oc->chosen && oc->chosen != (void *)-1UL) oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : "Memory cgroup out of memory"); - /* - * Give the killed process a good chance to exit before trying - * to allocate memory again. - */ - schedule_timeout_killable(1); - } return !!oc->chosen; } @@ -1111,4 +1106,5 @@ void pagefault_out_of_memory(void) return; out_of_memory(&oc); mutex_unlock(&oom_lock); + schedule_timeout_killable(1); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 905db9d..458ed32 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3478,7 +3478,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) */ if (!mutex_trylock(&oom_lock)) { *did_some_progress = 1; - schedule_timeout_uninterruptible(1); return NULL; } @@ -4241,6 +4240,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) /* Retry as long as the OOM killer is making progress */ if (did_some_progress) { no_progress_loops = 0; + /* + * This schedule_timeout_*() serves as a guaranteed sleep for + * PF_WQ_WORKER threads when __zone_watermark_ok() == false. + */ + if (!tsk_is_oom_victim(current)) + schedule_timeout_uninterruptible(1); goto retry; }