diff mbox

[1/4] mm,oom: Don't call schedule_timeout_killable() with oom_lock held.

Message ID 1528369223-7571-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp (mailing list archive)
State New, archived
Headers show

Commit Message

Tetsuo Handa June 7, 2018, 11 a.m. UTC
When I was examining a bug which occurs under CPU + memory pressure, I
observed that a thread which called out_of_memory() can sleep for minutes
at schedule_timeout_killable(1) with oom_lock held when many threads are
doing direct reclaim.

The whole point of the sleep is give the OOM victim some time to exit.
But since commit 27ae357fa82be5ab ("mm, oom: fix concurrent munlock and
oom reaper unmap, v3") changed the OOM victim to wait for oom_lock in order
to close race window at exit_mmap(), the whole point of this sleep is lost
now. We need to make sure that the thread which called out_of_memory() will
release oom_lock shortly. Therefore, this patch brings the sleep to outside
of the OOM path. Whether it is safe to remove the sleep will be tested by
future patch.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 mm/oom_kill.c   | 38 +++++++++++++++++---------------------
 mm/page_alloc.c |  7 ++++++-
 2 files changed, 23 insertions(+), 22 deletions(-)

Comments

Michal Hocko June 7, 2018, 11:11 a.m. UTC | #1
On Thu 07-06-18 20:00:20, Tetsuo Handa wrote:
[...]
> @@ -4238,6 +4237,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	/* Retry as long as the OOM killer is making progress */
>  	if (did_some_progress) {
>  		no_progress_loops = 0;
> +		/*
> +		 * This schedule_timeout_*() serves as a guaranteed sleep for
> +		 * PF_WQ_WORKER threads when __zone_watermark_ok() == false.
> +		 */
> +		if (!tsk_is_oom_victim(current))
> +			schedule_timeout_uninterruptible(1);
>  		goto retry;
>  	}

Nacked-by: Michal Hocko <mhocko@suse.com>

as explainaed several times already. This moving code just to preserve
the current logic without any arguments to back them must stop finally.
We have way too much of this "just in case" code that nobody really
understands and others just pile on top. Seriously this is not how the
development should work.
Tetsuo Handa June 8, 2018, 10:47 a.m. UTC | #2
On 2018/06/07 20:28, Michal Hocko wrote:
> On Thu 07-06-18 20:00:23, Tetsuo Handa wrote:
> OK, this looks like a nice shortcut. I am quite surprise that all your
> NOMMU concerns are gone now while you clearly regress that case because
> inflight victims are not detected anymore AFAICS. Not that I care all
> that much, just sayin'.
> 
> Anyway, I would suggest splitting this into two patches. One to add an
> early check for inflight oom victims and one removing the detection from
> oom_evaluate_task. Just to make it easier to revert if somebody on nommu
> actually notices a regression.

Sure. Making it easier to revert is a good thing.
But this patch comes after PATCH 3/4. Need to solve previous problem.

On 2018/06/07 20:16, Michal Hocko wrote:
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> with a minor nit
> 
>> ---
>>  mm/oom_kill.c | 13 ++++++++-----
>>  1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 23ce67f..5a6f1b1 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -1073,15 +1073,18 @@ bool out_of_memory(struct oom_control *oc)
>>  	}
>>  
>>  	select_bad_process(oc);
>> +	if (oc->chosen == (void *)-1UL)
> 
> I think this one deserves a comment.
> 	/* There is an inflight oom victim *.
> 
>> +		return true;

OK. Though, this change will be reverted by PATCH 4/4.
But this patch comes after PATCH 1/4. Need to solve previous problem.

On 2018/06/07 20:13, Michal Hocko wrote:
> On Thu 07-06-18 20:00:21, Tetsuo Handa wrote:
> 
> Your s-o-b is missing here. And I suspect this should be From: /me
> but I do not care all that much.

How can I do that? Forge the From: line (assuming that mail server does not
reject forged From: line)?

But I am quite surprised that you did not respond PATCH 2/4 with Nacked-by:
because

On 2018/06/07 20:11, Michal Hocko wrote:
> On Thu 07-06-18 20:00:20, Tetsuo Handa wrote:
> [...]
>> @@ -4238,6 +4237,12 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>>  	/* Retry as long as the OOM killer is making progress */
>>  	if (did_some_progress) {
>>  		no_progress_loops = 0;
>> +		/*
>> +		 * This schedule_timeout_*() serves as a guaranteed sleep for
>> +		 * PF_WQ_WORKER threads when __zone_watermark_ok() == false.
>> +		 */
>> +		if (!tsk_is_oom_victim(current))
>> +			schedule_timeout_uninterruptible(1);
>>  		goto retry;
>>  	}
> 
> Nacked-by: Michal Hocko <mhocko@suse.com>
> 
> as explainaed several times already. This moving code just to preserve
> the current logic without any arguments to back them must stop finally.
> We have way too much of this "just in case" code that nobody really
> understands and others just pile on top. Seriously this is not how the
> development should work.
> 

I am purposely splitting into PATCH 1/4 and PATCH 2/4 in order to make it
easier to revert (like you suggested doing so for PATCH 4/4) in case
somebody actually notices an unexpected side effect.

PATCH 1/4 is doing logically correct thing, no matter how you hate
the short sleep which will be removed by PATCH 2/4.

PATCH 1/4 is proven to be safe. But PATCH 2/4 is not tested to be safe.
PATCH 1/4 is safe for stable. But 2/4 might not be safe for stable.
Therefore, I insist on two separate patches.

If you can accept what PATCH 1/4 + PATCH 2/4 are doing, you are free to post
one squashed patch.
diff mbox

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8ba6cb8..23ce67f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,6 +479,21 @@  bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
+/*
+ * We have to make sure not to cause premature new oom victim selection.
+ *
+ * __alloc_pages_may_oom()     oom_reap_task_mm()/exit_mmap()
+ *   mutex_trylock(&oom_lock)
+ *   get_page_from_freelist(ALLOC_WMARK_HIGH) # fails
+ *                               unmap_page_range() # frees some memory
+ *                               set_bit(MMF_OOM_SKIP)
+ *   out_of_memory()
+ *     select_bad_process()
+ *       test_bit(MMF_OOM_SKIP) # selects new oom victim
+ *   mutex_unlock(&oom_lock)
+ *
+ * Therefore, the callers hold oom_lock when calling this function.
+ */
 void __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
@@ -523,20 +538,6 @@  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	bool ret = true;
 
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
 	mutex_lock(&oom_lock);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
@@ -1077,15 +1078,9 @@  bool out_of_memory(struct oom_control *oc)
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen && oc->chosen != (void *)-1UL) {
+	if (oc->chosen && oc->chosen != (void *)-1UL)
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
-		/*
-		 * Give the killed process a good chance to exit before trying
-		 * to allocate memory again.
-		 */
-		schedule_timeout_killable(1);
-	}
 	return !!oc->chosen;
 }
 
@@ -1111,4 +1106,5 @@  void pagefault_out_of_memory(void)
 		return;
 	out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
+	schedule_timeout_killable(1);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 22320ea27..e90f152 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3471,7 +3471,6 @@  void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	 */
 	if (!mutex_trylock(&oom_lock)) {
 		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
 
@@ -4238,6 +4237,12 @@  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	/* Retry as long as the OOM killer is making progress */
 	if (did_some_progress) {
 		no_progress_loops = 0;
+		/*
+		 * This schedule_timeout_*() serves as a guaranteed sleep for
+		 * PF_WQ_WORKER threads when __zone_watermark_ok() == false.
+		 */
+		if (!tsk_is_oom_victim(current))
+			schedule_timeout_uninterruptible(1);
 		goto retry;
 	}