diff mbox

mm,oom: Don't call schedule_timeout_killable() with oom_lock held.

Message ID 20180529081639.GM27180@dhcp22.suse.cz (mailing list archive)
State New, archived
Headers show

Commit Message

Michal Hocko May 29, 2018, 8:16 a.m. UTC
With the full changelog. This can be either folded into the respective
patch or applied on top.

From 0bd619e7a68337c97bdaed288e813e96a14ba339 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 29 May 2018 10:09:33 +0200
Subject: [PATCH] mm, memcg, oom: fix pre-mature allocation failures

Tetsuo has noticed that "mm, oom: cgroup-aware OOM killer" can lead to a
pre-mature allocation failure if the cgroup aware oom killer is enabled
and select_victim_memcg doesn't pick up any memcg to kill because there
is a memcg already being killed. oc->chosen_memcg will become INFLIGHT_VICTIM
and oom_kill_memcg_victim will bail out early. oc->chosen_task will
stay NULL, however, and out_of_memory will therefore return false which
forces __alloc_pages_may_oom to not set did_some_progress and the page
allocator backs out and fails the allocation.
U
Fix this by checking both chosen_task and chosen_memcg in out_of_memory
and return false only when _both_ are NULL.

Fixes: "mm, oom: cgroup-aware OOM killer"
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Tetsuo Handa May 29, 2018, 2:33 p.m. UTC | #1
On 2018/05/29 17:16, Michal Hocko wrote:
> With the full changelog. This can be either folded into the respective
> patch or applied on top.
> 
>>From 0bd619e7a68337c97bdaed288e813e96a14ba339 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 29 May 2018 10:09:33 +0200
> Subject: [PATCH] mm, memcg, oom: fix pre-mature allocation failures
> 
> Tetsuo has noticed that "mm, oom: cgroup-aware OOM killer" can lead to a
> pre-mature allocation failure if the cgroup aware oom killer is enabled
> and select_victim_memcg doesn't pick up any memcg to kill because there
> is a memcg already being killed. oc->chosen_memcg will become INFLIGHT_VICTIM
> and oom_kill_memcg_victim will bail out early. oc->chosen_task will
> stay NULL, however, and out_of_memory will therefore return false which
> forces __alloc_pages_may_oom to not set did_some_progress and the page
> allocator backs out and fails the allocation.
> U
> Fix this by checking both chosen_task and chosen_memcg in out_of_memory
> and return false only when _both_ are NULL.

I don't like this patch. It is not easy to understand and is fragile to
future changes. Currently the only case !!oc->chosen can become false is that
there was no eligible tasks when SysRq-f was requested or memcg OOM occurred.

	/* Found nothing?!?! Either we hang forever, or we panic. */
	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {

With this patch applied, what happens if
mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc) forgot to set
oc->chosen_memcg to NULL and called select_bad_process(oc) and reached

        /* Found nothing?!?! Either we hang forever, or we panic. */
        if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {

but did not trigger panic() because of is_sysrq_oom(oc) || is_memcg_oom(oc)
and reached the last "!!(oc->chosen_task | oc->chosen_memcg)" line?
It will by error return "true" when no eligible tasks found...

Don't make return conditions complicated.
The appropriate fix is to kill "delay" and "goto out;" now! My patch does it!!
Michal Hocko May 29, 2018, 5:18 p.m. UTC | #2
On Tue 29-05-18 23:33:13, Tetsuo Handa wrote:
> On 2018/05/29 17:16, Michal Hocko wrote:
> > With the full changelog. This can be either folded into the respective
> > patch or applied on top.
> > 
> >>From 0bd619e7a68337c97bdaed288e813e96a14ba339 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Tue, 29 May 2018 10:09:33 +0200
> > Subject: [PATCH] mm, memcg, oom: fix pre-mature allocation failures
> > 
> > Tetsuo has noticed that "mm, oom: cgroup-aware OOM killer" can lead to a
> > pre-mature allocation failure if the cgroup aware oom killer is enabled
> > and select_victim_memcg doesn't pick up any memcg to kill because there
> > is a memcg already being killed. oc->chosen_memcg will become INFLIGHT_VICTIM
> > and oom_kill_memcg_victim will bail out early. oc->chosen_task will
> > stay NULL, however, and out_of_memory will therefore return false which
> > forces __alloc_pages_may_oom to not set did_some_progress and the page
> > allocator backs out and fails the allocation.
> > U
> > Fix this by checking both chosen_task and chosen_memcg in out_of_memory
> > and return false only when _both_ are NULL.
> 
> I don't like this patch. It is not easy to understand and is fragile to
> future changes. Currently the only case !!oc->chosen can become false is that
> there was no eligible tasks when SysRq-f was requested or memcg OOM occurred.

Well, the current contract is not easy unfortunatelly. We have two
different modes of operation. We are either killing whole cgroups or a
task from a cgroup. In any case, the contract says that if we have any
killable entity then at least one of chosen* is set to INFLIGHT_VICTIM.
Other than that one of them has to be !NULL or we have no eligible
killable entity. The return value reflects all these cases.

> 	/* Found nothing?!?! Either we hang forever, or we panic. */
> 	if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> 
> With this patch applied, what happens if
> mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc) forgot to set
> oc->chosen_memcg to NULL and called select_bad_process(oc) and reached

Well, this would be a clear bug and breaking the expected contract. I do
not see such a bug. Do you?

>         /* Found nothing?!?! Either we hang forever, or we panic. */
>         if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> 
> but did not trigger panic() because of is_sysrq_oom(oc) || is_memcg_oom(oc)
> and reached the last "!!(oc->chosen_task | oc->chosen_memcg)" line?
> It will by error return "true" when no eligible tasks found...

What are you talking about?

> Don't make return conditions complicated.
> The appropriate fix is to kill "delay" and "goto out;" now! My patch does it!!

I will leave the decision about which fix to take to Roman.
Michal Hocko May 29, 2018, 5:28 p.m. UTC | #3
On Tue 29-05-18 19:18:33, Michal Hocko wrote:
> On Tue 29-05-18 23:33:13, Tetsuo Handa wrote:
> > On 2018/05/29 17:16, Michal Hocko wrote:
> > > With the full changelog. This can be either folded into the respective
> > > patch or applied on top.
> > > 
> > >>From 0bd619e7a68337c97bdaed288e813e96a14ba339 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko@suse.com>
> > > Date: Tue, 29 May 2018 10:09:33 +0200
> > > Subject: [PATCH] mm, memcg, oom: fix pre-mature allocation failures
> > > 
> > > Tetsuo has noticed that "mm, oom: cgroup-aware OOM killer" can lead to a
> > > pre-mature allocation failure if the cgroup aware oom killer is enabled
> > > and select_victim_memcg doesn't pick up any memcg to kill because there
> > > is a memcg already being killed. oc->chosen_memcg will become INFLIGHT_VICTIM
> > > and oom_kill_memcg_victim will bail out early. oc->chosen_task will
> > > stay NULL, however, and out_of_memory will therefore return false which
> > > forces __alloc_pages_may_oom to not set did_some_progress and the page
> > > allocator backs out and fails the allocation.
> > > U
> > > Fix this by checking both chosen_task and chosen_memcg in out_of_memory
> > > and return false only when _both_ are NULL.
> > 
> > I don't like this patch. It is not easy to understand and is fragile to
> > future changes. Currently the only case !!oc->chosen can become false is that
> > there was no eligible tasks when SysRq-f was requested or memcg OOM occurred.
> 
> Well, the current contract is not easy unfortunatelly. We have two
> different modes of operation. We are either killing whole cgroups or a
> task from a cgroup. In any case, the contract says that if we have any
> killable entity then at least one of chosen* is set to INFLIGHT_VICTIM.
> Other than that one of them has to be !NULL or we have no eligible
> killable entity. The return value reflects all these cases.

Btw. if your concern is the readability then we can add a helper and
decsribe all the above in the comment.
kernel test robot May 31, 2018, 4:31 p.m. UTC | #4
Hi Michal,

I love your patch! Yet something to improve:

[auto build test ERROR on mmotm/master]
[also build test ERROR on next-20180531]
[cannot apply to v4.17-rc7]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-memcg-oom-fix-pre-mature-allocation-failures/20180531-220120
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x078-201821 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   mm/oom_kill.c: In function 'out_of_memory':
>> mm/oom_kill.c:1157:28: error: invalid operands to binary | (have 'struct task_struct *' and 'struct mem_cgroup *')
     return !!(oc->chosen_task | oc->chosen_memcg);
               ~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~
>> mm/oom_kill.c:1158:1: warning: control reaches end of non-void function [-Wreturn-type]
    }
    ^

vim +1157 mm/oom_kill.c

  1068	
  1069	/**
  1070	 * out_of_memory - kill the "best" process when we run out of memory
  1071	 * @oc: pointer to struct oom_control
  1072	 *
  1073	 * If we run out of memory, we have the choice between either
  1074	 * killing a random task (bad), letting the system crash (worse)
  1075	 * OR try to be smart about which process to kill. Note that we
  1076	 * don't have to be perfect here, we just have to be good.
  1077	 */
  1078	bool out_of_memory(struct oom_control *oc)
  1079	{
  1080		unsigned long freed = 0;
  1081		enum oom_constraint constraint = CONSTRAINT_NONE;
  1082		bool delay = false; /* if set, delay next allocation attempt */
  1083	
  1084		if (oom_killer_disabled)
  1085			return false;
  1086	
  1087		if (!is_memcg_oom(oc)) {
  1088			blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
  1089			if (freed > 0)
  1090				/* Got some memory back in the last second. */
  1091				return true;
  1092		}
  1093	
  1094		/*
  1095		 * If current has a pending SIGKILL or is exiting, then automatically
  1096		 * select it.  The goal is to allow it to allocate so that it may
  1097		 * quickly exit and free its memory.
  1098		 */
  1099		if (task_will_free_mem(current)) {
  1100			mark_oom_victim(current);
  1101			wake_oom_reaper(current);
  1102			return true;
  1103		}
  1104	
  1105		/*
  1106		 * The OOM killer does not compensate for IO-less reclaim.
  1107		 * pagefault_out_of_memory lost its gfp context so we have to
  1108		 * make sure exclude 0 mask - all other users should have at least
  1109		 * ___GFP_DIRECT_RECLAIM to get here.
  1110		 */
  1111		if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
  1112			return true;
  1113	
  1114		/*
  1115		 * Check if there were limitations on the allocation (only relevant for
  1116		 * NUMA and memcg) that may require different handling.
  1117		 */
  1118		constraint = constrained_alloc(oc);
  1119		if (constraint != CONSTRAINT_MEMORY_POLICY)
  1120			oc->nodemask = NULL;
  1121		check_panic_on_oom(oc, constraint);
  1122	
  1123		if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
  1124		    current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
  1125		    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
  1126			get_task_struct(current);
  1127			oc->chosen_task = current;
  1128			oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
  1129			return true;
  1130		}
  1131	
  1132		if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
  1133			delay = true;
  1134			goto out;
  1135		}
  1136	
  1137		select_bad_process(oc);
  1138		/* Found nothing?!?! Either we hang forever, or we panic. */
  1139		if (!oc->chosen_task && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
  1140			dump_header(oc, NULL);
  1141			panic("Out of memory and no killable processes...\n");
  1142		}
  1143		if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
  1144			oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
  1145					 "Memory cgroup out of memory");
  1146			delay = true;
  1147		}
  1148	
  1149	out:
  1150		/*
  1151		 * Give the killed process a good chance to exit before trying
  1152		 * to allocate memory again.
  1153		 */
  1154		if (delay)
  1155			schedule_timeout_killable(1);
  1156	
> 1157		return !!(oc->chosen_task | oc->chosen_memcg);
> 1158	}
  1159	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
diff mbox

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 565e7da55318..fc06af041447 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1174,7 +1174,7 @@  bool out_of_memory(struct oom_control *oc)
 	if (delay)
 		schedule_timeout_killable(1);
 
-	return !!oc->chosen_task;
+	return !!(oc->chosen_task | oc->chosen_memcg);
 }
 
 /*