[v3] mm,page_alloc: wait for oom_lock before retrying.
diff mbox series

Message ID 20900d89-b06d-2ec6-0ae0-beffc5874f26@I-love.SAKURA.ne.jp
State New
Headers show
Series
  • [v3] mm,page_alloc: wait for oom_lock before retrying.
Related show

Commit Message

Tetsuo Handa Feb. 13, 2019, 4:30 p.m. UTC
This is resume of https://lkml.kernel.org/r/1500202791-5427-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .



Reproducer:
----------
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#include <sys/prctl.h>

int main(int argc, char *argv[])
{
	static int pipe_fd[2] = { EOF, EOF };
	char *buf = NULL;
	unsigned long size = 0;
	unsigned int i;
	int fd;
	char buffer[4096];
	pipe(pipe_fd);
	signal(SIGCLD, SIG_IGN);
	if (fork() == 0) {
		prctl(PR_SET_NAME, (unsigned long) "first-victim", 0, 0, 0);
		while (1)
			pause();
	}
	close(pipe_fd[1]);
	prctl(PR_SET_NAME, (unsigned long) "normal-priority", 0, 0, 0);
	for (i = 0; i < 1024; i++)
		if (fork() == 0) {
			char c;
			/* Wait until the first-victim is OOM-killed. */
			read(pipe_fd[0], &c, 1);
			/* Try to consume CPU time via page fault. */
			memset(buffer, 0, sizeof(buffer));
			_exit(0);
		}
	close(pipe_fd[0]);
	fd = open("/dev/zero", O_RDONLY);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	while (size) {
		int ret = read(fd, buf, size); /* Will cause OOM due to overcommit */
		if (ret <= 0)
			break;
		buf += ret;
		size -= ret;
	}
	kill(-1, SIGKILL);
	return 0; /* Not reached. */
}
----------



Before this patch: http://I-love.SAKURA.ne.jp/tmp/serial-20190212.txt.xz
Numbers from grep'ing of SysRq-t part inside the stall:

  $ grep -F 'Call Trace:' serial-20190212.txt | wc -l
  1234
  $ grep -F 'locks held by' serial-20190212.txt | wc -l
  1046
  $ grep -F '__alloc_pages_nodemask' serial-20190212.txt | wc -l
  1046
  $ grep -F '__alloc_pages_slowpath+0x16f8/0x2350' serial-20190212.txt | wc -l
  946

90% of allocating threads are sleeping at

        /*
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */
        if (!mutex_trylock(&oom_lock)) {
                *did_some_progress = 1;
                schedule_timeout_uninterruptible(1);
                return NULL;
        }

and almost all of them are simply waiting for CPU time (indicated by a
'locks held by' line without lock information due to TASK_RUNNING state).
That is, many hundreds of allocating threads are ready to hold
the owner of oom_lock preempted.

[  504.760909] normal-priority invoked oom-killer: gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  513.650210] CPU: 0 PID: 17881 Comm: normal-priority Kdump: loaded Not tainted 5.0.0-rc6 #826
[  513.653799] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  513.657968] Call Trace:
[  513.660026]  dump_stack+0x86/0xca
[  513.662292]  dump_header+0x10a/0x9d0
[  513.664673]  ? _raw_spin_unlock_irqrestore+0x3d/0x60
[  513.667319]  ? ___ratelimit+0x1d1/0x3c5
[  513.669682]  oom_kill_process.cold.32+0xb/0x5b9
[  513.672218]  ? check_flags.part.40+0x420/0x420
[  513.675347]  ? rcu_read_unlock_special+0x87/0x100
[  513.678734]  out_of_memory+0x287/0x7f0
[  513.681146]  ? oom_killer_disable+0x1f0/0x1f0
[  513.683629]  ? mutex_trylock+0x191/0x1e0
[  513.685983]  ? __alloc_pages_slowpath+0xa03/0x2350
[  513.688692]  __alloc_pages_slowpath+0x1cdf/0x2350
[  513.692541]  ? release_pages+0x8d6/0x12d0
[  513.696140]  ? warn_alloc+0x120/0x120
[  513.699669]  ? __lock_is_held+0xbc/0x140
[  513.703204]  ? __might_sleep+0x95/0x190
[  513.706554]  __alloc_pages_nodemask+0x510/0x5f0

[  717.991658] normal-priority R  running task    23432 17881   9439 0x80000080
[  717.994203] Call Trace:
[  717.995530]  __schedule+0x69a/0x1890
[  717.997116]  ? pci_mmcfg_check_reserved+0x120/0x120
[  717.999020]  ? __this_cpu_preempt_check+0x13/0x20
[  718.001299]  ? lockdep_hardirqs_on+0x347/0x5a0
[  718.003175]  ? preempt_schedule_irq+0x35/0x80
[  718.004966]  ? trace_hardirqs_on+0x28/0x170
[  718.006704]  preempt_schedule_irq+0x40/0x80
[  718.008440]  retint_kernel+0x1b/0x2d
[  718.010167] RIP: 0010:dump_stack+0xbc/0xca
[  718.011880] Code: c7 c0 ed 66 96 e8 7e d5 e2 fe c7 05 34 bc ed 00 ff ff ff ff 0f ba e3 09 72 09 53 9d e8 87 03 c4 fe eb 07 e8 10 02 c4 fe 53 9d <5b> 41 5c 41 5d 5d c3 90 90 90 90 90 90 90 55 48 89 e5 41 57 49 89
[  718.018262] RSP: 0000:ffff888111a672e0 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
[  718.020947] RAX: 0000000000000007 RBX: 0000000000000286 RCX: 1ffff1101563db64
[  718.023530] RDX: 0000000000000000 RSI: ffffffff95c6ff40 RDI: ffff8880ab1edab4
[  718.026597] RBP: ffff888111a672f8 R08: ffff8880ab1edab8 R09: 0000000000000006
[  718.029478] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  718.032224] R13: 00000000ffffffff R14: ffff888111a67628 R15: ffff888111a67628
[  718.035339]  dump_header+0x10a/0x9d0
[  718.037231]  ? _raw_spin_unlock_irqrestore+0x3d/0x60
[  718.039349]  ? ___ratelimit+0x1d1/0x3c5
[  718.041380]  oom_kill_process.cold.32+0xb/0x5b9
[  718.043451]  ? check_flags.part.40+0x420/0x420
[  718.045418]  ? rcu_read_unlock_special+0x87/0x100
[  718.047453]  out_of_memory+0x287/0x7f0
[  718.049245]  ? oom_killer_disable+0x1f0/0x1f0
[  718.051527]  ? mutex_trylock+0x191/0x1e0
[  718.053398]  ? __alloc_pages_slowpath+0xa03/0x2350
[  718.055478]  __alloc_pages_slowpath+0x1cdf/0x2350
[  718.057978]  ? release_pages+0x8d6/0x12d0
[  718.060245]  ? warn_alloc+0x120/0x120
[  718.062836]  ? __lock_is_held+0xbc/0x140
[  718.065815]  ? __might_sleep+0x95/0x190
[  718.068060]  __alloc_pages_nodemask+0x510/0x5f0



After this patch: http://I-love.SAKURA.ne.jp/tmp/serial-20190212-2.txt.xz
The OOM killer is smoothly invoked, though the system after all got stuck
due to a different problem.



While this patch cannot avoid delays caused by unlimited concurrent direct
reclaim, let's stop telling the lie

        /*
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */

because many of allocating threads are preventing the owner of oom_lock from
making progress. Therefore, here again is a patch.



From 63c5c8ee7910fa9ef1c4067f1cb35a779e9d582c Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Tue, 12 Feb 2019 20:12:35 +0900
Subject: [PATCH v3] mm,page_alloc: wait for oom_lock before retrying.

When many hundreds of threads concurrently triggered a page fault, and
one of them invoked the global OOM killer, the owner of oom_lock is
preempted for minutes because they are rather depriving the owner of
oom_lock of CPU time rather than waiting for the owner of oom_lock to
make progress. We don't want to disable preemption while holding oom_lock
but we want the owner of oom_lock to complete as soon as possible.

Thus, this patch kills the dangerous assumption that sleeping for one
jiffy is sufficient for allowing the owner of oom_lock to make progress.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/page_alloc.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Michal Hocko Feb. 13, 2019, 4:56 p.m. UTC | #1
On Thu 14-02-19 01:30:28, Tetsuo Handa wrote:
[...]
> >From 63c5c8ee7910fa9ef1c4067f1cb35a779e9d582c Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Tue, 12 Feb 2019 20:12:35 +0900
> Subject: [PATCH v3] mm,page_alloc: wait for oom_lock before retrying.
> 
> When many hundreds of threads concurrently triggered a page fault, and
> one of them invoked the global OOM killer, the owner of oom_lock is
> preempted for minutes because they are rather depriving the owner of
> oom_lock of CPU time rather than waiting for the owner of oom_lock to
> make progress. We don't want to disable preemption while holding oom_lock
> but we want the owner of oom_lock to complete as soon as possible.
> 
> Thus, this patch kills the dangerous assumption that sleeping for one
> jiffy is sufficient for allowing the owner of oom_lock to make progress.

What does this prevent any _other_ kernel path or even high priority
userspace to preempt the oom killer path? This was the essential
question the last time around and I do not see it covered here. I
strongly suspect that all these games with the locking is just a
pointless tunning for an insane workload without fixing the underlying
issue.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/page_alloc.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 35fdde0..c867513 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3618,7 +3618,10 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  	 */
>  	if (!mutex_trylock(&oom_lock)) {
>  		*did_some_progress = 1;
> -		schedule_timeout_uninterruptible(1);
> +		if (mutex_lock_killable(&oom_lock) == 0)
> +			mutex_unlock(&oom_lock);
> +		else if (!tsk_is_oom_victim(current))
> +			schedule_timeout_uninterruptible(1);
>  		return NULL;
>  	}
>  
> -- 
> 1.8.3.1
Tetsuo Handa Feb. 15, 2019, 10:42 a.m. UTC | #2
On 2019/02/14 1:56, Michal Hocko wrote:
> On Thu 14-02-19 01:30:28, Tetsuo Handa wrote:
> [...]
>> >From 63c5c8ee7910fa9ef1c4067f1cb35a779e9d582c Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Date: Tue, 12 Feb 2019 20:12:35 +0900
>> Subject: [PATCH v3] mm,page_alloc: wait for oom_lock before retrying.
>>
>> When many hundreds of threads concurrently triggered a page fault, and
>> one of them invoked the global OOM killer, the owner of oom_lock is
>> preempted for minutes because they are rather depriving the owner of
>> oom_lock of CPU time rather than waiting for the owner of oom_lock to
>> make progress. We don't want to disable preemption while holding oom_lock
>> but we want the owner of oom_lock to complete as soon as possible.
>>
>> Thus, this patch kills the dangerous assumption that sleeping for one
>> jiffy is sufficient for allowing the owner of oom_lock to make progress.
> 
> What does this prevent any _other_ kernel path or even high priority
> userspace to preempt the oom killer path? This was the essential
> question the last time around and I do not see it covered here. I

Since you already NACKed disabling preemption at
https://marc.info/?i=20180322114002.GC23100@dhcp22.suse.cz , pointing out
"even high priority userspace to preempt the oom killer path" is invalid.

Since printk() is very slow, dump_header() can become slow, especially when
dump_tasks() is called. And changing dump_tasks() to use rcu_lock_break()
does not solve this problem, for this is a problem that once current thread
released CPU, current thread might be kept preempted for minutes.

Allowing OOM path to be preempted is what you prefer, isn't it? Then,
spending CPU time for something (what you call "any _other_ kernel path") is
accountable for delaying OOM path. But wasting CPU time when allocating
threads can do nothing but wait for the owner of oom_lock to complete OOM
path is not accountable for delaying OOM path.

Thus, there is nothing to cover for your "I do not see it covered here"
response, except how to avoid "wasting CPU time when allocating threads
can do nothing but wait for the owner of oom_lock to complete OOM path".

> strongly suspect that all these games with the locking is just a
> pointless tunning for an insane workload without fixing the underlying
> issue.

We could even change oom_lock to a local lock inside oom_kill_process(), for
all threads in a same allocating context will select the same OOM victim
(unless oom_kill_allocating_task case), and many threads already inside
oom_kill_process() will prevent themselves from selecting next OOM victim.
Although this approach wastes some CPU resources for needlessly selecting
same OOM victim for many times, this approach also can solve this problem.

It seems that you don't want to admit that "wasting CPU time when allocating
threads can do nothing but wait for the owner of oom_lock to complete OOM path"
as the underlying issue. But we can't fix it without throttling direct reclaim
paths. That's the evidence that this problem is not fixed for many years.
Please stop NACKing proposals without alternative implementations.

---
 drivers/tty/sysrq.c |   2 -
 include/linux/oom.h |   2 -
 mm/memcontrol.c     |  11 +-----
 mm/oom_kill.c       | 106 ++++++++++++++++++++++++++++++++++++++--------------
 mm/page_alloc.c     |  18 +--------
 5 files changed, 81 insertions(+), 58 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index fa0ce7d..aa7f857 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -369,10 +369,8 @@ static void moom_callback(struct work_struct *ignored)
 		.order = -1,
 	};
 
-	mutex_lock(&oom_lock);
 	if (!out_of_memory(&oc))
 		pr_info("OOM request ignored. No task eligible\n");
-	mutex_unlock(&oom_lock);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index d079920..865fddf 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -54,8 +54,6 @@ struct oom_control {
 	enum oom_constraint constraint;
 };
 
-extern struct mutex oom_lock;
-
 static inline void set_current_oom_origin(void)
 {
 	current->signal->oom_flag_origin = true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5fc2e1a..31d3314 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1399,17 +1399,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		.gfp_mask = gfp_mask,
 		.order = order,
 	};
-	bool ret;
 
-	if (mutex_lock_killable(&oom_lock))
-		return true;
-	/*
-	 * A few threads which were not waiting at mutex_lock_killable() can
-	 * fail to bail out. Therefore, check again after holding oom_lock.
-	 */
-	ret = should_force_charge() || out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
-	return ret;
+	return should_force_charge() || out_of_memory(&oc);
 }
 
 #if MAX_NUMNODES > 1
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3a24848..db804fa 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -54,14 +54,10 @@
 int sysctl_oom_dump_tasks = 1;
 
 /*
- * Serializes oom killer invocations (out_of_memory()) from all contexts to
- * prevent from over eager oom killing (e.g. when the oom killer is invoked
- * from different domains).
- *
  * oom_killer_disable() relies on this lock to stabilize oom_killer_disabled
  * and mark_oom_victim
  */
-DEFINE_MUTEX(oom_lock);
+static DEFINE_SPINLOCK(oom_disable_lock);
 
 #ifdef CONFIG_NUMA
 /**
@@ -430,7 +426,10 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 
 static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim)
 {
+	static DEFINE_SPINLOCK(lock);
+
 	/* one line summary of the oom killer context. */
+	spin_lock(&lock);
 	pr_info("oom-kill:constraint=%s,nodemask=%*pbl",
 			oom_constraint_text[oc->constraint],
 			nodemask_pr_args(oc->nodemask));
@@ -438,6 +437,7 @@ static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim)
 	mem_cgroup_print_oom_context(oc->memcg, victim);
 	pr_cont(",task=%s,pid=%d,uid=%d\n", victim->comm, victim->pid,
 		from_kuid(&init_user_ns, task_uid(victim)));
+	spin_unlock(&lock);
 }
 
 static void dump_header(struct oom_control *oc, struct task_struct *p)
@@ -677,9 +677,6 @@ static inline void wake_oom_reaper(struct task_struct *tsk)
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
  *
- * Has to be called with oom_lock held and never after
- * oom has been disabled already.
- *
  * tsk->mm has to be non NULL and caller has to guarantee it is stable (either
  * under task_lock or operate on the current).
  */
@@ -687,11 +684,6 @@ static void mark_oom_victim(struct task_struct *tsk)
 {
 	struct mm_struct *mm = tsk->mm;
 
-	WARN_ON(oom_killer_disabled);
-	/* OOM killer might race with memcg OOM */
-	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
-
 	/* oom_mm is bound to the signal struct life time. */
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) {
 		mmgrab(tsk->signal->oom_mm);
@@ -704,8 +696,14 @@ static void mark_oom_victim(struct task_struct *tsk)
 	 * any memory and livelock. freezing_slow_path will tell the freezer
 	 * that TIF_MEMDIE tasks should be ignored.
 	 */
-	__thaw_task(tsk);
-	atomic_inc(&oom_victims);
+	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
+	spin_lock(&oom_disable_lock);
+	if (!oom_killer_disabled) {
+		__thaw_task(tsk);
+		atomic_inc(&oom_victims);
+	}
+	spin_unlock(&oom_disable_lock);
 	trace_mark_victim(tsk->pid);
 }
 
@@ -752,10 +750,9 @@ bool oom_killer_disable(signed long timeout)
 	 * Make sure to not race with an ongoing OOM killer. Check that the
 	 * current is not killed (possibly due to sharing the victim's memory).
 	 */
-	if (mutex_lock_killable(&oom_lock))
-		return false;
+	spin_lock(&oom_disable_lock);
 	oom_killer_disabled = true;
-	mutex_unlock(&oom_lock);
+	spin_unlock(&oom_disable_lock);
 
 	ret = wait_event_interruptible_timeout(oom_victims_wait,
 			!atomic_read(&oom_victims), timeout);
@@ -942,7 +939,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	struct task_struct *victim = oc->chosen;
 	struct mem_cgroup *oom_group;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
+				      DEFAULT_RATELIMIT_BURST);
+	static DEFINE_MUTEX(oom_lock);
+	bool locked;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -959,6 +958,20 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	}
 	task_unlock(victim);
 
+	/*
+	 * Try to yield CPU time for dump_header(). But tolerate mixed printk()
+	 * output if current thread was killed, for exiting quickly is better.
+	 */
+	locked = !mutex_lock_killable(&oom_lock);
+	/*
+	 * Try to avoid reporting same OOM victim for multiple times, for
+	 * concurrently allocating threads are waiting (after selecting the
+	 * same OOM victim) at mutex_lock_killable() above.
+	 */
+	if (tsk_is_oom_victim(victim)) {
+		put_task_struct(victim);
+		goto done;
+	}
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, victim);
 
@@ -980,6 +993,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 				      (void*)message);
 		mem_cgroup_put(oom_group);
 	}
+ done:
+	if (locked)
+		mutex_unlock(&oom_lock);
 }
 
 /*
@@ -1032,16 +1048,54 @@ int unregister_oom_notifier(struct notifier_block *nb)
  */
 bool out_of_memory(struct oom_control *oc)
 {
-	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
 	if (oom_killer_disabled)
 		return false;
 
 	if (!is_memcg_oom(oc)) {
-		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
-		if (freed > 0)
-			/* Got some memory back in the last second. */
+		/*
+		 * This mutex makes sure that concurrently allocating threads
+		 * yield CPU time for the OOM notifier reclaim attempt.
+		 *
+		 * Since currently we are not doing the last second allocation
+		 * attempt after the OOM notifier reclaim attempt, we need to
+		 * use heuristically decreasing last_freed for concurrently
+		 * allocating threads (which can become dozens to hundreds).
+		 */
+		static DEFINE_MUTEX(lock);
+		static unsigned long last_freed;
+		unsigned long freed;
+
+		if (mutex_trylock(&lock)) {
+			last_freed /= 2;
+			freed = last_freed;
+force_notify:
+			blocking_notifier_call_chain(&oom_notify_list, 0,
+						     &freed);
+			last_freed = freed;
+			mutex_unlock(&lock);
+		} else if (mutex_lock_killable(&lock) == 0) {
+			/*
+			 * If someone is waiting at mutex_lock_killable() here,
+			 * mutex_trylock() above can't succeed. Therefore,
+			 * although this path should be fast enough, let's be
+			 * prepared for the worst case where nobody can update
+			 * last_freed from mutex_trylock() path above.
+			 */
+			if (!last_freed)
+				goto force_notify;
+			freed = last_freed--;
+			mutex_unlock(&lock);
+		} else {
+			/*
+			 * If killable wait failed, task_will_free_mem(current)
+			 * below likely returns true. Therefore, just fall
+			 * through here.
+			 */
+			freed = 0;
+		}
+		if (freed)
 			return true;
 	}
 
@@ -1104,8 +1158,7 @@ bool out_of_memory(struct oom_control *oc)
 
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * memory-hogging task.
  */
 void pagefault_out_of_memory(void)
 {
@@ -1120,8 +1173,5 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
-		return;
 	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5ff93ad..dce737e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3775,24 +3775,11 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	*did_some_progress = 0;
 
 	/*
-	 * Acquire the oom lock.  If that fails, somebody else is
-	 * making progress for us.
-	 */
-	if (!mutex_trylock(&oom_lock)) {
-		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
-		return NULL;
-	}
-
-	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure. But make sure that this reclaim
-	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
-	 * allocation which will never fail due to oom_lock already held.
+	 * we're still under heavy pressure.
 	 */
-	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
-				      ~__GFP_DIRECT_RECLAIM, order,
+	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL), order,
 				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
 	if (page)
 		goto out;
@@ -3843,7 +3830,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 					ALLOC_NO_WATERMARKS, ac);
 	}
 out:
-	mutex_unlock(&oom_lock);
 	return page;
 }
Michal Hocko Feb. 15, 2019, 12:18 p.m. UTC | #3
On Fri 15-02-19 19:42:41, Tetsuo Handa wrote:
> On 2019/02/14 1:56, Michal Hocko wrote:
> > On Thu 14-02-19 01:30:28, Tetsuo Handa wrote:
> > [...]
> >> >From 63c5c8ee7910fa9ef1c4067f1cb35a779e9d582c Mon Sep 17 00:00:00 2001
> >> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> >> Date: Tue, 12 Feb 2019 20:12:35 +0900
> >> Subject: [PATCH v3] mm,page_alloc: wait for oom_lock before retrying.
> >>
> >> When many hundreds of threads concurrently triggered a page fault, and
> >> one of them invoked the global OOM killer, the owner of oom_lock is
> >> preempted for minutes because they are rather depriving the owner of
> >> oom_lock of CPU time rather than waiting for the owner of oom_lock to
> >> make progress. We don't want to disable preemption while holding oom_lock
> >> but we want the owner of oom_lock to complete as soon as possible.
> >>
> >> Thus, this patch kills the dangerous assumption that sleeping for one
> >> jiffy is sufficient for allowing the owner of oom_lock to make progress.
> > 
> > What does this prevent any _other_ kernel path or even high priority
> > userspace to preempt the oom killer path? This was the essential
> > question the last time around and I do not see it covered here. I
> 
> Since you already NACKed disabling preemption at
> https://marc.info/?i=20180322114002.GC23100@dhcp22.suse.cz , pointing out
> "even high priority userspace to preempt the oom killer path" is invalid.

Why?

> Since printk() is very slow, dump_header() can become slow, especially when
> dump_tasks() is called. And changing dump_tasks() to use rcu_lock_break()
> does not solve this problem, for this is a problem that once current thread
> released CPU, current thread might be kept preempted for minutes.

dump_tasks might be disabled for those who are concerned about the
overhead but I do not see what your actual point here is.

> Allowing OOM path to be preempted is what you prefer, isn't it?

No, it is just practicality. If you disable preemption you are
immediatelly going to fight with soft lockups.

> Then,
> spending CPU time for something (what you call "any _other_ kernel path") is
> accountable for delaying OOM path. But wasting CPU time when allocating
> threads can do nothing but wait for the owner of oom_lock to complete OOM
> path is not accountable for delaying OOM path.
> 
> Thus, there is nothing to cover for your "I do not see it covered here"
> response, except how to avoid "wasting CPU time when allocating threads
> can do nothing but wait for the owner of oom_lock to complete OOM path".
> 
> > strongly suspect that all these games with the locking is just a
> > pointless tunning for an insane workload without fixing the underlying
> > issue.
> 
> We could even change oom_lock to a local lock inside oom_kill_process(), for
> all threads in a same allocating context will select the same OOM victim
> (unless oom_kill_allocating_task case), and many threads already inside
> oom_kill_process() will prevent themselves from selecting next OOM victim.
> Although this approach wastes some CPU resources for needlessly selecting
> same OOM victim for many times, this approach also can solve this problem.
> 
> It seems that you don't want to admit that "wasting CPU time when allocating
> threads can do nothing but wait for the owner of oom_lock to complete OOM path"
> as the underlying issue. But we can't fix it without throttling direct reclaim
> paths. That's the evidence that this problem is not fixed for many years.

And yet I do not rememeber any _single_ bug report for a real life
workload that would be suffering from this. I am all for a better
throttling on the OOM conditions but what you have been proposing are
hacks at best without any real world workload backing them.

Please try to understand that the OOM path in the current form is quite
complex already and adding more on top without addressing a problem
which real workloads do care about is not really all that attractive.
I have no objections to simple and obviously correct changes in this
area but playing with locking with hard to evaluate side effects is not
something I will ack.

Patch
diff mbox series

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 35fdde0..c867513 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3618,7 +3618,10 @@  void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	 */
 	if (!mutex_trylock(&oom_lock)) {
 		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
+		if (mutex_lock_killable(&oom_lock) == 0)
+			mutex_unlock(&oom_lock);
+		else if (!tsk_is_oom_victim(current))
+			schedule_timeout_uninterruptible(1);
 		return NULL;
 	}