mbox series

[RFC,v2,0/3] oom: rework oom_reaper vs. exit_mmap handoff

Message ID 20181025082403.3806-1-mhocko@kernel.org (mailing list archive)
Headers show
Series oom: rework oom_reaper vs. exit_mmap handoff | expand

Message

Michal Hocko Oct. 25, 2018, 8:24 a.m. UTC
The previous version of this RFC has been posted here [1]. I have fixed
few issues spotted during the review and by 0day bot. I have also reworked
patch 2 to be ratio rather than an absolute number based.

With this series applied the locking protocol between the oom_reaper and
the exit path is as follows.

All parts which cannot race should use the exclusive lock on the exit
path. Once the exit path has passed the moment when no blocking locks
are taken then it clears mm->mmap under the exclusive lock. oom_reaper
checks for this and sets MMF_OOM_SKIP only if the exit path is not guaranteed
to finish the job. This is patch 3 so see the changelog for all the details.

I would really appreciate if David could give this a try and see how
this behaves in workloads where the oom_reaper falls flat now. I have
been playing with sparsely allocated memory with a high pte/real memory
ratio and large mlocked processes and it worked reasonably well.

There is still some room for tuning here of course. We can change the
number of retries for the oom_reaper as well as the threshold when the
keep retrying.

Michal Hocko (3):
      mm, oom: rework mmap_exit vs. oom_reaper synchronization
      mm, oom: keep retrying the oom_reap operation as long as there is substantial memory left
      mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish

Diffstat:
 include/linux/oom.h |  2 --
 mm/internal.h       |  3 +++
 mm/memory.c         | 28 ++++++++++++++--------
 mm/mmap.c           | 69 +++++++++++++++++++++++++++++++++--------------------
 mm/oom_kill.c       | 45 ++++++++++++++++++++++++----------
 5 files changed, 97 insertions(+), 50 deletions(-)

[1] http://lkml.kernel.org/r/20180910125513.311-1-mhocko@kernel.org

Comments

Michal Hocko Nov. 8, 2018, 9:32 a.m. UTC | #1
On Thu 25-10-18 10:24:00, Michal Hocko wrote:
> The previous version of this RFC has been posted here [1]. I have fixed
> few issues spotted during the review and by 0day bot. I have also reworked
> patch 2 to be ratio rather than an absolute number based.
> 
> With this series applied the locking protocol between the oom_reaper and
> the exit path is as follows.
> 
> All parts which cannot race should use the exclusive lock on the exit
> path. Once the exit path has passed the moment when no blocking locks
> are taken then it clears mm->mmap under the exclusive lock. oom_reaper
> checks for this and sets MMF_OOM_SKIP only if the exit path is not guaranteed
> to finish the job. This is patch 3 so see the changelog for all the details.
> 
> I would really appreciate if David could give this a try and see how
> this behaves in workloads where the oom_reaper falls flat now. I have
> been playing with sparsely allocated memory with a high pte/real memory
> ratio and large mlocked processes and it worked reasonably well.

Does this help workloads you were referring to earlier David?

> There is still some room for tuning here of course. We can change the
> number of retries for the oom_reaper as well as the threshold when the
> keep retrying.
> 
> Michal Hocko (3):
>       mm, oom: rework mmap_exit vs. oom_reaper synchronization
>       mm, oom: keep retrying the oom_reap operation as long as there is substantial memory left
>       mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish
> 
> Diffstat:
>  include/linux/oom.h |  2 --
>  mm/internal.h       |  3 +++
>  mm/memory.c         | 28 ++++++++++++++--------
>  mm/mmap.c           | 69 +++++++++++++++++++++++++++++++++--------------------
>  mm/oom_kill.c       | 45 ++++++++++++++++++++++++----------
>  5 files changed, 97 insertions(+), 50 deletions(-)
> 
> [1] http://lkml.kernel.org/r/20180910125513.311-1-mhocko@kernel.org
>
Tetsuo Handa Nov. 14, 2018, 9:46 a.m. UTC | #2
On 2018/11/08 18:32, Michal Hocko wrote:
> On Thu 25-10-18 10:24:00, Michal Hocko wrote:
>> The previous version of this RFC has been posted here [1]. I have fixed
>> few issues spotted during the review and by 0day bot. I have also reworked
>> patch 2 to be ratio rather than an absolute number based.
>>
>> With this series applied the locking protocol between the oom_reaper and
>> the exit path is as follows.
>>
>> All parts which cannot race should use the exclusive lock on the exit
>> path. Once the exit path has passed the moment when no blocking locks
>> are taken then it clears mm->mmap under the exclusive lock. oom_reaper
>> checks for this and sets MMF_OOM_SKIP only if the exit path is not guaranteed
>> to finish the job. This is patch 3 so see the changelog for all the details.
>>
>> I would really appreciate if David could give this a try and see how
>> this behaves in workloads where the oom_reaper falls flat now. I have
>> been playing with sparsely allocated memory with a high pte/real memory
>> ratio and large mlocked processes and it worked reasonably well.
> 
> Does this help workloads you were referring to earlier David?

I still refuse this patch, due to allowing OOM lockups in the exit path
triggered by CPU starvation.



First, try a stressor shown below

----------------------------------------
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <sys/wait.h>

int main(int argc, char *argv[])
{
	cpu_set_t cpu = { { 1 } };
	struct sched_param sp = { 99 };
	FILE *fp;
	int i;
	const unsigned long size = 1048576 * 200;
	char *buf = malloc(size);
	mkdir("/sys/fs/cgroup/memory/test1", 0755);
	fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
	fprintf(fp, "%lu\n", size / 2);
	fclose(fp);
	fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
	fprintf(fp, "%u\n", getpid());
	fclose(fp);
	fp = fopen("/proc/self/oom_score_adj", "w");
	fprintf(fp, "1000\n");
	fclose(fp);
	sched_setaffinity(0, sizeof(cpu), &cpu);
	sched_setscheduler(0, SCHED_FIFO, &sp);
	nice(-20);
	sp.sched_priority = 0;
	for (i = 0; i < 64; i++)
		if (fork() == 0) {
			if (i < 32) {
				sched_setscheduler(0, SCHED_IDLE, &sp);
				nice(19);
			}
			memset(buf, 0, size);
			_exit(0);
		}
	while (wait(NULL) > 0);
	return 0;
}
----------------------------------------

with a bit of additional printk() patch shown below.

----------------------------------------
 kernel/locking/lockdep.c | 6 ------
 mm/oom_kill.c            | 9 +++++++--
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 1efada2..822b412 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -564,12 +564,6 @@ static void lockdep_print_held_locks(struct task_struct *p)
 	else
 		printk("%d lock%s held by %s/%d:\n", depth,
 		       depth > 1 ? "s" : "", p->comm, task_pid_nr(p));
-	/*
-	 * It's not reliable to print a task's held locks if it's not sleeping
-	 * and it's not the current task.
-	 */
-	if (p->state == TASK_RUNNING && p != current)
-		return;
 	for (i = 0; i < depth; i++) {
 		printk(" #%d: ", i);
 		print_lock(p->held_locks + i);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 503d24d..b9e5a61 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -613,8 +613,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	    test_bit(MMF_OOM_SKIP, &mm->flags))
 		goto done;
 
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
+	pr_info("oom_reaper: unable to reap pid:%d (%s), anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+		task_pid_nr(tsk), tsk->comm,
+		K(get_mm_counter(mm, MM_ANONPAGES)),
+		K(get_mm_counter(mm, MM_FILEPAGES)),
+		K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	debug_show_all_locks();
 
 done:
@@ -628,6 +631,8 @@ static void oom_reap_task(struct task_struct *tsk)
 	 */
 	if (mm->mmap)
 		set_bit(MMF_OOM_SKIP, &mm->flags);
+	else if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+		pr_info("Handed over %u to exit path.\n", task_pid_nr(tsk));
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
----------------------------------------

You will find that "oom_reaper: unable to reap" message is trivially printed (and
MMF_OOM_SKIP is set) when the OOM victim is doing sched_setscheduler() request in
the stressor shown above.

  1 lock held by a.out/1331:
   #0: 00000000065062b5 (rcu_read_lock){....}, at: do_sched_setscheduler+0x54/0x190

This is because

	if (oom_badness_pages(mm) > (original_badness >> 2))
		ret = false;

made OOM reaper to print "oom_reaper: unable to reap" message even after reaping
succeeded when the OOM victim was consuming little memory. Yes, we could disable
"oom_reaper: unable to reap" message if reaping succeeded.



Next, let's think the lucky cases where "Handed over " message is printed.
Remove

	if (i < 32) {
		sched_setscheduler(0, SCHED_IDLE, &sp);
		nice(19);
	}

lines from the stressor, and retry with a patch shown below which will do the
similar thing from the exit path.

----------------------------------------
 kernel/sys.c | 14 ++++++++++++++
 mm/mmap.c    |  2 ++
 2 files changed, 16 insertions(+)

diff --git a/kernel/sys.c b/kernel/sys.c
index 123bd73..a1798bb 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -190,6 +190,20 @@ static int set_one_prio(struct task_struct *p, int niceval, int error)
 	return error;
 }
 
+struct sched_param {
+	int sched_priority;
+};
+
+void my_setpriority(void)
+{
+	struct sched_param sp = { };
+
+	sched_setscheduler(current, SCHED_IDLE, &sp);
+	rcu_read_lock();
+	set_one_prio(current, MAX_NICE, -ESRCH);
+	rcu_read_unlock();
+}
+
 SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
 {
 	struct task_struct *g, *p;
diff --git a/mm/mmap.c b/mm/mmap.c
index 9063fdc..516cfc9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3140,6 +3140,7 @@ void exit_mmap(struct mm_struct *mm)
 	 * tear down when it is safe to do so
 	 */
 	if (oom) {
+		extern void my_setpriority(void);
 		down_write(&mm->mmap_sem);
 		__unlink_vmas(vma);
 		/*
@@ -3149,6 +3150,7 @@ void exit_mmap(struct mm_struct *mm)
 		 */
 		mm->mmap = NULL;
 		up_write(&mm->mmap_sem);
+		my_setpriority();
 		__free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	} else {
 		free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
----------------------------------------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20181114-1.txt.xz .
----------------------------------------
[   43.961735] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[   43.964369] kmem: usage 6528kB, limit 9007199254740988kB, failcnt 0
[   43.967410] Memory cgroup stats for /test1: cache:0KB rss:95832KB rss_huge:94208KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:95876KB inactive_file:0KB active_file:0KB unevictable:0KB
[   43.976150] Memory cgroup out of memory: Kill process 1079 (a.out) score 990 or sacrifice child
[   43.980391] Killed process 1083 (a.out) total-vm:209156kB, anon-rss:96kB, file-rss:0kB, shmem-rss:0kB
[   44.090703] Handed over 1083 to exit path.
[   76.209031] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 33s!
[   76.212351] Showing busy workqueues and worker pools:
[   76.215308] workqueue events: flags=0x0
[   76.217832]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[   76.217860]     pending: vmstat_shepherd, vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx]
[   76.217878] workqueue events_highpri: flags=0x10
[   76.217882]   pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256
[   76.217887]     pending: flush_backlog BAR(436)
[   76.217941] workqueue mm_percpu_wq: flags=0x8
[   76.218004]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   76.218007]     pending: vmstat_update
[   76.218011] workqueue netns: flags=0xe000a
[   76.218039]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=1/1
[   76.218068]     in-flight: 436:cleanup_net
[   76.218081] workqueue memcg_kmem_cache: flags=0x0
[   76.218085]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
[   76.218088]     pending: memcg_kmem_cache_create_func
[   76.218091]     delayed: memcg_kmem_cache_create_func,(...snipped...), memcg_kmem_cache_create_func
[   77.511224] pool 16: cpus=0-7 flags=0x4 nice=0 hung=4s workers=32 idle: 435 437 434 433 432 431 430 429 428 427 426 425 424 423 422 421 420 419 417 418 416 414 413 412 143 411 75 7 145 439 438
[  141.216183] rcu: INFO: rcu_preempt self-detected stall on CPU
[  141.219379] rcu:     0-....: (66161 ticks this GP) idle=11e/1/0x4000000000000002 softirq=3290/3292 fqs=15874
[  141.223795] rcu:      (t=65000 jiffies g=10501 q=1115)
[  141.226815] NMI backtrace for cpu 0
[  141.229437] CPU: 0 PID: 1084 Comm: a.out Not tainted 4.20.0-rc2+ #778
[  141.232891] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  141.237363] Call Trace:
(...snipped...)
[  141.313476]  ? mem_cgroup_iter+0x156/0x7c0
[  141.315721]  mem_cgroup_iter+0x17e/0x7c0
[  141.318885]  ? mem_cgroup_iter+0x156/0x7c0
[  141.321488]  snapshot_refaults+0x5c/0x90
[  141.323609]  do_try_to_free_pages+0x2dd/0x3b0
[  141.326132]  try_to_free_mem_cgroup_pages+0x10d/0x390
[  141.329223]  try_charge+0x28b/0x830
[  141.331761]  memcg_kmem_charge_memcg+0x35/0x90
[  141.334568]  ? get_mem_cgroup_from_mm+0x239/0x2e0
[  141.336693]  memcg_kmem_charge+0x85/0x260
[  141.338605]  __alloc_pages_nodemask+0x218/0x360
[  141.340693]  pte_alloc_one+0x16/0xb0
[  141.342394]  __handle_mm_fault+0x873/0x1580
[  141.344355]  handle_mm_fault+0x1b2/0x3a0
[  141.346048]  ? handle_mm_fault+0x47/0x3a0
[  141.348501]  __do_page_fault+0x28c/0x530
[  141.350389]  do_page_fault+0x28/0x260
[  141.352133]  ? page_fault+0x8/0x30
[  141.353785]  page_fault+0x1e/0x30
[  141.355602] RIP: 0033:0x7f4bc7c92682
[  141.357336] Code: Bad RIP value.
[  141.358787] RSP: 002b:00007ffe682fc4a0 EFLAGS: 00010246
[  141.361110] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f4bc7c92682
[  141.363863] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[  141.366464] RBP: 00007ffe682fc4c0 R08: 00007f4bc81a2740 R09: 0000000000000000
[  141.369074] R10: 00007f4bc81a2a10 R11: 0000000000000246 R12: 0000000000000000
[  141.371704] R13: 00007ffe682fc650 R14: 0000000000000000 R15: 0000000000000000
[  248.241504] INFO: task kworker/u16:28:436 blocked for more than 120 seconds.
[  248.244263]       Not tainted 4.20.0-rc2+ #778
[  248.246033] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  248.248718] kworker/u16:28  D10872   436      2 0x80000000
[  248.250768] Workqueue: netns cleanup_net
(...snipped...)
[  290.559855] a.out           R  running task    12792  1083   1079 0x80100084
[  290.562480] Call Trace:
[  290.563633]  __schedule+0x246/0x990
[  290.565031]  ? _raw_spin_unlock_irqrestore+0x55/0x70
[  290.566873]  preempt_schedule_common+0x34/0x54
[  290.568782]  preempt_schedule+0x1f/0x30
[  290.570896]  ___preempt_schedule+0x16/0x18
[  290.573639]  __sched_setscheduler+0x722/0x7c0
[  290.576363]  _sched_setscheduler+0x70/0x90
[  290.578785]  sched_setscheduler+0xe/0x10
[  290.580403]  my_setpriority+0x35/0x130
[  290.582277]  exit_mmap+0x1b8/0x1e0
[  290.583990]  mmput+0x63/0x130
[  290.585624]  do_exit+0x29d/0xcf0
[  290.586963]  ? _raw_spin_unlock_irq+0x27/0x50
[  290.588673]  ? __this_cpu_preempt_check+0x13/0x20
[  290.590600]  do_group_exit+0x47/0xc0
[  290.592098]  get_signal+0x329/0x920
[  290.594028]  do_signal+0x32/0x6e0
[  290.595620]  ? exit_to_usermode_loop+0x26/0x95
[  290.597268]  ? prepare_exit_to_usermode+0xa8/0xd0
[  290.599154]  exit_to_usermode_loop+0x3e/0x95
[  290.600794]  prepare_exit_to_usermode+0xa8/0xd0
[  290.602524]  ? page_fault+0x8/0x30
[  290.603928]  retint_user+0x8/0x18
(...snipped...)
[  336.219361] rcu: INFO: rcu_preempt self-detected stall on CPU
[  336.221340] rcu:     0-....: (259718 ticks this GP) idle=11e/1/0x4000000000000002 softirq=3290/3292 fqs=63254
[  336.224387] rcu:      (t=260003 jiffies g=10501 q=26578)
[  336.226119] NMI backtrace for cpu 0
[  336.227544] CPU: 0 PID: 1084 Comm: a.out Not tainted 4.20.0-rc2+ #778
[  336.229729] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  336.233056] Call Trace:
[  336.234645]  <IRQ>
[  336.235696]  dump_stack+0x67/0x95
[  336.226119] NMI backtrace for cpu 0
[  336.227544] CPU: 0 PID: 1084 Comm: a.out Not tainted 4.20.0-rc2+ #778
[  336.229729] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  336.233056] Call Trace:
(...snipped...)
[  336.287781]  shrink_node_memcg+0xa6/0x430
[  336.289621]  ? mem_cgroup_iter+0x21d/0x7c0
[  336.291382]  ? mem_cgroup_iter+0x156/0x7c0
[  336.293093]  shrink_node+0xd1/0x450
[  336.294734]  do_try_to_free_pages+0x103/0x3b0
[  336.296729]  try_to_free_mem_cgroup_pages+0x10d/0x390
[  336.298955]  try_charge+0x28b/0x830
[  336.300523]  memcg_kmem_charge_memcg+0x35/0x90
[  336.302258]  ? get_mem_cgroup_from_mm+0x239/0x2e0
[  336.304054]  memcg_kmem_charge+0x85/0x260
[  336.305688]  __alloc_pages_nodemask+0x218/0x360
[  336.307404]  pte_alloc_one+0x16/0xb0
[  336.308900]  __handle_mm_fault+0x873/0x1580
[  336.310577]  handle_mm_fault+0x1b2/0x3a0
[  336.312123]  ? handle_mm_fault+0x47/0x3a0
[  336.313768]  __do_page_fault+0x28c/0x530
[  336.315351]  do_page_fault+0x28/0x260
[  336.316870]  ? page_fault+0x8/0x30
[  336.318251]  page_fault+0x1e/0x30
----------------------------------------

Since the OOM victim thread is still reachable from the task list at this point,
it is possible that somebody changes scheduling priority of the OOM victim.



You might think that it is unfair to change scheduling priority from the exit path.
Then, retry with a patch shown below which will do the similar thing from outside
of the exit path (the OOM reaper kernel thread) instead of the patch shown above.

----------------------------------------
 kernel/sys.c  | 14 ++++++++++++++
 mm/oom_kill.c |  5 ++++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 123bd73..acf24f7 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -190,6 +190,20 @@ static int set_one_prio(struct task_struct *p, int niceval, int error)
 	return error;
 }
 
+struct sched_param {
+	int sched_priority;
+};
+
+void my_setpriority(struct task_struct *p)
+{
+	struct sched_param sp = { };
+
+	sched_setscheduler(p, SCHED_IDLE, &sp);
+	rcu_read_lock();
+	set_one_prio(p, MAX_NICE, -ESRCH);
+	rcu_read_unlock();
+}
+
 SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
 {
 	struct task_struct *g, *p;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b9e5a61..5841522 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -631,8 +631,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	 */
 	if (mm->mmap)
 		set_bit(MMF_OOM_SKIP, &mm->flags);
-	else if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+	else if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
+		extern void my_setpriority(struct task_struct *p);
+		my_setpriority(tsk);
 		pr_info("Handed over %u to exit path.\n", task_pid_nr(tsk));
+	}
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
----------------------------------------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20181114-2.txt.xz .
----------------------------------------
[   63.713061] Memory cgroup out of memory: Kill process 1236 (a.out) score 994 or sacrifice child
[   63.716303] Killed process 1236 (a.out) total-vm:209156kB, anon-rss:1344kB, file-rss:440kB, shmem-rss:0kB
[   63.720719] Handed over 1236 to exit path.
[  121.588872] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 62s!
[  121.591898] Showing busy workqueues and worker pools:
[  121.594011] workqueue events: flags=0x0
[  121.596230]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256
[  121.596258]     pending: vmstat_shepherd, vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], free_work, check_corruption
[  121.596331] workqueue events_power_efficient: flags=0x80
[  121.596336]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  121.596339]     pending: neigh_periodic_work, neigh_periodic_work
[  121.596347] workqueue mm_percpu_wq: flags=0x8
[  121.596350]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  121.596353]     pending: vmstat_update
[  186.594586] rcu: INFO: rcu_preempt self-detected stall on CPU
[  186.597329] rcu:     0-....: (67217 ticks this GP) idle=55e/1/0x4000000000000002 softirq=4231/4233 fqs=15984
[  186.601230] rcu:      (t=65000 jiffies g=9133 q=905)
[  186.603531] NMI backtrace for cpu 0
[  186.605611] CPU: 0 PID: 1237 Comm: a.out Not tainted 4.20.0-rc2+ #779
[  186.608291] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  186.612320] Call Trace:
(...snipped...)
[  186.674690]  mem_cgroup_iter+0x21d/0x7c0
[  186.676609]  ? mem_cgroup_iter+0x156/0x7c0
[  186.678443]  shrink_node+0xa9/0x450
[  186.680183]  do_try_to_free_pages+0x103/0x3b0
[  186.682154]  try_to_free_mem_cgroup_pages+0x10d/0x390
[  186.684358]  try_charge+0x28b/0x830
[  186.686018]  mem_cgroup_try_charge+0x42/0x1d0
[  186.687931]  mem_cgroup_try_charge_delay+0x11/0x30
[  186.689889]  __handle_mm_fault+0xa88/0x1580
[  186.691851]  handle_mm_fault+0x1b2/0x3a0
[  186.693686]  ? handle_mm_fault+0x47/0x3a0
[  186.695662]  __do_page_fault+0x28c/0x530
[  186.697859]  do_page_fault+0x28/0x260
[  186.699598]  ? page_fault+0x8/0x30
[  186.701223]  page_fault+0x1e/0x30
(...snipped...)
[  381.597229] rcu: INFO: rcu_preempt self-detected stall on CPU
[  381.599916] rcu:     0-....: (261318 ticks this GP) idle=55e/1/0x4000000000000002 softirq=4231/4233 fqs=63272
[  381.603762] rcu:      (t=260003 jiffies g=9133 q=4712)
[  381.605823] NMI backtrace for cpu 0
[  381.607551] CPU: 0 PID: 1237 Comm: a.out Not tainted 4.20.0-rc2+ #779
[  381.610201] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  381.614034] Call Trace:
(...snipped...)
[  381.672523]  ? mem_cgroup_iter+0x579/0x7c0
[  381.674461]  mem_cgroup_iter+0x5a6/0x7c0
[  381.676437]  ? mem_cgroup_iter+0x579/0x7c0
[  381.678375]  shrink_node+0x125/0x450
[  381.680013]  do_try_to_free_pages+0x103/0x3b0
[  381.682017]  try_to_free_mem_cgroup_pages+0x10d/0x390
[  381.684460]  try_charge+0x28b/0x830
[  381.686290]  mem_cgroup_try_charge+0x42/0x1d0
[  381.688454]  mem_cgroup_try_charge_delay+0x11/0x30
[  381.690929]  __handle_mm_fault+0xa88/0x1580
[  381.692860]  handle_mm_fault+0x1b2/0x3a0
[  381.694599]  ? handle_mm_fault+0x47/0x3a0
[  381.696269]  __do_page_fault+0x28c/0x530
[  381.697970]  do_page_fault+0x28/0x260
[  381.699650]  ? page_fault+0x8/0x30
[  381.701160]  page_fault+0x1e/0x30
(...snipped...)
[  446.166354] a.out           R  running task    13560  1236   1212 0x80100084
[  446.169179] Call Trace:
[  446.170469]  __schedule+0x246/0x990
[  446.172061]  preempt_schedule_common+0x34/0x54
[  446.173849]  preempt_schedule+0x1f/0x30
[  446.175514]  ___preempt_schedule+0x16/0x18
[  446.177187]  _raw_spin_unlock_irqrestore+0x5e/0x70
[  446.179277]  __debug_check_no_obj_freed+0x10b/0x1c0
[  446.181315]  ? kmem_cache_free+0x7f/0x2a0
[  446.183000]  ? vm_area_free+0x13/0x20
[  446.184646]  ? vm_area_free+0x13/0x20
[  446.186144]  debug_check_no_obj_freed+0x14/0x16
[  446.187950]  kmem_cache_free+0x15d/0x2a0
[  446.189556]  vm_area_free+0x13/0x20
[  446.190968]  remove_vma+0x54/0x60
[  446.192498]  exit_mmap+0x13d/0x1d0
[  446.193941]  mmput+0x63/0x130
[  446.195628]  do_exit+0x29d/0xcf0
[  446.197013]  ? _raw_spin_unlock_irq+0x27/0x50
[  446.198981]  ? __this_cpu_preempt_check+0x13/0x20
[  446.201033]  do_group_exit+0x47/0xc0
[  446.202715]  get_signal+0x329/0x920
[  446.204375]  do_signal+0x32/0x6e0
[  446.205731]  ? exit_to_usermode_loop+0x26/0x95
[  446.207487]  ? prepare_exit_to_usermode+0xa8/0xd0
[  446.209288]  exit_to_usermode_loop+0x3e/0x95
[  446.210924]  prepare_exit_to_usermode+0xa8/0xd0
[  446.212968]  ? page_fault+0x8/0x30
[  446.214627]  retint_user+0x8/0x18
(...snipped...)
[  446.234391] a.out           R  running task    13680  1237   1212 0x80000080
[  446.236896] Call Trace:
[  446.238079]  ? mem_cgroup_iter+0x156/0x7c0
[  446.239737]  ? find_held_lock+0x44/0xb0
[  446.241264]  ? debug_smp_processor_id+0x17/0x20
[  446.243088]  ? css_next_descendant_pre+0x45/0xb0
[  446.245025]  ? delayacct_end+0x1e/0x50
[  446.246640]  ? mem_cgroup_iter+0x2d2/0x7c0
[  446.248231]  ? shrink_node+0x125/0x450
[  446.249815]  ? do_try_to_free_pages+0x103/0x3b0
[  446.251724]  ? try_to_free_mem_cgroup_pages+0x10d/0x390
[  446.253703]  ? try_charge+0x28b/0x830
[  446.255270]  ? mem_cgroup_try_charge+0x42/0x1d0
[  446.257133]  ? mem_cgroup_try_charge_delay+0x11/0x30
[  446.259433]  ? __handle_mm_fault+0xa88/0x1580
[  446.261252]  ? handle_mm_fault+0x1b2/0x3a0
[  446.263071]  ? handle_mm_fault+0x47/0x3a0
[  446.264739]  ? __do_page_fault+0x28c/0x530
[  446.266376]  ? do_page_fault+0x28/0x260
[  446.267936]  ? page_fault+0x8/0x30
[  446.269478]  ? page_fault+0x1e/0x30
(...snipped...)
[  583.605227] a.out           R  running task    13560  1236   1212 0x80100084
[  583.608637] Call Trace:
[  583.609968]  __schedule+0x246/0x990
[  583.611682]  preempt_schedule_common+0x34/0x54
[  583.613472]  preempt_schedule+0x1f/0x30
[  583.615140]  ___preempt_schedule+0x16/0x18
[  583.616803]  _raw_spin_unlock_irqrestore+0x5e/0x70
[  583.618908]  __debug_check_no_obj_freed+0x10b/0x1c0
[  583.620844]  ? kmem_cache_free+0x7f/0x2a0
[  583.622691]  ? vm_area_free+0x13/0x20
[  583.624268]  ? vm_area_free+0x13/0x20
[  583.625765]  debug_check_no_obj_freed+0x14/0x16
[  583.627572]  kmem_cache_free+0x15d/0x2a0
[  583.629095]  vm_area_free+0x13/0x20
[  583.630631]  remove_vma+0x54/0x60
[  583.632533]  exit_mmap+0x13d/0x1d0
[  583.633993]  mmput+0x63/0x130
[  583.635500]  do_exit+0x29d/0xcf0
[  583.637010]  ? _raw_spin_unlock_irq+0x27/0x50
[  583.638776]  ? __this_cpu_preempt_check+0x13/0x20
[  583.640580]  do_group_exit+0x47/0xc0
[  583.642026]  get_signal+0x329/0x920
[  583.643908]  do_signal+0x32/0x6e0
[  583.645447]  ? exit_to_usermode_loop+0x26/0x95
[  583.647548]  ? prepare_exit_to_usermode+0xa8/0xd0
[  583.649457]  exit_to_usermode_loop+0x3e/0x95
[  583.651466]  prepare_exit_to_usermode+0xa8/0xd0
[  583.653244]  ? page_fault+0x8/0x30
[  583.654681]  retint_user+0x8/0x18
(...snipped...)
[  583.674642] a.out           R  running task    13680  1237   1212 0x80000080
[  583.677898] Call Trace:
[  583.679677]  ? find_held_lock+0x44/0xb0
[  583.681522]  ? lock_acquire+0xd3/0x210
[  583.685214]  ? rcu_is_watching+0x11/0x50
[  583.687452]  ? vmpressure+0x100/0x110
[  583.689069]  ? mem_cgroup_iter+0x5a6/0x7c0
[  583.690787]  ? lockdep_hardirqs_on+0xe5/0x1c0
[  583.692476]  ? shrink_node+0x125/0x450
[  583.694018]  ? css_task_iter_next+0x2b/0x70
[  583.695883]  ? do_try_to_free_pages+0x2fe/0x3b0
[  583.697702]  ? try_to_free_mem_cgroup_pages+0x10d/0x390
[  583.699611]  ? cgroup_file_notify+0x16/0x70
[  583.701252]  ? try_charge+0x28b/0x830
[  583.702954]  ? mem_cgroup_try_charge+0x42/0x1d0
[  583.704958]  ? mem_cgroup_try_charge_delay+0x11/0x30
[  583.706993]  ? __handle_mm_fault+0xa88/0x1580
[  583.708833]  ? handle_mm_fault+0x1b2/0x3a0
[  583.710502]  ? handle_mm_fault+0x47/0x3a0
[  583.712154]  ? __do_page_fault+0x28c/0x530
[  583.713806]  ? do_page_fault+0x28/0x260
[  583.715439]  ? page_fault+0x8/0x30
[  583.716973]  ? page_fault+0x1e/0x30
----------------------------------------

Since the allocating threads call only cond_resched(), and cond_resched() for
realtime priority threads is a no-op, allocating threads with realtime priority
will forever wait for the OOM victim with idle priority.



There is always an invisible lock called "scheduling priority". You can't
leave the MMF_OOM_SKIP to the exit path. Your approach is not ready for
handling the worst case.

Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Michal Hocko Nov. 14, 2018, 10:16 a.m. UTC | #3
On Wed 14-11-18 18:46:13, Tetsuo Handa wrote:
[...]
> There is always an invisible lock called "scheduling priority". You can't
> leave the MMF_OOM_SKIP to the exit path. Your approach is not ready for
> handling the worst case.

And that problem is all over the memory reclaim. You can get starved
to death and block other resources. And the memory reclaim is not the
only one. This is a fundamental issue of the locking without priority
inheritance and other real time techniques.

> Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

OK, if your whole point is to block any changes into this area unless
it is your solution to be merged then I guess I will just push patches
through with your explicit Nack on them. Your feedback was far fetched
at many times has distracted the discussion way too often. This is
especially sad because your testing and review was really helpful at
times. I do not really have energy to argue the same set of arguments
over and over again.

You have expressed unwillingness to understand the overall
picture several times. You do not care about a long term maintenance
burden of this code which is quite tricky already and refuse to
understand the cost/benefit part.

If this series works for the workload reported by David I will simply
push it through and let Andrew decide. If there is a lack of feedback
I will just keep it around because it seems that most users do not care
about these corner cases anyway.
Tetsuo Handa Nov. 15, 2018, 9:54 a.m. UTC | #4
On 2018/11/14 19:16, Michal Hocko wrote:
> On Wed 14-11-18 18:46:13, Tetsuo Handa wrote:
> [...]
> > There is always an invisible lock called "scheduling priority". You can't
> > leave the MMF_OOM_SKIP to the exit path. Your approach is not ready for
> > handling the worst case.
> 
> And that problem is all over the memory reclaim. You can get starved
> to death and block other resources. And the memory reclaim is not the
> only one.

I think that it is a manner for kernel developers that no thread keeps
consuming CPU resources forever. In the kernel world, doing

  while (1);

is not permitted. Likewise, doing

  for (i = 0; i < very_large_value; i++)
      do_something_which_does_not_yield_CPU_to_others();

has to be avoided, in order to avoid lockup problems. We are required to
yield CPU to others when we are waiting for somebody else to make progress.
It is the page allocator who is refusing to yield CPU to those who need CPU.

Since the OOM reaper kernel thread "has normal priority" and "can run on any
CPU", the possibility of failing to run is lower than an OOM victim thread
which "has idle priority" and "can run on only limited CPU". You are trying
to add a dependency on such thread, and I'm saying that adding a dependency
on such thread increases possibility of lockup.

Yes, even the OOM reaper kernel thread might fail to run if all CPUs were
busy with realtime threads waiting for the OOM reaper kernel thread to make
progress. In that case, we had better stop relying on asynchronous memory
reclaim, and switch to direct OOM reaping by allocating threads.

But what I demonstrated is that

        /*
         * the exit path is guaranteed to finish the memory tear down
         * without any unbound blocking at this stage so make it clear
         * to the oom_reaper
         */

becomes a lie even when only one CPU was busy with realtime threads waiting
for an idle thread to make progress. If the page allocator stops telling a
lie that "an OOM victim is making progress on behalf of me", we can avoid
the lockup.

>           This is a fundamental issue of the locking without priority
> inheritance and other real time techniques.

That is nothing but an evidence that you are refusing to solve the possibility
of lockup, and will keep piling up stupid lockup bugs. OOM handling does not
use locks when waiting for somebody else to make progress. Blaming "the locking
without priority inheritance" is wrong.

Each subsystem is responsible for avoiding the lockup. If one subsystem is
triggering lockup due to printk() flooding, that subsystem avoids the lockup
by stop abusing CPU resources by reducing the printk() messages. That's all
we can do for now. MM is not privileged enough to lockup the system.

> 
> > Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> OK, if your whole point is to block any changes into this area unless
> it is your solution to be merged then I guess I will just push patches
> through with your explicit Nack on them. Your feedback was far fetched
> at many times has distracted the discussion way too often. This is
> especially sad because your testing and review was really helpful at
> times. I do not really have energy to argue the same set of arguments
> over and over again.
> 
> You have expressed unwillingness to understand the overall
> picture several times. You do not care about a long term maintenance
> burden of this code which is quite tricky already and refuse to
> understand the cost/benefit part.
> 
> If this series works for the workload reported by David I will simply
> push it through and let Andrew decide.

I'm skeptic about this approach. Given that exit_mmap() has to do more
things than what the OOM reaper kernel thread can do, how likely the OOM
reaper kernel thread can find that exit_mmap() has completed both
fullset of unmap_vmas() and __unlink_vmas() within (at most) one second
after the OOM reaper completed only subset of unmap_vmas() ? If exit_mmap()
was preempted (due to scheduling priority), the OOM reaper might likely fail
to find it. Anyway, if you insist this approach, I expect that exit_mmap()
asks the OOM reaper kernel thread to call __free_pgtables(), and the OOM
reaper kernel thread sets MMF_OOM_SKIP, and exit_mmap() resumes remove_vma()
etc. , for the OOM reaper kernel thread ("has normal priority" and "can run
on any CPU") is more reliable than a random thread ("has idle priority" and
"can run on only limited CPU").

Given that the OOM reaper likely can find it, there are lack of reviews from
each arch maintainers (regarding whether this approach is really safe, and
whether this approach constrains future improvements).

>                                        If there is a lack of feedback
> I will just keep it around because it seems that most users do not care
> about these corner cases anyway.

No way. We are always failing to get attention regarding OOM handling.  ;-)

Even how to avoid flooding by

    dump_header(oc, NULL);
    pr_warn("Out of memory and no killable processes...\n");

for memcg OOM is failing to make progress.

People make changes without thinking about OOM handling.
If you push this approach, solve the lack of reviews.
Michal Hocko Nov. 15, 2018, 11:36 a.m. UTC | #5
On Thu 15-11-18 18:54:15, Tetsuo Handa wrote:
> On 2018/11/14 19:16, Michal Hocko wrote:
> > On Wed 14-11-18 18:46:13, Tetsuo Handa wrote:
> > [...]
> > > There is always an invisible lock called "scheduling priority". You can't
> > > leave the MMF_OOM_SKIP to the exit path. Your approach is not ready for
> > > handling the worst case.
> > 
> > And that problem is all over the memory reclaim. You can get starved
> > to death and block other resources. And the memory reclaim is not the
> > only one.
> 
> I think that it is a manner for kernel developers that no thread keeps
> consuming CPU resources forever. In the kernel world, doing
> 
>   while (1);
> 
> is not permitted. Likewise, doing
> 
>   for (i = 0; i < very_large_value; i++)
>       do_something_which_does_not_yield_CPU_to_others();

There is nothing like that proposed in this series.

> has to be avoided, in order to avoid lockup problems. We are required to
> yield CPU to others when we are waiting for somebody else to make progress.
> It is the page allocator who is refusing to yield CPU to those who need CPU.

And we do that in the reclaim path.

> Since the OOM reaper kernel thread "has normal priority" and "can run on any
> CPU", the possibility of failing to run is lower than an OOM victim thread
> which "has idle priority" and "can run on only limited CPU". You are trying
> to add a dependency on such thread, and I'm saying that adding a dependency
> on such thread increases possibility of lockup.

Sigh. No, this is not the case. All this patch series does is that we
hand over to the exiting task once it doesn't block on any locks
anymore. If the thread is low priority then it is quite likely that the
oom reaper is done by the time the victim even reaches the exit path.

> Yes, even the OOM reaper kernel thread might fail to run if all CPUs were
> busy with realtime threads waiting for the OOM reaper kernel thread to make
> progress. In that case, we had better stop relying on asynchronous memory
> reclaim, and switch to direct OOM reaping by allocating threads.
> 
> But what I demonstrated is that
> 
>         /*
>          * the exit path is guaranteed to finish the memory tear down
>          * without any unbound blocking at this stage so make it clear
>          * to the oom_reaper
>          */
> 
> becomes a lie even when only one CPU was busy with realtime threads waiting
> for an idle thread to make progress. If the page allocator stops telling a
> lie that "an OOM victim is making progress on behalf of me", we can avoid
> the lockup.

OK, I stopped reading right here. This discussion is pointless. Once you
busy loop all CPUs you are screwed. Are you going to blame a filesystem
that no progress can be made if a code path holding an important lock
is preemempted by high priority stuff a no further progress can be
made? This is just ridiculous. What you are arguing here is not fixable
with the current upstream kernel. Even your so beloved timeout based
solution doesn't cope with that because oom reaper can be preempted for
unbound amount of time. Your argument just doens't make much sense in
the context of the current kernel. Full stop.
Tetsuo Handa Nov. 16, 2018, 10:06 a.m. UTC | #6
On 2018/11/15 20:36, Michal Hocko wrote:
> On Thu 15-11-18 18:54:15, Tetsuo Handa wrote:
> > On 2018/11/14 19:16, Michal Hocko wrote:
> > > On Wed 14-11-18 18:46:13, Tetsuo Handa wrote:
> > > [...]
> > > > There is always an invisible lock called "scheduling priority". You can't
> > > > leave the MMF_OOM_SKIP to the exit path. Your approach is not ready for
> > > > handling the worst case.
> > > 
> > > And that problem is all over the memory reclaim. You can get starved
> > > to death and block other resources. And the memory reclaim is not the
> > > only one.
> > 
> > I think that it is a manner for kernel developers that no thread keeps
> > consuming CPU resources forever. In the kernel world, doing
> > 
> >   while (1);
> > 
> > is not permitted. Likewise, doing
> > 
> >   for (i = 0; i < very_large_value; i++)
> >       do_something_which_does_not_yield_CPU_to_others();
> 
> There is nothing like that proposed in this series.
> 
> > has to be avoided, in order to avoid lockup problems. We are required to
> > yield CPU to others when we are waiting for somebody else to make progress.
> > It is the page allocator who is refusing to yield CPU to those who need CPU.
> 
> And we do that in the reclaim path.
> 

No we don't. Please explain why holding oom_lock and not holding oom_lock in a
patch shown below makes difference (i.e. holding oom_lock can avoid lockups while
not holding oom_lock fails to avoid lockups).

----------
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3137,50 +3137,53 @@ void exit_mmap(struct mm_struct *mm)
 	/*
 	 * oom_reaper cannot race with the page tables teardown but we
 	 * want to make sure that the exit path can take over the full
 	 * tear down when it is safe to do so
 	 */
 	if (oom) {
 		extern void my_setpriority(void);
 		down_write(&mm->mmap_sem);
 		__unlink_vmas(vma);
+		mutex_lock(&oom_lock);
 		/*
 		 * the exit path is guaranteed to finish the memory tear down
 		 * without any unbound blocking at this stage so make it clear
 		 * to the oom_reaper
 		 */
 		mm->mmap = NULL;
 		up_write(&mm->mmap_sem);
 		my_setpriority();
 		__free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	} else {
 		free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	}
 
 	tlb_finish_mmu(&tlb, 0, -1);
 
 	/*
+	 * Now that the full address space is torn down, make sure the
+	 * OOM killer skips over this task
+	 */
+	if (oom) {
+		set_bit(MMF_OOM_SKIP, &mm->flags);
+		mutex_unlock(&oom_lock);
+	}
+
+	/*
 	 * Walk the list again, actually closing and freeing it,
 	 * with preemption enabled, without holding any MM locks.
 	 */
 	while (vma) {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
 		vma = remove_vma(vma);
 	}
 	vm_unacct_memory(nr_accounted);
-
-	/*
-	 * Now that the full address space is torn down, make sure the
-	 * OOM killer skips over this task
-	 */
-	if (oom)
-		set_bit(MMF_OOM_SKIP, &mm->flags);
 }
 
 /* Insert vm structure into process list sorted by address
  * and into the inode's i_mmap tree.  If vm_file is non-NULL
  * then i_mmap_rwsem is taken here.
  */
 int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 {
 	struct vm_area_struct *prev;

----------

> > Since the OOM reaper kernel thread "has normal priority" and "can run on any
> > CPU", the possibility of failing to run is lower than an OOM victim thread
> > which "has idle priority" and "can run on only limited CPU". You are trying
> > to add a dependency on such thread, and I'm saying that adding a dependency
> > on such thread increases possibility of lockup.
> 
> Sigh. No, this is not the case. All this patch series does is that we
> hand over to the exiting task once it doesn't block on any locks
> anymore. If the thread is low priority then it is quite likely that the
> oom reaper is done by the time the victim even reaches the exit path.

Not true. A thread executing exit_mmap() can change its priority at any moment.

Also, the OOM reaper kernel thread can fail to complete OOM reaping before
exit_mmap() does mm->mmap = NULL due to e.g. down_read_trylock(&mm->mmap_sem)
failure, doing OOM reaping on other OOM victims in different OOM domains, being
preempted by scheduling priority. You will notice it by trying both

----------
 	 */
 	if (mm->mmap)
 		set_bit(MMF_OOM_SKIP, &mm->flags);
+	else
+		pr_info("Handed over %u to exit path.\n", task_pid_nr(tsk));
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
----------

and

----------
 	 */
 	if (mm->mmap)
 		set_bit(MMF_OOM_SKIP, &mm->flags);
+	else if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+		pr_info("Handed over %u to exit path.\n", task_pid_nr(tsk));
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
----------

and checking the frequency of printing "Handed over " messages.

> 
> > Yes, even the OOM reaper kernel thread might fail to run if all CPUs were
> > busy with realtime threads waiting for the OOM reaper kernel thread to make
> > progress. In that case, we had better stop relying on asynchronous memory
> > reclaim, and switch to direct OOM reaping by allocating threads.
> > 
> > But what I demonstrated is that
> > 
> >         /*
> >          * the exit path is guaranteed to finish the memory tear down
> >          * without any unbound blocking at this stage so make it clear
> >          * to the oom_reaper
> >          */
> > 
> > becomes a lie even when only one CPU was busy with realtime threads waiting
> > for an idle thread to make progress. If the page allocator stops telling a
> > lie that "an OOM victim is making progress on behalf of me", we can avoid
> > the lockup.
> 
> OK, I stopped reading right here. This discussion is pointless. Once you
> busy loop all CPUs you are screwed.

Look at the log files. I made only 1 CPU (out of 8 CPUs) busy.

>                                     Are you going to blame a filesystem
> that no progress can be made if a code path holding an important lock
> is preemempted by high priority stuff a no further progress can be
> made?

I don't blame that case because it is doing something which is not a kernel bug.

>       This is just ridiculous. What you are arguing here is not fixable
> with the current upstream kernel.

I do blame memory allocation case because it is doing something which is a
kernel bug which can be avoided if we stop telling a lie. No future kernel
is fixable as long as we keep telling a lie.

>                                   Even your so beloved timeout based
> solution doesn't cope with that because oom reaper can be preempted for
> unbound amount of time.

Yes, that's the reason I suggest direct OOM reaping.

>                         Your argument just doens't make much sense in
> the context of the current kernel. Full stop.

Won't full stop at all. What I'm saying is "don't rely on busy polling loop".
The reclaim path becomes a no-op for a thread executing __free_pgtables()
when memcg OOM is waiting for that thread to complete __free_pgtables() and
set MMF_OOM_SKIP.