diff mbox series

mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

Message ID ca3da8b8-1bb5-c302-b190-fa6cebab58ca@I-love.SAKURA.ne.jp (mailing list archive)
State New, archived
Headers show
Series mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry(). | expand

Commit Message

Tetsuo Handa July 26, 2018, 11:06 a.m. UTC
Before applying "an OOM lockup mitigation patch", I want to apply this
"another OOM lockup avoidance" patch.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz
(which was captured with

  --- a/mm/oom_kill.c
  +++ b/mm/oom_kill.c
  @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc)
   {
   	unsigned long freed = 0;
   	bool delay = false; /* if set, delay next allocation attempt */
  +	static unsigned long last_warned;
  +	if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) {
  +		pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", current->comm,
  +			current->pid, oc->gfp_mask, &oc->gfp_mask, oc->order);
  +		last_warned = jiffies;
  +	}
   
   	oc->constraint = CONSTRAINT_NONE;
   	if (oom_killer_disabled)

in order to demonstrate that the GFP_NOIO allocation from disk_events_workfn() is
calling out_of_memory() rather than by error failing to give up direct reclaim).

[  258.619119] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
[  268.622732] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
[  278.635344] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
[  288.639360] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
[  298.642715] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
[  308.527975] sysrq: SysRq : Show Memory
[  308.529713] Mem-Info:
[  308.530930] active_anon:855844 inactive_anon:2123 isolated_anon:0
[  308.530930]  active_file:7 inactive_file:12 isolated_file:0
[  308.530930]  unevictable:0 dirty:0 writeback:0 unstable:0
[  308.530930]  slab_reclaimable:3444 slab_unreclaimable:23008
[  308.530930]  mapped:1743 shmem:2272 pagetables:3991 bounce:0
[  308.530930]  free:21206 free_pcp:165 free_cma:0
[  308.542309] Node 0 active_anon:3423376kB inactive_anon:8492kB active_file:28kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:6972kB dirty:0kB writeback:0kB shmem:9088kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3227648kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[  308.550495] Node 0 DMA free:14712kB min:288kB low:360kB high:432kB active_anon:1128kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  308.558420] lowmem_reserve[]: 0 2717 3607 3607
[  308.560197] Node 0 DMA32 free:53860kB min:50684kB low:63352kB high:76020kB active_anon:2727108kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2782536kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  308.568640] lowmem_reserve[]: 0 0 890 890
[  308.570396] Node 0 Normal free:16252kB min:16608kB low:20760kB high:24912kB active_anon:694864kB inactive_anon:8492kB active_file:44kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:911820kB mlocked:0kB kernel_stack:8080kB pagetables:15956kB bounce:0kB free_pcp:660kB local_pcp:660kB free_cma:0kB
[  308.580075] lowmem_reserve[]: 0 0 0 0
[  308.581827] Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (M) 1*32kB (U) 1*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (M) = 14712kB
[  308.586271] Node 0 DMA32: 5*4kB (UM) 3*8kB (U) 5*16kB (U) 5*32kB (U) 5*64kB (U) 2*128kB (UM) 2*256kB (UM) 7*512kB (M) 4*1024kB (M) 2*2048kB (UM) 10*4096kB (UM) = 54108kB
[  308.591900] Node 0 Normal: 13*4kB (UM) 5*8kB (UM) 2*16kB (U) 74*32kB (UME) 23*64kB (UME) 6*128kB (UME) 5*256kB (U) 2*512kB (UM) 9*1024kB (M) 0*2048kB 0*4096kB = 16252kB
[  308.597637] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  308.600764] 2273 total pagecache pages
[  308.602712] 0 pages in swap cache
[  308.604532] Swap cache stats: add 0, delete 0, find 0/0
[  308.606843] Free swap  = 0kB
[  308.608632] Total swap = 0kB
[  308.610357] 1048422 pages RAM
[  308.612153] 0 pages HighMem/MovableOnly
[  308.614173] 120864 pages reserved
[  308.615994] 0 pages cma reserved
[  308.617811] 0 pages hwpoisoned
[  308.527975] sysrq: SysRq : Show Memory
[  308.529713] Mem-Info:
[  308.530930] active_anon:855844 inactive_anon:2123 isolated_anon:0
[  308.530930]  active_file:7 inactive_file:12 isolated_file:0
[  308.530930]  unevictable:0 dirty:0 writeback:0 unstable:0
[  308.530930]  slab_reclaimable:3444 slab_unreclaimable:23008
[  308.530930]  mapped:1743 shmem:2272 pagetables:3991 bounce:0
[  308.530930]  free:21206 free_pcp:165 free_cma:0
[  308.542309] Node 0 active_anon:3423376kB inactive_anon:8492kB active_file:28kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:6972kB dirty:0kB writeback:0kB shmem:9088kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3227648kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[  308.550495] Node 0 DMA free:14712kB min:288kB low:360kB high:432kB active_anon:1128kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  308.558420] lowmem_reserve[]: 0 2717 3607 3607
[  308.560197] Node 0 DMA32 free:53860kB min:50684kB low:63352kB high:76020kB active_anon:2727108kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2782536kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  308.568640] lowmem_reserve[]: 0 0 890 890
[  308.570396] Node 0 Normal free:16252kB min:16608kB low:20760kB high:24912kB active_anon:694864kB inactive_anon:8492kB active_file:44kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:911820kB mlocked:0kB kernel_stack:8080kB pagetables:15956kB bounce:0kB free_pcp:660kB local_pcp:660kB free_cma:0kB
[  308.580075] lowmem_reserve[]: 0 0 0 0
[  308.581827] Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (M) 1*32kB (U) 1*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (M) = 14712kB
[  308.586271] Node 0 DMA32: 5*4kB (UM) 3*8kB (U) 5*16kB (U) 5*32kB (U) 5*64kB (U) 2*128kB (UM) 2*256kB (UM) 7*512kB (M) 4*1024kB (M) 2*2048kB (UM) 10*4096kB (UM) = 54108kB
[  308.591900] Node 0 Normal: 13*4kB (UM) 5*8kB (UM) 2*16kB (U) 74*32kB (UME) 23*64kB (UME) 6*128kB (UME) 5*256kB (U) 2*512kB (UM) 9*1024kB (M) 0*2048kB 0*4096kB = 16252kB
[  308.597637] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  308.600764] 2273 total pagecache pages
[  308.602712] 0 pages in swap cache
[  308.604532] Swap cache stats: add 0, delete 0, find 0/0
[  308.606843] Free swap  = 0kB
[  308.608632] Total swap = 0kB
[  308.610357] 1048422 pages RAM
[  308.612153] 0 pages HighMem/MovableOnly
[  308.614173] 120864 pages reserved
[  308.615994] 0 pages cma reserved
[  308.617811] 0 pages hwpoisoned

[  310.383005] kworker/0:0     R  running task    13504     5      2 0x80000000
[  310.385328] Workqueue: events_freezable_power_ disk_events_workfn
[  310.387578] Call Trace:
[  310.475050]  ? shrink_node+0xca/0x460
[  310.476614]  shrink_node+0xca/0x460
[  310.478129]  do_try_to_free_pages+0xcb/0x380
[  310.479848]  try_to_free_pages+0xbb/0xf0
[  310.481481]  __alloc_pages_slowpath+0x3c1/0xc50
[  310.483332]  __alloc_pages_nodemask+0x2a6/0x2c0
[  310.485130]  bio_copy_kern+0xcd/0x200
[  310.486710]  blk_rq_map_kern+0xb6/0x130
[  310.488317]  scsi_execute+0x64/0x250
[  310.489859]  sr_check_events+0x9a/0x2b0 [sr_mod]
[  310.491669]  ? __mutex_unlock_slowpath+0x46/0x2b0
[  310.493581]  cdrom_check_events+0xf/0x30 [cdrom]
[  310.495435]  sr_block_check_events+0x7c/0xb0 [sr_mod]
[  310.497434]  disk_check_events+0x5e/0x150
[  310.499172]  process_one_work+0x290/0x4a0
[  310.500878]  ? process_one_work+0x227/0x4a0
[  310.502591]  worker_thread+0x28/0x3d0
[  310.504184]  ? process_one_work+0x4a0/0x4a0
[  310.505916]  kthread+0x107/0x120
[  310.507384]  ? kthread_create_worker_on_cpu+0x70/0x70
[  310.509333]  ret_from_fork+0x24/0x30

[  324.960731] Showing busy workqueues and worker pools:
[  324.962577] workqueue events: flags=0x0
[  324.964137]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=13/256
[  324.966231]     pending: vmw_fb_dirty_flush [vmwgfx], vmstat_shepherd, vmpressure_work_fn, free_work, mmdrop_async_fn, mmdrop_async_fn, mmdrop_async_fn, mmdrop_async_fn, e1000_watchdog [e1000], mmdrop_async_fn, mmdrop_async_fn, check_corruption, console_callback
[  324.973425] workqueue events_freezable: flags=0x4
[  324.975247]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.977393]     pending: vmballoon_work [vmw_balloon]
[  324.979310] workqueue events_power_efficient: flags=0x80
[  324.981298]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256
[  324.983543]     pending: gc_worker [nf_conntrack], fb_flashcursor, neigh_periodic_work, neigh_periodic_work, check_lifetime
[  324.987240] workqueue events_freezable_power_: flags=0x84
[  324.989292]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.991482]     in-flight: 5:disk_events_workfn
[  324.993371] workqueue mm_percpu_wq: flags=0x8
[  324.995167]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  324.997363]     pending: vmstat_update, drain_local_pages_wq BAR(498)
[  324.999977] workqueue ipv6_addrconf: flags=0x40008
[  325.001899]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
[  325.004092]     pending: addrconf_verify_work
[  325.005911] workqueue mpt_poll_0: flags=0x8
[  325.007686]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.009914]     pending: mpt_fault_reset_work [mptbase]
[  325.012044] workqueue xfs-cil/sda1: flags=0xc
[  325.013897]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.016190]     pending: xlog_cil_push_work [xfs] BAR(2344)
[  325.018354] workqueue xfs-reclaim/sda1: flags=0xc
[  325.020293]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.022549]     pending: xfs_reclaim_worker [xfs]
[  325.024540] workqueue xfs-sync/sda1: flags=0x4
[  325.026425]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.028691]     pending: xfs_log_worker [xfs]
[  325.030546] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=189s workers=4 idle: 977 65 13

[  427.593034] sysrq: SysRq : Show Memory
[  427.594680] Mem-Info:
[  427.595882] active_anon:855844 inactive_anon:2123 isolated_anon:0
[  427.595882]  active_file:7 inactive_file:12 isolated_file:0
[  427.595882]  unevictable:0 dirty:0 writeback:0 unstable:0
[  427.595882]  slab_reclaimable:3444 slab_unreclaimable:22960
[  427.595882]  mapped:1743 shmem:2272 pagetables:3991 bounce:0
[  427.595882]  free:21254 free_pcp:165 free_cma:0
[  427.607487] Node 0 active_anon:3423376kB inactive_anon:8492kB active_file:28kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:6972kB dirty:0kB writeback:0kB shmem:9088kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3227648kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[  427.615694] Node 0 DMA free:14712kB min:288kB low:360kB high:432kB active_anon:1128kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  427.623632] lowmem_reserve[]: 0 2717 3607 3607
[  427.625423] Node 0 DMA32 free:53860kB min:50684kB low:63352kB high:76020kB active_anon:2727108kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2782536kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  427.634066] lowmem_reserve[]: 0 0 890 890
[  427.635829] Node 0 Normal free:16444kB min:16608kB low:20760kB high:24912kB active_anon:694864kB inactive_anon:8492kB active_file:44kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:911820kB mlocked:0kB kernel_stack:7444kB pagetables:15956kB bounce:0kB free_pcp:660kB local_pcp:660kB free_cma:0kB
[  427.645560] lowmem_reserve[]: 0 0 0 0
[  427.647320] Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (M) 1*32kB (U) 1*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (M) = 14712kB
[  427.651757] Node 0 DMA32: 5*4kB (UM) 3*8kB (U) 5*16kB (U) 5*32kB (U) 5*64kB (U) 2*128kB (UM) 2*256kB (UM) 7*512kB (M) 4*1024kB (M) 2*2048kB (UM) 10*4096kB (UM) = 54108kB
[  427.657428] Node 0 Normal: 13*4kB (UM) 5*8kB (UM) 2*16kB (U) 81*32kB (UME) 23*64kB (UME) 6*128kB (UME) 5*256kB (U) 2*512kB (UM) 9*1024kB (M) 0*2048kB 0*4096kB = 16476kB
[  427.663144] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  427.666283] 2273 total pagecache pages
[  427.668249] 0 pages in swap cache
[  427.670085] Swap cache stats: add 0, delete 0, find 0/0
[  427.672416] Free swap  = 0kB
[  427.674256] Total swap = 0kB
[  427.676011] 1048422 pages RAM
[  427.677746] 0 pages HighMem/MovableOnly
[  427.679704] 120864 pages reserved
[  427.681526] 0 pages cma reserved
[  427.683371] 0 pages hwpoisoned

[  430.083584] kworker/0:0     R  running task    13504     5      2 0x80000000
[  430.085990] Workqueue: events_freezable_power_ disk_events_workfn
[  430.088175] Call Trace:
[  430.175214]  ? shrink_slab+0x240/0x2c0
[  430.176861]  shrink_node+0xe3/0x460
[  430.178402]  do_try_to_free_pages+0xcb/0x380
[  430.180110]  try_to_free_pages+0xbb/0xf0
[  430.181733]  __alloc_pages_slowpath+0x3c1/0xc50
[  430.183516]  __alloc_pages_nodemask+0x2a6/0x2c0
[  430.185292]  bio_copy_kern+0xcd/0x200
[  430.186847]  blk_rq_map_kern+0xb6/0x130
[  430.188475]  scsi_execute+0x64/0x250
[  430.190027]  sr_check_events+0x9a/0x2b0 [sr_mod]
[  430.191844]  ? __mutex_unlock_slowpath+0x46/0x2b0
[  430.193668]  cdrom_check_events+0xf/0x30 [cdrom]
[  430.195466]  sr_block_check_events+0x7c/0xb0 [sr_mod]
[  430.197383]  disk_check_events+0x5e/0x150
[  430.199038]  process_one_work+0x290/0x4a0
[  430.200712]  ? process_one_work+0x227/0x4a0
[  430.202413]  worker_thread+0x28/0x3d0
[  430.204003]  ? process_one_work+0x4a0/0x4a0
[  430.205757]  kthread+0x107/0x120
[  430.207282]  ? kthread_create_worker_on_cpu+0x70/0x70
[  430.209345]  ret_from_fork+0x24/0x30

[  444.206334] Showing busy workqueues and worker pools:
[  444.208472] workqueue events: flags=0x0
[  444.210193]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=15/256
[  444.212389]     pending: vmw_fb_dirty_flush [vmwgfx], vmstat_shepherd, vmpressure_work_fn, free_work, mmdrop_async_fn, mmdrop_async_fn, mmdrop_async_fn, mmdrop_async_fn, e1000_watchdog [e1000], mmdrop_async_fn, mmdrop_async_fn, check_corruption, console_callback, sysrq_reinject_alt_sysrq, moom_callback
[  444.220547] workqueue events_freezable: flags=0x4
[  444.222562]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.224852]     pending: vmballoon_work [vmw_balloon]
[  444.227022] workqueue events_power_efficient: flags=0x80
[  444.229103]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256
[  444.231271]     pending: gc_worker [nf_conntrack], fb_flashcursor, neigh_periodic_work, neigh_periodic_work, check_lifetime
[  444.234824] workqueue events_freezable_power_: flags=0x84
[  444.236937]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.239138]     in-flight: 5:disk_events_workfn
[  444.241022] workqueue mm_percpu_wq: flags=0x8
[  444.242829]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  444.245057]     pending: vmstat_update, drain_local_pages_wq BAR(498)
[  444.247646] workqueue ipv6_addrconf: flags=0x40008
[  444.249582]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
[  444.251784]     pending: addrconf_verify_work
[  444.253620] workqueue mpt_poll_0: flags=0x8
[  444.255427]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.257666]     pending: mpt_fault_reset_work [mptbase]
[  444.259800] workqueue xfs-cil/sda1: flags=0xc
[  444.261646]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.263903]     pending: xlog_cil_push_work [xfs] BAR(2344)
[  444.266101] workqueue xfs-reclaim/sda1: flags=0xc
[  444.268104]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.270454]     pending: xfs_reclaim_worker [xfs]
[  444.272425] workqueue xfs-eofblocks/sda1: flags=0xc
[  444.274432]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.276729]     pending: xfs_eofblocks_worker [xfs]
[  444.278739] workqueue xfs-sync/sda1: flags=0x4
[  444.280641]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.282967]     pending: xfs_log_worker [xfs]
[  444.285195] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=309s workers=3 idle: 977 65

Since the patch shown below was suggested by Michal Hocko at
https://marc.info/?l=linux-mm&m=152723708623015 , it is from Michal Hocko.

From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 26 Jul 2018 14:40:03 +0900
Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

Tetsuo Handa has reported that it is possible to bypass the short sleep
for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
only when there are too many dirty and writeback pages") and lock up the
system if OOM.

This is because we are implicitly counting on falling back to
schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
schedule_timeout_uninterruptible() in should_reclaim_retry() was not
called due to __zone_watermark_ok() == false.

However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is
not called if all allocating threads but a PF_WQ_WORKER thread got stuck at
__GFP_FS direct reclaim, for mutex_trylock(&oom_lock) by that PF_WQ_WORKER
thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER
thread is doing __GFP_FS allocation. Tetsuo is observing that GFP_NOIO
allocation request from disk_events_workfn() is preventing other pending
works from starting.

Since should_reclaim_retry() should be a natural reschedule point,
let's do the short sleep for PF_WQ_WORKER threads unconditionally
in order to guarantee that other pending works are started.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 mm/page_alloc.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

Comments

Michal Hocko July 26, 2018, 11:39 a.m. UTC | #1
On Thu 26-07-18 20:06:24, Tetsuo Handa wrote:
> Before applying "an OOM lockup mitigation patch", I want to apply this
> "another OOM lockup avoidance" patch.

I do not really see why. All these are borderline interesting as the
system is basically dead by the time you reach this state.

> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz
> (which was captured with
> 
>   --- a/mm/oom_kill.c
>   +++ b/mm/oom_kill.c
>   @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc)
>    {
>    	unsigned long freed = 0;
>    	bool delay = false; /* if set, delay next allocation attempt */
>   +	static unsigned long last_warned;
>   +	if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) {
>   +		pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", current->comm,
>   +			current->pid, oc->gfp_mask, &oc->gfp_mask, oc->order);
>   +		last_warned = jiffies;
>   +	}
>    
>    	oc->constraint = CONSTRAINT_NONE;
>    	if (oom_killer_disabled)
> 
> in order to demonstrate that the GFP_NOIO allocation from disk_events_workfn() is
> calling out_of_memory() rather than by error failing to give up direct reclaim).
> 
> [  258.619119] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
> [  268.622732] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
> [  278.635344] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
> [  288.639360] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
> [  298.642715] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0

Hmm, so there is no other memory allocation to trigger the oom or they
all just back off on the oom_lock trylock? In other words what is
preventing from the oom killer invocation?
 
[...]

> Since the patch shown below was suggested by Michal Hocko at
> https://marc.info/?l=linux-mm&m=152723708623015 , it is from Michal Hocko.
> 
> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 26 Jul 2018 14:40:03 +0900
> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().
> 
> Tetsuo Handa has reported that it is possible to bypass the short sleep
> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
> only when there are too many dirty and writeback pages") and lock up the
> system if OOM.
> 
> This is because we are implicitly counting on falling back to
> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
> schedule_timeout_uninterruptible() in should_reclaim_retry() was not
> called due to __zone_watermark_ok() == false.

How do we rely on that?

> However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is
> not called if all allocating threads but a PF_WQ_WORKER thread got stuck at
> __GFP_FS direct reclaim, for mutex_trylock(&oom_lock) by that PF_WQ_WORKER
> thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER
> thread is doing __GFP_FS allocation.

I have really hard time to parse and understand this.

> Tetsuo is observing that GFP_NOIO
> allocation request from disk_events_workfn() is preventing other pending
> works from starting.

What about any other allocation from !PF_WQ_WORKER context? Why those do
not jump in?

> Since should_reclaim_retry() should be a natural reschedule point,
> let's do the short sleep for PF_WQ_WORKER threads unconditionally
> in order to guarantee that other pending works are started.

OK, this is finally makes some sense. But it doesn't explain why it
handles the live lock.

> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Your s-o-b is missing again. I have already told you that previously
when you were posting the patch.

I do not mind this change per se but I am not happy about _your_ changelog.
It doesn't explain the underlying problem IMHO. Having a natural and
unconditional scheduling point in should_reclaim_retry is a reasonable
thing. But how the hack it relates to the livelock you are seeing. So
namely the changelog should explain
1) why nobody is able to make forward progress during direct reclaim
2) why nobody is able to trigger oom killer as the last resort

> Cc: Roman Gushchin <guro@fb.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> ---
>  mm/page_alloc.c | 34 ++++++++++++++++++----------------
>  1 file changed, 18 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a790ef4..0c2c0a2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3922,6 +3922,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
> +	bool ret = false;
>  
>  	/*
>  	 * Costly allocations might have made a progress but this doesn't mean
> @@ -3985,25 +3986,26 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  				}
>  			}
>  
> -			/*
> -			 * Memory allocation/reclaim might be called from a WQ
> -			 * context and the current implementation of the WQ
> -			 * concurrency control doesn't recognize that
> -			 * a particular WQ is congested if the worker thread is
> -			 * looping without ever sleeping. Therefore we have to
> -			 * do a short sleep here rather than calling
> -			 * cond_resched().
> -			 */
> -			if (current->flags & PF_WQ_WORKER)
> -				schedule_timeout_uninterruptible(1);
> -			else
> -				cond_resched();
> -
> -			return true;
> +			ret = true;
> +			goto out;
>  		}
>  	}
>  
> -	return false;
> +out:
> +	/*
> +	 * Memory allocation/reclaim might be called from a WQ
> +	 * context and the current implementation of the WQ
> +	 * concurrency control doesn't recognize that
> +	 * a particular WQ is congested if the worker thread is
> +	 * looping without ever sleeping. Therefore we have to
> +	 * do a short sleep here rather than calling
> +	 * cond_resched().
> +	 */
> +	if (current->flags & PF_WQ_WORKER)
> +		schedule_timeout_uninterruptible(1);
> +	else
> +		cond_resched();
> +	return ret;
>  }
>  
>  static inline bool
> -- 
> 1.8.3.1
>
Tetsuo Handa July 27, 2018, 3:47 p.m. UTC | #2
On 2018/07/26 20:39, Michal Hocko wrote:
> On Thu 26-07-18 20:06:24, Tetsuo Handa wrote:
>> Before applying "an OOM lockup mitigation patch", I want to apply this
>> "another OOM lockup avoidance" patch.
> 
> I do not really see why. All these are borderline interesting as the
> system is basically dead by the time you reach this state.

I don't like your "borderline interesting". We still don't have a watchdog
which tells something went wrong. Thus, "borderline interesting" has to be
examined and fixed.

> 
>> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz
>> (which was captured with
>>
>>   --- a/mm/oom_kill.c
>>   +++ b/mm/oom_kill.c
>>   @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc)
>>    {
>>    	unsigned long freed = 0;
>>    	bool delay = false; /* if set, delay next allocation attempt */
>>   +	static unsigned long last_warned;
>>   +	if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) {
>>   +		pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", current->comm,
>>   +			current->pid, oc->gfp_mask, &oc->gfp_mask, oc->order);
>>   +		last_warned = jiffies;
>>   +	}
>>    
>>    	oc->constraint = CONSTRAINT_NONE;
>>    	if (oom_killer_disabled)
>>
>> in order to demonstrate that the GFP_NOIO allocation from disk_events_workfn() is
>> calling out_of_memory() rather than by error failing to give up direct reclaim).
>>
>> [  258.619119] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
>> [  268.622732] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
>> [  278.635344] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
>> [  288.639360] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
>> [  298.642715] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
> 
> Hmm, so there is no other memory allocation to trigger the oom or they
> all just back off on the oom_lock trylock? In other words what is
> preventing from the oom killer invocation?

All __GFP_FS allocations got stuck at direct reclaim or workqueue.

[  310.265293] systemd         D12240     1      0 0x00000000
[  310.268118] Call Trace:
[  310.269894]  ? __schedule+0x245/0x7f0
[  310.271901]  ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[  310.274187]  schedule+0x23/0x80
[  310.275965]  schedule_preempt_disabled+0x5/0x10
[  310.278107]  __mutex_lock+0x3f5/0x9b0
[  310.280070]  ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[  310.282451]  xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[  310.284730]  ? lock_acquire+0x51/0x70
[  310.286678]  ? xfs_ail_push_all+0x13/0x60 [xfs]
[  310.288844]  xfs_reclaim_inodes_nr+0x2c/0x40 [xfs]
[  310.291092]  super_cache_scan+0x14b/0x1a0
[  310.293131]  do_shrink_slab+0xfc/0x190
[  310.295082]  shrink_slab+0x240/0x2c0
[  310.297028]  shrink_node+0xe3/0x460
[  310.298899]  do_try_to_free_pages+0xcb/0x380
[  310.300945]  try_to_free_pages+0xbb/0xf0
[  310.302874]  __alloc_pages_slowpath+0x3c1/0xc50
[  310.304931]  ? lock_acquire+0x51/0x70
[  310.306753]  ? set_page_refcounted+0x40/0x40
[  310.308686]  __alloc_pages_nodemask+0x2a6/0x2c0
[  310.310663]  filemap_fault+0x437/0x8e0
[  310.312446]  ? lock_acquire+0x51/0x70
[  310.314150]  ? xfs_ilock+0x86/0x190 [xfs]
[  310.315915]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[  310.318089]  __do_fault+0x13/0x126
[  310.319696]  __handle_mm_fault+0xc8d/0x11c0
[  310.321493]  handle_mm_fault+0x17a/0x390
[  310.323156]  __do_page_fault+0x24c/0x4d0
[  310.324782]  do_page_fault+0x2a/0x70
[  310.326289]  ? page_fault+0x8/0x30
[  310.327745]  page_fault+0x1e/0x30

[  313.714537] systemd-journal D12600   498      1 0x00000000
[  313.716538] Call Trace:
[  313.717683]  ? __schedule+0x245/0x7f0
[  313.719221]  schedule+0x23/0x80
[  313.720586]  schedule_timeout+0x21f/0x400
[  313.722163]  wait_for_completion+0xb2/0x130
[  313.723750]  ? wake_up_q+0x70/0x70
[  313.725134]  flush_work+0x1db/0x2b0
[  313.726535]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[  313.728336]  ? page_alloc_cpu_dead+0x30/0x30
[  313.729936]  drain_all_pages+0x174/0x1e0
[  313.731446]  __alloc_pages_slowpath+0x428/0xc50
[  313.733120]  __alloc_pages_nodemask+0x2a6/0x2c0
[  313.734826]  filemap_fault+0x437/0x8e0
[  313.736296]  ? lock_acquire+0x51/0x70
[  313.737769]  ? xfs_ilock+0x86/0x190 [xfs]
[  313.739309]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[  313.741291]  __do_fault+0x13/0x126
[  313.742667]  __handle_mm_fault+0xc8d/0x11c0
[  313.744245]  handle_mm_fault+0x17a/0x390
[  313.745755]  __do_page_fault+0x24c/0x4d0
[  313.747290]  do_page_fault+0x2a/0x70
[  313.748728]  ? page_fault+0x8/0x30
[  313.750148]  page_fault+0x1e/0x30

[  324.987240] workqueue events_freezable_power_: flags=0x84
[  324.989292]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.991482]     in-flight: 5:disk_events_workfn
[  324.993371] workqueue mm_percpu_wq: flags=0x8
[  324.995167]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  324.997363]     pending: vmstat_update, drain_local_pages_wq BAR(498)

[  378.984241] INFO: task systemd-journal:498 blocked for more than 120 seconds.
[  378.986657]       Not tainted 4.18.0-rc6-next-20180724+ #248
[  378.988741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  378.991343] systemd-journal D12600   498      1 0x00000000
[  378.993434] Call Trace:
[  378.994643]  ? __schedule+0x245/0x7f0
[  378.996158]  schedule+0x23/0x80
[  378.997528]  schedule_timeout+0x21f/0x400
[  378.999112]  wait_for_completion+0xb2/0x130
[  379.000728]  ? wake_up_q+0x70/0x70
[  379.002170]  flush_work+0x1db/0x2b0
[  379.003667]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[  379.005507]  ? page_alloc_cpu_dead+0x30/0x30
[  379.007176]  drain_all_pages+0x174/0x1e0
[  379.008836]  __alloc_pages_slowpath+0x428/0xc50
[  379.010625]  __alloc_pages_nodemask+0x2a6/0x2c0
[  379.012364]  filemap_fault+0x437/0x8e0
[  379.013881]  ? lock_acquire+0x51/0x70
[  379.015373]  ? xfs_ilock+0x86/0x190 [xfs]
[  379.016950]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[  379.019019]  __do_fault+0x13/0x126
[  379.020439]  __handle_mm_fault+0xc8d/0x11c0
[  379.022055]  handle_mm_fault+0x17a/0x390
[  379.023623]  __do_page_fault+0x24c/0x4d0
[  379.025167]  do_page_fault+0x2a/0x70
[  379.026621]  ? page_fault+0x8/0x30
[  379.028022]  page_fault+0x1e/0x30

[  432.759743] systemd-journal D12600   498      1 0x00000000
[  432.761748] Call Trace:
[  432.762892]  ? __schedule+0x245/0x7f0
[  432.764352]  schedule+0x23/0x80
[  432.765673]  schedule_timeout+0x21f/0x400
[  432.767428]  wait_for_completion+0xb2/0x130
[  432.769207]  ? wake_up_q+0x70/0x70
[  432.770764]  flush_work+0x1db/0x2b0
[  432.772223]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[  432.774125]  ? page_alloc_cpu_dead+0x30/0x30
[  432.775873]  drain_all_pages+0x174/0x1e0
[  432.777528]  __alloc_pages_slowpath+0x428/0xc50
[  432.779306]  __alloc_pages_nodemask+0x2a6/0x2c0
[  432.780977]  filemap_fault+0x437/0x8e0
[  432.782503]  ? lock_acquire+0x51/0x70
[  432.784012]  ? xfs_ilock+0x86/0x190 [xfs]
[  432.785556]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[  432.787537]  __do_fault+0x13/0x126
[  432.788930]  __handle_mm_fault+0xc8d/0x11c0
[  432.790570]  handle_mm_fault+0x17a/0x390
[  432.792162]  __do_page_fault+0x24c/0x4d0
[  432.793773]  do_page_fault+0x2a/0x70
[  432.795196]  ? page_fault+0x8/0x30
[  432.796572]  page_fault+0x1e/0x30

[  444.234824] workqueue events_freezable_power_: flags=0x84
[  444.236937]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.239138]     in-flight: 5:disk_events_workfn
[  444.241022] workqueue mm_percpu_wq: flags=0x8
[  444.242829]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  444.245057]     pending: vmstat_update, drain_local_pages_wq BAR(498)

>  
> [...]
> 
>> Since the patch shown below was suggested by Michal Hocko at
>> https://marc.info/?l=linux-mm&m=152723708623015 , it is from Michal Hocko.
>>
>> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
>> From: Michal Hocko <mhocko@suse.com>
>> Date: Thu, 26 Jul 2018 14:40:03 +0900
>> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().
>>
>> Tetsuo Handa has reported that it is possible to bypass the short sleep
>> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
>> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
>> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
>> only when there are too many dirty and writeback pages") and lock up the
>> system if OOM.
>>
>> This is because we are implicitly counting on falling back to
>> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
>> schedule_timeout_uninterruptible() in should_reclaim_retry() was not
>> called due to __zone_watermark_ok() == false.
> 
> How do we rely on that?

Unless disk_events_workfn() (PID=5) sleeps at schedule_timeout_uninterruptible()
in __alloc_pages_may_oom(), drain_local_pages_wq() which a thread doing __GFP_FS
allocation (PID=498) is waiting for cannot be started.

> 
>> However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is
>> not called if all allocating threads but a PF_WQ_WORKER thread got stuck at
>> __GFP_FS direct reclaim, for mutex_trylock(&oom_lock) by that PF_WQ_WORKER
>> thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER
>> thread is doing __GFP_FS allocation.
> 
> I have really hard time to parse and understand this.

You can write as you like.

> 
>> Tetsuo is observing that GFP_NOIO
>> allocation request from disk_events_workfn() is preventing other pending
>> works from starting.
> 
> What about any other allocation from !PF_WQ_WORKER context? Why those do
> not jump in?

All __GFP_FS allocations got stuck at direct reclaim or workqueue.

> 
>> Since should_reclaim_retry() should be a natural reschedule point,
>> let's do the short sleep for PF_WQ_WORKER threads unconditionally
>> in order to guarantee that other pending works are started.
> 
> OK, this is finally makes some sense. But it doesn't explain why it
> handles the live lock.

As explained above, if disk_events_workfn() (PID=5) explicitly sleeps,
vmstat_update and drain_local_pages_wq from WQ_MEM_RECLAIM workqueue will
start, and unblock PID=498 which is waiting for drain_local_pages_wq and
allow PID=498 to invoke the OOM killer.

> 
>> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Your s-o-b is missing again. I have already told you that previously
> when you were posting the patch.

I'm waiting for you to post this change with your wording.

> 
> I do not mind this change per se but I am not happy about _your_ changelog.
> It doesn't explain the underlying problem IMHO. Having a natural and
> unconditional scheduling point in should_reclaim_retry is a reasonable
> thing. But how the hack it relates to the livelock you are seeing. So
> namely the changelog should explain
> 1) why nobody is able to make forward progress during direct reclaim

Because GFP_NOIO allocation from one workqueue prevented WQ_MEM_RECLAIM
workqueues from starting, for it did not call schedule_timeout_*() because
mutex_trylock(&oom_lock) did not fail because nobody else could call
__alloc_pages_may_oom().

> 2) why nobody is able to trigger oom killer as the last resort

Because only one !__GFP_FS allocating thread which did not get stuck at
direct reclaim was able to call __alloc_pages_may_oom().
Michal Hocko July 30, 2018, 9:32 a.m. UTC | #3
On Sat 28-07-18 00:47:40, Tetsuo Handa wrote:
> On 2018/07/26 20:39, Michal Hocko wrote:
> > On Thu 26-07-18 20:06:24, Tetsuo Handa wrote:
> >> Before applying "an OOM lockup mitigation patch", I want to apply this
> >> "another OOM lockup avoidance" patch.
> > 
> > I do not really see why. All these are borderline interesting as the
> > system is basically dead by the time you reach this state.
> 
> I don't like your "borderline interesting". We still don't have a watchdog
> which tells something went wrong. Thus, "borderline interesting" has to be
> examined and fixed.

No question about that. Bugs should be fixed. But this doesn't look like
something we should panic about and rush a half baked or not fully
understood fixes.
 
> >> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz
> >> (which was captured with
> >>
> >>   --- a/mm/oom_kill.c
> >>   +++ b/mm/oom_kill.c
> >>   @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc)
> >>    {
> >>    	unsigned long freed = 0;
> >>    	bool delay = false; /* if set, delay next allocation attempt */
> >>   +	static unsigned long last_warned;
> >>   +	if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) {
> >>   +		pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", current->comm,
> >>   +			current->pid, oc->gfp_mask, &oc->gfp_mask, oc->order);
> >>   +		last_warned = jiffies;
> >>   +	}
> >>    
> >>    	oc->constraint = CONSTRAINT_NONE;
> >>    	if (oom_killer_disabled)
> >>
> >> in order to demonstrate that the GFP_NOIO allocation from disk_events_workfn() is
> >> calling out_of_memory() rather than by error failing to give up direct reclaim).
> >>
> >> [  258.619119] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
> >> [  268.622732] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
> >> [  278.635344] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
> >> [  288.639360] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
> >> [  298.642715] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
> > 
> > Hmm, so there is no other memory allocation to trigger the oom or they
> > all just back off on the oom_lock trylock? In other words what is
> > preventing from the oom killer invocation?
> 
> All __GFP_FS allocations got stuck at direct reclaim or workqueue.

OK, I see. This is important information which was missing in the
previous examination.

[...]

> >> Since the patch shown below was suggested by Michal Hocko at
> >> https://marc.info/?l=linux-mm&m=152723708623015 , it is from Michal Hocko.
> >>
> >> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
> >> From: Michal Hocko <mhocko@suse.com>
> >> Date: Thu, 26 Jul 2018 14:40:03 +0900
> >> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().
> >>
> >> Tetsuo Handa has reported that it is possible to bypass the short sleep
> >> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
> >> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
> >> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
> >> only when there are too many dirty and writeback pages") and lock up the
> >> system if OOM.
> >>
> >> This is because we are implicitly counting on falling back to
> >> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
> >> schedule_timeout_uninterruptible() in should_reclaim_retry() was not
> >> called due to __zone_watermark_ok() == false.
> > 
> > How do we rely on that?
> 
> Unless disk_events_workfn() (PID=5) sleeps at schedule_timeout_uninterruptible()
> in __alloc_pages_may_oom(), drain_local_pages_wq() which a thread doing __GFP_FS
> allocation (PID=498) is waiting for cannot be started.

Now you are losing me again. What is going on about
drain_local_pages_wq? I can see that all __GFP_FS allocations are
waiting for a completion which depends on the kworker context to run but
I fail to see why drain_local_pages_wq matters. The memory on the pcp
lists is not accounted to NR_FREE_PAGES IIRC but that should be a
relatively small amount of memory so this cannot be a fundamental
problem here.

> >> However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is
> >> not called if all allocating threads but a PF_WQ_WORKER thread got stuck at
> >> __GFP_FS direct reclaim, for mutex_trylock(&oom_lock) by that PF_WQ_WORKER
> >> thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER
> >> thread is doing __GFP_FS allocation.
> > 
> > I have really hard time to parse and understand this.
> 
> You can write as you like.

I can as soon as I understand what is going on.

> >> Tetsuo is observing that GFP_NOIO
> >> allocation request from disk_events_workfn() is preventing other pending
> >> works from starting.
> > 
> > What about any other allocation from !PF_WQ_WORKER context? Why those do
> > not jump in?
> 
> All __GFP_FS allocations got stuck at direct reclaim or workqueue.
> 
> > 
> >> Since should_reclaim_retry() should be a natural reschedule point,
> >> let's do the short sleep for PF_WQ_WORKER threads unconditionally
> >> in order to guarantee that other pending works are started.
> > 
> > OK, this is finally makes some sense. But it doesn't explain why it
> > handles the live lock.
> 
> As explained above, if disk_events_workfn() (PID=5) explicitly sleeps,
> vmstat_update and drain_local_pages_wq from WQ_MEM_RECLAIM workqueue will
> start, and unblock PID=498 which is waiting for drain_local_pages_wq and
> allow PID=498 to invoke the OOM killer.

[  313.714537] systemd-journal D12600   498      1 0x00000000
[  313.716538] Call Trace:
[  313.717683]  ? __schedule+0x245/0x7f0
[  313.719221]  schedule+0x23/0x80
[  313.720586]  schedule_timeout+0x21f/0x400
[  313.722163]  wait_for_completion+0xb2/0x130
[  313.723750]  ? wake_up_q+0x70/0x70
[  313.725134]  flush_work+0x1db/0x2b0
[  313.726535]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[  313.728336]  ? page_alloc_cpu_dead+0x30/0x30
[  313.729936]  drain_all_pages+0x174/0x1e0
[  313.731446]  __alloc_pages_slowpath+0x428/0xc50
[  313.733120]  __alloc_pages_nodemask+0x2a6/0x2c0
[  313.734826]  filemap_fault+0x437/0x8e0
[  313.736296]  ? lock_acquire+0x51/0x70
[  313.737769]  ? xfs_ilock+0x86/0x190 [xfs]
[  313.739309]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[  313.741291]  __do_fault+0x13/0x126
[  313.742667]  __handle_mm_fault+0xc8d/0x11c0
[  313.744245]  handle_mm_fault+0x17a/0x390
[  313.745755]  __do_page_fault+0x24c/0x4d0
[  313.747290]  do_page_fault+0x2a/0x70
[  313.748728]  ? page_fault+0x8/0x30
[  313.750148]  page_fault+0x1e/0x30

This one is waiting for draining and we are in mm_percpu_wq WQ context
which has its rescuer so no other activity can block us for ever. So
this certainly shouldn't deadlock. It can be dead slow but well, this is
what you will get when your shoot your system to death.

So there must be something more going on...
 
> >> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> >> Signed-off-by: Michal Hocko <mhocko@suse.com>
> > 
> > Your s-o-b is missing again. I have already told you that previously
> > when you were posting the patch.
> 
> I'm waiting for you to post this change with your wording.
> 
> > 
> > I do not mind this change per se but I am not happy about _your_ changelog.
> > It doesn't explain the underlying problem IMHO. Having a natural and
> > unconditional scheduling point in should_reclaim_retry is a reasonable
> > thing. But how the hack it relates to the livelock you are seeing. So
> > namely the changelog should explain
> > 1) why nobody is able to make forward progress during direct reclaim
> 
> Because GFP_NOIO allocation from one workqueue prevented WQ_MEM_RECLAIM
> workqueues from starting, for it did not call schedule_timeout_*() because
> mutex_trylock(&oom_lock) did not fail because nobody else could call
> __alloc_pages_may_oom().

Why hasn't the rescuer helped here?

> > 2) why nobody is able to trigger oom killer as the last resort
> 
> Because only one !__GFP_FS allocating thread which did not get stuck at
> direct reclaim was able to call __alloc_pages_may_oom().

All the remaining allocations got stuck waiting for completion which
depends on !WQ_MEM_RECLAIM workers right?
Tetsuo Handa July 30, 2018, 2:34 p.m. UTC | #4
On 2018/07/30 18:32, Michal Hocko wrote:
>>>> Since the patch shown below was suggested by Michal Hocko at
>>>> https://marc.info/?l=linux-mm&m=152723708623015 , it is from Michal Hocko.
>>>>
>>>> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
>>>> From: Michal Hocko <mhocko@suse.com>
>>>> Date: Thu, 26 Jul 2018 14:40:03 +0900
>>>> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().
>>>>
>>>> Tetsuo Handa has reported that it is possible to bypass the short sleep
>>>> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
>>>> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
>>>> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
>>>> only when there are too many dirty and writeback pages") and lock up the
>>>> system if OOM.
>>>>
>>>> This is because we are implicitly counting on falling back to
>>>> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
>>>> schedule_timeout_uninterruptible() in should_reclaim_retry() was not
>>>> called due to __zone_watermark_ok() == false.
>>>
>>> How do we rely on that?
>>
>> Unless disk_events_workfn() (PID=5) sleeps at schedule_timeout_uninterruptible()
>> in __alloc_pages_may_oom(), drain_local_pages_wq() which a thread doing __GFP_FS
>> allocation (PID=498) is waiting for cannot be started.
> 
> Now you are losing me again. What is going on about
> drain_local_pages_wq? I can see that all __GFP_FS allocations are
> waiting for a completion which depends on the kworker context to run but
> I fail to see why drain_local_pages_wq matters. The memory on the pcp
> lists is not accounted to NR_FREE_PAGES IIRC but that should be a
> relatively small amount of memory so this cannot be a fundamental
> problem here.

If you look at "busy workqueues and worker pools", you will find that not only
vmstat_update and drain_local_pages_wq (which is WQ_MEM_RECLAIM) but also
other works such as xlog_cil_push_work and xfs_reclaim_worker (which is also
WQ_MEM_RECLAIM) remain "pending:" state.

[  324.960731] Showing busy workqueues and worker pools:
[  324.962577] workqueue events: flags=0x0
[  324.964137]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=13/256
[  324.966231]     pending: vmw_fb_dirty_flush [vmwgfx], vmstat_shepherd, vmpressure_work_fn, free_work, mmdrop_async_fn, mmdrop_async_fn, mmdrop_async_fn, mmdrop_async_fn, e1000_watchdog [e1000], mmdrop_async_fn, mmdrop_async_fn, check_corruption, console_callback
[  324.973425] workqueue events_freezable: flags=0x4
[  324.975247]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.977393]     pending: vmballoon_work [vmw_balloon]
[  324.979310] workqueue events_power_efficient: flags=0x80
[  324.981298]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256
[  324.983543]     pending: gc_worker [nf_conntrack], fb_flashcursor, neigh_periodic_work, neigh_periodic_work, check_lifetime
[  324.987240] workqueue events_freezable_power_: flags=0x84
[  324.989292]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.991482]     in-flight: 5:disk_events_workfn
[  324.993371] workqueue mm_percpu_wq: flags=0x8
[  324.995167]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  324.997363]     pending: vmstat_update, drain_local_pages_wq BAR(498)
[  324.999977] workqueue ipv6_addrconf: flags=0x40008
[  325.001899]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
[  325.004092]     pending: addrconf_verify_work
[  325.005911] workqueue mpt_poll_0: flags=0x8
[  325.007686]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.009914]     pending: mpt_fault_reset_work [mptbase]
[  325.012044] workqueue xfs-cil/sda1: flags=0xc
[  325.013897]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.016190]     pending: xlog_cil_push_work [xfs] BAR(2344)
[  325.018354] workqueue xfs-reclaim/sda1: flags=0xc
[  325.020293]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.022549]     pending: xfs_reclaim_worker [xfs]
[  325.024540] workqueue xfs-sync/sda1: flags=0x4
[  325.026425]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.028691]     pending: xfs_log_worker [xfs]
[  325.030546] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=189s workers=4 idle: 977 65 13

[  444.206334] Showing busy workqueues and worker pools:
[  444.208472] workqueue events: flags=0x0
[  444.210193]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=15/256
[  444.212389]     pending: vmw_fb_dirty_flush [vmwgfx], vmstat_shepherd, vmpressure_work_fn, free_work, mmdrop_async_fn, mmdrop_async_fn, mmdrop_async_fn, mmdrop_async_fn, e1000_watchdog [e1000], mmdrop_async_fn, mmdrop_async_fn, check_corruption, console_callback, sysrq_reinject_alt_sysrq, moom_callback
[  444.220547] workqueue events_freezable: flags=0x4
[  444.222562]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.224852]     pending: vmballoon_work [vmw_balloon]
[  444.227022] workqueue events_power_efficient: flags=0x80
[  444.229103]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256
[  444.231271]     pending: gc_worker [nf_conntrack], fb_flashcursor, neigh_periodic_work, neigh_periodic_work, check_lifetime
[  444.234824] workqueue events_freezable_power_: flags=0x84
[  444.236937]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.239138]     in-flight: 5:disk_events_workfn
[  444.241022] workqueue mm_percpu_wq: flags=0x8
[  444.242829]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  444.245057]     pending: vmstat_update, drain_local_pages_wq BAR(498)
[  444.247646] workqueue ipv6_addrconf: flags=0x40008
[  444.249582]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
[  444.251784]     pending: addrconf_verify_work
[  444.253620] workqueue mpt_poll_0: flags=0x8
[  444.255427]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.257666]     pending: mpt_fault_reset_work [mptbase]
[  444.259800] workqueue xfs-cil/sda1: flags=0xc
[  444.261646]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.263903]     pending: xlog_cil_push_work [xfs] BAR(2344)
[  444.266101] workqueue xfs-reclaim/sda1: flags=0xc
[  444.268104]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.270454]     pending: xfs_reclaim_worker [xfs]
[  444.272425] workqueue xfs-eofblocks/sda1: flags=0xc
[  444.274432]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.276729]     pending: xfs_eofblocks_worker [xfs]
[  444.278739] workqueue xfs-sync/sda1: flags=0x4
[  444.280641]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.282967]     pending: xfs_log_worker [xfs]
[  444.285195] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=309s workers=3 idle: 977 65

>>>> Tetsuo is observing that GFP_NOIO
>>>> allocation request from disk_events_workfn() is preventing other pending
>>>> works from starting.
>>>
>>> What about any other allocation from !PF_WQ_WORKER context? Why those do
>>> not jump in?
>>
>> All __GFP_FS allocations got stuck at direct reclaim or workqueue.
>>
>>>
>>>> Since should_reclaim_retry() should be a natural reschedule point,
>>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally
>>>> in order to guarantee that other pending works are started.
>>>
>>> OK, this is finally makes some sense. But it doesn't explain why it
>>> handles the live lock.
>>
>> As explained above, if disk_events_workfn() (PID=5) explicitly sleeps,
>> vmstat_update and drain_local_pages_wq from WQ_MEM_RECLAIM workqueue will
>> start, and unblock PID=498 which is waiting for drain_local_pages_wq and
>> allow PID=498 to invoke the OOM killer.
> 
> [  313.714537] systemd-journal D12600   498      1 0x00000000
> [  313.716538] Call Trace:
> [  313.717683]  ? __schedule+0x245/0x7f0
> [  313.719221]  schedule+0x23/0x80
> [  313.720586]  schedule_timeout+0x21f/0x400
> [  313.722163]  wait_for_completion+0xb2/0x130
> [  313.723750]  ? wake_up_q+0x70/0x70
> [  313.725134]  flush_work+0x1db/0x2b0
> [  313.726535]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
> [  313.728336]  ? page_alloc_cpu_dead+0x30/0x30
> [  313.729936]  drain_all_pages+0x174/0x1e0
> [  313.731446]  __alloc_pages_slowpath+0x428/0xc50
> [  313.733120]  __alloc_pages_nodemask+0x2a6/0x2c0
> [  313.734826]  filemap_fault+0x437/0x8e0
> [  313.736296]  ? lock_acquire+0x51/0x70
> [  313.737769]  ? xfs_ilock+0x86/0x190 [xfs]
> [  313.739309]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
> [  313.741291]  __do_fault+0x13/0x126
> [  313.742667]  __handle_mm_fault+0xc8d/0x11c0
> [  313.744245]  handle_mm_fault+0x17a/0x390
> [  313.745755]  __do_page_fault+0x24c/0x4d0
> [  313.747290]  do_page_fault+0x2a/0x70
> [  313.748728]  ? page_fault+0x8/0x30
> [  313.750148]  page_fault+0x1e/0x30
> 
> This one is waiting for draining and we are in mm_percpu_wq WQ context
> which has its rescuer so no other activity can block us for ever. So
> this certainly shouldn't deadlock. It can be dead slow but well, this is
> what you will get when your shoot your system to death.

We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to wake up. (Tejun,
is my understanding correct?) Lack of schedule_timeout_*() does block WQ_MEM_RECLAIM
workqueues forever.

Did you count how many threads were running from SysRq-t output?
If you did, you won't say "when you shoot your system to death". :-(

(Tester processes stuck at D state are omitted from below excerpt.
You can see that the system is almost idle.)

[  310.265293] systemd         D12240     1      0 0x00000000
[  310.345487] kthreadd        S14024     2      0 0x80000000
[  310.355813] rcu_gp          I15176     3      2 0x80000000
[  310.369297] rcu_par_gp      I15176     4      2 0x80000000
[  310.383005] kworker/0:0     R  running task    13504     5      2 0x80000000
[  310.385328] Workqueue: events_freezable_power_ disk_events_workfn
[  310.510953] kworker/0:0H    I14648     6      2 0x80000000
[  310.513024] Workqueue:            (null) (xfs-log/sda1)
[  310.527289] kworker/u256:0  I12816     7      2 0x80000000
[  310.529282] Workqueue:            (null) (events_unbound)
[  310.543584] mm_percpu_wq    I15176     8      2 0x80000000
[  310.557527] ksoftirqd/0     S14664     9      2 0x80000000
[  310.573041] rcu_sched       I14360    10      2 0x80000000
[  310.586582] rcu_bh          I15192    11      2 0x80000000
[  310.603372] migration/0     S14664    12      2 0x80000000
[  310.617362] kworker/0:1     I13312    13      2 0x80000000
[  310.619297] Workqueue:            (null) (cgroup_pidlist_destroy)
[  310.633055] cpuhp/0         S13920    14      2 0x80000000
[  310.647134] cpuhp/1         P13992    15      2 0x80000000
[  310.662731] migration/1     P14664    16      2 0x80000000
[  310.678559] ksoftirqd/1     P14656    17      2 0x80000000
[  310.694278] kworker/1:0     I13808    18      2 0x80000000
[  310.696221] Workqueue:            (null) (events)
[  310.709753] kworker/1:0H    I14648    19      2 0x80000000
[  310.711701] Workqueue:            (null) (kblockd)
[  310.725166] cpuhp/2         P13992    20      2 0x80000000
[  310.740708] migration/2     P14664    21      2 0x80000000
[  310.756255] ksoftirqd/2     P14664    22      2 0x80000000
[  310.771832] kworker/2:0     I14648    23      2 0x80000000
[  310.773789] Workqueue:            (null) (events)
[  310.787215] kworker/2:0H    I14480    24      2 0x80000000
[  310.789156] Workqueue:            (null) (xfs-log/sda1)
[  310.802735] cpuhp/3         P14056    25      2 0x80000000
[  310.818282] migration/3     P14664    26      2 0x80000000
[  310.833806] ksoftirqd/3     P14600    27      2 0x80000000
[  310.849330] kworker/3:0     I14648    28      2 0x80000000
[  310.851269] Workqueue:            (null) (events)
[  310.864855] kworker/3:0H    I15192    29      2 0x80000000
[  310.878479] cpuhp/4         P13696    30      2 0x80000000
[  310.894006] migration/4     P14664    31      2 0x80000000
[  310.909697] ksoftirqd/4     P14664    32      2 0x80000000
[  310.925546] kworker/4:0     I14464    33      2 0x80000000
[  310.927482] Workqueue:            (null) (events)
[  310.940917] kworker/4:0H    I14808    34      2 0x80000000
[  310.942872] Workqueue:            (null) (kblockd)
[  310.956699] cpuhp/5         P13960    35      2 0x80000000
[  310.972274] migration/5     P14664    36      2 0x80000000
[  310.987850] ksoftirqd/5     P14664    37      2 0x80000000
[  311.003382] kworker/5:0     I14400    38      2 0x80000000
[  311.005322] Workqueue:            (null) (events)
[  311.018758] kworker/5:0H    I15184    39      2 0x80000000
[  311.032629] cpuhp/6         P14024    40      2 0x80000000
[  311.048609] migration/6     P14664    41      2 0x80000000
[  311.064377] ksoftirqd/6     P14600    42      2 0x80000000
[  311.079858] kworker/6:0     I14616    43      2 0x80000000
[  311.081800] Workqueue:            (null) (cgroup_pidlist_destroy)
[  311.095610] kworker/6:0H    I14616    44      2 0x80000000
[  311.097549] Workqueue:            (null) (kblockd)
[  311.111033] cpuhp/7         P13992    45      2 0x80000000
[  311.127639] migration/7     P14664    46      2 0x80000000
[  311.143797] ksoftirqd/7     P14856    47      2 0x80000000
[  311.159383] kworker/7:0     I14576    48      2 0x80000000
[  311.161324] Workqueue:            (null) (events)
[  311.174750] kworker/7:0H    I14808    49      2 0x80000000
[  311.176690] Workqueue:            (null) (xfs-log/sda1)
[  311.190305] kdevtmpfs       S13552    50      2 0x80000000
[  311.205199] netns           I15152    51      2 0x80000000
[  311.218718] kworker/1:1     I14808    52      2 0x80000000
[  311.220630] Workqueue:            (null) (ata_sff)
[  311.234003] kauditd         S14640    53      2 0x80000000
[  311.251581] khungtaskd      S14368    54      2 0x80000000
[  311.268573] oom_reaper      S14464    55      2 0x80000000
[  311.283432] writeback       I15152    56      2 0x80000000
[  311.296825] kcompactd0      S14856    57      2 0x80000000
[  311.311451] ksmd            S15032    58      2 0x80000000
[  311.331290] khugepaged      D13832    59      2 0x80000000
[  311.402468] crypto          I15192    60      2 0x80000000
[  311.421397] kintegrityd     I15152    61      2 0x80000000
[  311.435436] kblockd         I15192    62      2 0x80000000
[  311.449503] kworker/3:1     I14264    63      2 0x80000000
[  311.451420] Workqueue:            (null) (events)
[  311.464883] kworker/5:1     I13568    64      2 0x80000000
[  311.466846] Workqueue:            (null) (events)
[  311.480784] kworker/0:2     I13104    65      2 0x80000000
[  311.482780] Workqueue:            (null) (cgroup_pidlist_destroy)
[  311.497643] md              I15152    66      2 0x80000000
[  311.513651] edac-poller     I15192    67      2 0x80000000
[  311.528223] devfreq_wq      I15152    68      2 0x80000000
[  311.542243] watchdogd       S15192    69      2 0x80000000
[  311.557627] kworker/2:1     I13848    70      2 0x80000000
[  311.559557] Workqueue:            (null) (cgroup_pidlist_destroy)
[  311.573158] kswapd0         S11528    71      2 0x80000000
[  311.588482] kworker/4:1     I13976    82      2 0x80000000
[  311.590418] Workqueue:            (null) (events)
[  311.603683] kthrotld        I14712    88      2 0x80000000
[  311.617936] kworker/6:1     I14280    89      2 0x80000000
[  311.619855] Workqueue:            (null) (events)
[  311.633386] kworker/7:1     I13976    90      2 0x80000000
[  311.635309] Workqueue:            (null) (events_power_efficient)
[  311.649001] acpi_thermal_pm I14856    91      2 0x80000000
[  311.662606] kworker/u256:1  I13168    92      2 0x80000000
[  311.664551] Workqueue:            (null) (events_unbound)
[  311.678115] kmpath_rdacd    I15176    93      2 0x80000000
[  311.691756] kaluad          I14712    94      2 0x80000000
[  311.705839] nvme-wq         I14712    95      2 0x80000000
[  311.720293] nvme-reset-wq   I14712    96      2 0x80000000
[  311.734252] nvme-delete-wq  I15176    97      2 0x80000000
[  311.747801] kworker/u256:2  I14088    98      2 0x80000000
[  311.749766] Workqueue:            (null) (events_unbound)
[  311.763307] kworker/u256:3  I13560    99      2 0x80000000
[  311.765273] Workqueue:            (null) (events_unbound)
[  311.778791] ipv6_addrconf   I15176   100      2 0x80000000
[  311.792413] kworker/5:2     I14728   134      2 0x80000000
[  311.794381] Workqueue:            (null) (events)
[  311.809119] kworker/2:2     I13952   156      2 0x80000000
[  311.811554] Workqueue:            (null) (events)
[  311.829185] kworker/6:2     I14264   158      2 0x80000000
[  311.831332] Workqueue:            (null) (events)
[  311.844858] kworker/4:2     I14648   200      2 0x80000000
[  311.846876] Workqueue:            (null) (cgroup_pidlist_destroy)
[  311.860651] kworker/3:2     I14728   251      2 0x80000000
[  311.862598] Workqueue:            (null) (cgroup_pidlist_destroy)
[  311.876388] kworker/7:2     I13520   282      2 0x80000000
[  311.878336] Workqueue:            (null) (events)
[  311.891701] ata_sff         I15152   284      2 0x80000000
[  311.905381] scsi_eh_0       S14504   285      2 0x80000000
[  311.922693] scsi_tmf_0      I14952   286      2 0x80000000
[  311.936327] scsi_eh_1       S13560   287      2 0x80000000
[  311.953512] scsi_tmf_1      I14952   288      2 0x80000000
[  311.967137] mpt_poll_0      I15152   289      2 0x80000000
[  311.980810] mpt/0           I14952   290      2 0x80000000
[  311.994421] scsi_eh_2       S13320   291      2 0x80000000
[  312.011558] scsi_tmf_2      I15152   292      2 0x80000000
[  312.026613] scsi_eh_3       S13656   293      2 0x80000000
[  312.046380] scsi_tmf_3      I15152   294      2 0x80000000
[  312.063536] scsi_eh_4       S13656   295      2 0x80000000
[  312.082705] scsi_tmf_4      I15176   296      2 0x80000000
[  312.097973] scsi_eh_5       S13320   297      2 0x80000000
[  312.117232] scsi_tmf_5      I15152   298      2 0x80000000
[  312.132573] scsi_eh_6       S13320   299      2 0x80000000
[  312.151843] scsi_tmf_6      I15152   300      2 0x80000000
[  312.167109] ttm_swap        I15176   301      2 0x80000000
[  312.182415] scsi_eh_7       S13320   302      2 0x80000000
[  312.201595] scsi_tmf_7      I15064   303      2 0x80000000
[  312.216695] scsi_eh_8       S13320   304      2 0x80000000
[  312.236470] scsi_tmf_8      I14856   305      2 0x80000000
[  312.251403] scsi_eh_9       S13656   306      2 0x80000000
[  312.270443] scsi_tmf_9      I15152   307      2 0x80000000
[  312.284642] irq/16-vmwgfx   S14784   308      2 0x80000000
[  312.301877] scsi_eh_10      S13320   309      2 0x80000000
[  312.319679] scsi_tmf_10     I15152   310      2 0x80000000
[  312.333461] scsi_eh_11      S13320   311      2 0x80000000
[  312.350681] scsi_tmf_11     I15176   312      2 0x80000000
[  312.364444] scsi_eh_12      S13656   313      2 0x80000000
[  312.382093] scsi_tmf_12     I15152   314      2 0x80000000
[  312.395952] scsi_eh_13      S13656   315      2 0x80000000
[  312.413378] scsi_tmf_13     I15032   316      2 0x80000000
[  312.427103] scsi_eh_14      S13624   317      2 0x80000000
[  312.444286] scsi_tmf_14     I15176   318      2 0x80000000
[  312.457978] scsi_eh_15      S15192   319      2 0x80000000
[  312.477516] scsi_tmf_15     I15176   320      2 0x80000000
[  312.491641] scsi_eh_16      S13656   321      2 0x80000000
[  312.509443] scsi_tmf_16     I15032   322      2 0x80000000
[  312.523573] scsi_eh_17      S13320   323      2 0x80000000
[  312.540983] scsi_tmf_17     I15152   324      2 0x80000000
[  312.554847] scsi_eh_18      S13320   325      2 0x80000000
[  312.572062] scsi_tmf_18     I15152   326      2 0x80000000
[  312.585722] scsi_eh_19      S13320   327      2 0x80000000
[  312.603012] scsi_tmf_19     I14952   328      2 0x80000000
[  312.616683] scsi_eh_20      S13320   329      2 0x80000000
[  312.633915] scsi_tmf_20     I14856   330      2 0x80000000
[  312.647594] scsi_eh_21      S13624   331      2 0x80000000
[  312.664791] scsi_tmf_21     I15152   332      2 0x80000000
[  312.678495] scsi_eh_22      S13320   333      2 0x80000000
[  312.695697] scsi_tmf_22     I15176   334      2 0x80000000
[  312.709366] scsi_eh_23      S13656   335      2 0x80000000
[  312.738358] scsi_tmf_23     I14984   336      2 0x80000000
[  312.752608] scsi_eh_24      S13320   337      2 0x80000000
[  312.770582] scsi_tmf_24     I15176   338      2 0x80000000
[  312.784878] scsi_eh_25      S13624   339      2 0x80000000
[  312.802234] scsi_tmf_25     I14984   340      2 0x80000000
[  312.815906] scsi_eh_26      S13624   341      2 0x80000000
[  312.833209] scsi_tmf_26     I14984   342      2 0x80000000
[  312.847120] scsi_eh_27      S13656   343      2 0x80000000
[  312.864431] scsi_tmf_27     I14984   344      2 0x80000000
[  312.878111] scsi_eh_28      S13656   345      2 0x80000000
[  312.895298] scsi_tmf_28     I15152   346      2 0x80000000
[  312.908979] scsi_eh_29      S13656   348      2 0x80000000
[  312.926273] scsi_tmf_29     I15152   349      2 0x80000000
[  312.939917] scsi_eh_30      S13656   350      2 0x80000000
[  312.957305] scsi_tmf_30     I15152   351      2 0x80000000
[  312.970959] scsi_eh_31      S13008   352      2 0x80000000
[  312.988336] scsi_tmf_31     I14856   353      2 0x80000000
[  313.002016] scsi_eh_32      S13624   354      2 0x80000000
[  313.019231] scsi_tmf_32     I15152   355      2 0x80000000
[  313.033116] kworker/u256:4  I14616   356      2 0x80000000
[  313.035122] Workqueue:            (null) (events_unbound)
[  313.048958] kworker/u256:5  I14408   357      2 0x80000000
[  313.050911] Workqueue:            (null) (events_unbound)
[  313.064605] kworker/u256:6  I14616   358      2 0x80000000
[  313.066588] Workqueue:            (null) (events_unbound)
[  313.080184] kworker/u256:7  I14392   359      2 0x80000000
[  313.082139] Workqueue:            (null) (events_unbound)
[  313.095693] kworker/u256:8  I14616   360      2 0x80000000
[  313.097625] Workqueue:            (null) (events_unbound)
[  313.111206] kworker/u256:9  I14088   362      2 0x80000000
[  313.113130] Workqueue:            (null) (events_unbound)
[  313.126701] kworker/u256:10 I14440   364      2 0x80000000
[  313.128630] Workqueue:            (null) (events_unbound)
[  313.142204] kworker/u256:11 I14440   365      2 0x80000000
[  313.144133] Workqueue:            (null) (events_unbound)
[  313.157836] kworker/u256:12 I14440   366      2 0x80000000
[  313.159834] Workqueue:            (null) (events_unbound)
[  313.173426] kworker/u256:13 I14440   367      2 0x80000000
[  313.175360] Workqueue:            (null) (events_unbound)
[  313.188953] kworker/u256:14 I14440   368      2 0x80000000
[  313.190883] Workqueue:            (null) (events_unbound)
[  313.204529] kworker/u256:15 I14440   369      2 0x80000000
[  313.206524] Workqueue:            (null) (events_unbound)
[  313.220106] kworker/u256:16 I14440   370      2 0x80000000
[  313.222040] Workqueue:            (null) (events_unbound)
[  313.235845] kworker/u256:17 I14440   371      2 0x80000000
[  313.238007] Workqueue:            (null) (events_unbound)
[  313.251879] kworker/u256:18 I14440   372      2 0x80000000
[  313.253880] Workqueue:            (null) (events_unbound)
[  313.268276] kworker/u256:19 I14440   373      2 0x80000000
[  313.270312] Workqueue:            (null) (events_unbound)
[  313.284299] kworker/u256:20 I14440   374      2 0x80000000
[  313.286254] Workqueue:            (null) (events_unbound)
[  313.300548] kworker/u256:21 I14440   375      2 0x80000000
[  313.302563] Workqueue:            (null) (events_unbound)
[  313.316925] kworker/u256:22 I14440   376      2 0x80000000
[  313.318857] Workqueue:            (null) (events_unbound)
[  313.333454] kworker/u256:23 I14440   377      2 0x80000000
[  313.335460] Workqueue:            (null) (events_unbound)
[  313.349943] kworker/u256:24 I14392   378      2 0x80000000
[  313.351872] Workqueue:            (null) (events_unbound)
[  313.366316] kworker/u256:25 I14392   379      2 0x80000000
[  313.368369] Workqueue:            (null) (events_unbound)
[  313.382706] kworker/u256:26 I14440   380      2 0x80000000
[  313.384638] Workqueue:            (null) (events_unbound)
[  313.398913] kworker/u256:27 I14440   381      2 0x80000000
[  313.400995] Workqueue:            (null) (events_unbound)
[  313.415395] kworker/u256:28 I14440   382      2 0x80000000
[  313.417325] Workqueue:            (null) (events_unbound)
[  313.431863] kworker/u256:29 I14440   383      2 0x80000000
[  313.433880] Workqueue:            (null) (events_unbound)
[  313.448121] kworker/u256:30 I12872   384      2 0x80000000
[  313.450050] Workqueue:            (null) (events_unbound)
[  313.464862] kworker/u256:31 I10848   385      2 0x80000000
[  313.467082] Workqueue:            (null) (flush-8:0)
[  313.481995] kworker/u256:32 I14440   386      2 0x80000000
[  313.484013] Workqueue:            (null) (events_unbound)
[  313.498207] kworker/u256:33 I15032   387      2 0x80000000
[  313.500272] Workqueue:            (null) (events_unbound)
[  313.514300] kworker/1:2     I14728   411      2 0x80000000
[  313.516350] Workqueue:            (null) (cgroup_pidlist_destroy)
[  313.530409] kworker/6:1H    I15144   414      2 0x80000000
[  313.532422] Workqueue:            (null) (kblockd)
[  313.545957] xfsalloc        I15176   419      2 0x80000000
[  313.559714] xfs_mru_cache   I15176   420      2 0x80000000
[  313.573541] xfs-buf/sda1    I15176   421      2 0x80000000
[  313.587184] xfs-data/sda1   I15176   422      2 0x80000000
[  313.600830] xfs-conv/sda1   I15176   423      2 0x80000000
[  313.614464] xfs-cil/sda1    I15176   424      2 0x80000000
[  313.628114] xfs-reclaim/sda I15152   425      2 0x80000000
[  313.641772] xfs-log/sda1    I15176   426      2 0x80000000
[  313.655372] xfs-eofblocks/s I15176   427      2 0x80000000
[  313.668954] xfsaild/sda1    S13560   428      2 0x80000000
[  313.685728] kworker/7:1H    I14648   429      2 0x80000000
[  313.687693] Workqueue:            (null) (kblockd)
[  313.701018] kworker/1:1H    I14648   497      2 0x80000000
[  313.714537] systemd-journal D12600   498      1 0x00000000
[  313.767803] systemd-udevd   S12648   529      1 0x00000000
[  313.804733] auditd          D12808   546      1 0x00000000
[  313.874023] auditd          S13584   547      1 0x00000000
[  313.910947] kworker/4:1H    I15184   552      2 0x80000080
[  313.912888] Workqueue:            (null) (kblockd)
[  313.927331] polkitd         S12536   593      1 0x00000080
[  313.981730] gmain           S14032   604      1 0x00000080
[  314.041923] gdbus           S13760   605      1 0x00000080
[  314.094085] JS GC Helper    S14696   611      1 0x00000080
[  314.136686] JS Sour~ Thread S14696   616      1 0x00000080
[  314.180020] polkitd         S14200   621      1 0x00000080
[  314.238165] irqbalance      D12904   594      1 0x80000080
[  314.307023] avahi-daemon    S12632   597      1 0x00000080
[  314.357108] dbus-daemon     S12632   598      1 0x00000080
[  314.390871] avahi-daemon    S12888   603    597 0x00000080
[  314.439229] abrtd           S12488   606      1 0x00000080
[  314.503323] systemd-logind  D12904   608      1 0x00000080
[  314.572804] atd             S12728   612      1 0x00000080
[  314.608670] crond           D12536   614      1 0x00000080
[  314.677690] ksmtuned        S13312   619      1 0x00000080
[  314.714501] firewalld       S12456   623      1 0x00000080
[  314.767120] gmain           S14248   756      1 0x00000080
[  314.824569] NetworkManager  S11816   662      1 0x00000080
[  314.881313] gmain           D12544   669      1 0x00000080
[  314.949419] gdbus           S13432   672      1 0x00000080
[  315.000239] kworker/0:1H    I14096   742      2 0x80000080
[  315.002178] Workqueue:            (null) (kblockd)
[  315.015633] dhclient        S12328   759    662 0x00000080
[  315.082950] kworker/2:1H    I14480   768      2 0x80000080
[  315.084945] Workqueue:            (null) (xfs-log/sda1)
[  315.098501] tuned           S12552   949      1 0x00000080
[  315.165944] gmain           S14248   995      1 0x00000080
[  315.224133] tuned           D12328   996      1 0x80000080
[  315.292809] tuned           S14408   997      1 0x00000080
[  315.330274] tuned           S14248  1011      1 0x00000080
[  315.389366] smbd            S11288   951      1 0x00000080
[  315.427197] sshd            S12552   952      1 0x00000080
[  315.509092] rsyslogd        S12696   954      1 0x00000080
[  315.578028] in:imjournal    D12048   965      1 0x80000080
[  315.645481] rs:main Q:Reg   S12792   966      1 0x00000080
[  315.681641] sendmail        D13176   967      1 0x00000080
[  315.751526] kworker/0:3     I13376   977      2 0x80000080
[  315.753526] Workqueue:            (null) (ata_sff)
[  315.766860] agetty          S12904   989      1 0x00000080
[  315.810543] login           S12488   990      1 0x00000080
[  315.849297] sendmail        S14256   991      1 0x00000080
[  315.877525] smbd-notifyd    D12792  1012    951 0x00000080
[  315.947463] cleanupd        D13144  1013    951 0x00000080
[  316.016683] lpqd            S12888  1016    951 0x00000080
[  316.054304] bash            S12656  1019    990 0x00000080
[  316.092765] sleep           D12808  1614    619 0x00000080

[  429.965904] systemd         D12240     1      0 0x00000000
[  430.045240] kthreadd        S14024     2      0 0x80000000
[  430.055553] rcu_gp          I15176     3      2 0x80000000
[  430.069575] rcu_par_gp      I15176     4      2 0x80000000
[  430.083584] kworker/0:0     R  running task    13504     5      2 0x80000000
[  430.085990] Workqueue: events_freezable_power_ disk_events_workfn
[  430.210920] kworker/0:0H    I14648     6      2 0x80000000
[  430.212997] Workqueue:            (null) (xfs-log/sda1)
[  430.227747] mm_percpu_wq    I15176     8      2 0x80000000
[  430.242131] ksoftirqd/0     S14664     9      2 0x80000000
[  430.257645] rcu_sched       I14360    10      2 0x80000000
[  430.271300] rcu_bh          I15192    11      2 0x80000000
[  430.288804] migration/0     S14664    12      2 0x80000000
[  430.302736] cpuhp/0         S13920    14      2 0x80000000
[  430.316785] cpuhp/1         P13992    15      2 0x80000000
[  430.332349] migration/1     P14664    16      2 0x80000000
[  430.347965] ksoftirqd/1     P14656    17      2 0x80000000
[  430.363616] kworker/1:0     I13808    18      2 0x80000000
[  430.365571] Workqueue:            (null) (events)
[  430.379215] kworker/1:0H    I14648    19      2 0x80000000
[  430.381183] Workqueue:            (null) (kblockd)
[  430.394637] cpuhp/2         P13992    20      2 0x80000000
[  430.410289] migration/2     P14664    21      2 0x80000000
[  430.425871] ksoftirqd/2     P14664    22      2 0x80000000
[  430.441401] kworker/2:0H    I14480    24      2 0x80000000
[  430.443354] Workqueue:            (null) (xfs-log/sda1)
[  430.456923] cpuhp/3         P14056    25      2 0x80000000
[  430.472692] migration/3     P14664    26      2 0x80000000
[  430.488439] ksoftirqd/3     P14600    27      2 0x80000000
[  430.504382] kworker/3:0H    I15192    29      2 0x80000000
[  430.518385] cpuhp/4         P13696    30      2 0x80000000
[  430.534771] migration/4     P14664    31      2 0x80000000
[  430.550345] ksoftirqd/4     P14664    32      2 0x80000000
[  430.565951] kworker/4:0H    I14808    34      2 0x80000000
[  430.567894] Workqueue:            (null) (kblockd)
[  430.581313] cpuhp/5         P13960    35      2 0x80000000
[  430.596904] migration/5     P14664    36      2 0x80000000
[  430.612501] ksoftirqd/5     P14664    37      2 0x80000000
[  430.628142] kworker/5:0H    I15184    39      2 0x80000000
[  430.641797] cpuhp/6         P14024    40      2 0x80000000
[  430.657356] migration/6     P14664    41      2 0x80000000
[  430.672944] ksoftirqd/6     P14600    42      2 0x80000000
[  430.688512] kworker/6:0H    I14616    44      2 0x80000000
[  430.690475] Workqueue:            (null) (kblockd)
[  430.704778] cpuhp/7         P13992    45      2 0x80000000
[  430.721362] migration/7     P14664    46      2 0x80000000
[  430.737498] ksoftirqd/7     P14856    47      2 0x80000000
[  430.753602] kworker/7:0H    I14808    49      2 0x80000000
[  430.755618] Workqueue:            (null) (xfs-log/sda1)
[  430.769558] kdevtmpfs       S13552    50      2 0x80000000
[  430.784541] netns           I15152    51      2 0x80000000
[  430.798092] kauditd         S14640    53      2 0x80000000
[  430.816061] khungtaskd      S14368    54      2 0x80000000
[  430.832568] oom_reaper      S14464    55      2 0x80000000
[  430.847090] writeback       I15152    56      2 0x80000000
[  430.860445] kcompactd0      S14856    57      2 0x80000000
[  430.875004] ksmd            S15032    58      2 0x80000000
[  430.894691] khugepaged      D13832    59      2 0x80000000
[  430.941897] crypto          I15192    60      2 0x80000000
[  430.955462] kintegrityd     I15152    61      2 0x80000000
[  430.970103] kblockd         I15192    62      2 0x80000000
[  430.985679] kworker/3:1     I14264    63      2 0x80000000
[  430.987830] Workqueue:            (null) (events)
[  431.006463] kworker/5:1     I13568    64      2 0x80000000
[  431.008721] Workqueue:            (null) (events)
[  431.022691] kworker/0:2     I13104    65      2 0x80000000
[  431.024683] Workqueue:            (null) (cgroup_pidlist_destroy)
[  431.039039] md              I15152    66      2 0x80000000
[  431.052855] edac-poller     I15192    67      2 0x80000000
[  431.066622] devfreq_wq      I15152    68      2 0x80000000
[  431.082179] watchdogd       S15192    69      2 0x80000000
[  431.099273] kworker/2:1     I13848    70      2 0x80000000
[  431.101428] Workqueue:            (null) (cgroup_pidlist_destroy)
[  431.115249] kswapd0         S11528    71      2 0x80000000
[  431.130196] kworker/4:1     I13976    82      2 0x80000000
[  431.132110] Workqueue:            (null) (events)
[  431.145382] kthrotld        I14712    88      2 0x80000000
[  431.159018] kworker/6:1     I14280    89      2 0x80000000
[  431.160998] Workqueue:            (null) (events)
[  431.174300] kworker/7:1     I13976    90      2 0x80000000
[  431.176275] Workqueue:            (null) (events_power_efficient)
[  431.190169] acpi_thermal_pm I14856    91      2 0x80000000
[  431.204078] kmpath_rdacd    I15176    93      2 0x80000000
[  431.218355] kaluad          I14712    94      2 0x80000000
[  431.232512] nvme-wq         I14712    95      2 0x80000000
[  431.246066] nvme-reset-wq   I14712    96      2 0x80000000
[  431.259655] nvme-delete-wq  I15176    97      2 0x80000000
[  431.273200] ipv6_addrconf   I15176   100      2 0x80000000
[  431.287563] kworker/5:2     I14728   134      2 0x80000000
[  431.289707] Workqueue:            (null) (events)
[  431.303527] kworker/2:2     I13952   156      2 0x80000000
[  431.305447] Workqueue:            (null) (events)
[  431.318965] kworker/6:2     I14264   158      2 0x80000000
[  431.320882] Workqueue:            (null) (events)
[  431.334236] kworker/4:2     I14648   200      2 0x80000000
[  431.336153] Workqueue:            (null) (cgroup_pidlist_destroy)
[  431.349824] kworker/3:2     I14728   251      2 0x80000000
[  431.351745] Workqueue:            (null) (cgroup_pidlist_destroy)
[  431.365438] kworker/7:2     I13520   282      2 0x80000000
[  431.367360] Workqueue:            (null) (events)
[  431.383306] ata_sff         I15152   284      2 0x80000000
[  431.397949] scsi_eh_0       S14504   285      2 0x80000000
[  431.418707] scsi_tmf_0      I14952   286      2 0x80000000
[  431.432417] scsi_eh_1       S13560   287      2 0x80000000
[  431.449499] scsi_tmf_1      I14952   288      2 0x80000000
[  431.463159] mpt_poll_0      I15152   289      2 0x80000000
[  431.477194] mpt/0           I14952   290      2 0x80000000
[  431.491369] scsi_eh_2       S13320   291      2 0x80000000
[  431.509148] scsi_tmf_2      I15152   292      2 0x80000000
[  431.523419] scsi_eh_3       S13656   293      2 0x80000000
[  431.540836] scsi_tmf_3      I15152   294      2 0x80000000
[  431.554526] scsi_eh_4       S13656   295      2 0x80000000
[  431.573951] scsi_tmf_4      I15176   296      2 0x80000000
[  431.590636] scsi_eh_5       S13320   297      2 0x80000000
[  431.610073] scsi_tmf_5      I15152   298      2 0x80000000
[  431.625309] scsi_eh_6       S13320   299      2 0x80000000
[  431.645255] scsi_tmf_6      I15152   300      2 0x80000000
[  431.660850] ttm_swap        I15176   301      2 0x80000000
[  431.676590] scsi_eh_7       S13320   302      2 0x80000000
[  431.696192] scsi_tmf_7      I15064   303      2 0x80000000
[  431.711927] scsi_eh_8       S13320   304      2 0x80000000
[  431.731901] scsi_tmf_8      I14856   305      2 0x80000000
[  431.747401] scsi_eh_9       S13656   306      2 0x80000000
[  431.767158] scsi_tmf_9      I15152   307      2 0x80000000
[  431.782141] irq/16-vmwgfx   S14784   308      2 0x80000000
[  431.801432] scsi_eh_10      S13320   309      2 0x80000000
[  431.819869] scsi_tmf_10     I15152   310      2 0x80000000
[  431.834047] scsi_eh_11      S13320   311      2 0x80000000
[  431.851641] scsi_tmf_11     I15176   312      2 0x80000000
[  431.865316] scsi_eh_12      S13656   313      2 0x80000000
[  431.882579] scsi_tmf_12     I15152   314      2 0x80000000
[  431.896275] scsi_eh_13      S13656   315      2 0x80000000
[  431.913545] scsi_tmf_13     I15032   316      2 0x80000000
[  431.927339] scsi_eh_14      S13624   317      2 0x80000000
[  431.944787] scsi_tmf_14     I15176   318      2 0x80000000
[  431.958953] scsi_eh_15      S15192   319      2 0x80000000
[  431.977842] scsi_tmf_15     I15176   320      2 0x80000000
[  431.991927] scsi_eh_16      S13656   321      2 0x80000000
[  432.009353] scsi_tmf_16     I15032   322      2 0x80000000
[  432.023012] scsi_eh_17      S13320   323      2 0x80000000
[  432.040219] scsi_tmf_17     I15152   324      2 0x80000000
[  432.053920] scsi_eh_18      S13320   325      2 0x80000000
[  432.071121] scsi_tmf_18     I15152   326      2 0x80000000
[  432.084940] scsi_eh_19      S13320   327      2 0x80000000
[  432.102554] scsi_tmf_19     I14952   328      2 0x80000000
[  432.116239] scsi_eh_20      S13320   329      2 0x80000000
[  432.133434] scsi_tmf_20     I14856   330      2 0x80000000
[  432.147092] scsi_eh_21      S13624   331      2 0x80000000
[  432.164292] scsi_tmf_21     I15152   332      2 0x80000000
[  432.179198] scsi_eh_22      S13320   333      2 0x80000000
[  432.197105] scsi_tmf_22     I15176   334      2 0x80000000
[  432.211388] scsi_eh_23      S13656   335      2 0x80000000
[  432.228948] scsi_tmf_23     I14984   336      2 0x80000000
[  432.242745] scsi_eh_24      S13320   337      2 0x80000000
[  432.260114] scsi_tmf_24     I15176   338      2 0x80000000
[  432.273909] scsi_eh_25      S13624   339      2 0x80000000
[  432.292257] scsi_tmf_25     I14984   340      2 0x80000000
[  432.306667] scsi_eh_26      S13624   341      2 0x80000000
[  432.324014] scsi_tmf_26     I14984   342      2 0x80000000
[  432.337927] scsi_eh_27      S13656   343      2 0x80000000
[  432.355175] scsi_tmf_27     I14984   344      2 0x80000000
[  432.368970] scsi_eh_28      S13656   345      2 0x80000000
[  432.386277] scsi_tmf_28     I15152   346      2 0x80000000
[  432.400001] scsi_eh_29      S13656   348      2 0x80000000
[  432.417628] scsi_tmf_29     I15152   349      2 0x80000000
[  432.431300] scsi_eh_30      S13656   350      2 0x80000000
[  432.448517] scsi_tmf_30     I15152   351      2 0x80000000
[  432.462367] scsi_eh_31      S13008   352      2 0x80000000
[  432.480180] scsi_tmf_31     I14856   353      2 0x80000000
[  432.494334] scsi_eh_32      S13624   354      2 0x80000000
[  432.512183] scsi_tmf_32     I15152   355      2 0x80000000
[  432.526539] kworker/u256:30 I12872   384      2 0x80000000
[  432.528477] Workqueue:            (null) (events_unbound)
[  432.542276] kworker/u256:31 I10848   385      2 0x80000000
[  432.544216] Workqueue:            (null) (flush-8:0)
[  432.557869] kworker/1:2     I14728   411      2 0x80000000
[  432.559806] Workqueue:            (null) (cgroup_pidlist_destroy)
[  432.573637] kworker/6:1H    I15144   414      2 0x80000000
[  432.575578] Workqueue:            (null) (kblockd)
[  432.589008] xfsalloc        I15176   419      2 0x80000000
[  432.602765] xfs_mru_cache   I15176   420      2 0x80000000
[  432.616433] xfs-buf/sda1    I15176   421      2 0x80000000
[  432.630080] xfs-data/sda1   I15176   422      2 0x80000000
[  432.643719] xfs-conv/sda1   I15176   423      2 0x80000000
[  432.657334] xfs-cil/sda1    I15176   424      2 0x80000000
[  432.670935] xfs-reclaim/sda I15152   425      2 0x80000000
[  432.684559] xfs-log/sda1    I15176   426      2 0x80000000
[  432.698196] xfs-eofblocks/s I15176   427      2 0x80000000
[  432.712330] xfsaild/sda1    S13560   428      2 0x80000000
[  432.729342] kworker/7:1H    I14648   429      2 0x80000000
[  432.731277] Workqueue:            (null) (kblockd)
[  432.745674] kworker/1:1H    I14648   497      2 0x80000000
[  432.759743] systemd-journal D12600   498      1 0x00000000
[  432.815707] systemd-udevd   S12648   529      1 0x00000000
[  432.854484] auditd          D12808   546      1 0x00000000
[  432.928413] auditd          S13584   547      1 0x00000000
[  432.967315] kworker/4:1H    I15184   552      2 0x80000080
[  432.969349] Workqueue:            (null) (kblockd)
[  432.983507] polkitd         S12536   593      1 0x00000080
[  433.040141] gmain           S14032   604      1 0x00000080
[  433.103106] gdbus           S13760   605      1 0x00000080
[  433.154604] JS GC Helper    S14696   611      1 0x00000080
[  433.196161] JS Sour~ Thread S14696   616      1 0x00000080
[  433.240104] polkitd         S14200   621      1 0x00000080
[  433.299587] irqbalance      D12904   594      1 0x80000080
[  433.367588] avahi-daemon    S12632   597      1 0x00000080
[  433.417623] dbus-daemon     S12632   598      1 0x00000080
[  433.451459] avahi-daemon    S12888   603    597 0x00000080
[  433.499393] abrtd           S12488   606      1 0x00000080
[  433.562968] systemd-logind  D12904   608      1 0x00000080
[  433.632571] atd             S12728   612      1 0x00000080
[  433.668235] crond           D12536   614      1 0x00000080
[  433.737767] ksmtuned        S13312   619      1 0x00000080
[  433.775017] firewalld       S12456   623      1 0x00000080
[  433.828076] gmain           S14248   756      1 0x00000080
[  433.885072] NetworkManager  D11816   662      1 0x00000080
[  433.954132] gmain           D12544   669      1 0x00000080
[  434.023109] gdbus           S13432   672      1 0x00000080
[  434.073500] kworker/0:1H    I14096   742      2 0x80000080
[  434.075444] Workqueue:            (null) (kblockd)
[  434.088969] dhclient        S12328   759    662 0x00000080
[  434.157035] kworker/2:1H    I14480   768      2 0x80000080
[  434.159035] Workqueue:            (null) (xfs-log/sda1)
[  434.172628] tuned           S12552   949      1 0x00000080
[  434.240578] gmain           S14248   995      1 0x00000080
[  434.300681] tuned           D12328   996      1 0x80000080
[  434.369863] tuned           S14408   997      1 0x00000080
[  434.407096] tuned           S14248  1011      1 0x00000080
[  434.464385] smbd            S11288   951      1 0x00000080
[  434.502579] sshd            S12552   952      1 0x00000080
[  434.574892] rsyslogd        S12696   954      1 0x00000080
[  434.644123] in:imjournal    D12048   965      1 0x80000080
[  434.711494] rs:main Q:Reg   S12792   966      1 0x00000080
[  434.746928] sendmail        D13176   967      1 0x00000080
[  434.816872] kworker/0:3     I13376   977      2 0x80000080
[  434.818821] Workqueue:            (null) (ata_sff)
[  434.832423] agetty          S12904   989      1 0x00000080
[  434.877203] login           S12488   990      1 0x00000080
[  434.917384] sendmail        S14256   991      1 0x00000080
[  434.947588] smbd-notifyd    D12792  1012    951 0x00000080
[  435.021389] cleanupd        D13144  1013    951 0x00000080
[  435.094050] lpqd            S12888  1016    951 0x00000080
[  435.133174] bash            S12656  1019    990 0x00000080
[  435.172224] sleep           D12808  1614    619 0x00000080

>>>
>>> I do not mind this change per se but I am not happy about _your_ changelog.
>>> It doesn't explain the underlying problem IMHO. Having a natural and
>>> unconditional scheduling point in should_reclaim_retry is a reasonable
>>> thing. But how the hack it relates to the livelock you are seeing. So
>>> namely the changelog should explain
>>> 1) why nobody is able to make forward progress during direct reclaim
>>
>> Because GFP_NOIO allocation from one workqueue prevented WQ_MEM_RECLAIM
>> workqueues from starting, for it did not call schedule_timeout_*() because
>> mutex_trylock(&oom_lock) did not fail because nobody else could call
>> __alloc_pages_may_oom().
> 
> Why hasn't the rescuer helped here?

Rescuer does not work without schedule_timeout_*().

> 
>>> 2) why nobody is able to trigger oom killer as the last resort
>>
>> Because only one !__GFP_FS allocating thread which did not get stuck at
>> direct reclaim was able to call __alloc_pages_may_oom().
> 
> All the remaining allocations got stuck waiting for completion which
> depends on !WQ_MEM_RECLAIM workers right?
> 

A !WQ_MEM_RECLAIM worker did not call schedule_timeout_*() is preventing
anyone who is waiting for WQ_MEM_RECLAIM workers from making progress.
Michal Hocko July 30, 2018, 2:46 p.m. UTC | #5
On Mon 30-07-18 23:34:23, Tetsuo Handa wrote:
> On 2018/07/30 18:32, Michal Hocko wrote:
[...]
> > This one is waiting for draining and we are in mm_percpu_wq WQ context
> > which has its rescuer so no other activity can block us for ever. So
> > this certainly shouldn't deadlock. It can be dead slow but well, this is
> > what you will get when your shoot your system to death.
> 
> We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to wake up. (Tejun,
> is my understanding correct?) Lack of schedule_timeout_*() does block WQ_MEM_RECLAIM
> workqueues forever.

Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually
guarantees. If you are right then the whole thing sounds quite fragile
to me TBH.

Anyway we would at least have an explanation for what you are seeing.
Tejun Heo July 30, 2018, 2:54 p.m. UTC | #6
Hello,

On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote:
> On Mon 30-07-18 23:34:23, Tetsuo Handa wrote:
> > On 2018/07/30 18:32, Michal Hocko wrote:
> [...]
> > > This one is waiting for draining and we are in mm_percpu_wq WQ context
> > > which has its rescuer so no other activity can block us for ever. So
> > > this certainly shouldn't deadlock. It can be dead slow but well, this is
> > > what you will get when your shoot your system to death.
> > 
> > We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to wake up. (Tejun,
> > is my understanding correct?) Lack of schedule_timeout_*() does block WQ_MEM_RECLAIM
> > workqueues forever.
> 
> Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually
> guarantees. If you are right then the whole thing sounds quite fragile
> to me TBH.

Workqueue doesn't think the cpu is stalled as long as one of the
per-cpu kworkers is running.  The assumption is that kernel threads
are not supposed to be busy-looping indefinitely (and they really
shouldn't).  We can add timeout mechanism to workqueue so that it
kicks off other kworkers if one of them is in running state for too
long, but idk, if there's an indefinite busy loop condition in kernel
threads, we really should get rid of them and hung task watchdog is
pretty effective at finding these cases (at least with preemption
disabled).

Thanks.
Tetsuo Handa July 30, 2018, 3:25 p.m. UTC | #7
On 2018/07/30 23:54, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote:
>> On Mon 30-07-18 23:34:23, Tetsuo Handa wrote:
>>> On 2018/07/30 18:32, Michal Hocko wrote:
>> [...]
>>>> This one is waiting for draining and we are in mm_percpu_wq WQ context
>>>> which has its rescuer so no other activity can block us for ever. So
>>>> this certainly shouldn't deadlock. It can be dead slow but well, this is
>>>> what you will get when your shoot your system to death.
>>>
>>> We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to wake up. (Tejun,
>>> is my understanding correct?) Lack of schedule_timeout_*() does block WQ_MEM_RECLAIM
>>> workqueues forever.
>>
>> Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually
>> guarantees. If you are right then the whole thing sounds quite fragile
>> to me TBH.
> 
> Workqueue doesn't think the cpu is stalled as long as one of the
> per-cpu kworkers is running.  The assumption is that kernel threads
> are not supposed to be busy-looping indefinitely (and they really
> shouldn't).

WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But
WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon
as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues 
and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e.
when schedule_timeout_*() is called).

Is this correct?

>              We can add timeout mechanism to workqueue so that it
> kicks off other kworkers if one of them is in running state for too
> long, but idk, if there's an indefinite busy loop condition in kernel
> threads, we really should get rid of them and hung task watchdog is
> pretty effective at finding these cases (at least with preemption
> disabled).

Currently the page allocator has a path which can loop forever with
only cond_resched().

> 
> Thanks.
>
Tejun Heo July 30, 2018, 3:44 p.m. UTC | #8
Hello,

On Tue, Jul 31, 2018 at 12:25:04AM +0900, Tetsuo Handa wrote:
> WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But
> WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon
> as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues 
> and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e.
> when schedule_timeout_*() is called).
> 
> Is this correct?

WQ_MEM_RECLAIM guarantees that there's always gonna exist at least one
kworker running the workqueue.  But all per-cpu kworkers are subject
to concurrency limiting execution - ie. if there are any per-cpu
actively running on a cpu, no futher kworkers will be scheduled.

> >              We can add timeout mechanism to workqueue so that it
> > kicks off other kworkers if one of them is in running state for too
> > long, but idk, if there's an indefinite busy loop condition in kernel
> > threads, we really should get rid of them and hung task watchdog is
> > pretty effective at finding these cases (at least with preemption
> > disabled).
> 
> Currently the page allocator has a path which can loop forever with
> only cond_resched().

Yeah, workqueue can choke on things like that and kthread indefinitely
busy looping doesn't do anybody any good.

Thanks.
Michal Hocko July 30, 2018, 6:51 p.m. UTC | #9
On Mon 30-07-18 08:44:24, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 31, 2018 at 12:25:04AM +0900, Tetsuo Handa wrote:
> > WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But
> > WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon
> > as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues 
> > and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e.
> > when schedule_timeout_*() is called).
> > 
> > Is this correct?
> 
> WQ_MEM_RECLAIM guarantees that there's always gonna exist at least one
> kworker running the workqueue.  But all per-cpu kworkers are subject
> to concurrency limiting execution - ie. if there are any per-cpu
> actively running on a cpu, no futher kworkers will be scheduled.

Well, in the ideal world we would _use_ that pre-allocated kworker if
there are no other available because they are doing something that takes
a long time to accomplish. Page allocator can spend a lot of time if we
are struggling to death to get some memory.

> > >              We can add timeout mechanism to workqueue so that it
> > > kicks off other kworkers if one of them is in running state for too
> > > long, but idk, if there's an indefinite busy loop condition in kernel
> > > threads, we really should get rid of them and hung task watchdog is
> > > pretty effective at finding these cases (at least with preemption
> > > disabled).
> > 
> > Currently the page allocator has a path which can loop forever with
> > only cond_resched().
> 
> Yeah, workqueue can choke on things like that and kthread indefinitely
> busy looping doesn't do anybody any good.

Yeah, I do agree. But this is much easier said than done ;) Sure
we have that hack that does sleep rather than cond_resched in the
page allocator. We can and will "fix" it to be unconditional in the
should_reclaim_retry [1] but this whole thing is really subtle. It just
take one misbehaving worker and something which is really important to
run will get stuck.

That being said I will post the patch with updated changelog recording
this.

[1] http://lkml.kernel.org/r/ca3da8b8-1bb5-c302-b190-fa6cebab58ca@I-love.SAKURA.ne.jp
Michal Hocko July 30, 2018, 7:10 p.m. UTC | #10
This change has been posted several times with some concerns about the
changelog. Originally I thought it is more of a "nice to have" thing
rather than a bug fix, later Tetsuo has taken over it but the changelog
was not really comprehensible so I reworded it. Let's see if this is
better.


From 9bbea6516bb99615aff5ba5699865aa2d48333cc Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 26 Jul 2018 14:40:03 +0900
Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
 should_reclaim_retry().

Tetsuo Handa has reported that it is possible to bypass the short sleep
for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
any progress") and lock up the system if OOM.

The primary reason is that WQ_MEM_RECLAIM WQs are not guaranteed to
run even when they have a rescuer available. Those workers might be
essential for reclaim to make a forward progress, however. If we are
too unlucky all the allocations requests can get stuck waiting for a
WQ_MEM_RECLAIM work item and the system is essentially stuck in an OOM
condition without much hope to move on. Tetsuo has seen the reclaim
stuck on drain_local_pages_wq or xlog_cil_push_work (xfs). There might
be others.

Since should_reclaim_retry() should be a natural reschedule point,
let's do the short sleep for PF_WQ_WORKER threads unconditionally in
order to guarantee that other pending work items are started. This will
workaround this problem and it is less fragile than hunting down when
the sleep is missed. E.g. we used to have a sleeping point in the oom
path but this has been removed recently because it caused other issues.
Having a single sleeping point is more robust.

Reported-and-debugged-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
---
 mm/page_alloc.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100f1e63..f56cc0958d09 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3922,6 +3922,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 {
 	struct zone *zone;
 	struct zoneref *z;
+	bool ret = false;
 
 	/*
 	 * Costly allocations might have made a progress but this doesn't mean
@@ -3985,25 +3986,26 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 				}
 			}
 
-			/*
-			 * Memory allocation/reclaim might be called from a WQ
-			 * context and the current implementation of the WQ
-			 * concurrency control doesn't recognize that
-			 * a particular WQ is congested if the worker thread is
-			 * looping without ever sleeping. Therefore we have to
-			 * do a short sleep here rather than calling
-			 * cond_resched().
-			 */
-			if (current->flags & PF_WQ_WORKER)
-				schedule_timeout_uninterruptible(1);
-			else
-				cond_resched();
-
-			return true;
+			ret = true;
+			goto out;
 		}
 	}
 
-	return false;
+out:
+	/*
+	 * Memory allocation/reclaim might be called from a WQ
+	 * context and the current implementation of the WQ
+	 * concurrency control doesn't recognize that
+	 * a particular WQ is congested if the worker thread is
+	 * looping without ever sleeping. Therefore we have to
+	 * do a short sleep here rather than calling
+	 * cond_resched().
+	 */
+	if (current->flags & PF_WQ_WORKER)
+		schedule_timeout_uninterruptible(1);
+	else
+		cond_resched();
+	return ret;
 }
 
 static inline bool
Tejun Heo July 30, 2018, 7:14 p.m. UTC | #11
Hello, Michal.

On Mon, Jul 30, 2018 at 08:51:10PM +0200, Michal Hocko wrote:
> > Yeah, workqueue can choke on things like that and kthread indefinitely
> > busy looping doesn't do anybody any good.
> 
> Yeah, I do agree. But this is much easier said than done ;) Sure
> we have that hack that does sleep rather than cond_resched in the
> page allocator. We can and will "fix" it to be unconditional in the
> should_reclaim_retry [1] but this whole thing is really subtle. It just
> take one misbehaving worker and something which is really important to
> run will get stuck.

Oh yeah, I'm not saying the current behavior is ideal or anything, but
since the behavior has been put in many years ago, it only became a
problem only a couple times and all cases were rather easy and obvious
fixes on the wq user side.  It shouldn't be difficult to add a timer
mechanism on top.  We might be able to simply extend the hang
detection mechanism to kick off all pending rescuers after detecting a
wq stall.  I'm wary about making it a part of normal operation
(ie. silent timeout).  per-cpu kworkers really shouldn't busy loop for
an extended period of time.

Thanks.
Tetsuo Handa July 30, 2018, 9:01 p.m. UTC | #12
On 2018/07/31 4:10, Michal Hocko wrote:
> Since should_reclaim_retry() should be a natural reschedule point,
> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> order to guarantee that other pending work items are started. This will
> workaround this problem and it is less fragile than hunting down when
> the sleep is missed. E.g. we used to have a sleeping point in the oom
> path but this has been removed recently because it caused other issues.
> Having a single sleeping point is more robust.

linux.git has not removed the sleeping point in the OOM path yet. Since removing the
sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately.
(And that change will conflict with Roman's cgroup aware OOM killer patchset. But it
should be easy to rebase.)
Michal Hocko July 31, 2018, 5:09 a.m. UTC | #13
On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
> On 2018/07/31 4:10, Michal Hocko wrote:
> > Since should_reclaim_retry() should be a natural reschedule point,
> > let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> > order to guarantee that other pending work items are started. This will
> > workaround this problem and it is less fragile than hunting down when
> > the sleep is missed. E.g. we used to have a sleeping point in the oom
> > path but this has been removed recently because it caused other issues.
> > Having a single sleeping point is more robust.
> 
> linux.git has not removed the sleeping point in the OOM path yet. Since removing the
> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately.

is this an {Acked,Reviewed,Tested}-by?

I will send the patch to Andrew if the patch is ok. 

> (And that change will conflict with Roman's cgroup aware OOM killer patchset. But it
> should be easy to rebase.)

That is still a WIP so I would lose sleep over it.
Tetsuo Handa July 31, 2018, 10:47 a.m. UTC | #14
On 2018/07/31 14:09, Michal Hocko wrote:
> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>> On 2018/07/31 4:10, Michal Hocko wrote:
>>> Since should_reclaim_retry() should be a natural reschedule point,
>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
>>> order to guarantee that other pending work items are started. This will
>>> workaround this problem and it is less fragile than hunting down when
>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
>>> path but this has been removed recently because it caused other issues.
>>> Having a single sleeping point is more robust.
>>
>> linux.git has not removed the sleeping point in the OOM path yet. Since removing the
>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately.
> 
> is this an {Acked,Reviewed,Tested}-by?

I'm saying that "we used to have a sleeping point in the oom path but this has been
removed recently" is not true. You need to send that patch to linux.git first if you
want to refer that patch in this patch.

> 
> I will send the patch to Andrew if the patch is ok. 

Andrew, can we send the "we used to have a sleeping point in the oom path but this has
been removed recently" patch to linux.git ?

> 
>> (And that change will conflict with Roman's cgroup aware OOM killer patchset. But it
>> should be easy to rebase.)
> 
> That is still a WIP so I would lose sleep over it.
>
Michal Hocko July 31, 2018, 11:15 a.m. UTC | #15
On Tue 31-07-18 19:47:45, Tetsuo Handa wrote:
> On 2018/07/31 14:09, Michal Hocko wrote:
> > On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
> >> On 2018/07/31 4:10, Michal Hocko wrote:
> >>> Since should_reclaim_retry() should be a natural reschedule point,
> >>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> >>> order to guarantee that other pending work items are started. This will
> >>> workaround this problem and it is less fragile than hunting down when
> >>> the sleep is missed. E.g. we used to have a sleeping point in the oom
> >>> path but this has been removed recently because it caused other issues.
> >>> Having a single sleeping point is more robust.
> >>
> >> linux.git has not removed the sleeping point in the OOM path yet. Since removing the
> >> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately.
> > 
> > is this an {Acked,Reviewed,Tested}-by?
> 
> I'm saying that "we used to have a sleeping point in the oom path but this has been
> removed recently" is not true. You need to send that patch to linux.git first if you
> want to refer that patch in this patch.

That patch is already sitting in mmotm tree and this one will go on top.
I do not really see any reason to rush it to Linus tree. A dubious CVE
doesn't really raise the priority if you ask me.

On the other hand, having a confirmation, either of the above tags would
help to raise the credibility of the change.

> > I will send the patch to Andrew if the patch is ok. 
> 
> Andrew, can we send the "we used to have a sleeping point in the oom path but this has
> been removed recently" patch to linux.git ?

This can really wait for the next merge window IMHO.
Tetsuo Handa July 31, 2018, 11:30 a.m. UTC | #16
On 2018/07/31 20:15, Michal Hocko wrote:
>>> I will send the patch to Andrew if the patch is ok. 
>>
>> Andrew, can we send the "we used to have a sleeping point in the oom path but this has
>> been removed recently" patch to linux.git ?
> 
> This can really wait for the next merge window IMHO.
> 

"mm, oom: cgroup-aware OOM killer" in linux-next.git is reviving that sleeping point.
Current "mm, oom: cgroup-aware OOM killer" will not be sent to linux.git in the next
merge window? I'm confused...
Michal Hocko July 31, 2018, 11:55 a.m. UTC | #17
On Tue 31-07-18 20:30:08, Tetsuo Handa wrote:
> On 2018/07/31 20:15, Michal Hocko wrote:
> >>> I will send the patch to Andrew if the patch is ok. 
> >>
> >> Andrew, can we send the "we used to have a sleeping point in the oom path but this has
> >> been removed recently" patch to linux.git ?
> > 
> > This can really wait for the next merge window IMHO.
> > 
> 
> "mm, oom: cgroup-aware OOM killer" in linux-next.git is reviving that sleeping point.
> Current "mm, oom: cgroup-aware OOM killer" will not be sent to linux.git in the next
> merge window? I'm confused...

This has nothing to do with cgroup-aware OOM killer.
Tetsuo Handa Aug. 2, 2018, 10:05 p.m. UTC | #18
On 2018/07/31 14:09, Michal Hocko wrote:
> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>> On 2018/07/31 4:10, Michal Hocko wrote:
>>> Since should_reclaim_retry() should be a natural reschedule point,
>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
>>> order to guarantee that other pending work items are started. This will
>>> workaround this problem and it is less fragile than hunting down when
>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
>>> path but this has been removed recently because it caused other issues.
>>> Having a single sleeping point is more robust.
>>
>> linux.git has not removed the sleeping point in the OOM path yet. Since removing the
>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately.
> 
> is this an {Acked,Reviewed,Tested}-by?
> 
> I will send the patch to Andrew if the patch is ok. 
> 
>> (And that change will conflict with Roman's cgroup aware OOM killer patchset. But it
>> should be easy to rebase.)
> 
> That is still a WIP so I would lose sleep over it.
> 

Now that Roman's cgroup aware OOM killer patchset will be dropped from linux-next.git ,
linux-next.git will get the sleeping point removed. Please send this patch to linux-next.git .
Michal Hocko Aug. 3, 2018, 6:16 a.m. UTC | #19
On Fri 03-08-18 07:05:54, Tetsuo Handa wrote:
> On 2018/07/31 14:09, Michal Hocko wrote:
> > On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
> >> On 2018/07/31 4:10, Michal Hocko wrote:
> >>> Since should_reclaim_retry() should be a natural reschedule point,
> >>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> >>> order to guarantee that other pending work items are started. This will
> >>> workaround this problem and it is less fragile than hunting down when
> >>> the sleep is missed. E.g. we used to have a sleeping point in the oom
> >>> path but this has been removed recently because it caused other issues.
> >>> Having a single sleeping point is more robust.
> >>
> >> linux.git has not removed the sleeping point in the OOM path yet. Since removing the
> >> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately.
> > 
> > is this an {Acked,Reviewed,Tested}-by?
> > 
> > I will send the patch to Andrew if the patch is ok. 
> > 
> >> (And that change will conflict with Roman's cgroup aware OOM killer patchset. But it
> >> should be easy to rebase.)
> > 
> > That is still a WIP so I would lose sleep over it.
> > 
> 
> Now that Roman's cgroup aware OOM killer patchset will be dropped from linux-next.git ,
> linux-next.git will get the sleeping point removed. Please send this patch to linux-next.git .

I still haven't heard any explicit confirmation that the patch works for
your workload. Should I beg for it? Or you simply do not want to have
your stamp on the patch? If yes, I can live with that but this playing
hide and catch is not really a lot of fun.
Tetsuo Handa Aug. 21, 2018, 9:07 p.m. UTC | #20
On 2018/08/03 15:16, Michal Hocko wrote:
> On Fri 03-08-18 07:05:54, Tetsuo Handa wrote:
>> On 2018/07/31 14:09, Michal Hocko wrote:
>>> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>>>> On 2018/07/31 4:10, Michal Hocko wrote:
>>>>> Since should_reclaim_retry() should be a natural reschedule point,
>>>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
>>>>> order to guarantee that other pending work items are started. This will
>>>>> workaround this problem and it is less fragile than hunting down when
>>>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
>>>>> path but this has been removed recently because it caused other issues.
>>>>> Having a single sleeping point is more robust.
>>>>
>>>> linux.git has not removed the sleeping point in the OOM path yet. Since removing the
>>>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately.
>>>
>>> is this an {Acked,Reviewed,Tested}-by?
>>>
>>> I will send the patch to Andrew if the patch is ok. 
>>>
>>>> (And that change will conflict with Roman's cgroup aware OOM killer patchset. But it
>>>> should be easy to rebase.)
>>>
>>> That is still a WIP so I would lose sleep over it.
>>>
>>
>> Now that Roman's cgroup aware OOM killer patchset will be dropped from linux-next.git ,
>> linux-next.git will get the sleeping point removed. Please send this patch to linux-next.git .
> 
> I still haven't heard any explicit confirmation that the patch works for
> your workload. Should I beg for it? Or you simply do not want to have
> your stamp on the patch? If yes, I can live with that but this playing
> hide and catch is not really a lot of fun.
> 

I noticed that the patch has not been sent to linux-next.git yet.
Please send to linux-next.git without my stamp on the patch.

[   44.863590] Out of memory: Kill process 1071 (a.out) score 865 or sacrifice child
[   44.867666] Killed process 1817 (a.out) total-vm:5244kB, anon-rss:1040kB, file-rss:0kB, shmem-rss:0kB
[   44.872176] oom_reaper: reaped process 1817 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   91.698761] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 48s!
[   91.702313] Showing busy workqueues and worker pools:
[   91.705011] workqueue events: flags=0x0
[   91.707482]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256
[   91.710524]     pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], e1000_watchdog [e1000], vmstat_shepherd, free_work, mmdrop_async_fn, mmdrop_async_fn, check_corruption
[   91.717439] workqueue events_freezable: flags=0x4
[   91.720161]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.723304]     pending: vmballoon_work [vmw_balloon]
[   91.726167] workqueue events_power_efficient: flags=0x80
[   91.729139]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[   91.732253]     pending: fb_flashcursor, gc_worker [nf_conntrack], neigh_periodic_work, neigh_periodic_work
[   91.736471] workqueue events_freezable_power_: flags=0x84
[   91.739546]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.742696]     in-flight: 2097:disk_events_workfn
[   91.745517] workqueue mm_percpu_wq: flags=0x8
[   91.748069]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[   91.751182]     pending: drain_local_pages_wq BAR(1830), vmstat_update
[   91.754661] workqueue mpt_poll_0: flags=0x8
[   91.757161]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.759958]     pending: mpt_fault_reset_work [mptbase]
[   91.762696] workqueue xfs-data/sda1: flags=0xc
[   91.765353]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[   91.768248]     pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[   91.771589] workqueue xfs-cil/sda1: flags=0xc
[   91.774009]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.776800]     pending: xlog_cil_push_work [xfs] BAR(703)
[   91.779464] workqueue xfs-reclaim/sda1: flags=0xc
[   91.782017]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.784599]     pending: xfs_reclaim_worker [xfs]
[   91.786930] workqueue xfs-sync/sda1: flags=0x4
[   91.789289]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.792075]     pending: xfs_log_worker [xfs]
[   91.794213] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=48s workers=4 idle: 52 13 5
[  121.906640] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 78s!
[  121.909572] Showing busy workqueues and worker pools:
[  121.911703] workqueue events: flags=0x0
[  121.913531]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256
[  121.915873]     pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], e1000_watchdog [e1000], vmstat_shepherd, free_work, mmdrop_async_fn, mmdrop_async_fn, check_corruption
[  121.921962] workqueue events_freezable: flags=0x4
[  121.924336]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  121.926941]     pending: vmballoon_work [vmw_balloon]
[  121.929226] workqueue events_power_efficient: flags=0x80
[  121.931554]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[  121.933983]     pending: fb_flashcursor, gc_worker [nf_conntrack], neigh_periodic_work, neigh_periodic_work
[  121.937415] workqueue events_freezable_power_: flags=0x84
[  121.939719]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  121.942166]     in-flight: 2097:disk_events_workfn
[  121.944385] workqueue mm_percpu_wq: flags=0x8
[  121.946468]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  121.948975]     pending: drain_local_pages_wq BAR(1830), vmstat_update
[  121.951808] workqueue mpt_poll_0: flags=0x8
[  121.953864]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  121.956245]     pending: mpt_fault_reset_work [mptbase]
[  121.958505] workqueue xfs-data/sda1: flags=0xc
[  121.960514]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[  121.962896]     pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[  121.965682] workqueue xfs-cil/sda1: flags=0xc
[  121.967639]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  121.970004]     pending: xlog_cil_push_work [xfs] BAR(703)
[  121.972285] workqueue xfs-reclaim/sda1: flags=0xc
[  121.974339]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  121.976779]     pending: xfs_reclaim_worker [xfs]
[  121.978926] workqueue xfs-sync/sda1: flags=0x4
[  121.980997]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  121.983364]     pending: xfs_log_worker [xfs]
[  121.985326] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=78s workers=4 idle: 52 13 5
[  147.872620] sysrq: SysRq : Terminate All Tasks
Michal Hocko Aug. 22, 2018, 7:32 a.m. UTC | #21
On Wed 22-08-18 06:07:40, Tetsuo Handa wrote:
> On 2018/08/03 15:16, Michal Hocko wrote:
[...]
> >> Now that Roman's cgroup aware OOM killer patchset will be dropped from linux-next.git ,
> >> linux-next.git will get the sleeping point removed. Please send this patch to linux-next.git .
> > 
> > I still haven't heard any explicit confirmation that the patch works for
> > your workload. Should I beg for it? Or you simply do not want to have
> > your stamp on the patch? If yes, I can live with that but this playing
> > hide and catch is not really a lot of fun.
> > 
> 
> I noticed that the patch has not been sent to linux-next.git yet.
> Please send to linux-next.git without my stamp on the patch.

I plan to do so after merge window closes.
David Rientjes Aug. 23, 2018, 8:06 p.m. UTC | #22
On Wed, 22 Aug 2018, Tetsuo Handa wrote:

> On 2018/08/03 15:16, Michal Hocko wrote:
> > On Fri 03-08-18 07:05:54, Tetsuo Handa wrote:
> >> On 2018/07/31 14:09, Michal Hocko wrote:
> >>> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
> >>>> On 2018/07/31 4:10, Michal Hocko wrote:
> >>>>> Since should_reclaim_retry() should be a natural reschedule point,
> >>>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> >>>>> order to guarantee that other pending work items are started. This will
> >>>>> workaround this problem and it is less fragile than hunting down when
> >>>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
> >>>>> path but this has been removed recently because it caused other issues.
> >>>>> Having a single sleeping point is more robust.
> >>>>
> >>>> linux.git has not removed the sleeping point in the OOM path yet. Since removing the
> >>>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately.
> >>>
> >>> is this an {Acked,Reviewed,Tested}-by?
> >>>
> >>> I will send the patch to Andrew if the patch is ok. 
> >>>
> >>>> (And that change will conflict with Roman's cgroup aware OOM killer patchset. But it
> >>>> should be easy to rebase.)
> >>>
> >>> That is still a WIP so I would lose sleep over it.
> >>>
> >>
> >> Now that Roman's cgroup aware OOM killer patchset will be dropped from linux-next.git ,
> >> linux-next.git will get the sleeping point removed. Please send this patch to linux-next.git .
> > 
> > I still haven't heard any explicit confirmation that the patch works for
> > your workload. Should I beg for it? Or you simply do not want to have
> > your stamp on the patch? If yes, I can live with that but this playing
> > hide and catch is not really a lot of fun.
> > 
> 
> I noticed that the patch has not been sent to linux-next.git yet.
> Please send to linux-next.git without my stamp on the patch.
> 

For those of us who are tracking CVE-2016-10723 which has peristently been 
labeled as "disputed" and with no clear indication of what patches address 
it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
under oom_lock") and this patch are the intended mitigations?

A list of SHA1s for merged fixed and links to proposed patches to address 
this issue would be appreciated.
Tetsuo Handa Aug. 23, 2018, 9 p.m. UTC | #23
On 2018/08/24 5:06, David Rientjes wrote:
> For those of us who are tracking CVE-2016-10723 which has peristently been 
> labeled as "disputed" and with no clear indication of what patches address 
> it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
> under oom_lock") and this patch are the intended mitigations?
> 
> A list of SHA1s for merged fixed and links to proposed patches to address 
> this issue would be appreciated.
> 

Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a
mitigation for CVE-2016-10723.

"[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
should_reclaim_retry()." is independent from CVE-2016-10723.

We haven't made sure that the OOM reaper / exit_mmap() will get enough CPU
resources. For example, under a cluster of concurrently allocating realtime
scheduling priority threads, the OOM reaper takes about 1800 milliseconds
whereas direct OOM reaping takes only a few milliseconds.

Regards.
David Rientjes Aug. 23, 2018, 10:45 p.m. UTC | #24
On Fri, 24 Aug 2018, Tetsuo Handa wrote:

> > For those of us who are tracking CVE-2016-10723 which has peristently been 
> > labeled as "disputed" and with no clear indication of what patches address 
> > it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
> > under oom_lock") and this patch are the intended mitigations?
> > 
> > A list of SHA1s for merged fixed and links to proposed patches to address 
> > this issue would be appreciated.
> > 
> 
> Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a
> mitigation for CVE-2016-10723.
> 
> "[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
> should_reclaim_retry()." is independent from CVE-2016-10723.
> 

Thanks, Tetsuo.  Should commit af5679fbc669 ("mm, oom: remove oom_lock 
from oom_reaper") also be added to the list for CVE-2016-10723?
Tetsuo Handa Aug. 24, 2018, 12:31 a.m. UTC | #25
David Rientjes wrote:
> On Fri, 24 Aug 2018, Tetsuo Handa wrote:
> 
> > > For those of us who are tracking CVE-2016-10723 which has peristently been 
> > > labeled as "disputed" and with no clear indication of what patches address 
> > > it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
> > > under oom_lock") and this patch are the intended mitigations?
> > > 
> > > A list of SHA1s for merged fixed and links to proposed patches to address 
> > > this issue would be appreciated.
> > > 
> > 
> > Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a
> > mitigation for CVE-2016-10723.
> > 
> > "[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
> > should_reclaim_retry()." is independent from CVE-2016-10723.
> > 
> 
> Thanks, Tetsuo.  Should commit af5679fbc669 ("mm, oom: remove oom_lock 
> from oom_reaper") also be added to the list for CVE-2016-10723?
> 

Commit af5679fbc669f31f ("mm, oom: remove oom_lock from oom_reaper")
negated one of the two rationales for commit 9bfe5ded054b8e28 ("mm, oom:
remove sleep from under oom_lock"). If we didn't apply af5679fbc669f31f,
we could make sure that CPU resource is given to the owner of oom_lock
by replacing mutex_trylock() in __alloc_pages_may_oom() with mutex_lock().
But now that af5679fbc669f31f was already applied, we don't know how to
give CPU resource to the OOM reaper / exit_mmap(). We might arrive at
direct OOM reaping but we haven't reached there...

For now, I don't think we need to add af5679fbc669f31f to the list for
CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim
selection (especially with CONFIG_PREEMPT=y kernels) due to

   __alloc_pages_may_oom():               oom_reap_task():

     mutex_trylock(&oom_lock) succeeds.
     get_page_from_freelist() fails.
     Preempted to other process.
                                            oom_reap_task_mm() succeeds.
                                            Sets MMF_OOM_SKIP.
     Returned from preemption.
     Finds that MMF_OOM_SKIP was already set.
     Selects next OOM victim and kills it.
     mutex_unlock(&oom_lock) is called.

race window like described as

    Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
    to prevent from races when the page allocator didn't manage to get the
    freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
    on and move on to another victim.  Although this is possible in principle
    let's wait for it to actually happen in real life before we make the
    locking more complex again.

in that commit.
Tetsuo Handa Sept. 5, 2018, 1:20 p.m. UTC | #26
On 2018/08/24 9:31, Tetsuo Handa wrote:
> For now, I don't think we need to add af5679fbc669f31f to the list for
> CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim
> selection (especially with CONFIG_PREEMPT=y kernels) due to
> 
>    __alloc_pages_may_oom():               oom_reap_task():
> 
>      mutex_trylock(&oom_lock) succeeds.
>      get_page_from_freelist() fails.
>      Preempted to other process.
>                                             oom_reap_task_mm() succeeds.
>                                             Sets MMF_OOM_SKIP.
>      Returned from preemption.
>      Finds that MMF_OOM_SKIP was already set.
>      Selects next OOM victim and kills it.
>      mutex_unlock(&oom_lock) is called.
> 
> race window like described as
> 
>     Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
>     to prevent from races when the page allocator didn't manage to get the
>     freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
>     on and move on to another victim.  Although this is possible in principle
>     let's wait for it to actually happen in real life before we make the
>     locking more complex again.
> 
> in that commit.
> 

Yes, that race window is real. We can needlessly select next OOM victim.
I think that af5679fbc669f31f was too optimistic.

[  278.147280] Out of memory: Kill process 9943 (a.out) score 919 or sacrifice child
[  278.148927] Killed process 9943 (a.out) total-vm:4267252kB, anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB
[  278.151586] vmtoolsd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[  278.156642] vmtoolsd cpuset=/ mems_allowed=0
[  278.158884] CPU: 2 PID: 8916 Comm: vmtoolsd Kdump: loaded Not tainted 4.19.0-rc2+ #465
[  278.162252] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[  278.165499] Call Trace:
[  278.166693]  dump_stack+0x99/0xdc
[  278.167922]  dump_header+0x70/0x2fa
[  278.169414] oom_reaper: reaped process 9943 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  278.169629]  ? _raw_spin_unlock_irqrestore+0x6a/0x8c
[  278.169633]  oom_kill_process+0x2ee/0x380
[  278.169635]  out_of_memory+0x136/0x540
[  278.169636]  ? out_of_memory+0x1fc/0x540
[  278.169640]  __alloc_pages_slowpath+0x986/0xce4
[  278.169641]  ? get_page_from_freelist+0x16b/0x1600
[  278.169646]  __alloc_pages_nodemask+0x398/0x3d0
[  278.180594]  alloc_pages_current+0x65/0xb0
[  278.182173]  __page_cache_alloc+0x154/0x190
[  278.184200]  ? pagecache_get_page+0x27/0x250
[  278.185410]  filemap_fault+0x4df/0x880
[  278.186282]  ? filemap_fault+0x31b/0x880
[  278.187395]  ? xfs_ilock+0x1bd/0x220
[  278.188264]  ? __xfs_filemap_fault+0x76/0x270
[  278.189268]  ? down_read_nested+0x48/0x80
[  278.190229]  ? xfs_ilock+0x1bd/0x220
[  278.191061]  __xfs_filemap_fault+0x89/0x270
[  278.192059]  xfs_filemap_fault+0x27/0x30
[  278.192967]  __do_fault+0x1f/0x70
[  278.193777]  __handle_mm_fault+0xfbd/0x1470
[  278.194743]  handle_mm_fault+0x1f2/0x400
[  278.195679]  ? handle_mm_fault+0x47/0x400
[  278.196618]  __do_page_fault+0x217/0x4b0
[  278.197504]  do_page_fault+0x3c/0x21e
[  278.198303]  ? page_fault+0x8/0x30
[  278.199092]  page_fault+0x1e/0x30
[  278.199821] RIP: 0033:0x7f2322ebbfb0
[  278.200605] Code: Bad RIP value.
[  278.201370] RSP: 002b:00007ffda96e7648 EFLAGS: 00010246
[  278.202518] RAX: 0000000000000000 RBX: 00007f23220f26b0 RCX: 0000000000000010
[  278.204280] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f2321ecfb5b
[  278.205838] RBP: 0000000002504b70 R08: 00007f2321ecfb60 R09: 000000000250bd20
[  278.207426] R10: 383836312d646c69 R11: 0000000000000000 R12: 00007ffda96e76b0
[  278.208982] R13: 00007f2322ea8540 R14: 000000000250ba90 R15: 00007f2323173920
[  278.210840] Mem-Info:
[  278.211462] active_anon:18629 inactive_anon:2390 isolated_anon:0
[  278.211462]  active_file:19 inactive_file:1565 isolated_file:0
[  278.211462]  unevictable:0 dirty:0 writeback:0 unstable:0
[  278.211462]  slab_reclaimable:5820 slab_unreclaimable:8964
[  278.211462]  mapped:2128 shmem:2493 pagetables:1826 bounce:0
[  278.211462]  free:878043 free_pcp:909 free_cma:0
[  278.218830] Node 0 active_anon:74516kB inactive_anon:9560kB active_file:76kB inactive_file:6260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:8512kB dirty:0kB writeback:0kB shmem:9972kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 43008kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[  278.224997] Node 0 DMA free:15888kB min:288kB low:360kB high:432kB active_anon:32kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  278.230602] lowmem_reserve[]: 0 2663 3610 3610
[  278.231887] Node 0 DMA32 free:2746332kB min:49636kB low:62044kB high:74452kB active_anon:2536kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2749920kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:1500kB local_pcp:0kB free_cma:0kB
[  278.238291] lowmem_reserve[]: 0 0 947 947
[  278.239270] Node 0 Normal free:749952kB min:17652kB low:22064kB high:26476kB active_anon:72816kB inactive_anon:9560kB active_file:264kB inactive_file:5556kB unevictable:0kB writepending:4kB present:1048576kB managed:969932kB mlocked:0kB kernel_stack:5328kB pagetables:7092kB bounce:0kB free_pcp:2132kB local_pcp:64kB free_cma:0kB
[  278.245895] lowmem_reserve[]: 0 0 0 0
[  278.246820] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15904kB
[  278.249659] Node 0 DMA32: 7*4kB (U) 8*8kB (UM) 8*16kB (U) 8*32kB (U) 8*64kB (U) 6*128kB (U) 7*256kB (UM) 7*512kB (UM) 3*1024kB (UM) 2*2048kB (M) 667*4096kB (UM) = 2746332kB
[  278.253054] Node 0 Normal: 4727*4kB (UME) 3423*8kB (UME) 1679*16kB (UME) 704*32kB (UME) 253*64kB (UME) 107*128kB (UME) 38*256kB (M) 16*512kB (M) 10*1024kB (M) 9*2048kB (M) 141*4096kB (M) = 749700kB
[  278.257158] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  278.259018] 4125 total pagecache pages
[  278.259896] 0 pages in swap cache
[  278.260745] Swap cache stats: add 0, delete 0, find 0/0
[  278.261934] Free swap  = 0kB
[  278.262750] Total swap = 0kB
[  278.263483] 1048445 pages RAM
[  278.264216] 0 pages HighMem/MovableOnly
[  278.265077] 114506 pages reserved
[  278.265971] Tasks state (memory values in pages):
[  278.267118] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[  278.269090] [   2846]     0  2846      267      171    32768        0             0 none
[  278.271221] [   4891]     0  4891     9772      856   110592        0             0 systemd-journal
[  278.273253] [   6610]     0  6610    10811      261   114688        0         -1000 systemd-udevd
[  278.275197] [   6709]     0  6709    13880      112   131072        0         -1000 auditd
[  278.277067] [   7388]     0  7388     3307       52    69632        0             0 rngd
[  278.278880] [   7393]     0  7393    24917      403   184320        0             0 VGAuthService
[  278.280864] [   7510]    70  7510    15043      102   155648        0             0 avahi-daemon
[  278.282898] [   7555]     0  7555     5420      372    81920        0             0 irqbalance
[  278.284836] [   7563]     0  7563     6597       83   102400        0             0 systemd-logind
[  278.286959] [   7565]    81  7565    14553      157   167936        0          -900 dbus-daemon
[  278.288985] [   8286]    70  8286    15010       98   151552        0             0 avahi-daemon
[  278.290958] [   8731]     0  8731    74697      999   270336        0             0 vmtoolsd
[  278.293008] [   8732]   999  8732   134787     1730   274432        0             0 polkitd
[  278.294906] [   8733]     0  8733    55931      467   274432        0             0 abrtd
[  278.296774] [   8734]     0  8734    55311      354   266240        0             0 abrt-watch-log
[  278.298839] [   8774]     0  8774    31573      155   106496        0             0 crond
[  278.300810] [   8790]     0  8790    89503     5482   421888        0             0 firewalld
[  278.302727] [   8916]     0  8916    45262      211   204800        0             0 vmtoolsd
[  278.304841] [   9230]     0  9230    26877      507   229376        0             0 dhclient
[  278.306733] [   9333]     0  9333    87236      451   528384        0             0 nmbd
[  278.308554] [   9334]     0  9334    28206      257   253952        0         -1000 sshd
[  278.310431] [   9335]     0  9335   143457     3260   430080        0             0 tuned
[  278.312278] [   9337]     0  9337    55682     2442   200704        0             0 rsyslogd
[  278.314188] [   9497]     0  9497    24276      170   233472        0             0 login
[  278.316038] [   9498]     0  9498    27525       33    73728        0             0 agetty
[  278.317918] [   9539]     0  9539   104864      581   659456        0             0 smbd
[  278.319738] [   9590]     0  9590   103799      555   610304        0             0 smbd-notifyd
[  278.321918] [   9591]     0  9591   103797      555   602112        0             0 cleanupd
[  278.323935] [   9592]     0  9592   104864      580   610304        0             0 lpqd
[  278.325835] [   9596]     0  9596    28894      129    90112        0             0 bash
[  278.327663] [   9639]     0  9639    28833      474   249856        0             0 sendmail
[  278.329550] [   9773]    51  9773    26644      411   229376        0             0 sendmail
[  278.331527] Out of memory: Kill process 8790 (firewalld) score 5 or sacrifice child
[  278.333267] Killed process 8790 (firewalld) total-vm:358012kB, anon-rss:21928kB, file-rss:0kB, shmem-rss:0kB
[  278.336430] oom_reaper: reaped process 8790 (firewalld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Michal Hocko Sept. 5, 2018, 1:40 p.m. UTC | #27
On Wed 05-09-18 22:20:58, Tetsuo Handa wrote:
> On 2018/08/24 9:31, Tetsuo Handa wrote:
> > For now, I don't think we need to add af5679fbc669f31f to the list for
> > CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim
> > selection (especially with CONFIG_PREEMPT=y kernels) due to
> > 
> >    __alloc_pages_may_oom():               oom_reap_task():
> > 
> >      mutex_trylock(&oom_lock) succeeds.
> >      get_page_from_freelist() fails.
> >      Preempted to other process.
> >                                             oom_reap_task_mm() succeeds.
> >                                             Sets MMF_OOM_SKIP.
> >      Returned from preemption.
> >      Finds that MMF_OOM_SKIP was already set.
> >      Selects next OOM victim and kills it.
> >      mutex_unlock(&oom_lock) is called.
> > 
> > race window like described as
> > 
> >     Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
> >     to prevent from races when the page allocator didn't manage to get the
> >     freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
> >     on and move on to another victim.  Although this is possible in principle
> >     let's wait for it to actually happen in real life before we make the
> >     locking more complex again.
> > 
> > in that commit.
> > 
> 
> Yes, that race window is real. We can needlessly select next OOM victim.
> I think that af5679fbc669f31f was too optimistic.

Changelog said 

"Although this is possible in principle let's wait for it to actually
happen in real life before we make the locking more complex again."

So what is the real life workload that hits it? The log you have pasted
below doesn't tell much.

> [  278.147280] Out of memory: Kill process 9943 (a.out) score 919 or sacrifice child
> [  278.148927] Killed process 9943 (a.out) total-vm:4267252kB, anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB
> [  278.151586] vmtoolsd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[...]
> [  278.331527] Out of memory: Kill process 8790 (firewalld) score 5 or sacrifice child
> [  278.333267] Killed process 8790 (firewalld) total-vm:358012kB, anon-rss:21928kB, file-rss:0kB, shmem-rss:0kB
> [  278.336430] oom_reaper: reaped process 8790 (firewalld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Tetsuo Handa Sept. 5, 2018, 1:53 p.m. UTC | #28
On 2018/09/05 22:40, Michal Hocko wrote:
> Changelog said 
> 
> "Although this is possible in principle let's wait for it to actually
> happen in real life before we make the locking more complex again."
> 
> So what is the real life workload that hits it? The log you have pasted
> below doesn't tell much.

Nothing special. I just ran a multi-threaded memory eater on a CONFIG_PREEMPT=y kernel.
The OOM reaper succeeded to reclaim all memory but the OOM killer still triggered
just because somebody was inside that race window.

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#include <poll.h>
#include <sched.h>
#include <sys/prctl.h>
#include <sys/wait.h>
#include <sys/mman.h>

static int memory_eater(void *unused)
{
	const int fd = open("/proc/self/exe", O_RDONLY);
	static cpu_set_t set = { { 1 } };
	sched_setaffinity(0, sizeof(set), &set);
	srand(getpid());
	while (1) {
		int size = rand() % 1048576;
		void *ptr = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
		munmap(ptr, size);
	}
	return 0;
}

static int child(void *unused)
{
	static char *stack[256] = { };
	char buf[32] = { };
	int i;
	sleep(1);
	snprintf(buf, sizeof(buf), "tgid=%u", getpid());
	prctl(PR_SET_NAME, (unsigned long) buf, 0, 0, 0);
	for (i = 0; i < 256; i++)
		stack[i] = malloc(4096 * 2);
	for (i = 1; i < 128; i++)
		if (clone(memory_eater, stack[i] + 8192, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL) == -1)
			_exit(1);
	for (; i < 254; i++)
		if (clone(memory_eater, stack[i] + 8192, CLONE_VM, NULL) == -1)
			_exit(1);
	return 0;
}

int main(int argc, char *argv[])
{
	char *stack = malloc(1048576);
	char *buf = NULL;
	unsigned long size;
	unsigned long i;
	if (clone(child, stack + 1048576, CLONE_VM, NULL) == -1)
		_exit(1);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	/* Will cause OOM due to overcommit */
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	while (1)
		pause();
	return 0;
}

> 
>> [  278.147280] Out of memory: Kill process 9943 (a.out) score 919 or sacrifice child
>> [  278.148927] Killed process 9943 (a.out) total-vm:4267252kB, anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB
>> [  278.151586] vmtoolsd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
> [...]
>> [  278.331527] Out of memory: Kill process 8790 (firewalld) score 5 or sacrifice child
>> [  278.333267] Killed process 8790 (firewalld) total-vm:358012kB, anon-rss:21928kB, file-rss:0kB, shmem-rss:0kB
>> [  278.336430] oom_reaper: reaped process 8790 (firewalld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 

Another example from a different machine.

[  765.523676] a.out invoked oom-killer: gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0
[  765.528172] a.out cpuset=/ mems_allowed=0
[  765.530603] CPU: 5 PID: 4540 Comm: a.out Tainted: G                T 4.19.0-rc2+ #689
[  765.534307] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[  765.538975] Call Trace:
[  765.540920]  dump_stack+0x85/0xcb
[  765.543088]  dump_header+0x69/0x2fe
[  765.545375]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  765.547983]  oom_kill_process+0x307/0x390
[  765.550394]  out_of_memory+0x2f5/0x5b0
[  765.552818]  __alloc_pages_slowpath+0xc01/0x1030
[  765.555442]  __alloc_pages_nodemask+0x333/0x390
[  765.557966]  alloc_pages_vma+0x77/0x1f0
[  765.560292]  __handle_mm_fault+0x81c/0xf40
[  765.562736]  handle_mm_fault+0x1b7/0x3c0
[  765.565038]  __do_page_fault+0x2a6/0x580
[  765.567420]  do_page_fault+0x32/0x270
[  765.569670]  ? page_fault+0x8/0x30
[  765.571833]  page_fault+0x1e/0x30
[  765.573934] RIP: 0033:0x4008d8
[  765.575924] Code: Bad RIP value.
[  765.577842] RSP: 002b:00007ffec6f7d420 EFLAGS: 00010206
[  765.580221] RAX: 00007f6a201f4010 RBX: 0000000100000000 RCX: 0000000000000000
[  765.583253] RDX: 00000000bde23000 RSI: 0000000000020000 RDI: 0000000200000050
[  765.586207] RBP: 00007f6a201f4010 R08: 0000000200001000 R09: 0000000000021000
[  765.589047] R10: 0000000000000022 R11: 0000000000001000 R12: 0000000000000006
[  765.591996] R13: 00007ffec6f7d510 R14: 0000000000000000 R15: 0000000000000000
[  765.595732] Mem-Info:
[  765.597580] active_anon:794622 inactive_anon:2126 isolated_anon:0
[  765.597580]  active_file:7 inactive_file:11 isolated_file:0
[  765.597580]  unevictable:0 dirty:0 writeback:0 unstable:0
[  765.597580]  slab_reclaimable:7612 slab_unreclaimable:22840
[  765.597580]  mapped:1432 shmem:2280 pagetables:3230 bounce:0
[  765.597580]  free:20847 free_pcp:0 free_cma:0
[  765.611408] Node 0 active_anon:3178488kB inactive_anon:8504kB active_file:28kB inactive_file:44kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:5728kB dirty:0kB writeback:0kB shmem:9120kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3004416kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[  765.620811] Node 0 DMA free:13792kB min:308kB low:384kB high:460kB active_anon:2016kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  765.630835] lowmem_reserve[]: 0 2674 3378 3378
[  765.633372] Node 0 DMA32 free:55732kB min:53260kB low:66572kB high:79884kB active_anon:2680196kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2738556kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  765.643494] lowmem_reserve[]: 0 0 703 703
[  765.645712] Node 0 Normal free:13912kB min:14012kB low:17512kB high:21012kB active_anon:495920kB inactive_anon:8504kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:720644kB mlocked:0kB kernel_stack:6144kB pagetables:12924kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  765.656802] lowmem_reserve[]: 0 0 0 0
[  765.658884] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 0*2048kB 3*4096kB (M) = 13792kB
[  765.663518] Node 0 DMA32: 22*4kB (U) 26*8kB (UM) 50*16kB (U) 55*32kB (U) 46*64kB (U) 34*128kB (UM) 33*256kB (UM) 21*512kB (UM) 12*1024kB (UM) 3*2048kB (UM) 2*4096kB (M) = 55976kB
[  765.671041] Node 0 Normal: 74*4kB (UM) 22*8kB (UM) 22*16kB (UM) 147*32kB (UE) 31*64kB (UME) 8*128kB (UME) 3*256kB (UME) 1*512kB (M) 4*1024kB (UM) 0*2048kB 0*4096kB = 13912kB
[  765.677850] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  765.681352] 2295 total pagecache pages
[  765.683550] 0 pages in swap cache
[  765.685731] Swap cache stats: add 0, delete 0, find 0/0
[  765.688430] Free swap  = 0kB
[  765.690419] Total swap = 0kB
[  765.692375] 1048422 pages RAM
[  765.694388] 0 pages HighMem/MovableOnly
[  765.696637] 179653 pages reserved
[  765.698618] 0 pages cma reserved
[  765.700601] 0 pages hwpoisoned
[  765.702651] Out of memory: Kill process 4540 (a.out) score 897 or sacrifice child
[  765.706876] Killed process 4540 (a.out) total-vm:4267252kB, anon-rss:3111084kB, file-rss:0kB, shmem-rss:0kB
[  765.711735] in:imjournal invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[  765.719957] in:imjournal cpuset=/ mems_allowed=0
[  765.723302] CPU: 6 PID: 1012 Comm: in:imjournal Tainted: G                T 4.19.0-rc2+ #689
[  765.731014] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[  765.747405] Call Trace:
[  765.749579]  dump_stack+0x85/0xcb
[  765.752179]  dump_header+0x69/0x2fe
[  765.754475]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  765.757307]  oom_kill_process+0x307/0x390
[  765.759873]  out_of_memory+0x2f5/0x5b0
[  765.762464]  __alloc_pages_slowpath+0xc01/0x1030
[  765.765176]  __alloc_pages_nodemask+0x333/0x390
[  765.767832]  filemap_fault+0x465/0x910
[  765.770399]  ? xfs_ilock+0xbf/0x2b0 [xfs]
[  765.770514] oom_reaper: reaped process 4540 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  765.773047]  ? __xfs_filemap_fault+0x7d/0x2c0 [xfs]
[  765.773054]  ? down_read_nested+0x66/0xa0
[  765.783474]  __xfs_filemap_fault+0x8e/0x2c0 [xfs]
[  765.786436]  __do_fault+0x11/0x133
[  765.788649]  __handle_mm_fault+0xa57/0xf40
[  765.791022]  handle_mm_fault+0x1b7/0x3c0
[  765.793307]  __do_page_fault+0x2a6/0x580
[  765.795566]  do_page_fault+0x32/0x270
[  765.797708]  ? page_fault+0x8/0x30
[  765.799735]  page_fault+0x1e/0x30
[  765.801707] RIP: 0033:0x7f9a91e00fcf
[  765.803724] Code: Bad RIP value.
[  765.805569] RSP: 002b:00007f9a8ebe6c60 EFLAGS: 00010293
[  765.808021] RAX: 0000000000000000 RBX: 0000562b41445bb0 RCX: 00007f9a91e00fcf
[  765.810969] RDX: 00007f9a8ebe6c80 RSI: 0000000000000001 RDI: 00007f9a8ebe6ca0
[  765.813911] RBP: 00000000000dbba0 R08: 0000000000000008 R09: 0000000000000000
[  765.816861] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f9a8ebe6d90
[  765.819762] R13: 00007f9a80029af0 R14: 00007f9a8002fdf0 R15: 00007f9a8001ea60
[  765.823619] Mem-Info:
[  765.825340] active_anon:16863 inactive_anon:2126 isolated_anon:0
[  765.825340]  active_file:35 inactive_file:3002 isolated_file:0
[  765.825340]  unevictable:0 dirty:0 writeback:0 unstable:0
[  765.825340]  slab_reclaimable:7612 slab_unreclaimable:22764
[  765.825340]  mapped:2597 shmem:2280 pagetables:1686 bounce:0
[  765.825340]  free:796988 free_pcp:423 free_cma:0
[  765.839674] Node 0 active_anon:67452kB inactive_anon:8504kB active_file:140kB inactive_file:12008kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:10388kB dirty:0kB writeback:0kB shmem:9120kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 24576kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[  765.848867] Node 0 DMA free:15840kB min:308kB low:384kB high:460kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  765.859189] lowmem_reserve[]: 0 2674 3378 3378
[  765.861638] Node 0 DMA32 free:2735512kB min:53260kB low:66572kB high:79884kB active_anon:12kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2738556kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:1276kB local_pcp:0kB free_cma:0kB
[  765.871163] lowmem_reserve[]: 0 0 703 703
[  765.873345] Node 0 Normal free:436320kB min:14012kB low:17512kB high:21012kB active_anon:67460kB inactive_anon:8504kB active_file:140kB inactive_file:12112kB unevictable:0kB writepending:0kB present:1048576kB managed:720644kB mlocked:0kB kernel_stack:6128kB pagetables:6844kB bounce:0kB free_pcp:416kB local_pcp:0kB free_cma:0kB
[  765.884108] lowmem_reserve[]: 0 0 0 0
[  765.886080] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15844kB
[  765.891129] Node 0 DMA32: 28*4kB (UM) 33*8kB (UM) 58*16kB (UM) 60*32kB (UM) 48*64kB (UM) 34*128kB (UM) 34*256kB (UM) 21*512kB (UM) 12*1024kB (UM) 9*2048kB (UM) 653*4096kB (M) = 2735512kB
[  765.898614] Node 0 Normal: 3528*4kB (UM) 1178*8kB (UM) 1371*16kB (UM) 898*32kB (UME) 362*64kB (UME) 108*128kB (UME) 46*256kB (UME) 22*512kB (M) 7*1024kB (UM) 10*2048kB (M) 67*4096kB (M) = 436320kB
[  765.906220] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  765.909852] 5357 total pagecache pages
[  765.912036] 0 pages in swap cache
[  765.914092] Swap cache stats: add 0, delete 0, find 0/0
[  765.916680] Free swap  = 0kB
[  765.918560] Total swap = 0kB
[  765.920583] 1048422 pages RAM
[  765.922676] 0 pages HighMem/MovableOnly
[  765.924973] 179653 pages reserved
[  765.927004] 0 pages cma reserved
[  765.928919] 0 pages hwpoisoned
[  765.930821] Unreclaimable slab info:
[  765.932934] Name                      Used          Total
[  765.936489] nf_conntrack               0KB         31KB
[  765.939582] xfs_buf                   22KB        127KB
[  765.942322] xfs_ili                   20KB        127KB
[  765.944942] xfs_efi_item               0KB         31KB
[  765.947427] xfs_efd_item               0KB         31KB
[  765.949957] xfs_buf_item               0KB         62KB
[  765.952490] xfs_trans                  0KB         63KB
[  765.955268] xfs_ifork                 27KB        158KB
[  765.957913] xfs_da_state               0KB         31KB
[  765.960483] xfs_btree_cur              0KB         63KB
[  765.962943] xfs_bmap_free_item          0KB         15KB
[  765.965415] xfs_log_ticket             0KB         63KB
[  765.967880] bio-2                     24KB         63KB
[  765.970409] sd_ext_cdb                 0KB         15KB
[  765.973778] scsi_sense_cache           7KB        128KB
[  765.976603] fib6_nodes                11KB         32KB
[  765.979072] ip6_dst_cache              4KB         31KB
[  765.981504] RAWv6                    266KB        384KB
[  765.983897] UDPv6                      0KB         31KB
[  765.986243] TCPv6                     10KB         30KB
[  765.988850] sgpool-128                 8KB         62KB
[  765.991320] sgpool-64                  4KB         31KB
[  765.993554] sgpool-32                  2KB         31KB
[  765.995733] sgpool-16                  1KB         31KB
[  765.997852] sgpool-8                   1KB         63KB
[  766.000043] mqueue_inode_cache          1KB         31KB
[  766.002313] bio-1                      2KB         31KB
[  766.004549] fasync_cache               0KB         15KB
[  766.006757] posix_timers_cache          0KB         31KB
[  766.009787] UNIX                     187KB        240KB
[  766.012175] tcp_bind_bucket            1KB         32KB
[  766.014163] ip_fib_trie                2KB         15KB
[  766.016080] ip_fib_alias               3KB         15KB
[  766.017988] ip_dst_cache               6KB         31KB
[  766.019808] RAW                      464KB        585KB
[  766.021751] UDP                        8KB         30KB
[  766.023730] request_sock_TCP           0KB         31KB
[  766.025684] TCP                       12KB         31KB
[  766.027458] hugetlbfs_inode_cache          2KB         30KB
[  766.029314] eventpoll_pwq             37KB         63KB
[  766.031131] eventpoll_epi             53KB         94KB
[  766.032938] inotify_inode_mark         27KB         31KB
[  766.034774] request_queue             16KB         56KB
[  766.036598] blkdev_ioc                31KB        126KB
[  766.038508] bio-0                      6KB        127KB
[  766.040390] biovec-max               405KB        506KB
[  766.042199] biovec-128                 0KB         31KB
[  766.044857] biovec-64                  0KB         94KB
[  766.047009] biovec-16                  0KB         63KB
[  766.048856] bio_integrity_payload          1KB         31KB
[  766.050745] khugepaged_mm_slot          1KB         15KB
[  766.052602] uid_cache                  2KB         31KB
[  766.054537] dmaengine-unmap-2          0KB         15KB
[  766.056523] audit_buffer               0KB         15KB
[  766.058506] skbuff_fclone_cache          0KB         31KB
[  766.060521] skbuff_head_cache          0KB        127KB
[  766.062391] configfs_dir_cache          1KB         15KB
[  766.064226] file_lock_cache           13KB         31KB
[  766.066061] file_lock_ctx             15KB         31KB
[  766.067883] fsnotify_mark_connector         18KB         31KB
[  766.069874] net_namespace              0KB         32KB
[  766.071801] shmem_inode_cache       1546KB       1595KB
[  766.073838] task_delay_info          177KB        318KB
[  766.075820] taskstats                  2KB         63KB
[  766.077664] proc_dir_entry           498KB        540KB
[  766.080418] pde_opener                 0KB         31KB
[  766.082620] seq_file                   1KB        127KB
[  766.084526] sigqueue                   0KB         63KB
[  766.086374] kernfs_node_cache      29487KB      29536KB
[  766.088403] mnt_cache                 90KB        126KB
[  766.090442] filp                    1195KB       1466KB
[  766.092412] names_cache                0KB        155KB
[  766.094412] key_jar                    4KB         31KB
[  766.096428] nsproxy                    1KB         15KB
[  766.098448] vm_area_struct          3312KB       3480KB
[  766.100378] mm_struct                 68KB        128KB
[  766.102292] fs_cache                  90KB        189KB
[  766.104179] files_cache              200KB        281KB
[  766.106192] signal_cache             690KB        892KB
[  766.108158] sighand_cache            943KB       1137KB
[  766.110116] task_struct             3590KB       3825KB
[  766.112005] cred_jar                 249KB        378KB
[  766.113912] anon_vma_chain          1634KB       1742KB
[  766.116711] anon_vma                1214KB       1309KB
[  766.119046] pid                      129KB        256KB
[  766.121108] Acpi-Operand            3713KB       3761KB
[  766.123251] Acpi-ParseExt              0KB         15KB
[  766.125261] Acpi-Parse                 0KB         31KB
[  766.127259] Acpi-State                 0KB         47KB
[  766.129256] Acpi-Namespace          2913KB       2936KB
[  766.131118] numa_policy               74KB         94KB
[  766.132954] trace_event_file         455KB        456KB
[  766.134815] ftrace_event_field        777KB        787KB
[  766.136762] pool_workqueue            72KB         94KB
[  766.138859] task_group                35KB         63KB
[  766.140890] page->ptl                725KB        828KB
[  766.142857] dma-kmalloc-512            0KB         31KB
[  766.144751] kmalloc-8192             441KB        499KB
[  766.146793] kmalloc-4096             965KB       1090KB
[  766.148619] kmalloc-2048            4570KB       4691KB
[  766.150479] kmalloc-1024             881KB        988KB
[  766.153265] kmalloc-512             3343KB       3430KB
[  766.155491] kmalloc-256             3368KB       3434KB
[  766.157383] kmalloc-192              276KB        447KB
[  766.159287] kmalloc-128              981KB       1078KB
[  766.161152] kmalloc-96              2288KB       2294KB
[  766.163011] kmalloc-64              1675KB       1921KB
[  766.164862] kmalloc-32              1395KB       1533KB
[  766.166635] kmalloc-16              3359KB       3795KB
[  766.168413] kmalloc-8               2344KB       2384KB
[  766.170249] kmem_cache_node          102KB        126KB
[  766.172232] kmem_cache               147KB        158KB
[  766.174204] Out of memory: Kill process 679 (firewalld) score 6 or sacrifice child
[  766.176757] Killed process 679 (firewalld) total-vm:357864kB, anon-rss:21920kB, file-rss:0kB, shmem-rss:0kB
[  766.183011] oom_reaper: reaped process 679 (firewalld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Michal Hocko Sept. 5, 2018, 2:04 p.m. UTC | #29
On Wed 05-09-18 22:53:33, Tetsuo Handa wrote:
> On 2018/09/05 22:40, Michal Hocko wrote:
> > Changelog said 
> > 
> > "Although this is possible in principle let's wait for it to actually
> > happen in real life before we make the locking more complex again."
> > 
> > So what is the real life workload that hits it? The log you have pasted
> > below doesn't tell much.
> 
> Nothing special. I just ran a multi-threaded memory eater on a CONFIG_PREEMPT=y kernel.

I strongly suspec that your test doesn't really represent or simulate
any real and useful workload. Sure it triggers a rare race and we kill
another oom victim. Does this warrant to make the code more complex?
Well, I am not convinced, as I've said countless times.
Tetsuo Handa Sept. 6, 2018, 1 a.m. UTC | #30
Michal Hocko wrote:
> On Wed 05-09-18 22:53:33, Tetsuo Handa wrote:
> > On 2018/09/05 22:40, Michal Hocko wrote:
> > > Changelog said 
> > > 
> > > "Although this is possible in principle let's wait for it to actually
> > > happen in real life before we make the locking more complex again."
> > > 
> > > So what is the real life workload that hits it? The log you have pasted
> > > below doesn't tell much.
> > 
> > Nothing special. I just ran a multi-threaded memory eater on a CONFIG_PREEMPT=y kernel.
> 
> I strongly suspec that your test doesn't really represent or simulate
> any real and useful workload. Sure it triggers a rare race and we kill
> another oom victim. Does this warrant to make the code more complex?
> Well, I am not convinced, as I've said countless times.

Yes. Below is an example from a machine running Apache Web server/Tomcat AP server/PostgreSQL DB server.
An memory eater needlessly killed Tomcat due to this race. I assert that we should fix af5679fbc669f31f.



Before:

systemd(1)-+-NetworkManager(693)-+-dhclient(791)
           |                     |-{NetworkManager}(698)
           |                     `-{NetworkManager}(702)
           |-abrtd(630)
           |-agetty(1007)
           |-atd(653)
           |-auditd(600)---{auditd}(601)
           |-avahi-daemon(625)---avahi-daemon(631)
           |-crond(657)
           |-dbus-daemon(638)
           |-firewalld(661)---{firewalld}(788)
           |-httpd(1169)-+-httpd(1170)
           |             |-httpd(1171)
           |             |-httpd(1172)
           |             |-httpd(1173)
           |             `-httpd(1174)
           |-irqbalance(628)
           |-java(1074)-+-{java}(1092)
           |            |-{java}(1093)
           |            |-{java}(1094)
           |            |-{java}(1095)
           |            |-{java}(1096)
           |            |-{java}(1097)
           |            |-{java}(1098)
           |            |-{java}(1099)
           |            |-{java}(1100)
           |            |-{java}(1101)
           |            |-{java}(1102)
           |            |-{java}(1103)
           |            |-{java}(1104)
           |            |-{java}(1105)
           |            |-{java}(1106)
           |            |-{java}(1107)
           |            |-{java}(1108)
           |            |-{java}(1109)
           |            |-{java}(1110)
           |            |-{java}(1111)
           |            |-{java}(1114)
           |            |-{java}(1115)
           |            |-{java}(1116)
           |            |-{java}(1117)
           |            |-{java}(1118)
           |            |-{java}(1119)
           |            |-{java}(1120)
           |            |-{java}(1121)
           |            |-{java}(1122)
           |            |-{java}(1123)
           |            |-{java}(1124)
           |            |-{java}(1125)
           |            |-{java}(1126)
           |            |-{java}(1127)
           |            |-{java}(1128)
           |            |-{java}(1129)
           |            |-{java}(1130)
           |            |-{java}(1131)
           |            |-{java}(1132)
           |            |-{java}(1133)
           |            |-{java}(1134)
           |            |-{java}(1135)
           |            |-{java}(1136)
           |            |-{java}(1137)
           |            `-{java}(1138)
           |-ksmtuned(659)---sleep(1727)
           |-login(1006)---bash(1052)---pstree(1728)
           |-polkitd(624)-+-{polkitd}(633)
           |              |-{polkitd}(642)
           |              |-{polkitd}(643)
           |              |-{polkitd}(645)
           |              `-{polkitd}(650)
           |-postgres(1154)-+-postgres(1155)
           |                |-postgres(1157)
           |                |-postgres(1158)
           |                |-postgres(1159)
           |                |-postgres(1160)
           |                `-postgres(1161)
           |-rsyslogd(986)-+-{rsyslogd}(997)
           |               `-{rsyslogd}(999)
           |-sendmail(1008)
           |-sendmail(1023)
           |-smbd(983)-+-cleanupd(1027)
           |           |-lpqd(1032)
           |           `-smbd-notifyd(1026)
           |-sshd(981)
           |-systemd-journal(529)
           |-systemd-logind(627)
           |-systemd-udevd(560)
           `-tuned(980)-+-{tuned}(1030)
                        |-{tuned}(1031)
                        |-{tuned}(1033)
                        `-{tuned}(1047)



After:

systemd(1)-+-NetworkManager(693)-+-dhclient(791)
           |                     |-{NetworkManager}(698)
           |                     `-{NetworkManager}(702)
           |-abrtd(630)
           |-agetty(1007)
           |-atd(653)
           |-auditd(600)---{auditd}(601)
           |-avahi-daemon(625)---avahi-daemon(631)
           |-crond(657)
           |-dbus-daemon(638)
           |-firewalld(661)---{firewalld}(788)
           |-httpd(1169)-+-httpd(1170)
           |             |-httpd(1171)
           |             |-httpd(1172)
           |             |-httpd(1173)
           |             `-httpd(1174)
           |-irqbalance(628)
           |-ksmtuned(659)---sleep(1758)
           |-login(1006)---bash(1052)---pstree(1759)
           |-polkitd(624)-+-{polkitd}(633)
           |              |-{polkitd}(642)
           |              |-{polkitd}(643)
           |              |-{polkitd}(645)
           |              `-{polkitd}(650)
           |-postgres(1154)-+-postgres(1155)
           |                |-postgres(1157)
           |                |-postgres(1158)
           |                |-postgres(1159)
           |                |-postgres(1160)
           |                `-postgres(1161)
           |-rsyslogd(986)-+-{rsyslogd}(997)
           |               `-{rsyslogd}(999)
           |-sendmail(1008)
           |-sendmail(1023)
           |-smbd(983)-+-cleanupd(1027)
           |           |-lpqd(1032)
           |           `-smbd-notifyd(1026)
           |-sshd(981)
           |-systemd-journal(529)
           |-systemd-logind(627)
           |-systemd-udevd(560)
           `-tuned(980)-+-{tuned}(1030)
                        |-{tuned}(1031)
                        |-{tuned}(1033)
                        `-{tuned}(1047)



[  222.165946] a.out invoked oom-killer: gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0
[  222.170631] a.out cpuset=/ mems_allowed=0
[  222.172956] CPU: 4 PID: 1748 Comm: a.out Tainted: G                T 4.19.0-rc2+ #690
[  222.176517] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[  222.180892] Call Trace:
[  222.182947]  dump_stack+0x85/0xcb
[  222.185240]  dump_header+0x69/0x2fe
[  222.187579]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  222.190319]  oom_kill_process+0x307/0x390
[  222.192803]  out_of_memory+0x2f3/0x5d0
[  222.195190]  __alloc_pages_slowpath+0xc01/0x1030
[  222.197844]  __alloc_pages_nodemask+0x333/0x390
[  222.200452]  alloc_pages_vma+0x77/0x1f0
[  222.202869]  __handle_mm_fault+0x81c/0xf40
[  222.205334]  handle_mm_fault+0x1b7/0x3c0
[  222.207712]  __do_page_fault+0x2a6/0x580
[  222.210036]  do_page_fault+0x32/0x270
[  222.212266]  ? page_fault+0x8/0x30
[  222.214402]  page_fault+0x1e/0x30
[  222.216463] RIP: 0033:0x4008d8
[  222.218429] Code: Bad RIP value.
[  222.220388] RSP: 002b:00007fff34061350 EFLAGS: 00010206
[  222.222931] RAX: 00007efea3c2e010 RBX: 0000000100000000 RCX: 0000000000000000
[  222.225976] RDX: 00000000b190f000 RSI: 0000000000020000 RDI: 0000000200000050
[  222.228891] RBP: 00007efea3c2e010 R08: 0000000200001000 R09: 0000000000021000
[  222.231779] R10: 0000000000000022 R11: 0000000000001000 R12: 0000000000000006
[  222.234626] R13: 00007fff34061440 R14: 0000000000000000 R15: 0000000000000000
[  222.238482] Mem-Info:
[  222.240511] active_anon:789816 inactive_anon:3457 isolated_anon:0
[  222.240511]  active_file:11 inactive_file:44 isolated_file:0
[  222.240511]  unevictable:0 dirty:6 writeback:0 unstable:0
[  222.240511]  slab_reclaimable:8052 slab_unreclaimable:24408
[  222.240511]  mapped:1898 shmem:3704 pagetables:4316 bounce:0
[  222.240511]  free:20841 free_pcp:0 free_cma:0
[  222.254349] Node 0 active_anon:3159264kB inactive_anon:13828kB active_file:44kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:7592kB dirty:24kB writeback:0kB shmem:14816kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2793472kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[  222.264038] Node 0 DMA free:13812kB min:308kB low:384kB high:460kB active_anon:1876kB inactive_anon:8kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  222.274208] lowmem_reserve[]: 0 2674 3378 3378
[  222.276831] Node 0 DMA32 free:56068kB min:53260kB low:66572kB high:79884kB active_anon:2673292kB inactive_anon:216kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2738564kB mlocked:0kB kernel_stack:96kB pagetables:3024kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  222.287961] lowmem_reserve[]: 0 0 703 703
[  222.291154] Node 0 Normal free:13672kB min:14012kB low:17512kB high:21012kB active_anon:483864kB inactive_anon:13604kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:1048576kB managed:720644kB mlocked:0kB kernel_stack:7520kB pagetables:14272kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  222.302407] lowmem_reserve[]: 0 0 0 0
[  222.304596] Node 0 DMA: 1*4kB (M) 0*8kB 1*16kB (U) 7*32kB (UM) 4*64kB (U) 4*128kB (UM) 2*256kB (U) 0*512kB 0*1024kB 0*2048kB 3*4096kB (ME) = 13812kB
[  222.311748] Node 0 DMA32: 37*4kB (U) 29*8kB (UM) 20*16kB (UM) 30*32kB (UME) 28*64kB (UME) 11*128kB (UME) 9*256kB (UME) 8*512kB (UM) 6*1024kB (UME) 1*2048kB (E) 9*4096kB (UM) = 56316kB
[  222.318932] Node 0 Normal: 151*4kB (UM) 2*8kB (UM) 97*16kB (UM) 195*32kB (UME) 53*64kB (UME) 11*128kB (UME) 2*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 13724kB
[  222.325455] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  222.329059] 3707 total pagecache pages
[  222.331315] 0 pages in swap cache
[  222.333440] Swap cache stats: add 0, delete 0, find 0/0
[  222.336097] Free swap  = 0kB
[  222.338091] Total swap = 0kB
[  222.340084] 1048422 pages RAM
[  222.342165] 0 pages HighMem/MovableOnly
[  222.344460] 179651 pages reserved
[  222.347423] 0 pages cma reserved
[  222.349793] 0 pages hwpoisoned
[  222.351784] Out of memory: Kill process 1748 (a.out) score 838 or sacrifice child
[  222.355131] Killed process 1748 (a.out) total-vm:4267252kB, anon-rss:2909224kB, file-rss:0kB, shmem-rss:0kB
[  222.359644] java invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[  222.364180] java cpuset=/ mems_allowed=0
[  222.366619] CPU: 0 PID: 1110 Comm: java Tainted: G                T 4.19.0-rc2+ #690
[  222.370088] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[  222.377089] Call Trace:
[  222.380503]  dump_stack+0x85/0xcb
[  222.380509]  dump_header+0x69/0x2fe
[  222.380515]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  222.380518]  oom_kill_process+0x307/0x390
[  222.380553]  out_of_memory+0x2f3/0x5d0
[  222.396414]  __alloc_pages_slowpath+0xc01/0x1030
[  222.396423]  __alloc_pages_nodemask+0x333/0x390
[  222.396431]  filemap_fault+0x465/0x910
[  222.404090]  ? xfs_ilock+0xbf/0x2b0 [xfs]
[  222.404118]  ? __xfs_filemap_fault+0x7d/0x2c0 [xfs]
[  222.404124]  ? down_read_nested+0x66/0xa0
[  222.404148]  __xfs_filemap_fault+0x8e/0x2c0 [xfs]
[  222.404156]  __do_fault+0x11/0x133
[  222.404158]  __handle_mm_fault+0xa57/0xf40
[  222.404165]  handle_mm_fault+0x1b7/0x3c0
[  222.404171]  __do_page_fault+0x2a6/0x580
[  222.404187]  do_page_fault+0x32/0x270
[  222.404194]  ? page_fault+0x8/0x30
[  222.404196]  page_fault+0x1e/0x30
[  222.404199] RIP: 0033:0x7fedb229ed42
[  222.404205] Code: Bad RIP value.
[  222.404207] RSP: 002b:00007fed92ae9c90 EFLAGS: 00010202
[  222.404209] RAX: ffffffffffffff92 RBX: 00007fedb187c470 RCX: 00007fedb229ed42
[  222.404210] RDX: 0000000000000001 RSI: 0000000000000089 RDI: 00007fedac13c354
[  222.404211] RBP: 00007fed92ae9d50 R08: 00007fedac13c328 R09: 00000000ffffffff
[  222.404212] R10: 00007fed92ae9cf0 R11: 0000000000000202 R12: 0000000000000001
[  222.404213] R13: 00007fed92ae9cf0 R14: ffffffffffffff92 R15: 00007fedac13c300
[  222.404783] Mem-Info:
[  222.404790] active_anon:429056 inactive_anon:3457 isolated_anon:0
[  222.404790]  active_file:0 inactive_file:833 isolated_file:0
[  222.404790]  unevictable:0 dirty:0 writeback:0 unstable:0
[  222.404790]  slab_reclaimable:8052 slab_unreclaimable:24344
[  222.404790]  mapped:2375 shmem:3704 pagetables:3030 bounce:0
[  222.404790]  free:381368 free_pcp:89 free_cma:0
[  222.404793] Node 0 active_anon:1716224kB inactive_anon:13828kB active_file:0kB inactive_file:3332kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:9500kB dirty:0kB writeback:0kB shmem:14816kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 155648kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[  222.404794] Node 0 DMA free:13812kB min:308kB low:384kB high:460kB active_anon:1876kB inactive_anon:8kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  222.404798] lowmem_reserve[]: 0 2674 3378 3378
[  222.404802] Node 0 DMA32 free:1362940kB min:53260kB low:66572kB high:79884kB active_anon:1366928kB inactive_anon:216kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2738564kB mlocked:0kB kernel_stack:96kB pagetables:3028kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  222.404831] lowmem_reserve[]: 0 0 703 703
[  222.404837] Node 0 Normal free:149160kB min:14012kB low:17512kB high:21012kB active_anon:348116kB inactive_anon:13604kB active_file:0kB inactive_file:3080kB unevictable:0kB writepending:0kB present:1048576kB managed:720644kB mlocked:0kB kernel_stack:7504kB pagetables:9088kB bounce:0kB free_pcp:376kB local_pcp:12kB free_cma:0kB
[  222.404841] lowmem_reserve[]: 0 0 0 0
[  222.404859] Node 0 DMA: 1*4kB (M) 0*8kB 1*16kB (U) 7*32kB (UM) 4*64kB (U) 4*128kB (UM) 2*256kB (U) 0*512kB 0*1024kB 0*2048kB 3*4096kB (ME) = 13812kB
[  222.405114] Node 0 DMA32: 37*4kB (U) 29*8kB (UM) 20*16kB (UM) 30*32kB (UME) 28*64kB (UME) 11*128kB (UME) 9*256kB (UME) 8*512kB (UM) 6*1024kB (UME) 10*2048kB (ME) 326*4096kB (UM) = 1373180kB
[  222.405423] Node 0 Normal: 512*4kB (U) 1075*8kB (UM) 1667*16kB (UM) 1226*32kB (UME) 497*64kB (UME) 209*128kB (UME) 50*256kB (UM) 0*512kB 0*1024kB 0*2048kB 1*4096kB (M) = 152008kB
[  222.405797] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  222.405799] 4655 total pagecache pages
[  222.405801] 0 pages in swap cache
[  222.405802] Swap cache stats: add 0, delete 0, find 0/0
[  222.405803] Free swap  = 0kB
[  222.405803] Total swap = 0kB
[  222.405835] 1048422 pages RAM
[  222.405837] 0 pages HighMem/MovableOnly
[  222.405838] 179651 pages reserved
[  222.405839] 0 pages cma reserved
[  222.405840] 0 pages hwpoisoned
[  222.405843] Out of memory: Kill process 1074 (java) score 50 or sacrifice child
[  222.406136] Killed process 1074 (java) total-vm:5555688kB, anon-rss:174244kB, file-rss:0kB, shmem-rss:0kB
[  222.443446] oom_reaper: reaped process 1748 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Michal Hocko Sept. 6, 2018, 5:57 a.m. UTC | #31
On Thu 06-09-18 10:00:00, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 05-09-18 22:53:33, Tetsuo Handa wrote:
> > > On 2018/09/05 22:40, Michal Hocko wrote:
> > > > Changelog said 
> > > > 
> > > > "Although this is possible in principle let's wait for it to actually
> > > > happen in real life before we make the locking more complex again."
> > > > 
> > > > So what is the real life workload that hits it? The log you have pasted
> > > > below doesn't tell much.
> > > 
> > > Nothing special. I just ran a multi-threaded memory eater on a CONFIG_PREEMPT=y kernel.
> > 
> > I strongly suspec that your test doesn't really represent or simulate
> > any real and useful workload. Sure it triggers a rare race and we kill
> > another oom victim. Does this warrant to make the code more complex?
> > Well, I am not convinced, as I've said countless times.
> 
> Yes. Below is an example from a machine running Apache Web server/Tomcat AP server/PostgreSQL DB server.
> An memory eater needlessly killed Tomcat due to this race.

What prevents you from modifying you mem eater in a way that Tomcat
resp. others from being the primary oom victim choice? In other words,
yeah it is not optimal to lose the race but if it is rare enough then
this is something to live with because it can be hardly considered a
new DoS vector AFAICS. Remember that this is always going to be racy
land and we are not going to plumb all possible races because this is
simply not viable. But I am pretty sure we have been through all this
many times already. Oh well...

> I assert that we should fix af5679fbc669f31f.

If you can come up with reasonable patch which doesn't complicate the
code and it is a clear win for both this particular workload as well as
others then why not.
Tetsuo Handa Sept. 6, 2018, 6:22 a.m. UTC | #32
Michal Hocko wrote:
> > I assert that we should fix af5679fbc669f31f.
> 
> If you can come up with reasonable patch which doesn't complicate the
> code and it is a clear win for both this particular workload as well as
> others then why not.

Why can't we do "at least MMF_OOM_SKIP should be set under the lock to
prevent from races" ?

diff --git a/mm/mmap.c b/mm/mmap.c
index 5f2b2b1..e096bb8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3065,7 +3065,9 @@ void exit_mmap(struct mm_struct *mm)
 		 */
 		(void)__oom_reap_task_mm(mm);
 
+		mutex_lock(&oom_lock);
 		set_bit(MMF_OOM_SKIP, &mm->flags);
+		mutex_unlock(&oom_lock);
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..b2a94c1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -606,7 +606,9 @@ static void oom_reap_task(struct task_struct *tsk)
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
 	 */
+	mutex_lock(&oom_lock);
 	set_bit(MMF_OOM_SKIP, &mm->flags);
+	mutex_unlock(&oom_lock);		
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
Tetsuo Handa Sept. 6, 2018, 7:03 a.m. UTC | #33
Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > I assert that we should fix af5679fbc669f31f.
> > 
> > If you can come up with reasonable patch which doesn't complicate the
> > code and it is a clear win for both this particular workload as well as
> > others then why not.
> 
> Why can't we do "at least MMF_OOM_SKIP should be set under the lock to
> prevent from races" ?
> 

Well, serializing setting of MMF_OOM_SKIP using oom_lock was not sufficient.
It seems that some more moment is needed for allowing last second allocation
attempt at __alloc_pages_may_oom() to succeed. We need to migrate to
"mm, oom: fix unnecessary killing of additional processes" thread.



[  702.895936] a.out invoked oom-killer: gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0
[  702.900717] a.out cpuset=/ mems_allowed=0
[  702.903210] CPU: 1 PID: 3359 Comm: a.out Tainted: G                T 4.19.0-rc2+ #692
[  702.906630] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[  702.910771] Call Trace:
[  702.912681]  dump_stack+0x85/0xcb
[  702.914918]  dump_header+0x69/0x2fe
[  702.917387]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  702.921267]  oom_kill_process+0x307/0x390
[  702.923809]  out_of_memory+0x2f3/0x5d0
[  702.926208]  __alloc_pages_slowpath+0xc01/0x1030
[  702.928817]  __alloc_pages_nodemask+0x333/0x390
[  702.931270]  alloc_pages_vma+0x77/0x1f0
[  702.933618]  __handle_mm_fault+0x81c/0xf40
[  702.935962]  handle_mm_fault+0x1b7/0x3c0
[  702.938215]  __do_page_fault+0x2a6/0x580
[  702.940481]  do_page_fault+0x32/0x270
[  702.942753]  ? page_fault+0x8/0x30
[  702.944860]  page_fault+0x1e/0x30
[  702.947138] RIP: 0033:0x4008d8
[  702.949722] Code: Bad RIP value.
[  702.952000] RSP: 002b:00007ffc21fd99c0 EFLAGS: 00010206
[  702.954570] RAX: 00007fb3457cc010 RBX: 0000000100000000 RCX: 0000000000000000
[  702.957631] RDX: 00000000b0b24000 RSI: 0000000000020000 RDI: 0000000200000050
[  702.960599] RBP: 00007fb3457cc010 R08: 0000000200001000 R09: 0000000000021000
[  702.963531] R10: 0000000000000022 R11: 0000000000001000 R12: 0000000000000006
[  702.966518] R13: 00007ffc21fd9ab0 R14: 0000000000000000 R15: 0000000000000000
[  702.971186] Mem-Info:
[  702.976959] active_anon:788641 inactive_anon:3457 isolated_anon:0
[  702.976959]  active_file:0 inactive_file:77 isolated_file:0
[  702.976959]  unevictable:0 dirty:0 writeback:0 unstable:0
[  702.976959]  slab_reclaimable:8152 slab_unreclaimable:24616
[  702.976959]  mapped:2806 shmem:3704 pagetables:4355 bounce:0
[  702.976959]  free:20831 free_pcp:136 free_cma:0
[  703.007374] Node 0 active_anon:3154564kB inactive_anon:13828kB active_file:304kB inactive_file:112kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:11028kB dirty:0kB writeback:0kB shmem:14816kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2846720kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[  703.029994] Node 0 DMA free:13816kB min:308kB low:384kB high:460kB active_anon:1976kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  703.055331] lowmem_reserve[]: 0 2717 3378 3378
[  703.062881] Node 0 DMA32 free:58152kB min:54108kB low:67632kB high:81156kB active_anon:2718364kB inactive_anon:12kB active_file:424kB inactive_file:852kB unevictable:0kB writepending:0kB present:3129152kB managed:2782296kB mlocked:0kB kernel_stack:352kB pagetables:388kB bounce:0kB free_pcp:728kB local_pcp:0kB free_cma:0kB
[  703.091529] lowmem_reserve[]: 0 0 661 661
[  703.096552] Node 0 Normal free:12900kB min:13164kB low:16452kB high:19740kB active_anon:434248kB inactive_anon:13816kB active_file:0kB inactive_file:48kB unevictable:0kB writepending:0kB present:1048576kB managed:676908kB mlocked:0kB kernel_stack:6720kB pagetables:17028kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  703.123046] lowmem_reserve[]: 0 0 0 0
[  703.127490] Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (U) 3*32kB (U) 4*64kB (UM) 1*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 0*2048kB 3*4096kB (M) = 13816kB
[  703.134763] Node 0 DMA32: 85*4kB (UM) 59*8kB (UM) 61*16kB (UM) 65*32kB (UME) 47*64kB (UE) 44*128kB (UME) 41*256kB (UME) 27*512kB (UME) 20*1024kB (ME) 0*2048kB 0*4096kB = 57308kB
[  703.144076] Node 0 Normal: 119*4kB (UM) 69*8kB (UM) 156*16kB (UME) 179*32kB (UME) 44*64kB (UE) 9*128kB (UE) 1*256kB (E) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 13476kB
[  703.151746] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  703.156537] 3744 total pagecache pages
[  703.159017] 0 pages in swap cache
[  703.163209] Swap cache stats: add 0, delete 0, find 0/0
[  703.167254] Free swap  = 0kB
[  703.170577] Total swap = 0kB
[  703.173758] 1048422 pages RAM
[  703.176843] 0 pages HighMem/MovableOnly
[  703.180380] 179652 pages reserved
[  703.183718] 0 pages cma reserved
[  703.187160] 0 pages hwpoisoned
[  703.190058] Out of memory: Kill process 3359 (a.out) score 834 or sacrifice child
[  703.194707] Killed process 3359 (a.out) total-vm:4267252kB, anon-rss:2894908kB, file-rss:0kB, shmem-rss:0kB
[  703.203016] in:imjournal invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[  703.231814] in:imjournal cpuset=/ mems_allowed=0
[  703.231824] CPU: 2 PID: 1001 Comm: in:imjournal Tainted: G                T 4.19.0-rc2+ #692
[  703.231825] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[  703.231826] Call Trace:
[  703.231834]  dump_stack+0x85/0xcb
[  703.231839]  dump_header+0x69/0x2fe
[  703.231844]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  703.252369]  oom_kill_process+0x307/0x390
[  703.252375]  out_of_memory+0x2f3/0x5d0
[  703.252379]  __alloc_pages_slowpath+0xc01/0x1030
[  703.252388]  __alloc_pages_nodemask+0x333/0x390
[  703.262608]  filemap_fault+0x465/0x910
[  703.262663]  ? xfs_ilock+0xbf/0x2b0 [xfs]
[  703.262690]  ? __xfs_filemap_fault+0x7d/0x2c0 [xfs]
[  703.262695]  ? down_read_nested+0x66/0xa0
[  703.262718]  __xfs_filemap_fault+0x8e/0x2c0 [xfs]
[  703.262727]  __do_fault+0x11/0x133
[  703.262730]  __handle_mm_fault+0xa57/0xf40
[  703.262736]  handle_mm_fault+0x1b7/0x3c0
[  703.262742]  __do_page_fault+0x2a6/0x580
[  703.262748]  do_page_fault+0x32/0x270
[  703.262754]  ? page_fault+0x8/0x30
[  703.262756]  page_fault+0x1e/0x30
[  703.262759] RIP: 0033:0x7f5f078f6e28
[  703.262765] Code: Bad RIP value.
[  703.262766] RSP: 002b:00007f5f05007c50 EFLAGS: 00010246
[  703.262768] RAX: 0000000000000300 RBX: 0000000000000009 RCX: 00007f5f05007d80
[  703.262769] RDX: 00000000000003dd RSI: 00007f5f07b1ca1a RDI: 00005596745e9bb0
[  703.262770] RBP: 00007f5f05007d70 R08: 0000000000000006 R09: 0000000000000078
[  703.262771] R10: 00005596745e9810 R11: 00007f5f082bb4a0 R12: 00007f5f05007d90
[  703.262772] R13: 00007f5f00035fc0 R14: 00007f5f0003c850 R15: 00007f5f0000d760
[  703.263679] Mem-Info:
[  703.263692] active_anon:243554 inactive_anon:3457 isolated_anon:0
[  703.263692]  active_file:53 inactive_file:4386 isolated_file:3
[  703.263692]  unevictable:0 dirty:0 writeback:0 unstable:0
[  703.263692]  slab_reclaimable:8152 slab_unreclaimable:24512
[  703.263692]  mapped:3730 shmem:3704 pagetables:3123 bounce:0
[  703.263692]  free:562331 free_pcp:502 free_cma:0
[  703.263698] Node 0 active_anon:974216kB inactive_anon:13828kB active_file:212kB inactive_file:17544kB unevictable:0kB isolated(anon):0kB isolated(file):12kB mapped:14920kB dirty:0kB writeback:0kB shmem:14816kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 305152kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[  703.263700] Node 0 DMA free:13816kB min:308kB low:384kB high:460kB active_anon:1976kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  703.263705] lowmem_reserve[]: 0 2717 3378 3378
[  703.263712] Node 0 DMA32 free:2077084kB min:54108kB low:67632kB high:81156kB active_anon:696988kB inactive_anon:12kB active_file:252kB inactive_file:756kB unevictable:0kB writepending:0kB present:3129152kB managed:2782296kB mlocked:0kB kernel_stack:352kB pagetables:388kB bounce:0kB free_pcp:1328kB local_pcp:248kB free_cma:0kB
[  703.263716] lowmem_reserve[]: 0 0 661 661
[  703.263722] Node 0 Normal free:158424kB min:13164kB low:16452kB high:19740kB active_anon:275676kB inactive_anon:13816kB active_file:0kB inactive_file:16084kB unevictable:0kB writepending:0kB present:1048576kB managed:676908kB mlocked:0kB kernel_stack:6720kB pagetables:12100kB bounce:0kB free_pcp:680kB local_pcp:148kB free_cma:0kB
[  703.263726] lowmem_reserve[]: 0 0 0 0
[  703.263731] Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (U) 3*32kB (U) 4*64kB (UM) 1*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 0*2048kB 3*4096kB (M) = 13816kB
[  703.263752] Node 0 DMA32: 122*4kB (UM) 83*8kB (UM) 56*16kB (UM) 70*32kB (UME) 47*64kB (UE) 46*128kB (UME) 41*256kB (UME) 27*512kB (UME) 20*1024kB (ME) 213*2048kB (M) 387*4096kB (M) = 2079360kB
[  703.263776] Node 0 Normal: 653*4kB (UM) 1495*8kB (UM) 1604*16kB (UME) 1069*32kB (UME) 278*64kB (UME) 147*128kB (UME) 85*256kB (ME) 6*512kB (M) 0*1024kB 7*2048kB (M) 2*4096kB (M) = 158412kB
[  703.263801] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  703.263802] 8221 total pagecache pages
[  703.263804] 0 pages in swap cache
[  703.263806] Swap cache stats: add 0, delete 0, find 0/0
[  703.263807] Free swap  = 0kB
[  703.263808] Total swap = 0kB
[  703.263810] 1048422 pages RAM
[  703.263811] 0 pages HighMem/MovableOnly
[  703.263812] 179652 pages reserved
[  703.263813] 0 pages cma reserved
[  703.263814] 0 pages hwpoisoned
[  703.263817] Out of memory: Kill process 1089 (java) score 52 or sacrifice child
[  703.264145] Killed process 1089 (java) total-vm:5555688kB, anon-rss:181976kB, file-rss:0kB, shmem-rss:0kB
[  703.300281] oom_reaper: reaped process 3359 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a790ef4..0c2c0a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3922,6 +3922,7 @@  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
 	struct zone *zone;
 	struct zoneref *z;
+	bool ret = false;
 
 	/*
 	 * Costly allocations might have made a progress but this doesn't mean
@@ -3985,25 +3986,26 @@  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 				}
 			}
 
-			/*
-			 * Memory allocation/reclaim might be called from a WQ
-			 * context and the current implementation of the WQ
-			 * concurrency control doesn't recognize that
-			 * a particular WQ is congested if the worker thread is
-			 * looping without ever sleeping. Therefore we have to
-			 * do a short sleep here rather than calling
-			 * cond_resched().
-			 */
-			if (current->flags & PF_WQ_WORKER)
-				schedule_timeout_uninterruptible(1);
-			else
-				cond_resched();
-
-			return true;
+			ret = true;
+			goto out;
 		}
 	}
 
-	return false;
+out:
+	/*
+	 * Memory allocation/reclaim might be called from a WQ
+	 * context and the current implementation of the WQ
+	 * concurrency control doesn't recognize that
+	 * a particular WQ is congested if the worker thread is
+	 * looping without ever sleeping. Therefore we have to
+	 * do a short sleep here rather than calling
+	 * cond_resched().
+	 */
+	if (current->flags & PF_WQ_WORKER)
+		schedule_timeout_uninterruptible(1);
+	else
+		cond_resched();
+	return ret;
 }
 
 static inline bool