Message ID | ca3da8b8-1bb5-c302-b190-fa6cebab58ca@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry(). | expand |
On Thu 26-07-18 20:06:24, Tetsuo Handa wrote: > Before applying "an OOM lockup mitigation patch", I want to apply this > "another OOM lockup avoidance" patch. I do not really see why. All these are borderline interesting as the system is basically dead by the time you reach this state. > Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz > (which was captured with > > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc) > { > unsigned long freed = 0; > bool delay = false; /* if set, delay next allocation attempt */ > + static unsigned long last_warned; > + if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) { > + pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", current->comm, > + current->pid, oc->gfp_mask, &oc->gfp_mask, oc->order); > + last_warned = jiffies; > + } > > oc->constraint = CONSTRAINT_NONE; > if (oom_killer_disabled) > > in order to demonstrate that the GFP_NOIO allocation from disk_events_workfn() is > calling out_of_memory() rather than by error failing to give up direct reclaim). > > [ 258.619119] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 > [ 268.622732] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 > [ 278.635344] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 > [ 288.639360] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 > [ 298.642715] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 Hmm, so there is no other memory allocation to trigger the oom or they all just back off on the oom_lock trylock? In other words what is preventing from the oom killer invocation? [...] > Since the patch shown below was suggested by Michal Hocko at > https://marc.info/?l=linux-mm&m=152723708623015 , it is from Michal Hocko. > > >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Thu, 26 Jul 2018 14:40:03 +0900 > Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry(). > > Tetsuo Handa has reported that it is possible to bypass the short sleep > for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5 > ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make > any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO > only when there are too many dirty and writeback pages") and lock up the > system if OOM. > > This is because we are implicitly counting on falling back to > schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when > schedule_timeout_uninterruptible() in should_reclaim_retry() was not > called due to __zone_watermark_ok() == false. How do we rely on that? > However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is > not called if all allocating threads but a PF_WQ_WORKER thread got stuck at > __GFP_FS direct reclaim, for mutex_trylock(&oom_lock) by that PF_WQ_WORKER > thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER > thread is doing __GFP_FS allocation. I have really hard time to parse and understand this. > Tetsuo is observing that GFP_NOIO > allocation request from disk_events_workfn() is preventing other pending > works from starting. What about any other allocation from !PF_WQ_WORKER context? Why those do not jump in? > Since should_reclaim_retry() should be a natural reschedule point, > let's do the short sleep for PF_WQ_WORKER threads unconditionally > in order to guarantee that other pending works are started. OK, this is finally makes some sense. But it doesn't explain why it handles the live lock. > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Signed-off-by: Michal Hocko <mhocko@suse.com> Your s-o-b is missing again. I have already told you that previously when you were posting the patch. I do not mind this change per se but I am not happy about _your_ changelog. It doesn't explain the underlying problem IMHO. Having a natural and unconditional scheduling point in should_reclaim_retry is a reasonable thing. But how the hack it relates to the livelock you are seeing. So namely the changelog should explain 1) why nobody is able to make forward progress during direct reclaim 2) why nobody is able to trigger oom killer as the last resort > Cc: Roman Gushchin <guro@fb.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Tejun Heo <tj@kernel.org> > --- > mm/page_alloc.c | 34 ++++++++++++++++++---------------- > 1 file changed, 18 insertions(+), 16 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index a790ef4..0c2c0a2 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3922,6 +3922,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > { > struct zone *zone; > struct zoneref *z; > + bool ret = false; > > /* > * Costly allocations might have made a progress but this doesn't mean > @@ -3985,25 +3986,26 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > } > } > > - /* > - * Memory allocation/reclaim might be called from a WQ > - * context and the current implementation of the WQ > - * concurrency control doesn't recognize that > - * a particular WQ is congested if the worker thread is > - * looping without ever sleeping. Therefore we have to > - * do a short sleep here rather than calling > - * cond_resched(). > - */ > - if (current->flags & PF_WQ_WORKER) > - schedule_timeout_uninterruptible(1); > - else > - cond_resched(); > - > - return true; > + ret = true; > + goto out; > } > } > > - return false; > +out: > + /* > + * Memory allocation/reclaim might be called from a WQ > + * context and the current implementation of the WQ > + * concurrency control doesn't recognize that > + * a particular WQ is congested if the worker thread is > + * looping without ever sleeping. Therefore we have to > + * do a short sleep here rather than calling > + * cond_resched(). > + */ > + if (current->flags & PF_WQ_WORKER) > + schedule_timeout_uninterruptible(1); > + else > + cond_resched(); > + return ret; > } > > static inline bool > -- > 1.8.3.1 >
On 2018/07/26 20:39, Michal Hocko wrote: > On Thu 26-07-18 20:06:24, Tetsuo Handa wrote: >> Before applying "an OOM lockup mitigation patch", I want to apply this >> "another OOM lockup avoidance" patch. > > I do not really see why. All these are borderline interesting as the > system is basically dead by the time you reach this state. I don't like your "borderline interesting". We still don't have a watchdog which tells something went wrong. Thus, "borderline interesting" has to be examined and fixed. > >> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz >> (which was captured with >> >> --- a/mm/oom_kill.c >> +++ b/mm/oom_kill.c >> @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc) >> { >> unsigned long freed = 0; >> bool delay = false; /* if set, delay next allocation attempt */ >> + static unsigned long last_warned; >> + if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) { >> + pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", current->comm, >> + current->pid, oc->gfp_mask, &oc->gfp_mask, oc->order); >> + last_warned = jiffies; >> + } >> >> oc->constraint = CONSTRAINT_NONE; >> if (oom_killer_disabled) >> >> in order to demonstrate that the GFP_NOIO allocation from disk_events_workfn() is >> calling out_of_memory() rather than by error failing to give up direct reclaim). >> >> [ 258.619119] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 >> [ 268.622732] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 >> [ 278.635344] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 >> [ 288.639360] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 >> [ 298.642715] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 > > Hmm, so there is no other memory allocation to trigger the oom or they > all just back off on the oom_lock trylock? In other words what is > preventing from the oom killer invocation? All __GFP_FS allocations got stuck at direct reclaim or workqueue. [ 310.265293] systemd D12240 1 0 0x00000000 [ 310.268118] Call Trace: [ 310.269894] ? __schedule+0x245/0x7f0 [ 310.271901] ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs] [ 310.274187] schedule+0x23/0x80 [ 310.275965] schedule_preempt_disabled+0x5/0x10 [ 310.278107] __mutex_lock+0x3f5/0x9b0 [ 310.280070] ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs] [ 310.282451] xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs] [ 310.284730] ? lock_acquire+0x51/0x70 [ 310.286678] ? xfs_ail_push_all+0x13/0x60 [xfs] [ 310.288844] xfs_reclaim_inodes_nr+0x2c/0x40 [xfs] [ 310.291092] super_cache_scan+0x14b/0x1a0 [ 310.293131] do_shrink_slab+0xfc/0x190 [ 310.295082] shrink_slab+0x240/0x2c0 [ 310.297028] shrink_node+0xe3/0x460 [ 310.298899] do_try_to_free_pages+0xcb/0x380 [ 310.300945] try_to_free_pages+0xbb/0xf0 [ 310.302874] __alloc_pages_slowpath+0x3c1/0xc50 [ 310.304931] ? lock_acquire+0x51/0x70 [ 310.306753] ? set_page_refcounted+0x40/0x40 [ 310.308686] __alloc_pages_nodemask+0x2a6/0x2c0 [ 310.310663] filemap_fault+0x437/0x8e0 [ 310.312446] ? lock_acquire+0x51/0x70 [ 310.314150] ? xfs_ilock+0x86/0x190 [xfs] [ 310.315915] __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs] [ 310.318089] __do_fault+0x13/0x126 [ 310.319696] __handle_mm_fault+0xc8d/0x11c0 [ 310.321493] handle_mm_fault+0x17a/0x390 [ 310.323156] __do_page_fault+0x24c/0x4d0 [ 310.324782] do_page_fault+0x2a/0x70 [ 310.326289] ? page_fault+0x8/0x30 [ 310.327745] page_fault+0x1e/0x30 [ 313.714537] systemd-journal D12600 498 1 0x00000000 [ 313.716538] Call Trace: [ 313.717683] ? __schedule+0x245/0x7f0 [ 313.719221] schedule+0x23/0x80 [ 313.720586] schedule_timeout+0x21f/0x400 [ 313.722163] wait_for_completion+0xb2/0x130 [ 313.723750] ? wake_up_q+0x70/0x70 [ 313.725134] flush_work+0x1db/0x2b0 [ 313.726535] ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 [ 313.728336] ? page_alloc_cpu_dead+0x30/0x30 [ 313.729936] drain_all_pages+0x174/0x1e0 [ 313.731446] __alloc_pages_slowpath+0x428/0xc50 [ 313.733120] __alloc_pages_nodemask+0x2a6/0x2c0 [ 313.734826] filemap_fault+0x437/0x8e0 [ 313.736296] ? lock_acquire+0x51/0x70 [ 313.737769] ? xfs_ilock+0x86/0x190 [xfs] [ 313.739309] __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs] [ 313.741291] __do_fault+0x13/0x126 [ 313.742667] __handle_mm_fault+0xc8d/0x11c0 [ 313.744245] handle_mm_fault+0x17a/0x390 [ 313.745755] __do_page_fault+0x24c/0x4d0 [ 313.747290] do_page_fault+0x2a/0x70 [ 313.748728] ? page_fault+0x8/0x30 [ 313.750148] page_fault+0x1e/0x30 [ 324.987240] workqueue events_freezable_power_: flags=0x84 [ 324.989292] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 324.991482] in-flight: 5:disk_events_workfn [ 324.993371] workqueue mm_percpu_wq: flags=0x8 [ 324.995167] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 [ 324.997363] pending: vmstat_update, drain_local_pages_wq BAR(498) [ 378.984241] INFO: task systemd-journal:498 blocked for more than 120 seconds. [ 378.986657] Not tainted 4.18.0-rc6-next-20180724+ #248 [ 378.988741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 378.991343] systemd-journal D12600 498 1 0x00000000 [ 378.993434] Call Trace: [ 378.994643] ? __schedule+0x245/0x7f0 [ 378.996158] schedule+0x23/0x80 [ 378.997528] schedule_timeout+0x21f/0x400 [ 378.999112] wait_for_completion+0xb2/0x130 [ 379.000728] ? wake_up_q+0x70/0x70 [ 379.002170] flush_work+0x1db/0x2b0 [ 379.003667] ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 [ 379.005507] ? page_alloc_cpu_dead+0x30/0x30 [ 379.007176] drain_all_pages+0x174/0x1e0 [ 379.008836] __alloc_pages_slowpath+0x428/0xc50 [ 379.010625] __alloc_pages_nodemask+0x2a6/0x2c0 [ 379.012364] filemap_fault+0x437/0x8e0 [ 379.013881] ? lock_acquire+0x51/0x70 [ 379.015373] ? xfs_ilock+0x86/0x190 [xfs] [ 379.016950] __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs] [ 379.019019] __do_fault+0x13/0x126 [ 379.020439] __handle_mm_fault+0xc8d/0x11c0 [ 379.022055] handle_mm_fault+0x17a/0x390 [ 379.023623] __do_page_fault+0x24c/0x4d0 [ 379.025167] do_page_fault+0x2a/0x70 [ 379.026621] ? page_fault+0x8/0x30 [ 379.028022] page_fault+0x1e/0x30 [ 432.759743] systemd-journal D12600 498 1 0x00000000 [ 432.761748] Call Trace: [ 432.762892] ? __schedule+0x245/0x7f0 [ 432.764352] schedule+0x23/0x80 [ 432.765673] schedule_timeout+0x21f/0x400 [ 432.767428] wait_for_completion+0xb2/0x130 [ 432.769207] ? wake_up_q+0x70/0x70 [ 432.770764] flush_work+0x1db/0x2b0 [ 432.772223] ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 [ 432.774125] ? page_alloc_cpu_dead+0x30/0x30 [ 432.775873] drain_all_pages+0x174/0x1e0 [ 432.777528] __alloc_pages_slowpath+0x428/0xc50 [ 432.779306] __alloc_pages_nodemask+0x2a6/0x2c0 [ 432.780977] filemap_fault+0x437/0x8e0 [ 432.782503] ? lock_acquire+0x51/0x70 [ 432.784012] ? xfs_ilock+0x86/0x190 [xfs] [ 432.785556] __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs] [ 432.787537] __do_fault+0x13/0x126 [ 432.788930] __handle_mm_fault+0xc8d/0x11c0 [ 432.790570] handle_mm_fault+0x17a/0x390 [ 432.792162] __do_page_fault+0x24c/0x4d0 [ 432.793773] do_page_fault+0x2a/0x70 [ 432.795196] ? page_fault+0x8/0x30 [ 432.796572] page_fault+0x1e/0x30 [ 444.234824] workqueue events_freezable_power_: flags=0x84 [ 444.236937] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 444.239138] in-flight: 5:disk_events_workfn [ 444.241022] workqueue mm_percpu_wq: flags=0x8 [ 444.242829] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 [ 444.245057] pending: vmstat_update, drain_local_pages_wq BAR(498) > > [...] > >> Since the patch shown below was suggested by Michal Hocko at >> https://marc.info/?l=linux-mm&m=152723708623015 , it is from Michal Hocko. >> >> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001 >> From: Michal Hocko <mhocko@suse.com> >> Date: Thu, 26 Jul 2018 14:40:03 +0900 >> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry(). >> >> Tetsuo Handa has reported that it is possible to bypass the short sleep >> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5 >> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make >> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO >> only when there are too many dirty and writeback pages") and lock up the >> system if OOM. >> >> This is because we are implicitly counting on falling back to >> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when >> schedule_timeout_uninterruptible() in should_reclaim_retry() was not >> called due to __zone_watermark_ok() == false. > > How do we rely on that? Unless disk_events_workfn() (PID=5) sleeps at schedule_timeout_uninterruptible() in __alloc_pages_may_oom(), drain_local_pages_wq() which a thread doing __GFP_FS allocation (PID=498) is waiting for cannot be started. > >> However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is >> not called if all allocating threads but a PF_WQ_WORKER thread got stuck at >> __GFP_FS direct reclaim, for mutex_trylock(&oom_lock) by that PF_WQ_WORKER >> thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER >> thread is doing __GFP_FS allocation. > > I have really hard time to parse and understand this. You can write as you like. > >> Tetsuo is observing that GFP_NOIO >> allocation request from disk_events_workfn() is preventing other pending >> works from starting. > > What about any other allocation from !PF_WQ_WORKER context? Why those do > not jump in? All __GFP_FS allocations got stuck at direct reclaim or workqueue. > >> Since should_reclaim_retry() should be a natural reschedule point, >> let's do the short sleep for PF_WQ_WORKER threads unconditionally >> in order to guarantee that other pending works are started. > > OK, this is finally makes some sense. But it doesn't explain why it > handles the live lock. As explained above, if disk_events_workfn() (PID=5) explicitly sleeps, vmstat_update and drain_local_pages_wq from WQ_MEM_RECLAIM workqueue will start, and unblock PID=498 which is waiting for drain_local_pages_wq and allow PID=498 to invoke the OOM killer. > >> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> >> Signed-off-by: Michal Hocko <mhocko@suse.com> > > Your s-o-b is missing again. I have already told you that previously > when you were posting the patch. I'm waiting for you to post this change with your wording. > > I do not mind this change per se but I am not happy about _your_ changelog. > It doesn't explain the underlying problem IMHO. Having a natural and > unconditional scheduling point in should_reclaim_retry is a reasonable > thing. But how the hack it relates to the livelock you are seeing. So > namely the changelog should explain > 1) why nobody is able to make forward progress during direct reclaim Because GFP_NOIO allocation from one workqueue prevented WQ_MEM_RECLAIM workqueues from starting, for it did not call schedule_timeout_*() because mutex_trylock(&oom_lock) did not fail because nobody else could call __alloc_pages_may_oom(). > 2) why nobody is able to trigger oom killer as the last resort Because only one !__GFP_FS allocating thread which did not get stuck at direct reclaim was able to call __alloc_pages_may_oom().
On Sat 28-07-18 00:47:40, Tetsuo Handa wrote: > On 2018/07/26 20:39, Michal Hocko wrote: > > On Thu 26-07-18 20:06:24, Tetsuo Handa wrote: > >> Before applying "an OOM lockup mitigation patch", I want to apply this > >> "another OOM lockup avoidance" patch. > > > > I do not really see why. All these are borderline interesting as the > > system is basically dead by the time you reach this state. > > I don't like your "borderline interesting". We still don't have a watchdog > which tells something went wrong. Thus, "borderline interesting" has to be > examined and fixed. No question about that. Bugs should be fixed. But this doesn't look like something we should panic about and rush a half baked or not fully understood fixes. > >> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz > >> (which was captured with > >> > >> --- a/mm/oom_kill.c > >> +++ b/mm/oom_kill.c > >> @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc) > >> { > >> unsigned long freed = 0; > >> bool delay = false; /* if set, delay next allocation attempt */ > >> + static unsigned long last_warned; > >> + if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) { > >> + pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", current->comm, > >> + current->pid, oc->gfp_mask, &oc->gfp_mask, oc->order); > >> + last_warned = jiffies; > >> + } > >> > >> oc->constraint = CONSTRAINT_NONE; > >> if (oom_killer_disabled) > >> > >> in order to demonstrate that the GFP_NOIO allocation from disk_events_workfn() is > >> calling out_of_memory() rather than by error failing to give up direct reclaim). > >> > >> [ 258.619119] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 > >> [ 268.622732] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 > >> [ 278.635344] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 > >> [ 288.639360] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 > >> [ 298.642715] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0 > > > > Hmm, so there is no other memory allocation to trigger the oom or they > > all just back off on the oom_lock trylock? In other words what is > > preventing from the oom killer invocation? > > All __GFP_FS allocations got stuck at direct reclaim or workqueue. OK, I see. This is important information which was missing in the previous examination. [...] > >> Since the patch shown below was suggested by Michal Hocko at > >> https://marc.info/?l=linux-mm&m=152723708623015 , it is from Michal Hocko. > >> > >> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001 > >> From: Michal Hocko <mhocko@suse.com> > >> Date: Thu, 26 Jul 2018 14:40:03 +0900 > >> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry(). > >> > >> Tetsuo Handa has reported that it is possible to bypass the short sleep > >> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5 > >> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make > >> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO > >> only when there are too many dirty and writeback pages") and lock up the > >> system if OOM. > >> > >> This is because we are implicitly counting on falling back to > >> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when > >> schedule_timeout_uninterruptible() in should_reclaim_retry() was not > >> called due to __zone_watermark_ok() == false. > > > > How do we rely on that? > > Unless disk_events_workfn() (PID=5) sleeps at schedule_timeout_uninterruptible() > in __alloc_pages_may_oom(), drain_local_pages_wq() which a thread doing __GFP_FS > allocation (PID=498) is waiting for cannot be started. Now you are losing me again. What is going on about drain_local_pages_wq? I can see that all __GFP_FS allocations are waiting for a completion which depends on the kworker context to run but I fail to see why drain_local_pages_wq matters. The memory on the pcp lists is not accounted to NR_FREE_PAGES IIRC but that should be a relatively small amount of memory so this cannot be a fundamental problem here. > >> However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is > >> not called if all allocating threads but a PF_WQ_WORKER thread got stuck at > >> __GFP_FS direct reclaim, for mutex_trylock(&oom_lock) by that PF_WQ_WORKER > >> thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER > >> thread is doing __GFP_FS allocation. > > > > I have really hard time to parse and understand this. > > You can write as you like. I can as soon as I understand what is going on. > >> Tetsuo is observing that GFP_NOIO > >> allocation request from disk_events_workfn() is preventing other pending > >> works from starting. > > > > What about any other allocation from !PF_WQ_WORKER context? Why those do > > not jump in? > > All __GFP_FS allocations got stuck at direct reclaim or workqueue. > > > > >> Since should_reclaim_retry() should be a natural reschedule point, > >> let's do the short sleep for PF_WQ_WORKER threads unconditionally > >> in order to guarantee that other pending works are started. > > > > OK, this is finally makes some sense. But it doesn't explain why it > > handles the live lock. > > As explained above, if disk_events_workfn() (PID=5) explicitly sleeps, > vmstat_update and drain_local_pages_wq from WQ_MEM_RECLAIM workqueue will > start, and unblock PID=498 which is waiting for drain_local_pages_wq and > allow PID=498 to invoke the OOM killer. [ 313.714537] systemd-journal D12600 498 1 0x00000000 [ 313.716538] Call Trace: [ 313.717683] ? __schedule+0x245/0x7f0 [ 313.719221] schedule+0x23/0x80 [ 313.720586] schedule_timeout+0x21f/0x400 [ 313.722163] wait_for_completion+0xb2/0x130 [ 313.723750] ? wake_up_q+0x70/0x70 [ 313.725134] flush_work+0x1db/0x2b0 [ 313.726535] ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 [ 313.728336] ? page_alloc_cpu_dead+0x30/0x30 [ 313.729936] drain_all_pages+0x174/0x1e0 [ 313.731446] __alloc_pages_slowpath+0x428/0xc50 [ 313.733120] __alloc_pages_nodemask+0x2a6/0x2c0 [ 313.734826] filemap_fault+0x437/0x8e0 [ 313.736296] ? lock_acquire+0x51/0x70 [ 313.737769] ? xfs_ilock+0x86/0x190 [xfs] [ 313.739309] __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs] [ 313.741291] __do_fault+0x13/0x126 [ 313.742667] __handle_mm_fault+0xc8d/0x11c0 [ 313.744245] handle_mm_fault+0x17a/0x390 [ 313.745755] __do_page_fault+0x24c/0x4d0 [ 313.747290] do_page_fault+0x2a/0x70 [ 313.748728] ? page_fault+0x8/0x30 [ 313.750148] page_fault+0x1e/0x30 This one is waiting for draining and we are in mm_percpu_wq WQ context which has its rescuer so no other activity can block us for ever. So this certainly shouldn't deadlock. It can be dead slow but well, this is what you will get when your shoot your system to death. So there must be something more going on... > >> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > >> Signed-off-by: Michal Hocko <mhocko@suse.com> > > > > Your s-o-b is missing again. I have already told you that previously > > when you were posting the patch. > > I'm waiting for you to post this change with your wording. > > > > > I do not mind this change per se but I am not happy about _your_ changelog. > > It doesn't explain the underlying problem IMHO. Having a natural and > > unconditional scheduling point in should_reclaim_retry is a reasonable > > thing. But how the hack it relates to the livelock you are seeing. So > > namely the changelog should explain > > 1) why nobody is able to make forward progress during direct reclaim > > Because GFP_NOIO allocation from one workqueue prevented WQ_MEM_RECLAIM > workqueues from starting, for it did not call schedule_timeout_*() because > mutex_trylock(&oom_lock) did not fail because nobody else could call > __alloc_pages_may_oom(). Why hasn't the rescuer helped here? > > 2) why nobody is able to trigger oom killer as the last resort > > Because only one !__GFP_FS allocating thread which did not get stuck at > direct reclaim was able to call __alloc_pages_may_oom(). All the remaining allocations got stuck waiting for completion which depends on !WQ_MEM_RECLAIM workers right?
On 2018/07/30 18:32, Michal Hocko wrote: >>>> Since the patch shown below was suggested by Michal Hocko at >>>> https://marc.info/?l=linux-mm&m=152723708623015 , it is from Michal Hocko. >>>> >>>> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001 >>>> From: Michal Hocko <mhocko@suse.com> >>>> Date: Thu, 26 Jul 2018 14:40:03 +0900 >>>> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry(). >>>> >>>> Tetsuo Handa has reported that it is possible to bypass the short sleep >>>> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5 >>>> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make >>>> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO >>>> only when there are too many dirty and writeback pages") and lock up the >>>> system if OOM. >>>> >>>> This is because we are implicitly counting on falling back to >>>> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when >>>> schedule_timeout_uninterruptible() in should_reclaim_retry() was not >>>> called due to __zone_watermark_ok() == false. >>> >>> How do we rely on that? >> >> Unless disk_events_workfn() (PID=5) sleeps at schedule_timeout_uninterruptible() >> in __alloc_pages_may_oom(), drain_local_pages_wq() which a thread doing __GFP_FS >> allocation (PID=498) is waiting for cannot be started. > > Now you are losing me again. What is going on about > drain_local_pages_wq? I can see that all __GFP_FS allocations are > waiting for a completion which depends on the kworker context to run but > I fail to see why drain_local_pages_wq matters. The memory on the pcp > lists is not accounted to NR_FREE_PAGES IIRC but that should be a > relatively small amount of memory so this cannot be a fundamental > problem here. If you look at "busy workqueues and worker pools", you will find that not only vmstat_update and drain_local_pages_wq (which is WQ_MEM_RECLAIM) but also other works such as xlog_cil_push_work and xfs_reclaim_worker (which is also WQ_MEM_RECLAIM) remain "pending:" state. [ 324.960731] Showing busy workqueues and worker pools: [ 324.962577] workqueue events: flags=0x0 [ 324.964137] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=13/256 [ 324.966231] pending: vmw_fb_dirty_flush [vmwgfx], vmstat_shepherd, vmpressure_work_fn, free_work, mmdrop_async_fn, mmdrop_async_fn, mmdrop_async_fn, mmdrop_async_fn, e1000_watchdog [e1000], mmdrop_async_fn, mmdrop_async_fn, check_corruption, console_callback [ 324.973425] workqueue events_freezable: flags=0x4 [ 324.975247] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 324.977393] pending: vmballoon_work [vmw_balloon] [ 324.979310] workqueue events_power_efficient: flags=0x80 [ 324.981298] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256 [ 324.983543] pending: gc_worker [nf_conntrack], fb_flashcursor, neigh_periodic_work, neigh_periodic_work, check_lifetime [ 324.987240] workqueue events_freezable_power_: flags=0x84 [ 324.989292] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 324.991482] in-flight: 5:disk_events_workfn [ 324.993371] workqueue mm_percpu_wq: flags=0x8 [ 324.995167] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 [ 324.997363] pending: vmstat_update, drain_local_pages_wq BAR(498) [ 324.999977] workqueue ipv6_addrconf: flags=0x40008 [ 325.001899] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1 [ 325.004092] pending: addrconf_verify_work [ 325.005911] workqueue mpt_poll_0: flags=0x8 [ 325.007686] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 325.009914] pending: mpt_fault_reset_work [mptbase] [ 325.012044] workqueue xfs-cil/sda1: flags=0xc [ 325.013897] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 325.016190] pending: xlog_cil_push_work [xfs] BAR(2344) [ 325.018354] workqueue xfs-reclaim/sda1: flags=0xc [ 325.020293] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 325.022549] pending: xfs_reclaim_worker [xfs] [ 325.024540] workqueue xfs-sync/sda1: flags=0x4 [ 325.026425] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 325.028691] pending: xfs_log_worker [xfs] [ 325.030546] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=189s workers=4 idle: 977 65 13 [ 444.206334] Showing busy workqueues and worker pools: [ 444.208472] workqueue events: flags=0x0 [ 444.210193] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=15/256 [ 444.212389] pending: vmw_fb_dirty_flush [vmwgfx], vmstat_shepherd, vmpressure_work_fn, free_work, mmdrop_async_fn, mmdrop_async_fn, mmdrop_async_fn, mmdrop_async_fn, e1000_watchdog [e1000], mmdrop_async_fn, mmdrop_async_fn, check_corruption, console_callback, sysrq_reinject_alt_sysrq, moom_callback [ 444.220547] workqueue events_freezable: flags=0x4 [ 444.222562] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 444.224852] pending: vmballoon_work [vmw_balloon] [ 444.227022] workqueue events_power_efficient: flags=0x80 [ 444.229103] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256 [ 444.231271] pending: gc_worker [nf_conntrack], fb_flashcursor, neigh_periodic_work, neigh_periodic_work, check_lifetime [ 444.234824] workqueue events_freezable_power_: flags=0x84 [ 444.236937] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 444.239138] in-flight: 5:disk_events_workfn [ 444.241022] workqueue mm_percpu_wq: flags=0x8 [ 444.242829] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 [ 444.245057] pending: vmstat_update, drain_local_pages_wq BAR(498) [ 444.247646] workqueue ipv6_addrconf: flags=0x40008 [ 444.249582] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1 [ 444.251784] pending: addrconf_verify_work [ 444.253620] workqueue mpt_poll_0: flags=0x8 [ 444.255427] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 444.257666] pending: mpt_fault_reset_work [mptbase] [ 444.259800] workqueue xfs-cil/sda1: flags=0xc [ 444.261646] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 444.263903] pending: xlog_cil_push_work [xfs] BAR(2344) [ 444.266101] workqueue xfs-reclaim/sda1: flags=0xc [ 444.268104] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 444.270454] pending: xfs_reclaim_worker [xfs] [ 444.272425] workqueue xfs-eofblocks/sda1: flags=0xc [ 444.274432] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 444.276729] pending: xfs_eofblocks_worker [xfs] [ 444.278739] workqueue xfs-sync/sda1: flags=0x4 [ 444.280641] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 444.282967] pending: xfs_log_worker [xfs] [ 444.285195] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=309s workers=3 idle: 977 65 >>>> Tetsuo is observing that GFP_NOIO >>>> allocation request from disk_events_workfn() is preventing other pending >>>> works from starting. >>> >>> What about any other allocation from !PF_WQ_WORKER context? Why those do >>> not jump in? >> >> All __GFP_FS allocations got stuck at direct reclaim or workqueue. >> >>> >>>> Since should_reclaim_retry() should be a natural reschedule point, >>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally >>>> in order to guarantee that other pending works are started. >>> >>> OK, this is finally makes some sense. But it doesn't explain why it >>> handles the live lock. >> >> As explained above, if disk_events_workfn() (PID=5) explicitly sleeps, >> vmstat_update and drain_local_pages_wq from WQ_MEM_RECLAIM workqueue will >> start, and unblock PID=498 which is waiting for drain_local_pages_wq and >> allow PID=498 to invoke the OOM killer. > > [ 313.714537] systemd-journal D12600 498 1 0x00000000 > [ 313.716538] Call Trace: > [ 313.717683] ? __schedule+0x245/0x7f0 > [ 313.719221] schedule+0x23/0x80 > [ 313.720586] schedule_timeout+0x21f/0x400 > [ 313.722163] wait_for_completion+0xb2/0x130 > [ 313.723750] ? wake_up_q+0x70/0x70 > [ 313.725134] flush_work+0x1db/0x2b0 > [ 313.726535] ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 > [ 313.728336] ? page_alloc_cpu_dead+0x30/0x30 > [ 313.729936] drain_all_pages+0x174/0x1e0 > [ 313.731446] __alloc_pages_slowpath+0x428/0xc50 > [ 313.733120] __alloc_pages_nodemask+0x2a6/0x2c0 > [ 313.734826] filemap_fault+0x437/0x8e0 > [ 313.736296] ? lock_acquire+0x51/0x70 > [ 313.737769] ? xfs_ilock+0x86/0x190 [xfs] > [ 313.739309] __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs] > [ 313.741291] __do_fault+0x13/0x126 > [ 313.742667] __handle_mm_fault+0xc8d/0x11c0 > [ 313.744245] handle_mm_fault+0x17a/0x390 > [ 313.745755] __do_page_fault+0x24c/0x4d0 > [ 313.747290] do_page_fault+0x2a/0x70 > [ 313.748728] ? page_fault+0x8/0x30 > [ 313.750148] page_fault+0x1e/0x30 > > This one is waiting for draining and we are in mm_percpu_wq WQ context > which has its rescuer so no other activity can block us for ever. So > this certainly shouldn't deadlock. It can be dead slow but well, this is > what you will get when your shoot your system to death. We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to wake up. (Tejun, is my understanding correct?) Lack of schedule_timeout_*() does block WQ_MEM_RECLAIM workqueues forever. Did you count how many threads were running from SysRq-t output? If you did, you won't say "when you shoot your system to death". :-( (Tester processes stuck at D state are omitted from below excerpt. You can see that the system is almost idle.) [ 310.265293] systemd D12240 1 0 0x00000000 [ 310.345487] kthreadd S14024 2 0 0x80000000 [ 310.355813] rcu_gp I15176 3 2 0x80000000 [ 310.369297] rcu_par_gp I15176 4 2 0x80000000 [ 310.383005] kworker/0:0 R running task 13504 5 2 0x80000000 [ 310.385328] Workqueue: events_freezable_power_ disk_events_workfn [ 310.510953] kworker/0:0H I14648 6 2 0x80000000 [ 310.513024] Workqueue: (null) (xfs-log/sda1) [ 310.527289] kworker/u256:0 I12816 7 2 0x80000000 [ 310.529282] Workqueue: (null) (events_unbound) [ 310.543584] mm_percpu_wq I15176 8 2 0x80000000 [ 310.557527] ksoftirqd/0 S14664 9 2 0x80000000 [ 310.573041] rcu_sched I14360 10 2 0x80000000 [ 310.586582] rcu_bh I15192 11 2 0x80000000 [ 310.603372] migration/0 S14664 12 2 0x80000000 [ 310.617362] kworker/0:1 I13312 13 2 0x80000000 [ 310.619297] Workqueue: (null) (cgroup_pidlist_destroy) [ 310.633055] cpuhp/0 S13920 14 2 0x80000000 [ 310.647134] cpuhp/1 P13992 15 2 0x80000000 [ 310.662731] migration/1 P14664 16 2 0x80000000 [ 310.678559] ksoftirqd/1 P14656 17 2 0x80000000 [ 310.694278] kworker/1:0 I13808 18 2 0x80000000 [ 310.696221] Workqueue: (null) (events) [ 310.709753] kworker/1:0H I14648 19 2 0x80000000 [ 310.711701] Workqueue: (null) (kblockd) [ 310.725166] cpuhp/2 P13992 20 2 0x80000000 [ 310.740708] migration/2 P14664 21 2 0x80000000 [ 310.756255] ksoftirqd/2 P14664 22 2 0x80000000 [ 310.771832] kworker/2:0 I14648 23 2 0x80000000 [ 310.773789] Workqueue: (null) (events) [ 310.787215] kworker/2:0H I14480 24 2 0x80000000 [ 310.789156] Workqueue: (null) (xfs-log/sda1) [ 310.802735] cpuhp/3 P14056 25 2 0x80000000 [ 310.818282] migration/3 P14664 26 2 0x80000000 [ 310.833806] ksoftirqd/3 P14600 27 2 0x80000000 [ 310.849330] kworker/3:0 I14648 28 2 0x80000000 [ 310.851269] Workqueue: (null) (events) [ 310.864855] kworker/3:0H I15192 29 2 0x80000000 [ 310.878479] cpuhp/4 P13696 30 2 0x80000000 [ 310.894006] migration/4 P14664 31 2 0x80000000 [ 310.909697] ksoftirqd/4 P14664 32 2 0x80000000 [ 310.925546] kworker/4:0 I14464 33 2 0x80000000 [ 310.927482] Workqueue: (null) (events) [ 310.940917] kworker/4:0H I14808 34 2 0x80000000 [ 310.942872] Workqueue: (null) (kblockd) [ 310.956699] cpuhp/5 P13960 35 2 0x80000000 [ 310.972274] migration/5 P14664 36 2 0x80000000 [ 310.987850] ksoftirqd/5 P14664 37 2 0x80000000 [ 311.003382] kworker/5:0 I14400 38 2 0x80000000 [ 311.005322] Workqueue: (null) (events) [ 311.018758] kworker/5:0H I15184 39 2 0x80000000 [ 311.032629] cpuhp/6 P14024 40 2 0x80000000 [ 311.048609] migration/6 P14664 41 2 0x80000000 [ 311.064377] ksoftirqd/6 P14600 42 2 0x80000000 [ 311.079858] kworker/6:0 I14616 43 2 0x80000000 [ 311.081800] Workqueue: (null) (cgroup_pidlist_destroy) [ 311.095610] kworker/6:0H I14616 44 2 0x80000000 [ 311.097549] Workqueue: (null) (kblockd) [ 311.111033] cpuhp/7 P13992 45 2 0x80000000 [ 311.127639] migration/7 P14664 46 2 0x80000000 [ 311.143797] ksoftirqd/7 P14856 47 2 0x80000000 [ 311.159383] kworker/7:0 I14576 48 2 0x80000000 [ 311.161324] Workqueue: (null) (events) [ 311.174750] kworker/7:0H I14808 49 2 0x80000000 [ 311.176690] Workqueue: (null) (xfs-log/sda1) [ 311.190305] kdevtmpfs S13552 50 2 0x80000000 [ 311.205199] netns I15152 51 2 0x80000000 [ 311.218718] kworker/1:1 I14808 52 2 0x80000000 [ 311.220630] Workqueue: (null) (ata_sff) [ 311.234003] kauditd S14640 53 2 0x80000000 [ 311.251581] khungtaskd S14368 54 2 0x80000000 [ 311.268573] oom_reaper S14464 55 2 0x80000000 [ 311.283432] writeback I15152 56 2 0x80000000 [ 311.296825] kcompactd0 S14856 57 2 0x80000000 [ 311.311451] ksmd S15032 58 2 0x80000000 [ 311.331290] khugepaged D13832 59 2 0x80000000 [ 311.402468] crypto I15192 60 2 0x80000000 [ 311.421397] kintegrityd I15152 61 2 0x80000000 [ 311.435436] kblockd I15192 62 2 0x80000000 [ 311.449503] kworker/3:1 I14264 63 2 0x80000000 [ 311.451420] Workqueue: (null) (events) [ 311.464883] kworker/5:1 I13568 64 2 0x80000000 [ 311.466846] Workqueue: (null) (events) [ 311.480784] kworker/0:2 I13104 65 2 0x80000000 [ 311.482780] Workqueue: (null) (cgroup_pidlist_destroy) [ 311.497643] md I15152 66 2 0x80000000 [ 311.513651] edac-poller I15192 67 2 0x80000000 [ 311.528223] devfreq_wq I15152 68 2 0x80000000 [ 311.542243] watchdogd S15192 69 2 0x80000000 [ 311.557627] kworker/2:1 I13848 70 2 0x80000000 [ 311.559557] Workqueue: (null) (cgroup_pidlist_destroy) [ 311.573158] kswapd0 S11528 71 2 0x80000000 [ 311.588482] kworker/4:1 I13976 82 2 0x80000000 [ 311.590418] Workqueue: (null) (events) [ 311.603683] kthrotld I14712 88 2 0x80000000 [ 311.617936] kworker/6:1 I14280 89 2 0x80000000 [ 311.619855] Workqueue: (null) (events) [ 311.633386] kworker/7:1 I13976 90 2 0x80000000 [ 311.635309] Workqueue: (null) (events_power_efficient) [ 311.649001] acpi_thermal_pm I14856 91 2 0x80000000 [ 311.662606] kworker/u256:1 I13168 92 2 0x80000000 [ 311.664551] Workqueue: (null) (events_unbound) [ 311.678115] kmpath_rdacd I15176 93 2 0x80000000 [ 311.691756] kaluad I14712 94 2 0x80000000 [ 311.705839] nvme-wq I14712 95 2 0x80000000 [ 311.720293] nvme-reset-wq I14712 96 2 0x80000000 [ 311.734252] nvme-delete-wq I15176 97 2 0x80000000 [ 311.747801] kworker/u256:2 I14088 98 2 0x80000000 [ 311.749766] Workqueue: (null) (events_unbound) [ 311.763307] kworker/u256:3 I13560 99 2 0x80000000 [ 311.765273] Workqueue: (null) (events_unbound) [ 311.778791] ipv6_addrconf I15176 100 2 0x80000000 [ 311.792413] kworker/5:2 I14728 134 2 0x80000000 [ 311.794381] Workqueue: (null) (events) [ 311.809119] kworker/2:2 I13952 156 2 0x80000000 [ 311.811554] Workqueue: (null) (events) [ 311.829185] kworker/6:2 I14264 158 2 0x80000000 [ 311.831332] Workqueue: (null) (events) [ 311.844858] kworker/4:2 I14648 200 2 0x80000000 [ 311.846876] Workqueue: (null) (cgroup_pidlist_destroy) [ 311.860651] kworker/3:2 I14728 251 2 0x80000000 [ 311.862598] Workqueue: (null) (cgroup_pidlist_destroy) [ 311.876388] kworker/7:2 I13520 282 2 0x80000000 [ 311.878336] Workqueue: (null) (events) [ 311.891701] ata_sff I15152 284 2 0x80000000 [ 311.905381] scsi_eh_0 S14504 285 2 0x80000000 [ 311.922693] scsi_tmf_0 I14952 286 2 0x80000000 [ 311.936327] scsi_eh_1 S13560 287 2 0x80000000 [ 311.953512] scsi_tmf_1 I14952 288 2 0x80000000 [ 311.967137] mpt_poll_0 I15152 289 2 0x80000000 [ 311.980810] mpt/0 I14952 290 2 0x80000000 [ 311.994421] scsi_eh_2 S13320 291 2 0x80000000 [ 312.011558] scsi_tmf_2 I15152 292 2 0x80000000 [ 312.026613] scsi_eh_3 S13656 293 2 0x80000000 [ 312.046380] scsi_tmf_3 I15152 294 2 0x80000000 [ 312.063536] scsi_eh_4 S13656 295 2 0x80000000 [ 312.082705] scsi_tmf_4 I15176 296 2 0x80000000 [ 312.097973] scsi_eh_5 S13320 297 2 0x80000000 [ 312.117232] scsi_tmf_5 I15152 298 2 0x80000000 [ 312.132573] scsi_eh_6 S13320 299 2 0x80000000 [ 312.151843] scsi_tmf_6 I15152 300 2 0x80000000 [ 312.167109] ttm_swap I15176 301 2 0x80000000 [ 312.182415] scsi_eh_7 S13320 302 2 0x80000000 [ 312.201595] scsi_tmf_7 I15064 303 2 0x80000000 [ 312.216695] scsi_eh_8 S13320 304 2 0x80000000 [ 312.236470] scsi_tmf_8 I14856 305 2 0x80000000 [ 312.251403] scsi_eh_9 S13656 306 2 0x80000000 [ 312.270443] scsi_tmf_9 I15152 307 2 0x80000000 [ 312.284642] irq/16-vmwgfx S14784 308 2 0x80000000 [ 312.301877] scsi_eh_10 S13320 309 2 0x80000000 [ 312.319679] scsi_tmf_10 I15152 310 2 0x80000000 [ 312.333461] scsi_eh_11 S13320 311 2 0x80000000 [ 312.350681] scsi_tmf_11 I15176 312 2 0x80000000 [ 312.364444] scsi_eh_12 S13656 313 2 0x80000000 [ 312.382093] scsi_tmf_12 I15152 314 2 0x80000000 [ 312.395952] scsi_eh_13 S13656 315 2 0x80000000 [ 312.413378] scsi_tmf_13 I15032 316 2 0x80000000 [ 312.427103] scsi_eh_14 S13624 317 2 0x80000000 [ 312.444286] scsi_tmf_14 I15176 318 2 0x80000000 [ 312.457978] scsi_eh_15 S15192 319 2 0x80000000 [ 312.477516] scsi_tmf_15 I15176 320 2 0x80000000 [ 312.491641] scsi_eh_16 S13656 321 2 0x80000000 [ 312.509443] scsi_tmf_16 I15032 322 2 0x80000000 [ 312.523573] scsi_eh_17 S13320 323 2 0x80000000 [ 312.540983] scsi_tmf_17 I15152 324 2 0x80000000 [ 312.554847] scsi_eh_18 S13320 325 2 0x80000000 [ 312.572062] scsi_tmf_18 I15152 326 2 0x80000000 [ 312.585722] scsi_eh_19 S13320 327 2 0x80000000 [ 312.603012] scsi_tmf_19 I14952 328 2 0x80000000 [ 312.616683] scsi_eh_20 S13320 329 2 0x80000000 [ 312.633915] scsi_tmf_20 I14856 330 2 0x80000000 [ 312.647594] scsi_eh_21 S13624 331 2 0x80000000 [ 312.664791] scsi_tmf_21 I15152 332 2 0x80000000 [ 312.678495] scsi_eh_22 S13320 333 2 0x80000000 [ 312.695697] scsi_tmf_22 I15176 334 2 0x80000000 [ 312.709366] scsi_eh_23 S13656 335 2 0x80000000 [ 312.738358] scsi_tmf_23 I14984 336 2 0x80000000 [ 312.752608] scsi_eh_24 S13320 337 2 0x80000000 [ 312.770582] scsi_tmf_24 I15176 338 2 0x80000000 [ 312.784878] scsi_eh_25 S13624 339 2 0x80000000 [ 312.802234] scsi_tmf_25 I14984 340 2 0x80000000 [ 312.815906] scsi_eh_26 S13624 341 2 0x80000000 [ 312.833209] scsi_tmf_26 I14984 342 2 0x80000000 [ 312.847120] scsi_eh_27 S13656 343 2 0x80000000 [ 312.864431] scsi_tmf_27 I14984 344 2 0x80000000 [ 312.878111] scsi_eh_28 S13656 345 2 0x80000000 [ 312.895298] scsi_tmf_28 I15152 346 2 0x80000000 [ 312.908979] scsi_eh_29 S13656 348 2 0x80000000 [ 312.926273] scsi_tmf_29 I15152 349 2 0x80000000 [ 312.939917] scsi_eh_30 S13656 350 2 0x80000000 [ 312.957305] scsi_tmf_30 I15152 351 2 0x80000000 [ 312.970959] scsi_eh_31 S13008 352 2 0x80000000 [ 312.988336] scsi_tmf_31 I14856 353 2 0x80000000 [ 313.002016] scsi_eh_32 S13624 354 2 0x80000000 [ 313.019231] scsi_tmf_32 I15152 355 2 0x80000000 [ 313.033116] kworker/u256:4 I14616 356 2 0x80000000 [ 313.035122] Workqueue: (null) (events_unbound) [ 313.048958] kworker/u256:5 I14408 357 2 0x80000000 [ 313.050911] Workqueue: (null) (events_unbound) [ 313.064605] kworker/u256:6 I14616 358 2 0x80000000 [ 313.066588] Workqueue: (null) (events_unbound) [ 313.080184] kworker/u256:7 I14392 359 2 0x80000000 [ 313.082139] Workqueue: (null) (events_unbound) [ 313.095693] kworker/u256:8 I14616 360 2 0x80000000 [ 313.097625] Workqueue: (null) (events_unbound) [ 313.111206] kworker/u256:9 I14088 362 2 0x80000000 [ 313.113130] Workqueue: (null) (events_unbound) [ 313.126701] kworker/u256:10 I14440 364 2 0x80000000 [ 313.128630] Workqueue: (null) (events_unbound) [ 313.142204] kworker/u256:11 I14440 365 2 0x80000000 [ 313.144133] Workqueue: (null) (events_unbound) [ 313.157836] kworker/u256:12 I14440 366 2 0x80000000 [ 313.159834] Workqueue: (null) (events_unbound) [ 313.173426] kworker/u256:13 I14440 367 2 0x80000000 [ 313.175360] Workqueue: (null) (events_unbound) [ 313.188953] kworker/u256:14 I14440 368 2 0x80000000 [ 313.190883] Workqueue: (null) (events_unbound) [ 313.204529] kworker/u256:15 I14440 369 2 0x80000000 [ 313.206524] Workqueue: (null) (events_unbound) [ 313.220106] kworker/u256:16 I14440 370 2 0x80000000 [ 313.222040] Workqueue: (null) (events_unbound) [ 313.235845] kworker/u256:17 I14440 371 2 0x80000000 [ 313.238007] Workqueue: (null) (events_unbound) [ 313.251879] kworker/u256:18 I14440 372 2 0x80000000 [ 313.253880] Workqueue: (null) (events_unbound) [ 313.268276] kworker/u256:19 I14440 373 2 0x80000000 [ 313.270312] Workqueue: (null) (events_unbound) [ 313.284299] kworker/u256:20 I14440 374 2 0x80000000 [ 313.286254] Workqueue: (null) (events_unbound) [ 313.300548] kworker/u256:21 I14440 375 2 0x80000000 [ 313.302563] Workqueue: (null) (events_unbound) [ 313.316925] kworker/u256:22 I14440 376 2 0x80000000 [ 313.318857] Workqueue: (null) (events_unbound) [ 313.333454] kworker/u256:23 I14440 377 2 0x80000000 [ 313.335460] Workqueue: (null) (events_unbound) [ 313.349943] kworker/u256:24 I14392 378 2 0x80000000 [ 313.351872] Workqueue: (null) (events_unbound) [ 313.366316] kworker/u256:25 I14392 379 2 0x80000000 [ 313.368369] Workqueue: (null) (events_unbound) [ 313.382706] kworker/u256:26 I14440 380 2 0x80000000 [ 313.384638] Workqueue: (null) (events_unbound) [ 313.398913] kworker/u256:27 I14440 381 2 0x80000000 [ 313.400995] Workqueue: (null) (events_unbound) [ 313.415395] kworker/u256:28 I14440 382 2 0x80000000 [ 313.417325] Workqueue: (null) (events_unbound) [ 313.431863] kworker/u256:29 I14440 383 2 0x80000000 [ 313.433880] Workqueue: (null) (events_unbound) [ 313.448121] kworker/u256:30 I12872 384 2 0x80000000 [ 313.450050] Workqueue: (null) (events_unbound) [ 313.464862] kworker/u256:31 I10848 385 2 0x80000000 [ 313.467082] Workqueue: (null) (flush-8:0) [ 313.481995] kworker/u256:32 I14440 386 2 0x80000000 [ 313.484013] Workqueue: (null) (events_unbound) [ 313.498207] kworker/u256:33 I15032 387 2 0x80000000 [ 313.500272] Workqueue: (null) (events_unbound) [ 313.514300] kworker/1:2 I14728 411 2 0x80000000 [ 313.516350] Workqueue: (null) (cgroup_pidlist_destroy) [ 313.530409] kworker/6:1H I15144 414 2 0x80000000 [ 313.532422] Workqueue: (null) (kblockd) [ 313.545957] xfsalloc I15176 419 2 0x80000000 [ 313.559714] xfs_mru_cache I15176 420 2 0x80000000 [ 313.573541] xfs-buf/sda1 I15176 421 2 0x80000000 [ 313.587184] xfs-data/sda1 I15176 422 2 0x80000000 [ 313.600830] xfs-conv/sda1 I15176 423 2 0x80000000 [ 313.614464] xfs-cil/sda1 I15176 424 2 0x80000000 [ 313.628114] xfs-reclaim/sda I15152 425 2 0x80000000 [ 313.641772] xfs-log/sda1 I15176 426 2 0x80000000 [ 313.655372] xfs-eofblocks/s I15176 427 2 0x80000000 [ 313.668954] xfsaild/sda1 S13560 428 2 0x80000000 [ 313.685728] kworker/7:1H I14648 429 2 0x80000000 [ 313.687693] Workqueue: (null) (kblockd) [ 313.701018] kworker/1:1H I14648 497 2 0x80000000 [ 313.714537] systemd-journal D12600 498 1 0x00000000 [ 313.767803] systemd-udevd S12648 529 1 0x00000000 [ 313.804733] auditd D12808 546 1 0x00000000 [ 313.874023] auditd S13584 547 1 0x00000000 [ 313.910947] kworker/4:1H I15184 552 2 0x80000080 [ 313.912888] Workqueue: (null) (kblockd) [ 313.927331] polkitd S12536 593 1 0x00000080 [ 313.981730] gmain S14032 604 1 0x00000080 [ 314.041923] gdbus S13760 605 1 0x00000080 [ 314.094085] JS GC Helper S14696 611 1 0x00000080 [ 314.136686] JS Sour~ Thread S14696 616 1 0x00000080 [ 314.180020] polkitd S14200 621 1 0x00000080 [ 314.238165] irqbalance D12904 594 1 0x80000080 [ 314.307023] avahi-daemon S12632 597 1 0x00000080 [ 314.357108] dbus-daemon S12632 598 1 0x00000080 [ 314.390871] avahi-daemon S12888 603 597 0x00000080 [ 314.439229] abrtd S12488 606 1 0x00000080 [ 314.503323] systemd-logind D12904 608 1 0x00000080 [ 314.572804] atd S12728 612 1 0x00000080 [ 314.608670] crond D12536 614 1 0x00000080 [ 314.677690] ksmtuned S13312 619 1 0x00000080 [ 314.714501] firewalld S12456 623 1 0x00000080 [ 314.767120] gmain S14248 756 1 0x00000080 [ 314.824569] NetworkManager S11816 662 1 0x00000080 [ 314.881313] gmain D12544 669 1 0x00000080 [ 314.949419] gdbus S13432 672 1 0x00000080 [ 315.000239] kworker/0:1H I14096 742 2 0x80000080 [ 315.002178] Workqueue: (null) (kblockd) [ 315.015633] dhclient S12328 759 662 0x00000080 [ 315.082950] kworker/2:1H I14480 768 2 0x80000080 [ 315.084945] Workqueue: (null) (xfs-log/sda1) [ 315.098501] tuned S12552 949 1 0x00000080 [ 315.165944] gmain S14248 995 1 0x00000080 [ 315.224133] tuned D12328 996 1 0x80000080 [ 315.292809] tuned S14408 997 1 0x00000080 [ 315.330274] tuned S14248 1011 1 0x00000080 [ 315.389366] smbd S11288 951 1 0x00000080 [ 315.427197] sshd S12552 952 1 0x00000080 [ 315.509092] rsyslogd S12696 954 1 0x00000080 [ 315.578028] in:imjournal D12048 965 1 0x80000080 [ 315.645481] rs:main Q:Reg S12792 966 1 0x00000080 [ 315.681641] sendmail D13176 967 1 0x00000080 [ 315.751526] kworker/0:3 I13376 977 2 0x80000080 [ 315.753526] Workqueue: (null) (ata_sff) [ 315.766860] agetty S12904 989 1 0x00000080 [ 315.810543] login S12488 990 1 0x00000080 [ 315.849297] sendmail S14256 991 1 0x00000080 [ 315.877525] smbd-notifyd D12792 1012 951 0x00000080 [ 315.947463] cleanupd D13144 1013 951 0x00000080 [ 316.016683] lpqd S12888 1016 951 0x00000080 [ 316.054304] bash S12656 1019 990 0x00000080 [ 316.092765] sleep D12808 1614 619 0x00000080 [ 429.965904] systemd D12240 1 0 0x00000000 [ 430.045240] kthreadd S14024 2 0 0x80000000 [ 430.055553] rcu_gp I15176 3 2 0x80000000 [ 430.069575] rcu_par_gp I15176 4 2 0x80000000 [ 430.083584] kworker/0:0 R running task 13504 5 2 0x80000000 [ 430.085990] Workqueue: events_freezable_power_ disk_events_workfn [ 430.210920] kworker/0:0H I14648 6 2 0x80000000 [ 430.212997] Workqueue: (null) (xfs-log/sda1) [ 430.227747] mm_percpu_wq I15176 8 2 0x80000000 [ 430.242131] ksoftirqd/0 S14664 9 2 0x80000000 [ 430.257645] rcu_sched I14360 10 2 0x80000000 [ 430.271300] rcu_bh I15192 11 2 0x80000000 [ 430.288804] migration/0 S14664 12 2 0x80000000 [ 430.302736] cpuhp/0 S13920 14 2 0x80000000 [ 430.316785] cpuhp/1 P13992 15 2 0x80000000 [ 430.332349] migration/1 P14664 16 2 0x80000000 [ 430.347965] ksoftirqd/1 P14656 17 2 0x80000000 [ 430.363616] kworker/1:0 I13808 18 2 0x80000000 [ 430.365571] Workqueue: (null) (events) [ 430.379215] kworker/1:0H I14648 19 2 0x80000000 [ 430.381183] Workqueue: (null) (kblockd) [ 430.394637] cpuhp/2 P13992 20 2 0x80000000 [ 430.410289] migration/2 P14664 21 2 0x80000000 [ 430.425871] ksoftirqd/2 P14664 22 2 0x80000000 [ 430.441401] kworker/2:0H I14480 24 2 0x80000000 [ 430.443354] Workqueue: (null) (xfs-log/sda1) [ 430.456923] cpuhp/3 P14056 25 2 0x80000000 [ 430.472692] migration/3 P14664 26 2 0x80000000 [ 430.488439] ksoftirqd/3 P14600 27 2 0x80000000 [ 430.504382] kworker/3:0H I15192 29 2 0x80000000 [ 430.518385] cpuhp/4 P13696 30 2 0x80000000 [ 430.534771] migration/4 P14664 31 2 0x80000000 [ 430.550345] ksoftirqd/4 P14664 32 2 0x80000000 [ 430.565951] kworker/4:0H I14808 34 2 0x80000000 [ 430.567894] Workqueue: (null) (kblockd) [ 430.581313] cpuhp/5 P13960 35 2 0x80000000 [ 430.596904] migration/5 P14664 36 2 0x80000000 [ 430.612501] ksoftirqd/5 P14664 37 2 0x80000000 [ 430.628142] kworker/5:0H I15184 39 2 0x80000000 [ 430.641797] cpuhp/6 P14024 40 2 0x80000000 [ 430.657356] migration/6 P14664 41 2 0x80000000 [ 430.672944] ksoftirqd/6 P14600 42 2 0x80000000 [ 430.688512] kworker/6:0H I14616 44 2 0x80000000 [ 430.690475] Workqueue: (null) (kblockd) [ 430.704778] cpuhp/7 P13992 45 2 0x80000000 [ 430.721362] migration/7 P14664 46 2 0x80000000 [ 430.737498] ksoftirqd/7 P14856 47 2 0x80000000 [ 430.753602] kworker/7:0H I14808 49 2 0x80000000 [ 430.755618] Workqueue: (null) (xfs-log/sda1) [ 430.769558] kdevtmpfs S13552 50 2 0x80000000 [ 430.784541] netns I15152 51 2 0x80000000 [ 430.798092] kauditd S14640 53 2 0x80000000 [ 430.816061] khungtaskd S14368 54 2 0x80000000 [ 430.832568] oom_reaper S14464 55 2 0x80000000 [ 430.847090] writeback I15152 56 2 0x80000000 [ 430.860445] kcompactd0 S14856 57 2 0x80000000 [ 430.875004] ksmd S15032 58 2 0x80000000 [ 430.894691] khugepaged D13832 59 2 0x80000000 [ 430.941897] crypto I15192 60 2 0x80000000 [ 430.955462] kintegrityd I15152 61 2 0x80000000 [ 430.970103] kblockd I15192 62 2 0x80000000 [ 430.985679] kworker/3:1 I14264 63 2 0x80000000 [ 430.987830] Workqueue: (null) (events) [ 431.006463] kworker/5:1 I13568 64 2 0x80000000 [ 431.008721] Workqueue: (null) (events) [ 431.022691] kworker/0:2 I13104 65 2 0x80000000 [ 431.024683] Workqueue: (null) (cgroup_pidlist_destroy) [ 431.039039] md I15152 66 2 0x80000000 [ 431.052855] edac-poller I15192 67 2 0x80000000 [ 431.066622] devfreq_wq I15152 68 2 0x80000000 [ 431.082179] watchdogd S15192 69 2 0x80000000 [ 431.099273] kworker/2:1 I13848 70 2 0x80000000 [ 431.101428] Workqueue: (null) (cgroup_pidlist_destroy) [ 431.115249] kswapd0 S11528 71 2 0x80000000 [ 431.130196] kworker/4:1 I13976 82 2 0x80000000 [ 431.132110] Workqueue: (null) (events) [ 431.145382] kthrotld I14712 88 2 0x80000000 [ 431.159018] kworker/6:1 I14280 89 2 0x80000000 [ 431.160998] Workqueue: (null) (events) [ 431.174300] kworker/7:1 I13976 90 2 0x80000000 [ 431.176275] Workqueue: (null) (events_power_efficient) [ 431.190169] acpi_thermal_pm I14856 91 2 0x80000000 [ 431.204078] kmpath_rdacd I15176 93 2 0x80000000 [ 431.218355] kaluad I14712 94 2 0x80000000 [ 431.232512] nvme-wq I14712 95 2 0x80000000 [ 431.246066] nvme-reset-wq I14712 96 2 0x80000000 [ 431.259655] nvme-delete-wq I15176 97 2 0x80000000 [ 431.273200] ipv6_addrconf I15176 100 2 0x80000000 [ 431.287563] kworker/5:2 I14728 134 2 0x80000000 [ 431.289707] Workqueue: (null) (events) [ 431.303527] kworker/2:2 I13952 156 2 0x80000000 [ 431.305447] Workqueue: (null) (events) [ 431.318965] kworker/6:2 I14264 158 2 0x80000000 [ 431.320882] Workqueue: (null) (events) [ 431.334236] kworker/4:2 I14648 200 2 0x80000000 [ 431.336153] Workqueue: (null) (cgroup_pidlist_destroy) [ 431.349824] kworker/3:2 I14728 251 2 0x80000000 [ 431.351745] Workqueue: (null) (cgroup_pidlist_destroy) [ 431.365438] kworker/7:2 I13520 282 2 0x80000000 [ 431.367360] Workqueue: (null) (events) [ 431.383306] ata_sff I15152 284 2 0x80000000 [ 431.397949] scsi_eh_0 S14504 285 2 0x80000000 [ 431.418707] scsi_tmf_0 I14952 286 2 0x80000000 [ 431.432417] scsi_eh_1 S13560 287 2 0x80000000 [ 431.449499] scsi_tmf_1 I14952 288 2 0x80000000 [ 431.463159] mpt_poll_0 I15152 289 2 0x80000000 [ 431.477194] mpt/0 I14952 290 2 0x80000000 [ 431.491369] scsi_eh_2 S13320 291 2 0x80000000 [ 431.509148] scsi_tmf_2 I15152 292 2 0x80000000 [ 431.523419] scsi_eh_3 S13656 293 2 0x80000000 [ 431.540836] scsi_tmf_3 I15152 294 2 0x80000000 [ 431.554526] scsi_eh_4 S13656 295 2 0x80000000 [ 431.573951] scsi_tmf_4 I15176 296 2 0x80000000 [ 431.590636] scsi_eh_5 S13320 297 2 0x80000000 [ 431.610073] scsi_tmf_5 I15152 298 2 0x80000000 [ 431.625309] scsi_eh_6 S13320 299 2 0x80000000 [ 431.645255] scsi_tmf_6 I15152 300 2 0x80000000 [ 431.660850] ttm_swap I15176 301 2 0x80000000 [ 431.676590] scsi_eh_7 S13320 302 2 0x80000000 [ 431.696192] scsi_tmf_7 I15064 303 2 0x80000000 [ 431.711927] scsi_eh_8 S13320 304 2 0x80000000 [ 431.731901] scsi_tmf_8 I14856 305 2 0x80000000 [ 431.747401] scsi_eh_9 S13656 306 2 0x80000000 [ 431.767158] scsi_tmf_9 I15152 307 2 0x80000000 [ 431.782141] irq/16-vmwgfx S14784 308 2 0x80000000 [ 431.801432] scsi_eh_10 S13320 309 2 0x80000000 [ 431.819869] scsi_tmf_10 I15152 310 2 0x80000000 [ 431.834047] scsi_eh_11 S13320 311 2 0x80000000 [ 431.851641] scsi_tmf_11 I15176 312 2 0x80000000 [ 431.865316] scsi_eh_12 S13656 313 2 0x80000000 [ 431.882579] scsi_tmf_12 I15152 314 2 0x80000000 [ 431.896275] scsi_eh_13 S13656 315 2 0x80000000 [ 431.913545] scsi_tmf_13 I15032 316 2 0x80000000 [ 431.927339] scsi_eh_14 S13624 317 2 0x80000000 [ 431.944787] scsi_tmf_14 I15176 318 2 0x80000000 [ 431.958953] scsi_eh_15 S15192 319 2 0x80000000 [ 431.977842] scsi_tmf_15 I15176 320 2 0x80000000 [ 431.991927] scsi_eh_16 S13656 321 2 0x80000000 [ 432.009353] scsi_tmf_16 I15032 322 2 0x80000000 [ 432.023012] scsi_eh_17 S13320 323 2 0x80000000 [ 432.040219] scsi_tmf_17 I15152 324 2 0x80000000 [ 432.053920] scsi_eh_18 S13320 325 2 0x80000000 [ 432.071121] scsi_tmf_18 I15152 326 2 0x80000000 [ 432.084940] scsi_eh_19 S13320 327 2 0x80000000 [ 432.102554] scsi_tmf_19 I14952 328 2 0x80000000 [ 432.116239] scsi_eh_20 S13320 329 2 0x80000000 [ 432.133434] scsi_tmf_20 I14856 330 2 0x80000000 [ 432.147092] scsi_eh_21 S13624 331 2 0x80000000 [ 432.164292] scsi_tmf_21 I15152 332 2 0x80000000 [ 432.179198] scsi_eh_22 S13320 333 2 0x80000000 [ 432.197105] scsi_tmf_22 I15176 334 2 0x80000000 [ 432.211388] scsi_eh_23 S13656 335 2 0x80000000 [ 432.228948] scsi_tmf_23 I14984 336 2 0x80000000 [ 432.242745] scsi_eh_24 S13320 337 2 0x80000000 [ 432.260114] scsi_tmf_24 I15176 338 2 0x80000000 [ 432.273909] scsi_eh_25 S13624 339 2 0x80000000 [ 432.292257] scsi_tmf_25 I14984 340 2 0x80000000 [ 432.306667] scsi_eh_26 S13624 341 2 0x80000000 [ 432.324014] scsi_tmf_26 I14984 342 2 0x80000000 [ 432.337927] scsi_eh_27 S13656 343 2 0x80000000 [ 432.355175] scsi_tmf_27 I14984 344 2 0x80000000 [ 432.368970] scsi_eh_28 S13656 345 2 0x80000000 [ 432.386277] scsi_tmf_28 I15152 346 2 0x80000000 [ 432.400001] scsi_eh_29 S13656 348 2 0x80000000 [ 432.417628] scsi_tmf_29 I15152 349 2 0x80000000 [ 432.431300] scsi_eh_30 S13656 350 2 0x80000000 [ 432.448517] scsi_tmf_30 I15152 351 2 0x80000000 [ 432.462367] scsi_eh_31 S13008 352 2 0x80000000 [ 432.480180] scsi_tmf_31 I14856 353 2 0x80000000 [ 432.494334] scsi_eh_32 S13624 354 2 0x80000000 [ 432.512183] scsi_tmf_32 I15152 355 2 0x80000000 [ 432.526539] kworker/u256:30 I12872 384 2 0x80000000 [ 432.528477] Workqueue: (null) (events_unbound) [ 432.542276] kworker/u256:31 I10848 385 2 0x80000000 [ 432.544216] Workqueue: (null) (flush-8:0) [ 432.557869] kworker/1:2 I14728 411 2 0x80000000 [ 432.559806] Workqueue: (null) (cgroup_pidlist_destroy) [ 432.573637] kworker/6:1H I15144 414 2 0x80000000 [ 432.575578] Workqueue: (null) (kblockd) [ 432.589008] xfsalloc I15176 419 2 0x80000000 [ 432.602765] xfs_mru_cache I15176 420 2 0x80000000 [ 432.616433] xfs-buf/sda1 I15176 421 2 0x80000000 [ 432.630080] xfs-data/sda1 I15176 422 2 0x80000000 [ 432.643719] xfs-conv/sda1 I15176 423 2 0x80000000 [ 432.657334] xfs-cil/sda1 I15176 424 2 0x80000000 [ 432.670935] xfs-reclaim/sda I15152 425 2 0x80000000 [ 432.684559] xfs-log/sda1 I15176 426 2 0x80000000 [ 432.698196] xfs-eofblocks/s I15176 427 2 0x80000000 [ 432.712330] xfsaild/sda1 S13560 428 2 0x80000000 [ 432.729342] kworker/7:1H I14648 429 2 0x80000000 [ 432.731277] Workqueue: (null) (kblockd) [ 432.745674] kworker/1:1H I14648 497 2 0x80000000 [ 432.759743] systemd-journal D12600 498 1 0x00000000 [ 432.815707] systemd-udevd S12648 529 1 0x00000000 [ 432.854484] auditd D12808 546 1 0x00000000 [ 432.928413] auditd S13584 547 1 0x00000000 [ 432.967315] kworker/4:1H I15184 552 2 0x80000080 [ 432.969349] Workqueue: (null) (kblockd) [ 432.983507] polkitd S12536 593 1 0x00000080 [ 433.040141] gmain S14032 604 1 0x00000080 [ 433.103106] gdbus S13760 605 1 0x00000080 [ 433.154604] JS GC Helper S14696 611 1 0x00000080 [ 433.196161] JS Sour~ Thread S14696 616 1 0x00000080 [ 433.240104] polkitd S14200 621 1 0x00000080 [ 433.299587] irqbalance D12904 594 1 0x80000080 [ 433.367588] avahi-daemon S12632 597 1 0x00000080 [ 433.417623] dbus-daemon S12632 598 1 0x00000080 [ 433.451459] avahi-daemon S12888 603 597 0x00000080 [ 433.499393] abrtd S12488 606 1 0x00000080 [ 433.562968] systemd-logind D12904 608 1 0x00000080 [ 433.632571] atd S12728 612 1 0x00000080 [ 433.668235] crond D12536 614 1 0x00000080 [ 433.737767] ksmtuned S13312 619 1 0x00000080 [ 433.775017] firewalld S12456 623 1 0x00000080 [ 433.828076] gmain S14248 756 1 0x00000080 [ 433.885072] NetworkManager D11816 662 1 0x00000080 [ 433.954132] gmain D12544 669 1 0x00000080 [ 434.023109] gdbus S13432 672 1 0x00000080 [ 434.073500] kworker/0:1H I14096 742 2 0x80000080 [ 434.075444] Workqueue: (null) (kblockd) [ 434.088969] dhclient S12328 759 662 0x00000080 [ 434.157035] kworker/2:1H I14480 768 2 0x80000080 [ 434.159035] Workqueue: (null) (xfs-log/sda1) [ 434.172628] tuned S12552 949 1 0x00000080 [ 434.240578] gmain S14248 995 1 0x00000080 [ 434.300681] tuned D12328 996 1 0x80000080 [ 434.369863] tuned S14408 997 1 0x00000080 [ 434.407096] tuned S14248 1011 1 0x00000080 [ 434.464385] smbd S11288 951 1 0x00000080 [ 434.502579] sshd S12552 952 1 0x00000080 [ 434.574892] rsyslogd S12696 954 1 0x00000080 [ 434.644123] in:imjournal D12048 965 1 0x80000080 [ 434.711494] rs:main Q:Reg S12792 966 1 0x00000080 [ 434.746928] sendmail D13176 967 1 0x00000080 [ 434.816872] kworker/0:3 I13376 977 2 0x80000080 [ 434.818821] Workqueue: (null) (ata_sff) [ 434.832423] agetty S12904 989 1 0x00000080 [ 434.877203] login S12488 990 1 0x00000080 [ 434.917384] sendmail S14256 991 1 0x00000080 [ 434.947588] smbd-notifyd D12792 1012 951 0x00000080 [ 435.021389] cleanupd D13144 1013 951 0x00000080 [ 435.094050] lpqd S12888 1016 951 0x00000080 [ 435.133174] bash S12656 1019 990 0x00000080 [ 435.172224] sleep D12808 1614 619 0x00000080 >>> >>> I do not mind this change per se but I am not happy about _your_ changelog. >>> It doesn't explain the underlying problem IMHO. Having a natural and >>> unconditional scheduling point in should_reclaim_retry is a reasonable >>> thing. But how the hack it relates to the livelock you are seeing. So >>> namely the changelog should explain >>> 1) why nobody is able to make forward progress during direct reclaim >> >> Because GFP_NOIO allocation from one workqueue prevented WQ_MEM_RECLAIM >> workqueues from starting, for it did not call schedule_timeout_*() because >> mutex_trylock(&oom_lock) did not fail because nobody else could call >> __alloc_pages_may_oom(). > > Why hasn't the rescuer helped here? Rescuer does not work without schedule_timeout_*(). > >>> 2) why nobody is able to trigger oom killer as the last resort >> >> Because only one !__GFP_FS allocating thread which did not get stuck at >> direct reclaim was able to call __alloc_pages_may_oom(). > > All the remaining allocations got stuck waiting for completion which > depends on !WQ_MEM_RECLAIM workers right? > A !WQ_MEM_RECLAIM worker did not call schedule_timeout_*() is preventing anyone who is waiting for WQ_MEM_RECLAIM workers from making progress.
On Mon 30-07-18 23:34:23, Tetsuo Handa wrote: > On 2018/07/30 18:32, Michal Hocko wrote: [...] > > This one is waiting for draining and we are in mm_percpu_wq WQ context > > which has its rescuer so no other activity can block us for ever. So > > this certainly shouldn't deadlock. It can be dead slow but well, this is > > what you will get when your shoot your system to death. > > We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to wake up. (Tejun, > is my understanding correct?) Lack of schedule_timeout_*() does block WQ_MEM_RECLAIM > workqueues forever. Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually guarantees. If you are right then the whole thing sounds quite fragile to me TBH. Anyway we would at least have an explanation for what you are seeing.
Hello, On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote: > On Mon 30-07-18 23:34:23, Tetsuo Handa wrote: > > On 2018/07/30 18:32, Michal Hocko wrote: > [...] > > > This one is waiting for draining and we are in mm_percpu_wq WQ context > > > which has its rescuer so no other activity can block us for ever. So > > > this certainly shouldn't deadlock. It can be dead slow but well, this is > > > what you will get when your shoot your system to death. > > > > We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to wake up. (Tejun, > > is my understanding correct?) Lack of schedule_timeout_*() does block WQ_MEM_RECLAIM > > workqueues forever. > > Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually > guarantees. If you are right then the whole thing sounds quite fragile > to me TBH. Workqueue doesn't think the cpu is stalled as long as one of the per-cpu kworkers is running. The assumption is that kernel threads are not supposed to be busy-looping indefinitely (and they really shouldn't). We can add timeout mechanism to workqueue so that it kicks off other kworkers if one of them is in running state for too long, but idk, if there's an indefinite busy loop condition in kernel threads, we really should get rid of them and hung task watchdog is pretty effective at finding these cases (at least with preemption disabled). Thanks.
On 2018/07/30 23:54, Tejun Heo wrote: > Hello, > > On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote: >> On Mon 30-07-18 23:34:23, Tetsuo Handa wrote: >>> On 2018/07/30 18:32, Michal Hocko wrote: >> [...] >>>> This one is waiting for draining and we are in mm_percpu_wq WQ context >>>> which has its rescuer so no other activity can block us for ever. So >>>> this certainly shouldn't deadlock. It can be dead slow but well, this is >>>> what you will get when your shoot your system to death. >>> >>> We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to wake up. (Tejun, >>> is my understanding correct?) Lack of schedule_timeout_*() does block WQ_MEM_RECLAIM >>> workqueues forever. >> >> Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually >> guarantees. If you are right then the whole thing sounds quite fragile >> to me TBH. > > Workqueue doesn't think the cpu is stalled as long as one of the > per-cpu kworkers is running. The assumption is that kernel threads > are not supposed to be busy-looping indefinitely (and they really > shouldn't). WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e. when schedule_timeout_*() is called). Is this correct? > We can add timeout mechanism to workqueue so that it > kicks off other kworkers if one of them is in running state for too > long, but idk, if there's an indefinite busy loop condition in kernel > threads, we really should get rid of them and hung task watchdog is > pretty effective at finding these cases (at least with preemption > disabled). Currently the page allocator has a path which can loop forever with only cond_resched(). > > Thanks. >
Hello, On Tue, Jul 31, 2018 at 12:25:04AM +0900, Tetsuo Handa wrote: > WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But > WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon > as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues > and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e. > when schedule_timeout_*() is called). > > Is this correct? WQ_MEM_RECLAIM guarantees that there's always gonna exist at least one kworker running the workqueue. But all per-cpu kworkers are subject to concurrency limiting execution - ie. if there are any per-cpu actively running on a cpu, no futher kworkers will be scheduled. > > We can add timeout mechanism to workqueue so that it > > kicks off other kworkers if one of them is in running state for too > > long, but idk, if there's an indefinite busy loop condition in kernel > > threads, we really should get rid of them and hung task watchdog is > > pretty effective at finding these cases (at least with preemption > > disabled). > > Currently the page allocator has a path which can loop forever with > only cond_resched(). Yeah, workqueue can choke on things like that and kthread indefinitely busy looping doesn't do anybody any good. Thanks.
On Mon 30-07-18 08:44:24, Tejun Heo wrote: > Hello, > > On Tue, Jul 31, 2018 at 12:25:04AM +0900, Tetsuo Handa wrote: > > WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But > > WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon > > as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues > > and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e. > > when schedule_timeout_*() is called). > > > > Is this correct? > > WQ_MEM_RECLAIM guarantees that there's always gonna exist at least one > kworker running the workqueue. But all per-cpu kworkers are subject > to concurrency limiting execution - ie. if there are any per-cpu > actively running on a cpu, no futher kworkers will be scheduled. Well, in the ideal world we would _use_ that pre-allocated kworker if there are no other available because they are doing something that takes a long time to accomplish. Page allocator can spend a lot of time if we are struggling to death to get some memory. > > > We can add timeout mechanism to workqueue so that it > > > kicks off other kworkers if one of them is in running state for too > > > long, but idk, if there's an indefinite busy loop condition in kernel > > > threads, we really should get rid of them and hung task watchdog is > > > pretty effective at finding these cases (at least with preemption > > > disabled). > > > > Currently the page allocator has a path which can loop forever with > > only cond_resched(). > > Yeah, workqueue can choke on things like that and kthread indefinitely > busy looping doesn't do anybody any good. Yeah, I do agree. But this is much easier said than done ;) Sure we have that hack that does sleep rather than cond_resched in the page allocator. We can and will "fix" it to be unconditional in the should_reclaim_retry [1] but this whole thing is really subtle. It just take one misbehaving worker and something which is really important to run will get stuck. That being said I will post the patch with updated changelog recording this. [1] http://lkml.kernel.org/r/ca3da8b8-1bb5-c302-b190-fa6cebab58ca@I-love.SAKURA.ne.jp
This change has been posted several times with some concerns about the changelog. Originally I thought it is more of a "nice to have" thing rather than a bug fix, later Tetsuo has taken over it but the changelog was not really comprehensible so I reworded it. Let's see if this is better. From 9bbea6516bb99615aff5ba5699865aa2d48333cc Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Thu, 26 Jul 2018 14:40:03 +0900 Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry(). Tetsuo Handa has reported that it is possible to bypass the short sleep for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress") and lock up the system if OOM. The primary reason is that WQ_MEM_RECLAIM WQs are not guaranteed to run even when they have a rescuer available. Those workers might be essential for reclaim to make a forward progress, however. If we are too unlucky all the allocations requests can get stuck waiting for a WQ_MEM_RECLAIM work item and the system is essentially stuck in an OOM condition without much hope to move on. Tetsuo has seen the reclaim stuck on drain_local_pages_wq or xlog_cil_push_work (xfs). There might be others. Since should_reclaim_retry() should be a natural reschedule point, let's do the short sleep for PF_WQ_WORKER threads unconditionally in order to guarantee that other pending work items are started. This will workaround this problem and it is less fragile than hunting down when the sleep is missed. E.g. we used to have a sleeping point in the oom path but this has been removed recently because it caused other issues. Having a single sleeping point is more robust. Reported-and-debugged-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> --- mm/page_alloc.c | 34 ++++++++++++++++++---------------- 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1521100f1e63..f56cc0958d09 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3922,6 +3922,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, { struct zone *zone; struct zoneref *z; + bool ret = false; /* * Costly allocations might have made a progress but this doesn't mean @@ -3985,25 +3986,26 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, } } - /* - * Memory allocation/reclaim might be called from a WQ - * context and the current implementation of the WQ - * concurrency control doesn't recognize that - * a particular WQ is congested if the worker thread is - * looping without ever sleeping. Therefore we have to - * do a short sleep here rather than calling - * cond_resched(). - */ - if (current->flags & PF_WQ_WORKER) - schedule_timeout_uninterruptible(1); - else - cond_resched(); - - return true; + ret = true; + goto out; } } - return false; +out: + /* + * Memory allocation/reclaim might be called from a WQ + * context and the current implementation of the WQ + * concurrency control doesn't recognize that + * a particular WQ is congested if the worker thread is + * looping without ever sleeping. Therefore we have to + * do a short sleep here rather than calling + * cond_resched(). + */ + if (current->flags & PF_WQ_WORKER) + schedule_timeout_uninterruptible(1); + else + cond_resched(); + return ret; } static inline bool
Hello, Michal. On Mon, Jul 30, 2018 at 08:51:10PM +0200, Michal Hocko wrote: > > Yeah, workqueue can choke on things like that and kthread indefinitely > > busy looping doesn't do anybody any good. > > Yeah, I do agree. But this is much easier said than done ;) Sure > we have that hack that does sleep rather than cond_resched in the > page allocator. We can and will "fix" it to be unconditional in the > should_reclaim_retry [1] but this whole thing is really subtle. It just > take one misbehaving worker and something which is really important to > run will get stuck. Oh yeah, I'm not saying the current behavior is ideal or anything, but since the behavior has been put in many years ago, it only became a problem only a couple times and all cases were rather easy and obvious fixes on the wq user side. It shouldn't be difficult to add a timer mechanism on top. We might be able to simply extend the hang detection mechanism to kick off all pending rescuers after detecting a wq stall. I'm wary about making it a part of normal operation (ie. silent timeout). per-cpu kworkers really shouldn't busy loop for an extended period of time. Thanks.
On 2018/07/31 4:10, Michal Hocko wrote: > Since should_reclaim_retry() should be a natural reschedule point, > let's do the short sleep for PF_WQ_WORKER threads unconditionally in > order to guarantee that other pending work items are started. This will > workaround this problem and it is less fragile than hunting down when > the sleep is missed. E.g. we used to have a sleeping point in the oom > path but this has been removed recently because it caused other issues. > Having a single sleeping point is more robust. linux.git has not removed the sleeping point in the OOM path yet. Since removing the sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately. (And that change will conflict with Roman's cgroup aware OOM killer patchset. But it should be easy to rebase.)
On Tue 31-07-18 06:01:48, Tetsuo Handa wrote: > On 2018/07/31 4:10, Michal Hocko wrote: > > Since should_reclaim_retry() should be a natural reschedule point, > > let's do the short sleep for PF_WQ_WORKER threads unconditionally in > > order to guarantee that other pending work items are started. This will > > workaround this problem and it is less fragile than hunting down when > > the sleep is missed. E.g. we used to have a sleeping point in the oom > > path but this has been removed recently because it caused other issues. > > Having a single sleeping point is more robust. > > linux.git has not removed the sleeping point in the OOM path yet. Since removing the > sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately. is this an {Acked,Reviewed,Tested}-by? I will send the patch to Andrew if the patch is ok. > (And that change will conflict with Roman's cgroup aware OOM killer patchset. But it > should be easy to rebase.) That is still a WIP so I would lose sleep over it.
On 2018/07/31 14:09, Michal Hocko wrote: > On Tue 31-07-18 06:01:48, Tetsuo Handa wrote: >> On 2018/07/31 4:10, Michal Hocko wrote: >>> Since should_reclaim_retry() should be a natural reschedule point, >>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in >>> order to guarantee that other pending work items are started. This will >>> workaround this problem and it is less fragile than hunting down when >>> the sleep is missed. E.g. we used to have a sleeping point in the oom >>> path but this has been removed recently because it caused other issues. >>> Having a single sleeping point is more robust. >> >> linux.git has not removed the sleeping point in the OOM path yet. Since removing the >> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately. > > is this an {Acked,Reviewed,Tested}-by? I'm saying that "we used to have a sleeping point in the oom path but this has been removed recently" is not true. You need to send that patch to linux.git first if you want to refer that patch in this patch. > > I will send the patch to Andrew if the patch is ok. Andrew, can we send the "we used to have a sleeping point in the oom path but this has been removed recently" patch to linux.git ? > >> (And that change will conflict with Roman's cgroup aware OOM killer patchset. But it >> should be easy to rebase.) > > That is still a WIP so I would lose sleep over it. >
On Tue 31-07-18 19:47:45, Tetsuo Handa wrote: > On 2018/07/31 14:09, Michal Hocko wrote: > > On Tue 31-07-18 06:01:48, Tetsuo Handa wrote: > >> On 2018/07/31 4:10, Michal Hocko wrote: > >>> Since should_reclaim_retry() should be a natural reschedule point, > >>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in > >>> order to guarantee that other pending work items are started. This will > >>> workaround this problem and it is less fragile than hunting down when > >>> the sleep is missed. E.g. we used to have a sleeping point in the oom > >>> path but this has been removed recently because it caused other issues. > >>> Having a single sleeping point is more robust. > >> > >> linux.git has not removed the sleeping point in the OOM path yet. Since removing the > >> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately. > > > > is this an {Acked,Reviewed,Tested}-by? > > I'm saying that "we used to have a sleeping point in the oom path but this has been > removed recently" is not true. You need to send that patch to linux.git first if you > want to refer that patch in this patch. That patch is already sitting in mmotm tree and this one will go on top. I do not really see any reason to rush it to Linus tree. A dubious CVE doesn't really raise the priority if you ask me. On the other hand, having a confirmation, either of the above tags would help to raise the credibility of the change. > > I will send the patch to Andrew if the patch is ok. > > Andrew, can we send the "we used to have a sleeping point in the oom path but this has > been removed recently" patch to linux.git ? This can really wait for the next merge window IMHO.
On 2018/07/31 20:15, Michal Hocko wrote: >>> I will send the patch to Andrew if the patch is ok. >> >> Andrew, can we send the "we used to have a sleeping point in the oom path but this has >> been removed recently" patch to linux.git ? > > This can really wait for the next merge window IMHO. > "mm, oom: cgroup-aware OOM killer" in linux-next.git is reviving that sleeping point. Current "mm, oom: cgroup-aware OOM killer" will not be sent to linux.git in the next merge window? I'm confused...
On Tue 31-07-18 20:30:08, Tetsuo Handa wrote: > On 2018/07/31 20:15, Michal Hocko wrote: > >>> I will send the patch to Andrew if the patch is ok. > >> > >> Andrew, can we send the "we used to have a sleeping point in the oom path but this has > >> been removed recently" patch to linux.git ? > > > > This can really wait for the next merge window IMHO. > > > > "mm, oom: cgroup-aware OOM killer" in linux-next.git is reviving that sleeping point. > Current "mm, oom: cgroup-aware OOM killer" will not be sent to linux.git in the next > merge window? I'm confused... This has nothing to do with cgroup-aware OOM killer.
On 2018/07/31 14:09, Michal Hocko wrote: > On Tue 31-07-18 06:01:48, Tetsuo Handa wrote: >> On 2018/07/31 4:10, Michal Hocko wrote: >>> Since should_reclaim_retry() should be a natural reschedule point, >>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in >>> order to guarantee that other pending work items are started. This will >>> workaround this problem and it is less fragile than hunting down when >>> the sleep is missed. E.g. we used to have a sleeping point in the oom >>> path but this has been removed recently because it caused other issues. >>> Having a single sleeping point is more robust. >> >> linux.git has not removed the sleeping point in the OOM path yet. Since removing the >> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately. > > is this an {Acked,Reviewed,Tested}-by? > > I will send the patch to Andrew if the patch is ok. > >> (And that change will conflict with Roman's cgroup aware OOM killer patchset. But it >> should be easy to rebase.) > > That is still a WIP so I would lose sleep over it. > Now that Roman's cgroup aware OOM killer patchset will be dropped from linux-next.git , linux-next.git will get the sleeping point removed. Please send this patch to linux-next.git .
On Fri 03-08-18 07:05:54, Tetsuo Handa wrote: > On 2018/07/31 14:09, Michal Hocko wrote: > > On Tue 31-07-18 06:01:48, Tetsuo Handa wrote: > >> On 2018/07/31 4:10, Michal Hocko wrote: > >>> Since should_reclaim_retry() should be a natural reschedule point, > >>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in > >>> order to guarantee that other pending work items are started. This will > >>> workaround this problem and it is less fragile than hunting down when > >>> the sleep is missed. E.g. we used to have a sleeping point in the oom > >>> path but this has been removed recently because it caused other issues. > >>> Having a single sleeping point is more robust. > >> > >> linux.git has not removed the sleeping point in the OOM path yet. Since removing the > >> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately. > > > > is this an {Acked,Reviewed,Tested}-by? > > > > I will send the patch to Andrew if the patch is ok. > > > >> (And that change will conflict with Roman's cgroup aware OOM killer patchset. But it > >> should be easy to rebase.) > > > > That is still a WIP so I would lose sleep over it. > > > > Now that Roman's cgroup aware OOM killer patchset will be dropped from linux-next.git , > linux-next.git will get the sleeping point removed. Please send this patch to linux-next.git . I still haven't heard any explicit confirmation that the patch works for your workload. Should I beg for it? Or you simply do not want to have your stamp on the patch? If yes, I can live with that but this playing hide and catch is not really a lot of fun.
On 2018/08/03 15:16, Michal Hocko wrote: > On Fri 03-08-18 07:05:54, Tetsuo Handa wrote: >> On 2018/07/31 14:09, Michal Hocko wrote: >>> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote: >>>> On 2018/07/31 4:10, Michal Hocko wrote: >>>>> Since should_reclaim_retry() should be a natural reschedule point, >>>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in >>>>> order to guarantee that other pending work items are started. This will >>>>> workaround this problem and it is less fragile than hunting down when >>>>> the sleep is missed. E.g. we used to have a sleeping point in the oom >>>>> path but this has been removed recently because it caused other issues. >>>>> Having a single sleeping point is more robust. >>>> >>>> linux.git has not removed the sleeping point in the OOM path yet. Since removing the >>>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately. >>> >>> is this an {Acked,Reviewed,Tested}-by? >>> >>> I will send the patch to Andrew if the patch is ok. >>> >>>> (And that change will conflict with Roman's cgroup aware OOM killer patchset. But it >>>> should be easy to rebase.) >>> >>> That is still a WIP so I would lose sleep over it. >>> >> >> Now that Roman's cgroup aware OOM killer patchset will be dropped from linux-next.git , >> linux-next.git will get the sleeping point removed. Please send this patch to linux-next.git . > > I still haven't heard any explicit confirmation that the patch works for > your workload. Should I beg for it? Or you simply do not want to have > your stamp on the patch? If yes, I can live with that but this playing > hide and catch is not really a lot of fun. > I noticed that the patch has not been sent to linux-next.git yet. Please send to linux-next.git without my stamp on the patch. [ 44.863590] Out of memory: Kill process 1071 (a.out) score 865 or sacrifice child [ 44.867666] Killed process 1817 (a.out) total-vm:5244kB, anon-rss:1040kB, file-rss:0kB, shmem-rss:0kB [ 44.872176] oom_reaper: reaped process 1817 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 91.698761] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 48s! [ 91.702313] Showing busy workqueues and worker pools: [ 91.705011] workqueue events: flags=0x0 [ 91.707482] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256 [ 91.710524] pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], e1000_watchdog [e1000], vmstat_shepherd, free_work, mmdrop_async_fn, mmdrop_async_fn, check_corruption [ 91.717439] workqueue events_freezable: flags=0x4 [ 91.720161] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 91.723304] pending: vmballoon_work [vmw_balloon] [ 91.726167] workqueue events_power_efficient: flags=0x80 [ 91.729139] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256 [ 91.732253] pending: fb_flashcursor, gc_worker [nf_conntrack], neigh_periodic_work, neigh_periodic_work [ 91.736471] workqueue events_freezable_power_: flags=0x84 [ 91.739546] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 91.742696] in-flight: 2097:disk_events_workfn [ 91.745517] workqueue mm_percpu_wq: flags=0x8 [ 91.748069] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 [ 91.751182] pending: drain_local_pages_wq BAR(1830), vmstat_update [ 91.754661] workqueue mpt_poll_0: flags=0x8 [ 91.757161] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 91.759958] pending: mpt_fault_reset_work [mptbase] [ 91.762696] workqueue xfs-data/sda1: flags=0xc [ 91.765353] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256 [ 91.768248] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs] [ 91.771589] workqueue xfs-cil/sda1: flags=0xc [ 91.774009] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 91.776800] pending: xlog_cil_push_work [xfs] BAR(703) [ 91.779464] workqueue xfs-reclaim/sda1: flags=0xc [ 91.782017] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 91.784599] pending: xfs_reclaim_worker [xfs] [ 91.786930] workqueue xfs-sync/sda1: flags=0x4 [ 91.789289] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 91.792075] pending: xfs_log_worker [xfs] [ 91.794213] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=48s workers=4 idle: 52 13 5 [ 121.906640] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 78s! [ 121.909572] Showing busy workqueues and worker pools: [ 121.911703] workqueue events: flags=0x0 [ 121.913531] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256 [ 121.915873] pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], e1000_watchdog [e1000], vmstat_shepherd, free_work, mmdrop_async_fn, mmdrop_async_fn, check_corruption [ 121.921962] workqueue events_freezable: flags=0x4 [ 121.924336] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 121.926941] pending: vmballoon_work [vmw_balloon] [ 121.929226] workqueue events_power_efficient: flags=0x80 [ 121.931554] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256 [ 121.933983] pending: fb_flashcursor, gc_worker [nf_conntrack], neigh_periodic_work, neigh_periodic_work [ 121.937415] workqueue events_freezable_power_: flags=0x84 [ 121.939719] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 121.942166] in-flight: 2097:disk_events_workfn [ 121.944385] workqueue mm_percpu_wq: flags=0x8 [ 121.946468] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 [ 121.948975] pending: drain_local_pages_wq BAR(1830), vmstat_update [ 121.951808] workqueue mpt_poll_0: flags=0x8 [ 121.953864] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 121.956245] pending: mpt_fault_reset_work [mptbase] [ 121.958505] workqueue xfs-data/sda1: flags=0xc [ 121.960514] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256 [ 121.962896] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs] [ 121.965682] workqueue xfs-cil/sda1: flags=0xc [ 121.967639] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 121.970004] pending: xlog_cil_push_work [xfs] BAR(703) [ 121.972285] workqueue xfs-reclaim/sda1: flags=0xc [ 121.974339] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 121.976779] pending: xfs_reclaim_worker [xfs] [ 121.978926] workqueue xfs-sync/sda1: flags=0x4 [ 121.980997] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 121.983364] pending: xfs_log_worker [xfs] [ 121.985326] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=78s workers=4 idle: 52 13 5 [ 147.872620] sysrq: SysRq : Terminate All Tasks
On Wed 22-08-18 06:07:40, Tetsuo Handa wrote: > On 2018/08/03 15:16, Michal Hocko wrote: [...] > >> Now that Roman's cgroup aware OOM killer patchset will be dropped from linux-next.git , > >> linux-next.git will get the sleeping point removed. Please send this patch to linux-next.git . > > > > I still haven't heard any explicit confirmation that the patch works for > > your workload. Should I beg for it? Or you simply do not want to have > > your stamp on the patch? If yes, I can live with that but this playing > > hide and catch is not really a lot of fun. > > > > I noticed that the patch has not been sent to linux-next.git yet. > Please send to linux-next.git without my stamp on the patch. I plan to do so after merge window closes.
On Wed, 22 Aug 2018, Tetsuo Handa wrote: > On 2018/08/03 15:16, Michal Hocko wrote: > > On Fri 03-08-18 07:05:54, Tetsuo Handa wrote: > >> On 2018/07/31 14:09, Michal Hocko wrote: > >>> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote: > >>>> On 2018/07/31 4:10, Michal Hocko wrote: > >>>>> Since should_reclaim_retry() should be a natural reschedule point, > >>>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in > >>>>> order to guarantee that other pending work items are started. This will > >>>>> workaround this problem and it is less fragile than hunting down when > >>>>> the sleep is missed. E.g. we used to have a sleeping point in the oom > >>>>> path but this has been removed recently because it caused other issues. > >>>>> Having a single sleeping point is more robust. > >>>> > >>>> linux.git has not removed the sleeping point in the OOM path yet. Since removing the > >>>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so immediately. > >>> > >>> is this an {Acked,Reviewed,Tested}-by? > >>> > >>> I will send the patch to Andrew if the patch is ok. > >>> > >>>> (And that change will conflict with Roman's cgroup aware OOM killer patchset. But it > >>>> should be easy to rebase.) > >>> > >>> That is still a WIP so I would lose sleep over it. > >>> > >> > >> Now that Roman's cgroup aware OOM killer patchset will be dropped from linux-next.git , > >> linux-next.git will get the sleeping point removed. Please send this patch to linux-next.git . > > > > I still haven't heard any explicit confirmation that the patch works for > > your workload. Should I beg for it? Or you simply do not want to have > > your stamp on the patch? If yes, I can live with that but this playing > > hide and catch is not really a lot of fun. > > > > I noticed that the patch has not been sent to linux-next.git yet. > Please send to linux-next.git without my stamp on the patch. > For those of us who are tracking CVE-2016-10723 which has peristently been labeled as "disputed" and with no clear indication of what patches address it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") and this patch are the intended mitigations? A list of SHA1s for merged fixed and links to proposed patches to address this issue would be appreciated.
On 2018/08/24 5:06, David Rientjes wrote: > For those of us who are tracking CVE-2016-10723 which has peristently been > labeled as "disputed" and with no clear indication of what patches address > it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from > under oom_lock") and this patch are the intended mitigations? > > A list of SHA1s for merged fixed and links to proposed patches to address > this issue would be appreciated. > Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a mitigation for CVE-2016-10723. "[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry()." is independent from CVE-2016-10723. We haven't made sure that the OOM reaper / exit_mmap() will get enough CPU resources. For example, under a cluster of concurrently allocating realtime scheduling priority threads, the OOM reaper takes about 1800 milliseconds whereas direct OOM reaping takes only a few milliseconds. Regards.
On Fri, 24 Aug 2018, Tetsuo Handa wrote: > > For those of us who are tracking CVE-2016-10723 which has peristently been > > labeled as "disputed" and with no clear indication of what patches address > > it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from > > under oom_lock") and this patch are the intended mitigations? > > > > A list of SHA1s for merged fixed and links to proposed patches to address > > this issue would be appreciated. > > > > Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a > mitigation for CVE-2016-10723. > > "[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at > should_reclaim_retry()." is independent from CVE-2016-10723. > Thanks, Tetsuo. Should commit af5679fbc669 ("mm, oom: remove oom_lock from oom_reaper") also be added to the list for CVE-2016-10723?
David Rientjes wrote: > On Fri, 24 Aug 2018, Tetsuo Handa wrote: > > > > For those of us who are tracking CVE-2016-10723 which has peristently been > > > labeled as "disputed" and with no clear indication of what patches address > > > it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from > > > under oom_lock") and this patch are the intended mitigations? > > > > > > A list of SHA1s for merged fixed and links to proposed patches to address > > > this issue would be appreciated. > > > > > > > Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a > > mitigation for CVE-2016-10723. > > > > "[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at > > should_reclaim_retry()." is independent from CVE-2016-10723. > > > > Thanks, Tetsuo. Should commit af5679fbc669 ("mm, oom: remove oom_lock > from oom_reaper") also be added to the list for CVE-2016-10723? > Commit af5679fbc669f31f ("mm, oom: remove oom_lock from oom_reaper") negated one of the two rationales for commit 9bfe5ded054b8e28 ("mm, oom: remove sleep from under oom_lock"). If we didn't apply af5679fbc669f31f, we could make sure that CPU resource is given to the owner of oom_lock by replacing mutex_trylock() in __alloc_pages_may_oom() with mutex_lock(). But now that af5679fbc669f31f was already applied, we don't know how to give CPU resource to the OOM reaper / exit_mmap(). We might arrive at direct OOM reaping but we haven't reached there... For now, I don't think we need to add af5679fbc669f31f to the list for CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim selection (especially with CONFIG_PREEMPT=y kernels) due to __alloc_pages_may_oom(): oom_reap_task(): mutex_trylock(&oom_lock) succeeds. get_page_from_freelist() fails. Preempted to other process. oom_reap_task_mm() succeeds. Sets MMF_OOM_SKIP. Returned from preemption. Finds that MMF_OOM_SKIP was already set. Selects next OOM victim and kills it. mutex_unlock(&oom_lock) is called. race window like described as Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. in that commit.
On 2018/08/24 9:31, Tetsuo Handa wrote: > For now, I don't think we need to add af5679fbc669f31f to the list for > CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim > selection (especially with CONFIG_PREEMPT=y kernels) due to > > __alloc_pages_may_oom(): oom_reap_task(): > > mutex_trylock(&oom_lock) succeeds. > get_page_from_freelist() fails. > Preempted to other process. > oom_reap_task_mm() succeeds. > Sets MMF_OOM_SKIP. > Returned from preemption. > Finds that MMF_OOM_SKIP was already set. > Selects next OOM victim and kills it. > mutex_unlock(&oom_lock) is called. > > race window like described as > > Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock > to prevent from races when the page allocator didn't manage to get the > freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later > on and move on to another victim. Although this is possible in principle > let's wait for it to actually happen in real life before we make the > locking more complex again. > > in that commit. > Yes, that race window is real. We can needlessly select next OOM victim. I think that af5679fbc669f31f was too optimistic. [ 278.147280] Out of memory: Kill process 9943 (a.out) score 919 or sacrifice child [ 278.148927] Killed process 9943 (a.out) total-vm:4267252kB, anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB [ 278.151586] vmtoolsd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 [ 278.156642] vmtoolsd cpuset=/ mems_allowed=0 [ 278.158884] CPU: 2 PID: 8916 Comm: vmtoolsd Kdump: loaded Not tainted 4.19.0-rc2+ #465 [ 278.162252] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017 [ 278.165499] Call Trace: [ 278.166693] dump_stack+0x99/0xdc [ 278.167922] dump_header+0x70/0x2fa [ 278.169414] oom_reaper: reaped process 9943 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 278.169629] ? _raw_spin_unlock_irqrestore+0x6a/0x8c [ 278.169633] oom_kill_process+0x2ee/0x380 [ 278.169635] out_of_memory+0x136/0x540 [ 278.169636] ? out_of_memory+0x1fc/0x540 [ 278.169640] __alloc_pages_slowpath+0x986/0xce4 [ 278.169641] ? get_page_from_freelist+0x16b/0x1600 [ 278.169646] __alloc_pages_nodemask+0x398/0x3d0 [ 278.180594] alloc_pages_current+0x65/0xb0 [ 278.182173] __page_cache_alloc+0x154/0x190 [ 278.184200] ? pagecache_get_page+0x27/0x250 [ 278.185410] filemap_fault+0x4df/0x880 [ 278.186282] ? filemap_fault+0x31b/0x880 [ 278.187395] ? xfs_ilock+0x1bd/0x220 [ 278.188264] ? __xfs_filemap_fault+0x76/0x270 [ 278.189268] ? down_read_nested+0x48/0x80 [ 278.190229] ? xfs_ilock+0x1bd/0x220 [ 278.191061] __xfs_filemap_fault+0x89/0x270 [ 278.192059] xfs_filemap_fault+0x27/0x30 [ 278.192967] __do_fault+0x1f/0x70 [ 278.193777] __handle_mm_fault+0xfbd/0x1470 [ 278.194743] handle_mm_fault+0x1f2/0x400 [ 278.195679] ? handle_mm_fault+0x47/0x400 [ 278.196618] __do_page_fault+0x217/0x4b0 [ 278.197504] do_page_fault+0x3c/0x21e [ 278.198303] ? page_fault+0x8/0x30 [ 278.199092] page_fault+0x1e/0x30 [ 278.199821] RIP: 0033:0x7f2322ebbfb0 [ 278.200605] Code: Bad RIP value. [ 278.201370] RSP: 002b:00007ffda96e7648 EFLAGS: 00010246 [ 278.202518] RAX: 0000000000000000 RBX: 00007f23220f26b0 RCX: 0000000000000010 [ 278.204280] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f2321ecfb5b [ 278.205838] RBP: 0000000002504b70 R08: 00007f2321ecfb60 R09: 000000000250bd20 [ 278.207426] R10: 383836312d646c69 R11: 0000000000000000 R12: 00007ffda96e76b0 [ 278.208982] R13: 00007f2322ea8540 R14: 000000000250ba90 R15: 00007f2323173920 [ 278.210840] Mem-Info: [ 278.211462] active_anon:18629 inactive_anon:2390 isolated_anon:0 [ 278.211462] active_file:19 inactive_file:1565 isolated_file:0 [ 278.211462] unevictable:0 dirty:0 writeback:0 unstable:0 [ 278.211462] slab_reclaimable:5820 slab_unreclaimable:8964 [ 278.211462] mapped:2128 shmem:2493 pagetables:1826 bounce:0 [ 278.211462] free:878043 free_pcp:909 free_cma:0 [ 278.218830] Node 0 active_anon:74516kB inactive_anon:9560kB active_file:76kB inactive_file:6260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:8512kB dirty:0kB writeback:0kB shmem:9972kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 43008kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [ 278.224997] Node 0 DMA free:15888kB min:288kB low:360kB high:432kB active_anon:32kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 278.230602] lowmem_reserve[]: 0 2663 3610 3610 [ 278.231887] Node 0 DMA32 free:2746332kB min:49636kB low:62044kB high:74452kB active_anon:2536kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2749920kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:1500kB local_pcp:0kB free_cma:0kB [ 278.238291] lowmem_reserve[]: 0 0 947 947 [ 278.239270] Node 0 Normal free:749952kB min:17652kB low:22064kB high:26476kB active_anon:72816kB inactive_anon:9560kB active_file:264kB inactive_file:5556kB unevictable:0kB writepending:4kB present:1048576kB managed:969932kB mlocked:0kB kernel_stack:5328kB pagetables:7092kB bounce:0kB free_pcp:2132kB local_pcp:64kB free_cma:0kB [ 278.245895] lowmem_reserve[]: 0 0 0 0 [ 278.246820] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15904kB [ 278.249659] Node 0 DMA32: 7*4kB (U) 8*8kB (UM) 8*16kB (U) 8*32kB (U) 8*64kB (U) 6*128kB (U) 7*256kB (UM) 7*512kB (UM) 3*1024kB (UM) 2*2048kB (M) 667*4096kB (UM) = 2746332kB [ 278.253054] Node 0 Normal: 4727*4kB (UME) 3423*8kB (UME) 1679*16kB (UME) 704*32kB (UME) 253*64kB (UME) 107*128kB (UME) 38*256kB (M) 16*512kB (M) 10*1024kB (M) 9*2048kB (M) 141*4096kB (M) = 749700kB [ 278.257158] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 278.259018] 4125 total pagecache pages [ 278.259896] 0 pages in swap cache [ 278.260745] Swap cache stats: add 0, delete 0, find 0/0 [ 278.261934] Free swap = 0kB [ 278.262750] Total swap = 0kB [ 278.263483] 1048445 pages RAM [ 278.264216] 0 pages HighMem/MovableOnly [ 278.265077] 114506 pages reserved [ 278.265971] Tasks state (memory values in pages): [ 278.267118] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 278.269090] [ 2846] 0 2846 267 171 32768 0 0 none [ 278.271221] [ 4891] 0 4891 9772 856 110592 0 0 systemd-journal [ 278.273253] [ 6610] 0 6610 10811 261 114688 0 -1000 systemd-udevd [ 278.275197] [ 6709] 0 6709 13880 112 131072 0 -1000 auditd [ 278.277067] [ 7388] 0 7388 3307 52 69632 0 0 rngd [ 278.278880] [ 7393] 0 7393 24917 403 184320 0 0 VGAuthService [ 278.280864] [ 7510] 70 7510 15043 102 155648 0 0 avahi-daemon [ 278.282898] [ 7555] 0 7555 5420 372 81920 0 0 irqbalance [ 278.284836] [ 7563] 0 7563 6597 83 102400 0 0 systemd-logind [ 278.286959] [ 7565] 81 7565 14553 157 167936 0 -900 dbus-daemon [ 278.288985] [ 8286] 70 8286 15010 98 151552 0 0 avahi-daemon [ 278.290958] [ 8731] 0 8731 74697 999 270336 0 0 vmtoolsd [ 278.293008] [ 8732] 999 8732 134787 1730 274432 0 0 polkitd [ 278.294906] [ 8733] 0 8733 55931 467 274432 0 0 abrtd [ 278.296774] [ 8734] 0 8734 55311 354 266240 0 0 abrt-watch-log [ 278.298839] [ 8774] 0 8774 31573 155 106496 0 0 crond [ 278.300810] [ 8790] 0 8790 89503 5482 421888 0 0 firewalld [ 278.302727] [ 8916] 0 8916 45262 211 204800 0 0 vmtoolsd [ 278.304841] [ 9230] 0 9230 26877 507 229376 0 0 dhclient [ 278.306733] [ 9333] 0 9333 87236 451 528384 0 0 nmbd [ 278.308554] [ 9334] 0 9334 28206 257 253952 0 -1000 sshd [ 278.310431] [ 9335] 0 9335 143457 3260 430080 0 0 tuned [ 278.312278] [ 9337] 0 9337 55682 2442 200704 0 0 rsyslogd [ 278.314188] [ 9497] 0 9497 24276 170 233472 0 0 login [ 278.316038] [ 9498] 0 9498 27525 33 73728 0 0 agetty [ 278.317918] [ 9539] 0 9539 104864 581 659456 0 0 smbd [ 278.319738] [ 9590] 0 9590 103799 555 610304 0 0 smbd-notifyd [ 278.321918] [ 9591] 0 9591 103797 555 602112 0 0 cleanupd [ 278.323935] [ 9592] 0 9592 104864 580 610304 0 0 lpqd [ 278.325835] [ 9596] 0 9596 28894 129 90112 0 0 bash [ 278.327663] [ 9639] 0 9639 28833 474 249856 0 0 sendmail [ 278.329550] [ 9773] 51 9773 26644 411 229376 0 0 sendmail [ 278.331527] Out of memory: Kill process 8790 (firewalld) score 5 or sacrifice child [ 278.333267] Killed process 8790 (firewalld) total-vm:358012kB, anon-rss:21928kB, file-rss:0kB, shmem-rss:0kB [ 278.336430] oom_reaper: reaped process 8790 (firewalld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
On Wed 05-09-18 22:20:58, Tetsuo Handa wrote: > On 2018/08/24 9:31, Tetsuo Handa wrote: > > For now, I don't think we need to add af5679fbc669f31f to the list for > > CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim > > selection (especially with CONFIG_PREEMPT=y kernels) due to > > > > __alloc_pages_may_oom(): oom_reap_task(): > > > > mutex_trylock(&oom_lock) succeeds. > > get_page_from_freelist() fails. > > Preempted to other process. > > oom_reap_task_mm() succeeds. > > Sets MMF_OOM_SKIP. > > Returned from preemption. > > Finds that MMF_OOM_SKIP was already set. > > Selects next OOM victim and kills it. > > mutex_unlock(&oom_lock) is called. > > > > race window like described as > > > > Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock > > to prevent from races when the page allocator didn't manage to get the > > freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later > > on and move on to another victim. Although this is possible in principle > > let's wait for it to actually happen in real life before we make the > > locking more complex again. > > > > in that commit. > > > > Yes, that race window is real. We can needlessly select next OOM victim. > I think that af5679fbc669f31f was too optimistic. Changelog said "Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again." So what is the real life workload that hits it? The log you have pasted below doesn't tell much. > [ 278.147280] Out of memory: Kill process 9943 (a.out) score 919 or sacrifice child > [ 278.148927] Killed process 9943 (a.out) total-vm:4267252kB, anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB > [ 278.151586] vmtoolsd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 [...] > [ 278.331527] Out of memory: Kill process 8790 (firewalld) score 5 or sacrifice child > [ 278.333267] Killed process 8790 (firewalld) total-vm:358012kB, anon-rss:21928kB, file-rss:0kB, shmem-rss:0kB > [ 278.336430] oom_reaper: reaped process 8790 (firewalld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
On 2018/09/05 22:40, Michal Hocko wrote: > Changelog said > > "Although this is possible in principle let's wait for it to actually > happen in real life before we make the locking more complex again." > > So what is the real life workload that hits it? The log you have pasted > below doesn't tell much. Nothing special. I just ran a multi-threaded memory eater on a CONFIG_PREEMPT=y kernel. The OOM reaper succeeded to reclaim all memory but the OOM killer still triggered just because somebody was inside that race window. #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <signal.h> #include <poll.h> #include <sched.h> #include <sys/prctl.h> #include <sys/wait.h> #include <sys/mman.h> static int memory_eater(void *unused) { const int fd = open("/proc/self/exe", O_RDONLY); static cpu_set_t set = { { 1 } }; sched_setaffinity(0, sizeof(set), &set); srand(getpid()); while (1) { int size = rand() % 1048576; void *ptr = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0); munmap(ptr, size); } return 0; } static int child(void *unused) { static char *stack[256] = { }; char buf[32] = { }; int i; sleep(1); snprintf(buf, sizeof(buf), "tgid=%u", getpid()); prctl(PR_SET_NAME, (unsigned long) buf, 0, 0, 0); for (i = 0; i < 256; i++) stack[i] = malloc(4096 * 2); for (i = 1; i < 128; i++) if (clone(memory_eater, stack[i] + 8192, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL) == -1) _exit(1); for (; i < 254; i++) if (clone(memory_eater, stack[i] + 8192, CLONE_VM, NULL) == -1) _exit(1); return 0; } int main(int argc, char *argv[]) { char *stack = malloc(1048576); char *buf = NULL; unsigned long size; unsigned long i; if (clone(child, stack + 1048576, CLONE_VM, NULL) == -1) _exit(1); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } /* Will cause OOM due to overcommit */ for (i = 0; i < size; i += 4096) buf[i] = 0; while (1) pause(); return 0; } > >> [ 278.147280] Out of memory: Kill process 9943 (a.out) score 919 or sacrifice child >> [ 278.148927] Killed process 9943 (a.out) total-vm:4267252kB, anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB >> [ 278.151586] vmtoolsd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 > [...] >> [ 278.331527] Out of memory: Kill process 8790 (firewalld) score 5 or sacrifice child >> [ 278.333267] Killed process 8790 (firewalld) total-vm:358012kB, anon-rss:21928kB, file-rss:0kB, shmem-rss:0kB >> [ 278.336430] oom_reaper: reaped process 8790 (firewalld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > Another example from a different machine. [ 765.523676] a.out invoked oom-killer: gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0 [ 765.528172] a.out cpuset=/ mems_allowed=0 [ 765.530603] CPU: 5 PID: 4540 Comm: a.out Tainted: G T 4.19.0-rc2+ #689 [ 765.534307] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017 [ 765.538975] Call Trace: [ 765.540920] dump_stack+0x85/0xcb [ 765.543088] dump_header+0x69/0x2fe [ 765.545375] ? _raw_spin_unlock_irqrestore+0x41/0x70 [ 765.547983] oom_kill_process+0x307/0x390 [ 765.550394] out_of_memory+0x2f5/0x5b0 [ 765.552818] __alloc_pages_slowpath+0xc01/0x1030 [ 765.555442] __alloc_pages_nodemask+0x333/0x390 [ 765.557966] alloc_pages_vma+0x77/0x1f0 [ 765.560292] __handle_mm_fault+0x81c/0xf40 [ 765.562736] handle_mm_fault+0x1b7/0x3c0 [ 765.565038] __do_page_fault+0x2a6/0x580 [ 765.567420] do_page_fault+0x32/0x270 [ 765.569670] ? page_fault+0x8/0x30 [ 765.571833] page_fault+0x1e/0x30 [ 765.573934] RIP: 0033:0x4008d8 [ 765.575924] Code: Bad RIP value. [ 765.577842] RSP: 002b:00007ffec6f7d420 EFLAGS: 00010206 [ 765.580221] RAX: 00007f6a201f4010 RBX: 0000000100000000 RCX: 0000000000000000 [ 765.583253] RDX: 00000000bde23000 RSI: 0000000000020000 RDI: 0000000200000050 [ 765.586207] RBP: 00007f6a201f4010 R08: 0000000200001000 R09: 0000000000021000 [ 765.589047] R10: 0000000000000022 R11: 0000000000001000 R12: 0000000000000006 [ 765.591996] R13: 00007ffec6f7d510 R14: 0000000000000000 R15: 0000000000000000 [ 765.595732] Mem-Info: [ 765.597580] active_anon:794622 inactive_anon:2126 isolated_anon:0 [ 765.597580] active_file:7 inactive_file:11 isolated_file:0 [ 765.597580] unevictable:0 dirty:0 writeback:0 unstable:0 [ 765.597580] slab_reclaimable:7612 slab_unreclaimable:22840 [ 765.597580] mapped:1432 shmem:2280 pagetables:3230 bounce:0 [ 765.597580] free:20847 free_pcp:0 free_cma:0 [ 765.611408] Node 0 active_anon:3178488kB inactive_anon:8504kB active_file:28kB inactive_file:44kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:5728kB dirty:0kB writeback:0kB shmem:9120kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3004416kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [ 765.620811] Node 0 DMA free:13792kB min:308kB low:384kB high:460kB active_anon:2016kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 765.630835] lowmem_reserve[]: 0 2674 3378 3378 [ 765.633372] Node 0 DMA32 free:55732kB min:53260kB low:66572kB high:79884kB active_anon:2680196kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2738556kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 765.643494] lowmem_reserve[]: 0 0 703 703 [ 765.645712] Node 0 Normal free:13912kB min:14012kB low:17512kB high:21012kB active_anon:495920kB inactive_anon:8504kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:720644kB mlocked:0kB kernel_stack:6144kB pagetables:12924kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 765.656802] lowmem_reserve[]: 0 0 0 0 [ 765.658884] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 0*2048kB 3*4096kB (M) = 13792kB [ 765.663518] Node 0 DMA32: 22*4kB (U) 26*8kB (UM) 50*16kB (U) 55*32kB (U) 46*64kB (U) 34*128kB (UM) 33*256kB (UM) 21*512kB (UM) 12*1024kB (UM) 3*2048kB (UM) 2*4096kB (M) = 55976kB [ 765.671041] Node 0 Normal: 74*4kB (UM) 22*8kB (UM) 22*16kB (UM) 147*32kB (UE) 31*64kB (UME) 8*128kB (UME) 3*256kB (UME) 1*512kB (M) 4*1024kB (UM) 0*2048kB 0*4096kB = 13912kB [ 765.677850] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 765.681352] 2295 total pagecache pages [ 765.683550] 0 pages in swap cache [ 765.685731] Swap cache stats: add 0, delete 0, find 0/0 [ 765.688430] Free swap = 0kB [ 765.690419] Total swap = 0kB [ 765.692375] 1048422 pages RAM [ 765.694388] 0 pages HighMem/MovableOnly [ 765.696637] 179653 pages reserved [ 765.698618] 0 pages cma reserved [ 765.700601] 0 pages hwpoisoned [ 765.702651] Out of memory: Kill process 4540 (a.out) score 897 or sacrifice child [ 765.706876] Killed process 4540 (a.out) total-vm:4267252kB, anon-rss:3111084kB, file-rss:0kB, shmem-rss:0kB [ 765.711735] in:imjournal invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 [ 765.719957] in:imjournal cpuset=/ mems_allowed=0 [ 765.723302] CPU: 6 PID: 1012 Comm: in:imjournal Tainted: G T 4.19.0-rc2+ #689 [ 765.731014] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017 [ 765.747405] Call Trace: [ 765.749579] dump_stack+0x85/0xcb [ 765.752179] dump_header+0x69/0x2fe [ 765.754475] ? _raw_spin_unlock_irqrestore+0x41/0x70 [ 765.757307] oom_kill_process+0x307/0x390 [ 765.759873] out_of_memory+0x2f5/0x5b0 [ 765.762464] __alloc_pages_slowpath+0xc01/0x1030 [ 765.765176] __alloc_pages_nodemask+0x333/0x390 [ 765.767832] filemap_fault+0x465/0x910 [ 765.770399] ? xfs_ilock+0xbf/0x2b0 [xfs] [ 765.770514] oom_reaper: reaped process 4540 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 765.773047] ? __xfs_filemap_fault+0x7d/0x2c0 [xfs] [ 765.773054] ? down_read_nested+0x66/0xa0 [ 765.783474] __xfs_filemap_fault+0x8e/0x2c0 [xfs] [ 765.786436] __do_fault+0x11/0x133 [ 765.788649] __handle_mm_fault+0xa57/0xf40 [ 765.791022] handle_mm_fault+0x1b7/0x3c0 [ 765.793307] __do_page_fault+0x2a6/0x580 [ 765.795566] do_page_fault+0x32/0x270 [ 765.797708] ? page_fault+0x8/0x30 [ 765.799735] page_fault+0x1e/0x30 [ 765.801707] RIP: 0033:0x7f9a91e00fcf [ 765.803724] Code: Bad RIP value. [ 765.805569] RSP: 002b:00007f9a8ebe6c60 EFLAGS: 00010293 [ 765.808021] RAX: 0000000000000000 RBX: 0000562b41445bb0 RCX: 00007f9a91e00fcf [ 765.810969] RDX: 00007f9a8ebe6c80 RSI: 0000000000000001 RDI: 00007f9a8ebe6ca0 [ 765.813911] RBP: 00000000000dbba0 R08: 0000000000000008 R09: 0000000000000000 [ 765.816861] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f9a8ebe6d90 [ 765.819762] R13: 00007f9a80029af0 R14: 00007f9a8002fdf0 R15: 00007f9a8001ea60 [ 765.823619] Mem-Info: [ 765.825340] active_anon:16863 inactive_anon:2126 isolated_anon:0 [ 765.825340] active_file:35 inactive_file:3002 isolated_file:0 [ 765.825340] unevictable:0 dirty:0 writeback:0 unstable:0 [ 765.825340] slab_reclaimable:7612 slab_unreclaimable:22764 [ 765.825340] mapped:2597 shmem:2280 pagetables:1686 bounce:0 [ 765.825340] free:796988 free_pcp:423 free_cma:0 [ 765.839674] Node 0 active_anon:67452kB inactive_anon:8504kB active_file:140kB inactive_file:12008kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:10388kB dirty:0kB writeback:0kB shmem:9120kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 24576kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no [ 765.848867] Node 0 DMA free:15840kB min:308kB low:384kB high:460kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 765.859189] lowmem_reserve[]: 0 2674 3378 3378 [ 765.861638] Node 0 DMA32 free:2735512kB min:53260kB low:66572kB high:79884kB active_anon:12kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2738556kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:1276kB local_pcp:0kB free_cma:0kB [ 765.871163] lowmem_reserve[]: 0 0 703 703 [ 765.873345] Node 0 Normal free:436320kB min:14012kB low:17512kB high:21012kB active_anon:67460kB inactive_anon:8504kB active_file:140kB inactive_file:12112kB unevictable:0kB writepending:0kB present:1048576kB managed:720644kB mlocked:0kB kernel_stack:6128kB pagetables:6844kB bounce:0kB free_pcp:416kB local_pcp:0kB free_cma:0kB [ 765.884108] lowmem_reserve[]: 0 0 0 0 [ 765.886080] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15844kB [ 765.891129] Node 0 DMA32: 28*4kB (UM) 33*8kB (UM) 58*16kB (UM) 60*32kB (UM) 48*64kB (UM) 34*128kB (UM) 34*256kB (UM) 21*512kB (UM) 12*1024kB (UM) 9*2048kB (UM) 653*4096kB (M) = 2735512kB [ 765.898614] Node 0 Normal: 3528*4kB (UM) 1178*8kB (UM) 1371*16kB (UM) 898*32kB (UME) 362*64kB (UME) 108*128kB (UME) 46*256kB (UME) 22*512kB (M) 7*1024kB (UM) 10*2048kB (M) 67*4096kB (M) = 436320kB [ 765.906220] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 765.909852] 5357 total pagecache pages [ 765.912036] 0 pages in swap cache [ 765.914092] Swap cache stats: add 0, delete 0, find 0/0 [ 765.916680] Free swap = 0kB [ 765.918560] Total swap = 0kB [ 765.920583] 1048422 pages RAM [ 765.922676] 0 pages HighMem/MovableOnly [ 765.924973] 179653 pages reserved [ 765.927004] 0 pages cma reserved [ 765.928919] 0 pages hwpoisoned [ 765.930821] Unreclaimable slab info: [ 765.932934] Name Used Total [ 765.936489] nf_conntrack 0KB 31KB [ 765.939582] xfs_buf 22KB 127KB [ 765.942322] xfs_ili 20KB 127KB [ 765.944942] xfs_efi_item 0KB 31KB [ 765.947427] xfs_efd_item 0KB 31KB [ 765.949957] xfs_buf_item 0KB 62KB [ 765.952490] xfs_trans 0KB 63KB [ 765.955268] xfs_ifork 27KB 158KB [ 765.957913] xfs_da_state 0KB 31KB [ 765.960483] xfs_btree_cur 0KB 63KB [ 765.962943] xfs_bmap_free_item 0KB 15KB [ 765.965415] xfs_log_ticket 0KB 63KB [ 765.967880] bio-2 24KB 63KB [ 765.970409] sd_ext_cdb 0KB 15KB [ 765.973778] scsi_sense_cache 7KB 128KB [ 765.976603] fib6_nodes 11KB 32KB [ 765.979072] ip6_dst_cache 4KB 31KB [ 765.981504] RAWv6 266KB 384KB [ 765.983897] UDPv6 0KB 31KB [ 765.986243] TCPv6 10KB 30KB [ 765.988850] sgpool-128 8KB 62KB [ 765.991320] sgpool-64 4KB 31KB [ 765.993554] sgpool-32 2KB 31KB [ 765.995733] sgpool-16 1KB 31KB [ 765.997852] sgpool-8 1KB 63KB [ 766.000043] mqueue_inode_cache 1KB 31KB [ 766.002313] bio-1 2KB 31KB [ 766.004549] fasync_cache 0KB 15KB [ 766.006757] posix_timers_cache 0KB 31KB [ 766.009787] UNIX 187KB 240KB [ 766.012175] tcp_bind_bucket 1KB 32KB [ 766.014163] ip_fib_trie 2KB 15KB [ 766.016080] ip_fib_alias 3KB 15KB [ 766.017988] ip_dst_cache 6KB 31KB [ 766.019808] RAW 464KB 585KB [ 766.021751] UDP 8KB 30KB [ 766.023730] request_sock_TCP 0KB 31KB [ 766.025684] TCP 12KB 31KB [ 766.027458] hugetlbfs_inode_cache 2KB 30KB [ 766.029314] eventpoll_pwq 37KB 63KB [ 766.031131] eventpoll_epi 53KB 94KB [ 766.032938] inotify_inode_mark 27KB 31KB [ 766.034774] request_queue 16KB 56KB [ 766.036598] blkdev_ioc 31KB 126KB [ 766.038508] bio-0 6KB 127KB [ 766.040390] biovec-max 405KB 506KB [ 766.042199] biovec-128 0KB 31KB [ 766.044857] biovec-64 0KB 94KB [ 766.047009] biovec-16 0KB 63KB [ 766.048856] bio_integrity_payload 1KB 31KB [ 766.050745] khugepaged_mm_slot 1KB 15KB [ 766.052602] uid_cache 2KB 31KB [ 766.054537] dmaengine-unmap-2 0KB 15KB [ 766.056523] audit_buffer 0KB 15KB [ 766.058506] skbuff_fclone_cache 0KB 31KB [ 766.060521] skbuff_head_cache 0KB 127KB [ 766.062391] configfs_dir_cache 1KB 15KB [ 766.064226] file_lock_cache 13KB 31KB [ 766.066061] file_lock_ctx 15KB 31KB [ 766.067883] fsnotify_mark_connector 18KB 31KB [ 766.069874] net_namespace 0KB 32KB [ 766.071801] shmem_inode_cache 1546KB 1595KB [ 766.073838] task_delay_info 177KB 318KB [ 766.075820] taskstats 2KB 63KB [ 766.077664] proc_dir_entry 498KB 540KB [ 766.080418] pde_opener 0KB 31KB [ 766.082620] seq_file 1KB 127KB [ 766.084526] sigqueue 0KB 63KB [ 766.086374] kernfs_node_cache 29487KB 29536KB [ 766.088403] mnt_cache 90KB 126KB [ 766.090442] filp 1195KB 1466KB [ 766.092412] names_cache 0KB 155KB [ 766.094412] key_jar 4KB 31KB [ 766.096428] nsproxy 1KB 15KB [ 766.098448] vm_area_struct 3312KB 3480KB [ 766.100378] mm_struct 68KB 128KB [ 766.102292] fs_cache 90KB 189KB [ 766.104179] files_cache 200KB 281KB [ 766.106192] signal_cache 690KB 892KB [ 766.108158] sighand_cache 943KB 1137KB [ 766.110116] task_struct 3590KB 3825KB [ 766.112005] cred_jar 249KB 378KB [ 766.113912] anon_vma_chain 1634KB 1742KB [ 766.116711] anon_vma 1214KB 1309KB [ 766.119046] pid 129KB 256KB [ 766.121108] Acpi-Operand 3713KB 3761KB [ 766.123251] Acpi-ParseExt 0KB 15KB [ 766.125261] Acpi-Parse 0KB 31KB [ 766.127259] Acpi-State 0KB 47KB [ 766.129256] Acpi-Namespace 2913KB 2936KB [ 766.131118] numa_policy 74KB 94KB [ 766.132954] trace_event_file 455KB 456KB [ 766.134815] ftrace_event_field 777KB 787KB [ 766.136762] pool_workqueue 72KB 94KB [ 766.138859] task_group 35KB 63KB [ 766.140890] page->ptl 725KB 828KB [ 766.142857] dma-kmalloc-512 0KB 31KB [ 766.144751] kmalloc-8192 441KB 499KB [ 766.146793] kmalloc-4096 965KB 1090KB [ 766.148619] kmalloc-2048 4570KB 4691KB [ 766.150479] kmalloc-1024 881KB 988KB [ 766.153265] kmalloc-512 3343KB 3430KB [ 766.155491] kmalloc-256 3368KB 3434KB [ 766.157383] kmalloc-192 276KB 447KB [ 766.159287] kmalloc-128 981KB 1078KB [ 766.161152] kmalloc-96 2288KB 2294KB [ 766.163011] kmalloc-64 1675KB 1921KB [ 766.164862] kmalloc-32 1395KB 1533KB [ 766.166635] kmalloc-16 3359KB 3795KB [ 766.168413] kmalloc-8 2344KB 2384KB [ 766.170249] kmem_cache_node 102KB 126KB [ 766.172232] kmem_cache 147KB 158KB [ 766.174204] Out of memory: Kill process 679 (firewalld) score 6 or sacrifice child [ 766.176757] Killed process 679 (firewalld) total-vm:357864kB, anon-rss:21920kB, file-rss:0kB, shmem-rss:0kB [ 766.183011] oom_reaper: reaped process 679 (firewalld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
On Wed 05-09-18 22:53:33, Tetsuo Handa wrote: > On 2018/09/05 22:40, Michal Hocko wrote: > > Changelog said > > > > "Although this is possible in principle let's wait for it to actually > > happen in real life before we make the locking more complex again." > > > > So what is the real life workload that hits it? The log you have pasted > > below doesn't tell much. > > Nothing special. I just ran a multi-threaded memory eater on a CONFIG_PREEMPT=y kernel. I strongly suspec that your test doesn't really represent or simulate any real and useful workload. Sure it triggers a rare race and we kill another oom victim. Does this warrant to make the code more complex? Well, I am not convinced, as I've said countless times.
Michal Hocko wrote: > On Wed 05-09-18 22:53:33, Tetsuo Handa wrote: > > On 2018/09/05 22:40, Michal Hocko wrote: > > > Changelog said > > > > > > "Although this is possible in principle let's wait for it to actually > > > happen in real life before we make the locking more complex again." > > > > > > So what is the real life workload that hits it? The log you have pasted > > > below doesn't tell much. > > > > Nothing special. I just ran a multi-threaded memory eater on a CONFIG_PREEMPT=y kernel. > > I strongly suspec that your test doesn't really represent or simulate > any real and useful workload. Sure it triggers a rare race and we kill > another oom victim. Does this warrant to make the code more complex? > Well, I am not convinced, as I've said countless times. Yes. Below is an example from a machine running Apache Web server/Tomcat AP server/PostgreSQL DB server. An memory eater needlessly killed Tomcat due to this race. I assert that we should fix af5679fbc669f31f. Before: systemd(1)-+-NetworkManager(693)-+-dhclient(791) | |-{NetworkManager}(698) | `-{NetworkManager}(702) |-abrtd(630) |-agetty(1007) |-atd(653) |-auditd(600)---{auditd}(601) |-avahi-daemon(625)---avahi-daemon(631) |-crond(657) |-dbus-daemon(638) |-firewalld(661)---{firewalld}(788) |-httpd(1169)-+-httpd(1170) | |-httpd(1171) | |-httpd(1172) | |-httpd(1173) | `-httpd(1174) |-irqbalance(628) |-java(1074)-+-{java}(1092) | |-{java}(1093) | |-{java}(1094) | |-{java}(1095) | |-{java}(1096) | |-{java}(1097) | |-{java}(1098) | |-{java}(1099) | |-{java}(1100) | |-{java}(1101) | |-{java}(1102) | |-{java}(1103) | |-{java}(1104) | |-{java}(1105) | |-{java}(1106) | |-{java}(1107) | |-{java}(1108) | |-{java}(1109) | |-{java}(1110) | |-{java}(1111) | |-{java}(1114) | |-{java}(1115) | |-{java}(1116) | |-{java}(1117) | |-{java}(1118) | |-{java}(1119) | |-{java}(1120) | |-{java}(1121) | |-{java}(1122) | |-{java}(1123) | |-{java}(1124) | |-{java}(1125) | |-{java}(1126) | |-{java}(1127) | |-{java}(1128) | |-{java}(1129) | |-{java}(1130) | |-{java}(1131) | |-{java}(1132) | |-{java}(1133) | |-{java}(1134) | |-{java}(1135) | |-{java}(1136) | |-{java}(1137) | `-{java}(1138) |-ksmtuned(659)---sleep(1727) |-login(1006)---bash(1052)---pstree(1728) |-polkitd(624)-+-{polkitd}(633) | |-{polkitd}(642) | |-{polkitd}(643) | |-{polkitd}(645) | `-{polkitd}(650) |-postgres(1154)-+-postgres(1155) | |-postgres(1157) | |-postgres(1158) | |-postgres(1159) | |-postgres(1160) | `-postgres(1161) |-rsyslogd(986)-+-{rsyslogd}(997) | `-{rsyslogd}(999) |-sendmail(1008) |-sendmail(1023) |-smbd(983)-+-cleanupd(1027) | |-lpqd(1032) | `-smbd-notifyd(1026) |-sshd(981) |-systemd-journal(529) |-systemd-logind(627) |-systemd-udevd(560) `-tuned(980)-+-{tuned}(1030) |-{tuned}(1031) |-{tuned}(1033) `-{tuned}(1047) After: systemd(1)-+-NetworkManager(693)-+-dhclient(791) | |-{NetworkManager}(698) | `-{NetworkManager}(702) |-abrtd(630) |-agetty(1007) |-atd(653) |-auditd(600)---{auditd}(601) |-avahi-daemon(625)---avahi-daemon(631) |-crond(657) |-dbus-daemon(638) |-firewalld(661)---{firewalld}(788) |-httpd(1169)-+-httpd(1170) | |-httpd(1171) | |-httpd(1172) | |-httpd(1173) | `-httpd(1174) |-irqbalance(628) |-ksmtuned(659)---sleep(1758) |-login(1006)---bash(1052)---pstree(1759) |-polkitd(624)-+-{polkitd}(633) | |-{polkitd}(642) | |-{polkitd}(643) | |-{polkitd}(645) | `-{polkitd}(650) |-postgres(1154)-+-postgres(1155) | |-postgres(1157) | |-postgres(1158) | |-postgres(1159) | |-postgres(1160) | `-postgres(1161) |-rsyslogd(986)-+-{rsyslogd}(997) | `-{rsyslogd}(999) |-sendmail(1008) |-sendmail(1023) |-smbd(983)-+-cleanupd(1027) | |-lpqd(1032) | `-smbd-notifyd(1026) |-sshd(981) |-systemd-journal(529) |-systemd-logind(627) |-systemd-udevd(560) `-tuned(980)-+-{tuned}(1030) |-{tuned}(1031) |-{tuned}(1033) `-{tuned}(1047) [ 222.165946] a.out invoked oom-killer: gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0 [ 222.170631] a.out cpuset=/ mems_allowed=0 [ 222.172956] CPU: 4 PID: 1748 Comm: a.out Tainted: G T 4.19.0-rc2+ #690 [ 222.176517] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017 [ 222.180892] Call Trace: [ 222.182947] dump_stack+0x85/0xcb [ 222.185240] dump_header+0x69/0x2fe [ 222.187579] ? _raw_spin_unlock_irqrestore+0x41/0x70 [ 222.190319] oom_kill_process+0x307/0x390 [ 222.192803] out_of_memory+0x2f3/0x5d0 [ 222.195190] __alloc_pages_slowpath+0xc01/0x1030 [ 222.197844] __alloc_pages_nodemask+0x333/0x390 [ 222.200452] alloc_pages_vma+0x77/0x1f0 [ 222.202869] __handle_mm_fault+0x81c/0xf40 [ 222.205334] handle_mm_fault+0x1b7/0x3c0 [ 222.207712] __do_page_fault+0x2a6/0x580 [ 222.210036] do_page_fault+0x32/0x270 [ 222.212266] ? page_fault+0x8/0x30 [ 222.214402] page_fault+0x1e/0x30 [ 222.216463] RIP: 0033:0x4008d8 [ 222.218429] Code: Bad RIP value. [ 222.220388] RSP: 002b:00007fff34061350 EFLAGS: 00010206 [ 222.222931] RAX: 00007efea3c2e010 RBX: 0000000100000000 RCX: 0000000000000000 [ 222.225976] RDX: 00000000b190f000 RSI: 0000000000020000 RDI: 0000000200000050 [ 222.228891] RBP: 00007efea3c2e010 R08: 0000000200001000 R09: 0000000000021000 [ 222.231779] R10: 0000000000000022 R11: 0000000000001000 R12: 0000000000000006 [ 222.234626] R13: 00007fff34061440 R14: 0000000000000000 R15: 0000000000000000 [ 222.238482] Mem-Info: [ 222.240511] active_anon:789816 inactive_anon:3457 isolated_anon:0 [ 222.240511] active_file:11 inactive_file:44 isolated_file:0 [ 222.240511] unevictable:0 dirty:6 writeback:0 unstable:0 [ 222.240511] slab_reclaimable:8052 slab_unreclaimable:24408 [ 222.240511] mapped:1898 shmem:3704 pagetables:4316 bounce:0 [ 222.240511] free:20841 free_pcp:0 free_cma:0 [ 222.254349] Node 0 active_anon:3159264kB inactive_anon:13828kB active_file:44kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:7592kB dirty:24kB writeback:0kB shmem:14816kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2793472kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [ 222.264038] Node 0 DMA free:13812kB min:308kB low:384kB high:460kB active_anon:1876kB inactive_anon:8kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 222.274208] lowmem_reserve[]: 0 2674 3378 3378 [ 222.276831] Node 0 DMA32 free:56068kB min:53260kB low:66572kB high:79884kB active_anon:2673292kB inactive_anon:216kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2738564kB mlocked:0kB kernel_stack:96kB pagetables:3024kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 222.287961] lowmem_reserve[]: 0 0 703 703 [ 222.291154] Node 0 Normal free:13672kB min:14012kB low:17512kB high:21012kB active_anon:483864kB inactive_anon:13604kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:1048576kB managed:720644kB mlocked:0kB kernel_stack:7520kB pagetables:14272kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 222.302407] lowmem_reserve[]: 0 0 0 0 [ 222.304596] Node 0 DMA: 1*4kB (M) 0*8kB 1*16kB (U) 7*32kB (UM) 4*64kB (U) 4*128kB (UM) 2*256kB (U) 0*512kB 0*1024kB 0*2048kB 3*4096kB (ME) = 13812kB [ 222.311748] Node 0 DMA32: 37*4kB (U) 29*8kB (UM) 20*16kB (UM) 30*32kB (UME) 28*64kB (UME) 11*128kB (UME) 9*256kB (UME) 8*512kB (UM) 6*1024kB (UME) 1*2048kB (E) 9*4096kB (UM) = 56316kB [ 222.318932] Node 0 Normal: 151*4kB (UM) 2*8kB (UM) 97*16kB (UM) 195*32kB (UME) 53*64kB (UME) 11*128kB (UME) 2*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 13724kB [ 222.325455] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 222.329059] 3707 total pagecache pages [ 222.331315] 0 pages in swap cache [ 222.333440] Swap cache stats: add 0, delete 0, find 0/0 [ 222.336097] Free swap = 0kB [ 222.338091] Total swap = 0kB [ 222.340084] 1048422 pages RAM [ 222.342165] 0 pages HighMem/MovableOnly [ 222.344460] 179651 pages reserved [ 222.347423] 0 pages cma reserved [ 222.349793] 0 pages hwpoisoned [ 222.351784] Out of memory: Kill process 1748 (a.out) score 838 or sacrifice child [ 222.355131] Killed process 1748 (a.out) total-vm:4267252kB, anon-rss:2909224kB, file-rss:0kB, shmem-rss:0kB [ 222.359644] java invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 [ 222.364180] java cpuset=/ mems_allowed=0 [ 222.366619] CPU: 0 PID: 1110 Comm: java Tainted: G T 4.19.0-rc2+ #690 [ 222.370088] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017 [ 222.377089] Call Trace: [ 222.380503] dump_stack+0x85/0xcb [ 222.380509] dump_header+0x69/0x2fe [ 222.380515] ? _raw_spin_unlock_irqrestore+0x41/0x70 [ 222.380518] oom_kill_process+0x307/0x390 [ 222.380553] out_of_memory+0x2f3/0x5d0 [ 222.396414] __alloc_pages_slowpath+0xc01/0x1030 [ 222.396423] __alloc_pages_nodemask+0x333/0x390 [ 222.396431] filemap_fault+0x465/0x910 [ 222.404090] ? xfs_ilock+0xbf/0x2b0 [xfs] [ 222.404118] ? __xfs_filemap_fault+0x7d/0x2c0 [xfs] [ 222.404124] ? down_read_nested+0x66/0xa0 [ 222.404148] __xfs_filemap_fault+0x8e/0x2c0 [xfs] [ 222.404156] __do_fault+0x11/0x133 [ 222.404158] __handle_mm_fault+0xa57/0xf40 [ 222.404165] handle_mm_fault+0x1b7/0x3c0 [ 222.404171] __do_page_fault+0x2a6/0x580 [ 222.404187] do_page_fault+0x32/0x270 [ 222.404194] ? page_fault+0x8/0x30 [ 222.404196] page_fault+0x1e/0x30 [ 222.404199] RIP: 0033:0x7fedb229ed42 [ 222.404205] Code: Bad RIP value. [ 222.404207] RSP: 002b:00007fed92ae9c90 EFLAGS: 00010202 [ 222.404209] RAX: ffffffffffffff92 RBX: 00007fedb187c470 RCX: 00007fedb229ed42 [ 222.404210] RDX: 0000000000000001 RSI: 0000000000000089 RDI: 00007fedac13c354 [ 222.404211] RBP: 00007fed92ae9d50 R08: 00007fedac13c328 R09: 00000000ffffffff [ 222.404212] R10: 00007fed92ae9cf0 R11: 0000000000000202 R12: 0000000000000001 [ 222.404213] R13: 00007fed92ae9cf0 R14: ffffffffffffff92 R15: 00007fedac13c300 [ 222.404783] Mem-Info: [ 222.404790] active_anon:429056 inactive_anon:3457 isolated_anon:0 [ 222.404790] active_file:0 inactive_file:833 isolated_file:0 [ 222.404790] unevictable:0 dirty:0 writeback:0 unstable:0 [ 222.404790] slab_reclaimable:8052 slab_unreclaimable:24344 [ 222.404790] mapped:2375 shmem:3704 pagetables:3030 bounce:0 [ 222.404790] free:381368 free_pcp:89 free_cma:0 [ 222.404793] Node 0 active_anon:1716224kB inactive_anon:13828kB active_file:0kB inactive_file:3332kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:9500kB dirty:0kB writeback:0kB shmem:14816kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 155648kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no [ 222.404794] Node 0 DMA free:13812kB min:308kB low:384kB high:460kB active_anon:1876kB inactive_anon:8kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 222.404798] lowmem_reserve[]: 0 2674 3378 3378 [ 222.404802] Node 0 DMA32 free:1362940kB min:53260kB low:66572kB high:79884kB active_anon:1366928kB inactive_anon:216kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2738564kB mlocked:0kB kernel_stack:96kB pagetables:3028kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 222.404831] lowmem_reserve[]: 0 0 703 703 [ 222.404837] Node 0 Normal free:149160kB min:14012kB low:17512kB high:21012kB active_anon:348116kB inactive_anon:13604kB active_file:0kB inactive_file:3080kB unevictable:0kB writepending:0kB present:1048576kB managed:720644kB mlocked:0kB kernel_stack:7504kB pagetables:9088kB bounce:0kB free_pcp:376kB local_pcp:12kB free_cma:0kB [ 222.404841] lowmem_reserve[]: 0 0 0 0 [ 222.404859] Node 0 DMA: 1*4kB (M) 0*8kB 1*16kB (U) 7*32kB (UM) 4*64kB (U) 4*128kB (UM) 2*256kB (U) 0*512kB 0*1024kB 0*2048kB 3*4096kB (ME) = 13812kB [ 222.405114] Node 0 DMA32: 37*4kB (U) 29*8kB (UM) 20*16kB (UM) 30*32kB (UME) 28*64kB (UME) 11*128kB (UME) 9*256kB (UME) 8*512kB (UM) 6*1024kB (UME) 10*2048kB (ME) 326*4096kB (UM) = 1373180kB [ 222.405423] Node 0 Normal: 512*4kB (U) 1075*8kB (UM) 1667*16kB (UM) 1226*32kB (UME) 497*64kB (UME) 209*128kB (UME) 50*256kB (UM) 0*512kB 0*1024kB 0*2048kB 1*4096kB (M) = 152008kB [ 222.405797] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 222.405799] 4655 total pagecache pages [ 222.405801] 0 pages in swap cache [ 222.405802] Swap cache stats: add 0, delete 0, find 0/0 [ 222.405803] Free swap = 0kB [ 222.405803] Total swap = 0kB [ 222.405835] 1048422 pages RAM [ 222.405837] 0 pages HighMem/MovableOnly [ 222.405838] 179651 pages reserved [ 222.405839] 0 pages cma reserved [ 222.405840] 0 pages hwpoisoned [ 222.405843] Out of memory: Kill process 1074 (java) score 50 or sacrifice child [ 222.406136] Killed process 1074 (java) total-vm:5555688kB, anon-rss:174244kB, file-rss:0kB, shmem-rss:0kB [ 222.443446] oom_reaper: reaped process 1748 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
On Thu 06-09-18 10:00:00, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Wed 05-09-18 22:53:33, Tetsuo Handa wrote: > > > On 2018/09/05 22:40, Michal Hocko wrote: > > > > Changelog said > > > > > > > > "Although this is possible in principle let's wait for it to actually > > > > happen in real life before we make the locking more complex again." > > > > > > > > So what is the real life workload that hits it? The log you have pasted > > > > below doesn't tell much. > > > > > > Nothing special. I just ran a multi-threaded memory eater on a CONFIG_PREEMPT=y kernel. > > > > I strongly suspec that your test doesn't really represent or simulate > > any real and useful workload. Sure it triggers a rare race and we kill > > another oom victim. Does this warrant to make the code more complex? > > Well, I am not convinced, as I've said countless times. > > Yes. Below is an example from a machine running Apache Web server/Tomcat AP server/PostgreSQL DB server. > An memory eater needlessly killed Tomcat due to this race. What prevents you from modifying you mem eater in a way that Tomcat resp. others from being the primary oom victim choice? In other words, yeah it is not optimal to lose the race but if it is rare enough then this is something to live with because it can be hardly considered a new DoS vector AFAICS. Remember that this is always going to be racy land and we are not going to plumb all possible races because this is simply not viable. But I am pretty sure we have been through all this many times already. Oh well... > I assert that we should fix af5679fbc669f31f. If you can come up with reasonable patch which doesn't complicate the code and it is a clear win for both this particular workload as well as others then why not.
Michal Hocko wrote: > > I assert that we should fix af5679fbc669f31f. > > If you can come up with reasonable patch which doesn't complicate the > code and it is a clear win for both this particular workload as well as > others then why not. Why can't we do "at least MMF_OOM_SKIP should be set under the lock to prevent from races" ? diff --git a/mm/mmap.c b/mm/mmap.c index 5f2b2b1..e096bb8 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3065,7 +3065,9 @@ void exit_mmap(struct mm_struct *mm) */ (void)__oom_reap_task_mm(mm); + mutex_lock(&oom_lock); set_bit(MMF_OOM_SKIP, &mm->flags); + mutex_unlock(&oom_lock); down_write(&mm->mmap_sem); up_write(&mm->mmap_sem); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f10aa53..b2a94c1 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -606,7 +606,9 @@ static void oom_reap_task(struct task_struct *tsk) * Hide this mm from OOM killer because it has been either reaped or * somebody can't call up_write(mmap_sem). */ + mutex_lock(&oom_lock); set_bit(MMF_OOM_SKIP, &mm->flags); + mutex_unlock(&oom_lock); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk);
Tetsuo Handa wrote: > Michal Hocko wrote: > > > I assert that we should fix af5679fbc669f31f. > > > > If you can come up with reasonable patch which doesn't complicate the > > code and it is a clear win for both this particular workload as well as > > others then why not. > > Why can't we do "at least MMF_OOM_SKIP should be set under the lock to > prevent from races" ? > Well, serializing setting of MMF_OOM_SKIP using oom_lock was not sufficient. It seems that some more moment is needed for allowing last second allocation attempt at __alloc_pages_may_oom() to succeed. We need to migrate to "mm, oom: fix unnecessary killing of additional processes" thread. [ 702.895936] a.out invoked oom-killer: gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0 [ 702.900717] a.out cpuset=/ mems_allowed=0 [ 702.903210] CPU: 1 PID: 3359 Comm: a.out Tainted: G T 4.19.0-rc2+ #692 [ 702.906630] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017 [ 702.910771] Call Trace: [ 702.912681] dump_stack+0x85/0xcb [ 702.914918] dump_header+0x69/0x2fe [ 702.917387] ? _raw_spin_unlock_irqrestore+0x41/0x70 [ 702.921267] oom_kill_process+0x307/0x390 [ 702.923809] out_of_memory+0x2f3/0x5d0 [ 702.926208] __alloc_pages_slowpath+0xc01/0x1030 [ 702.928817] __alloc_pages_nodemask+0x333/0x390 [ 702.931270] alloc_pages_vma+0x77/0x1f0 [ 702.933618] __handle_mm_fault+0x81c/0xf40 [ 702.935962] handle_mm_fault+0x1b7/0x3c0 [ 702.938215] __do_page_fault+0x2a6/0x580 [ 702.940481] do_page_fault+0x32/0x270 [ 702.942753] ? page_fault+0x8/0x30 [ 702.944860] page_fault+0x1e/0x30 [ 702.947138] RIP: 0033:0x4008d8 [ 702.949722] Code: Bad RIP value. [ 702.952000] RSP: 002b:00007ffc21fd99c0 EFLAGS: 00010206 [ 702.954570] RAX: 00007fb3457cc010 RBX: 0000000100000000 RCX: 0000000000000000 [ 702.957631] RDX: 00000000b0b24000 RSI: 0000000000020000 RDI: 0000000200000050 [ 702.960599] RBP: 00007fb3457cc010 R08: 0000000200001000 R09: 0000000000021000 [ 702.963531] R10: 0000000000000022 R11: 0000000000001000 R12: 0000000000000006 [ 702.966518] R13: 00007ffc21fd9ab0 R14: 0000000000000000 R15: 0000000000000000 [ 702.971186] Mem-Info: [ 702.976959] active_anon:788641 inactive_anon:3457 isolated_anon:0 [ 702.976959] active_file:0 inactive_file:77 isolated_file:0 [ 702.976959] unevictable:0 dirty:0 writeback:0 unstable:0 [ 702.976959] slab_reclaimable:8152 slab_unreclaimable:24616 [ 702.976959] mapped:2806 shmem:3704 pagetables:4355 bounce:0 [ 702.976959] free:20831 free_pcp:136 free_cma:0 [ 703.007374] Node 0 active_anon:3154564kB inactive_anon:13828kB active_file:304kB inactive_file:112kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:11028kB dirty:0kB writeback:0kB shmem:14816kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2846720kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no [ 703.029994] Node 0 DMA free:13816kB min:308kB low:384kB high:460kB active_anon:1976kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 703.055331] lowmem_reserve[]: 0 2717 3378 3378 [ 703.062881] Node 0 DMA32 free:58152kB min:54108kB low:67632kB high:81156kB active_anon:2718364kB inactive_anon:12kB active_file:424kB inactive_file:852kB unevictable:0kB writepending:0kB present:3129152kB managed:2782296kB mlocked:0kB kernel_stack:352kB pagetables:388kB bounce:0kB free_pcp:728kB local_pcp:0kB free_cma:0kB [ 703.091529] lowmem_reserve[]: 0 0 661 661 [ 703.096552] Node 0 Normal free:12900kB min:13164kB low:16452kB high:19740kB active_anon:434248kB inactive_anon:13816kB active_file:0kB inactive_file:48kB unevictable:0kB writepending:0kB present:1048576kB managed:676908kB mlocked:0kB kernel_stack:6720kB pagetables:17028kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 703.123046] lowmem_reserve[]: 0 0 0 0 [ 703.127490] Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (U) 3*32kB (U) 4*64kB (UM) 1*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 0*2048kB 3*4096kB (M) = 13816kB [ 703.134763] Node 0 DMA32: 85*4kB (UM) 59*8kB (UM) 61*16kB (UM) 65*32kB (UME) 47*64kB (UE) 44*128kB (UME) 41*256kB (UME) 27*512kB (UME) 20*1024kB (ME) 0*2048kB 0*4096kB = 57308kB [ 703.144076] Node 0 Normal: 119*4kB (UM) 69*8kB (UM) 156*16kB (UME) 179*32kB (UME) 44*64kB (UE) 9*128kB (UE) 1*256kB (E) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 13476kB [ 703.151746] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 703.156537] 3744 total pagecache pages [ 703.159017] 0 pages in swap cache [ 703.163209] Swap cache stats: add 0, delete 0, find 0/0 [ 703.167254] Free swap = 0kB [ 703.170577] Total swap = 0kB [ 703.173758] 1048422 pages RAM [ 703.176843] 0 pages HighMem/MovableOnly [ 703.180380] 179652 pages reserved [ 703.183718] 0 pages cma reserved [ 703.187160] 0 pages hwpoisoned [ 703.190058] Out of memory: Kill process 3359 (a.out) score 834 or sacrifice child [ 703.194707] Killed process 3359 (a.out) total-vm:4267252kB, anon-rss:2894908kB, file-rss:0kB, shmem-rss:0kB [ 703.203016] in:imjournal invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 [ 703.231814] in:imjournal cpuset=/ mems_allowed=0 [ 703.231824] CPU: 2 PID: 1001 Comm: in:imjournal Tainted: G T 4.19.0-rc2+ #692 [ 703.231825] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017 [ 703.231826] Call Trace: [ 703.231834] dump_stack+0x85/0xcb [ 703.231839] dump_header+0x69/0x2fe [ 703.231844] ? _raw_spin_unlock_irqrestore+0x41/0x70 [ 703.252369] oom_kill_process+0x307/0x390 [ 703.252375] out_of_memory+0x2f3/0x5d0 [ 703.252379] __alloc_pages_slowpath+0xc01/0x1030 [ 703.252388] __alloc_pages_nodemask+0x333/0x390 [ 703.262608] filemap_fault+0x465/0x910 [ 703.262663] ? xfs_ilock+0xbf/0x2b0 [xfs] [ 703.262690] ? __xfs_filemap_fault+0x7d/0x2c0 [xfs] [ 703.262695] ? down_read_nested+0x66/0xa0 [ 703.262718] __xfs_filemap_fault+0x8e/0x2c0 [xfs] [ 703.262727] __do_fault+0x11/0x133 [ 703.262730] __handle_mm_fault+0xa57/0xf40 [ 703.262736] handle_mm_fault+0x1b7/0x3c0 [ 703.262742] __do_page_fault+0x2a6/0x580 [ 703.262748] do_page_fault+0x32/0x270 [ 703.262754] ? page_fault+0x8/0x30 [ 703.262756] page_fault+0x1e/0x30 [ 703.262759] RIP: 0033:0x7f5f078f6e28 [ 703.262765] Code: Bad RIP value. [ 703.262766] RSP: 002b:00007f5f05007c50 EFLAGS: 00010246 [ 703.262768] RAX: 0000000000000300 RBX: 0000000000000009 RCX: 00007f5f05007d80 [ 703.262769] RDX: 00000000000003dd RSI: 00007f5f07b1ca1a RDI: 00005596745e9bb0 [ 703.262770] RBP: 00007f5f05007d70 R08: 0000000000000006 R09: 0000000000000078 [ 703.262771] R10: 00005596745e9810 R11: 00007f5f082bb4a0 R12: 00007f5f05007d90 [ 703.262772] R13: 00007f5f00035fc0 R14: 00007f5f0003c850 R15: 00007f5f0000d760 [ 703.263679] Mem-Info: [ 703.263692] active_anon:243554 inactive_anon:3457 isolated_anon:0 [ 703.263692] active_file:53 inactive_file:4386 isolated_file:3 [ 703.263692] unevictable:0 dirty:0 writeback:0 unstable:0 [ 703.263692] slab_reclaimable:8152 slab_unreclaimable:24512 [ 703.263692] mapped:3730 shmem:3704 pagetables:3123 bounce:0 [ 703.263692] free:562331 free_pcp:502 free_cma:0 [ 703.263698] Node 0 active_anon:974216kB inactive_anon:13828kB active_file:212kB inactive_file:17544kB unevictable:0kB isolated(anon):0kB isolated(file):12kB mapped:14920kB dirty:0kB writeback:0kB shmem:14816kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 305152kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no [ 703.263700] Node 0 DMA free:13816kB min:308kB low:384kB high:460kB active_anon:1976kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 703.263705] lowmem_reserve[]: 0 2717 3378 3378 [ 703.263712] Node 0 DMA32 free:2077084kB min:54108kB low:67632kB high:81156kB active_anon:696988kB inactive_anon:12kB active_file:252kB inactive_file:756kB unevictable:0kB writepending:0kB present:3129152kB managed:2782296kB mlocked:0kB kernel_stack:352kB pagetables:388kB bounce:0kB free_pcp:1328kB local_pcp:248kB free_cma:0kB [ 703.263716] lowmem_reserve[]: 0 0 661 661 [ 703.263722] Node 0 Normal free:158424kB min:13164kB low:16452kB high:19740kB active_anon:275676kB inactive_anon:13816kB active_file:0kB inactive_file:16084kB unevictable:0kB writepending:0kB present:1048576kB managed:676908kB mlocked:0kB kernel_stack:6720kB pagetables:12100kB bounce:0kB free_pcp:680kB local_pcp:148kB free_cma:0kB [ 703.263726] lowmem_reserve[]: 0 0 0 0 [ 703.263731] Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (U) 3*32kB (U) 4*64kB (UM) 1*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 0*2048kB 3*4096kB (M) = 13816kB [ 703.263752] Node 0 DMA32: 122*4kB (UM) 83*8kB (UM) 56*16kB (UM) 70*32kB (UME) 47*64kB (UE) 46*128kB (UME) 41*256kB (UME) 27*512kB (UME) 20*1024kB (ME) 213*2048kB (M) 387*4096kB (M) = 2079360kB [ 703.263776] Node 0 Normal: 653*4kB (UM) 1495*8kB (UM) 1604*16kB (UME) 1069*32kB (UME) 278*64kB (UME) 147*128kB (UME) 85*256kB (ME) 6*512kB (M) 0*1024kB 7*2048kB (M) 2*4096kB (M) = 158412kB [ 703.263801] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 703.263802] 8221 total pagecache pages [ 703.263804] 0 pages in swap cache [ 703.263806] Swap cache stats: add 0, delete 0, find 0/0 [ 703.263807] Free swap = 0kB [ 703.263808] Total swap = 0kB [ 703.263810] 1048422 pages RAM [ 703.263811] 0 pages HighMem/MovableOnly [ 703.263812] 179652 pages reserved [ 703.263813] 0 pages cma reserved [ 703.263814] 0 pages hwpoisoned [ 703.263817] Out of memory: Kill process 1089 (java) score 52 or sacrifice child [ 703.264145] Killed process 1089 (java) total-vm:5555688kB, anon-rss:181976kB, file-rss:0kB, shmem-rss:0kB [ 703.300281] oom_reaper: reaped process 3359 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a790ef4..0c2c0a2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3922,6 +3922,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) { struct zone *zone; struct zoneref *z; + bool ret = false; /* * Costly allocations might have made a progress but this doesn't mean @@ -3985,25 +3986,26 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) } } - /* - * Memory allocation/reclaim might be called from a WQ - * context and the current implementation of the WQ - * concurrency control doesn't recognize that - * a particular WQ is congested if the worker thread is - * looping without ever sleeping. Therefore we have to - * do a short sleep here rather than calling - * cond_resched(). - */ - if (current->flags & PF_WQ_WORKER) - schedule_timeout_uninterruptible(1); - else - cond_resched(); - - return true; + ret = true; + goto out; } } - return false; +out: + /* + * Memory allocation/reclaim might be called from a WQ + * context and the current implementation of the WQ + * concurrency control doesn't recognize that + * a particular WQ is congested if the worker thread is + * looping without ever sleeping. Therefore we have to + * do a short sleep here rather than calling + * cond_resched(). + */ + if (current->flags & PF_WQ_WORKER) + schedule_timeout_uninterruptible(1); + else + cond_resched(); + return ret; } static inline bool