Message ID | 20240829102543.189453-1-jingxiangzeng.cas@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/vmscan: wake up flushers conditionally to avoid cgroup OOM | expand |
On Thu, 29 Aug 2024 18:25:43 +0800 Jingxiang Zeng <jingxiangzeng.cas@gmail.com> wrote: > From: Zeng Jingxiang <linuszeng@tencent.com> > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > removed the opportunity to wake up flushers during the MGLRU page > reclamation process can lead to an increased likelihood of triggering > OOM when encountering many dirty pages during reclamation on MGLRU. > > This leads to premature OOM if there are too many dirty pages in cgroup: > Killed > > ... > > The flusher wake up was removed to decrease SSD wearing, but if we are > seeing all dirty folios at the tail of an LRU, not waking up the flusher > could lead to thrashing easily. So wake it up when a mem cgroups is > about to OOM due to dirty caches. Thanks, I'll queue this for testing and review. Could people please consider whether we should backport this into -stable kernels. > MGLRU still suffers OOM issue on latest mm tree, so the test is done > with another fix merged [1]. > > Link: https://lore.kernel.org/linux-mm/CAOUHufYi9h0kz5uW3LHHS3ZrVwEq-kKp8S6N-MZUmErNAXoXmw@mail.gmail.com/ [1] This one is already queued for -stable.
On Sat, Aug 31, 2024 at 8:38 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 29 Aug 2024 18:25:43 +0800 Jingxiang Zeng <jingxiangzeng.cas@gmail.com> wrote: > > > From: Zeng Jingxiang <linuszeng@tencent.com> > > > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > > removed the opportunity to wake up flushers during the MGLRU page > > reclamation process can lead to an increased likelihood of triggering > > OOM when encountering many dirty pages during reclamation on MGLRU. > > > > This leads to premature OOM if there are too many dirty pages in cgroup: > > Killed > > > > ... > > > > The flusher wake up was removed to decrease SSD wearing, but if we are > > seeing all dirty folios at the tail of an LRU, not waking up the flusher > > could lead to thrashing easily. So wake it up when a mem cgroups is > > about to OOM due to dirty caches. > > Thanks, I'll queue this for testing and review. Could people please > consider whether we should backport this into -stable kernels. > Hi Andrew, Thanks for picking this up. > > MGLRU still suffers OOM issue on latest mm tree, so the test is done > > with another fix merged [1]. > > > > Link: https://lore.kernel.org/linux-mm/CAOUHufYi9h0kz5uW3LHHS3ZrVwEq-kKp8S6N-MZUmErNAXoXmw@mail.gmail.com/ [1] > > This one is already queued for -stable. I didn't see this in -unstable or -stable though, is there any other repo or branch I missed? Jingxiang is referring to this fix from Yu: diff --git a/mm/vmscan.c b/mm/vmscan.c index cfa839284b92..778bf5b7ef97 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4320,7 +4320,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c } /* ineligible */ - if (zone > sc->reclaim_idx || skip_cma(folio, sc)) { + if (!folio_test_lru(folio) || zone > sc->reclaim_idx || skip_cma(folio, sc)) { gen = folio_inc_gen(lruvec, folio, false); list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]); return true;
On Mon, 2 Sep 2024 04:39:24 +0800 Kairui Song <ryncsn@gmail.com> wrote: > > > MGLRU still suffers OOM issue on latest mm tree, so the test is done > > > with another fix merged [1]. > > > > > > Link: https://lore.kernel.org/linux-mm/CAOUHufYi9h0kz5uW3LHHS3ZrVwEq-kKp8S6N-MZUmErNAXoXmw@mail.gmail.com/ [1] > > > > This one is already queued for -stable. > > I didn't see this in -unstable or -stable though, is there any other > repo or branch I missed? Jingxiang is referring to this fix from Yu: > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index cfa839284b92..778bf5b7ef97 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4320,7 +4320,7 @@ static bool sort_folio(struct lruvec *lruvec, > struct folio *folio, struct scan_c > } > > /* ineligible */ > - if (zone > sc->reclaim_idx || skip_cma(folio, sc)) { > + if (!folio_test_lru(folio) || zone > sc->reclaim_idx || > skip_cma(folio, sc)) { > gen = folio_inc_gen(lruvec, folio, false); > list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]); > return true; I was mistaken. I don't believe we ever received a formal/usable version of the above and the mm-hotfixes-unstable commits Revert "mm: skip CMA pages when they are not available" and revert-mm-skip-cma-pages-when-they-are-not-available-update change this code significantly.
On Thu, 29 Aug 2024 18:25:43 +0800 Jingxiang Zeng <jingxiangzeng.cas@gmail.com> wrote: > > @@ -4919,6 +4920,14 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc > if (try_to_shrink_lruvec(lruvec, sc)) > lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG); > > + /* > + * If too many pages failed to evict due to page being dirty, > + * memory pressure have pushed dirty pages to oldest gen, > + * wake up flusher. > + */ > + if (sc->nr.unqueued_dirty >= sc->nr.taken) > + wakeup_flusher_threads(WB_REASON_VMSCAN); > + Because a) the right domain to processe dirty pages is writeback and b) flusher runs independent of page reclaimer that has nothing to do with WB_REASON_SYNC, feel free to erase WB_REASON_VMSCAN instead of adding it once more.
On Thu, Aug 29, 2024 at 3:25 AM Jingxiang Zeng <jingxiangzeng.cas@gmail.com> wrote: > > From: Zeng Jingxiang <linuszeng@tencent.com> > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > removed the opportunity to wake up flushers during the MGLRU page > reclamation process can lead to an increased likelihood of triggering > OOM when encountering many dirty pages during reclamation on MGLRU. > > This leads to premature OOM if there are too many dirty pages in cgroup: > Killed Thanks for the patch. We have encountered a similar problem. > > dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), > order=0, oom_score_adj=0 > > Call Trace: > <TASK> > dump_stack_lvl+0x5f/0x80 > dump_stack+0x14/0x20 > dump_header+0x46/0x1b0 > oom_kill_process+0x104/0x220 > out_of_memory+0x112/0x5a0 > mem_cgroup_out_of_memory+0x13b/0x150 > try_charge_memcg+0x44f/0x5c0 > charge_memcg+0x34/0x50 > __mem_cgroup_charge+0x31/0x90 > filemap_add_folio+0x4b/0xf0 > __filemap_get_folio+0x1a4/0x5b0 > ? srso_return_thunk+0x5/0x5f > ? __block_commit_write+0x82/0xb0 > ext4_da_write_begin+0xe5/0x270 > generic_perform_write+0x134/0x2b0 > ext4_buffered_write_iter+0x57/0xd0 > ext4_file_write_iter+0x76/0x7d0 > ? selinux_file_permission+0x119/0x150 > ? srso_return_thunk+0x5/0x5f > ? srso_return_thunk+0x5/0x5f > vfs_write+0x30c/0x440 > ksys_write+0x65/0xe0 > __x64_sys_write+0x1e/0x30 > x64_sys_call+0x11c2/0x1d50 > do_syscall_64+0x47/0x110 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > memory: usage 308224kB, limit 308224kB, failcnt 2589 > swap: usage 0kB, limit 9007199254740988kB, failcnt 0 > > ... > file_dirty 303247360 > file_writeback 0 > ... > > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test, > mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0 > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB, > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB > oom_score_adj:0 > > The flusher wake up was removed to decrease SSD wearing, but if we are > seeing all dirty folios at the tail of an LRU, not waking up the flusher > could lead to thrashing easily. So wake it up when a mem cgroups is > about to OOM due to dirty caches. > > MGLRU still suffers OOM issue on latest mm tree, so the test is done > with another fix merged [1]. > > Link: https://lore.kernel.org/linux-mm/CAOUHufYi9h0kz5uW3LHHS3ZrVwEq-kKp8S6N-MZUmErNAXoXmw@mail.gmail.com/ [1] > > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com> > Signed-off-by: Kairui Song <kasong@tencent.com> > --- > mm/vmscan.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f27792e77a0f..9cd8c42f67cb 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4447,6 +4447,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > scanned, skipped, isolated, > type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); > > + sc->nr.taken += isolated; > /* > * There might not be eligible folios due to reclaim_idx. Check the > * remaining to prevent livelock if it's not making progress. > @@ -4919,6 +4920,14 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc > if (try_to_shrink_lruvec(lruvec, sc)) > lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG); > > + /* > + * If too many pages failed to evict due to page being dirty, > + * memory pressure have pushed dirty pages to oldest gen, > + * wake up flusher. > + */ > + if (sc->nr.unqueued_dirty >= sc->nr.taken) Any reason not to use a strict == check as in shrink_inactive_list()? Also, this check allows the wakeup of the flusher threads when both sc->nr.unqueued_dirty and sc->nr.taken are 0, which is undesirable. If we skip the wakeup for the cases where both counters are 0, then I think we need to handle the situation where only dirty file pages are left for reclaim in the oldest gen. This means that sc->nr.unqueued_dirty needs to be updated in sort_folios() (in addition to shrink_folio_list()) as well because sort_folios() doesn't send dirty file pages to shrink_folio_list() for eviction. > + wakeup_flusher_threads(WB_REASON_VMSCAN); > + > clear_mm_walk(); > > blk_finish_plug(&plug); > -- > 2.43.5 >
On Fri, 6 Sept 2024 at 08:01, Wei Xu <weixugc@google.com> wrote: > > On Thu, Aug 29, 2024 at 3:25 AM Jingxiang Zeng > <jingxiangzeng.cas@gmail.com> wrote: > > > > From: Zeng Jingxiang <linuszeng@tencent.com> > > > > Commit 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > > removed the opportunity to wake up flushers during the MGLRU page > > reclamation process can lead to an increased likelihood of triggering > > OOM when encountering many dirty pages during reclamation on MGLRU. > > > > This leads to premature OOM if there are too many dirty pages in cgroup: > > Killed > > Thanks for the patch. We have encountered a similar problem. > > > > > dd invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), > > order=0, oom_score_adj=0 > > > > Call Trace: > > <TASK> > > dump_stack_lvl+0x5f/0x80 > > dump_stack+0x14/0x20 > > dump_header+0x46/0x1b0 > > oom_kill_process+0x104/0x220 > > out_of_memory+0x112/0x5a0 > > mem_cgroup_out_of_memory+0x13b/0x150 > > try_charge_memcg+0x44f/0x5c0 > > charge_memcg+0x34/0x50 > > __mem_cgroup_charge+0x31/0x90 > > filemap_add_folio+0x4b/0xf0 > > __filemap_get_folio+0x1a4/0x5b0 > > ? srso_return_thunk+0x5/0x5f > > ? __block_commit_write+0x82/0xb0 > > ext4_da_write_begin+0xe5/0x270 > > generic_perform_write+0x134/0x2b0 > > ext4_buffered_write_iter+0x57/0xd0 > > ext4_file_write_iter+0x76/0x7d0 > > ? selinux_file_permission+0x119/0x150 > > ? srso_return_thunk+0x5/0x5f > > ? srso_return_thunk+0x5/0x5f > > vfs_write+0x30c/0x440 > > ksys_write+0x65/0xe0 > > __x64_sys_write+0x1e/0x30 > > x64_sys_call+0x11c2/0x1d50 > > do_syscall_64+0x47/0x110 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > memory: usage 308224kB, limit 308224kB, failcnt 2589 > > swap: usage 0kB, limit 9007199254740988kB, failcnt 0 > > > > ... > > file_dirty 303247360 > > file_writeback 0 > > ... > > > > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=test, > > mems_allowed=0,oom_memcg=/test,task_memcg=/test,task=dd,pid=4404,uid=0 > > Memory cgroup out of memory: Killed process 4404 (dd) total-vm:10512kB, > > anon-rss:1152kB, file-rss:1824kB, shmem-rss:0kB, UID:0 pgtables:76kB > > oom_score_adj:0 > > > > The flusher wake up was removed to decrease SSD wearing, but if we are > > seeing all dirty folios at the tail of an LRU, not waking up the flusher > > could lead to thrashing easily. So wake it up when a mem cgroups is > > about to OOM due to dirty caches. > > > > MGLRU still suffers OOM issue on latest mm tree, so the test is done > > with another fix merged [1]. > > > > Link: https://lore.kernel.org/linux-mm/CAOUHufYi9h0kz5uW3LHHS3ZrVwEq-kKp8S6N-MZUmErNAXoXmw@mail.gmail.com/ [1] > > > > Fixes: 14aa8b2d5c2e ("mm/mglru: don't sync disk for each aging cycle") > > Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com> > > Signed-off-by: Kairui Song <kasong@tencent.com> > > --- > > mm/vmscan.c | 9 +++++++++ > > 1 file changed, 9 insertions(+) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index f27792e77a0f..9cd8c42f67cb 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4447,6 +4447,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > > scanned, skipped, isolated, > > type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); > > > > + sc->nr.taken += isolated; > > /* > > * There might not be eligible folios due to reclaim_idx. Check the > > * remaining to prevent livelock if it's not making progress. > > @@ -4919,6 +4920,14 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc > > if (try_to_shrink_lruvec(lruvec, sc)) > > lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG); > > > > + /* > > + * If too many pages failed to evict due to page being dirty, > > + * memory pressure have pushed dirty pages to oldest gen, > > + * wake up flusher. > > + */ > > + if (sc->nr.unqueued_dirty >= sc->nr.taken) > > Any reason not to use a strict == check as in shrink_inactive_list()? > > Also, this check allows the wakeup of the flusher threads when both > sc->nr.unqueued_dirty and sc->nr.taken are 0, which is undesirable. > > If we skip the wakeup for the cases where both counters are 0, then I > think we need to handle the situation where only dirty file pages are > left for reclaim in the oldest gen. This means that > sc->nr.unqueued_dirty needs to be updated in sort_folios() (in > addition to shrink_folio_list()) as well because sort_folios() doesn't > send dirty file pages to shrink_folio_list() for eviction. > Your suggestion is correct. I will modify it and release the V2 version. > > + wakeup_flusher_threads(WB_REASON_VMSCAN); > > + > > clear_mm_walk(); > > > > blk_finish_plug(&plug); > > -- > > 2.43.5 > > >
diff --git a/mm/vmscan.c b/mm/vmscan.c index f27792e77a0f..9cd8c42f67cb 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4447,6 +4447,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, scanned, skipped, isolated, type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); + sc->nr.taken += isolated; /* * There might not be eligible folios due to reclaim_idx. Check the * remaining to prevent livelock if it's not making progress. @@ -4919,6 +4920,14 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc if (try_to_shrink_lruvec(lruvec, sc)) lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG); + /* + * If too many pages failed to evict due to page being dirty, + * memory pressure have pushed dirty pages to oldest gen, + * wake up flusher. + */ + if (sc->nr.unqueued_dirty >= sc->nr.taken) + wakeup_flusher_threads(WB_REASON_VMSCAN); + clear_mm_walk(); blk_finish_plug(&plug);