Message ID | 1572275317-63910-1-git-send-email-zhongjiang@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: fix unevictable page reclaim when calling madvise_pageout | expand |
On 28.10.19 16:08, zhong jiang wrote: > Recently, I hit the following issue when running in the upstream. > > kernel BUG at mm/vmscan.c:1521! > invalid opcode: 0000 [#1] SMP KASAN PTI > CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 > RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 > Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74 > RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286 > RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36 > RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc > RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0 > R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0 > R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000 > FS: 00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0 > DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 > PKRU: 55555554 > Call Trace: > reclaim_pages+0x499/0x800 mm/vmscan.c:2188 > madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 > walk_pmd_range mm/pagewalk.c:53 [inline] > walk_pud_range mm/pagewalk.c:112 [inline] > walk_p4d_range mm/pagewalk.c:139 [inline] > walk_pgd_range mm/pagewalk.c:166 [inline] > __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 > walk_page_range+0x179/0x310 mm/pagewalk.c:349 > madvise_pageout_page_range mm/madvise.c:506 [inline] > madvise_pageout+0x1f0/0x330 mm/madvise.c:542 > madvise_vma mm/madvise.c:931 [inline] > __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 > do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > madvise_pageout access the specified range of the vma and isolate > them, then run shrink_page_list to reclaim the memory. But It also > isolate the unevictable page to reclaim. Hence, we can catch the > cases in shrink_page_list. > > We can fix it by preventing unevictable page from isolating. > Another way to fix the issue by removing the condition of > BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it > is better to use the latter. Because We has taken the unevictable > page and skip it into account in shrink_page_list. I really don't understand the last sentence. Looks like something got messed up :) > > Signed-off-by: zhong jiang <zhongjiang@huawei.com> > --- > mm/vmscan.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f7d1301..1c6e959 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, > unlock_page(page); > keep: > list_add(&page->lru, &ret_pages); > - VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); > + VM_BUG_ON_PAGE(PageLRU(page), page); So, this comes from commit b291f000393f5a0b679012b39d79fbc85c018233 Author: Nick Piggin <npiggin@suse.de> Date: Sat Oct 18 20:26:44 2008 -0700 mlock: mlocked pages are unevictable Make sure that mlocked pages also live on the unevictable LRU, so kswapd will not scan them over and over again. That patch is fairly old. How come we can suddenly trigger this? Which commit is responsible for that? Was it always broken? I can see that commit ad6b67041a45497261617d7a28b15159b202cb5a Author: Minchan Kim <minchan@kernel.org> Date: Wed May 3 14:54:13 2017 -0700 mm: remove SWAP_MLOCK in ttu Performed some changes in that area. But also some time ago.
On 2019/10/28 23:27, David Hildenbrand wrote: > On 28.10.19 16:08, zhong jiang wrote: >> Recently, I hit the following issue when running in the upstream. >> >> kernel BUG at mm/vmscan.c:1521! >> invalid opcode: 0000 [#1] SMP KASAN PTI >> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 >> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 >> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74 >> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286 >> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36 >> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc >> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0 >> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0 >> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000 >> FS: 00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0 >> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 >> PKRU: 55555554 >> Call Trace: >> reclaim_pages+0x499/0x800 mm/vmscan.c:2188 >> madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 >> walk_pmd_range mm/pagewalk.c:53 [inline] >> walk_pud_range mm/pagewalk.c:112 [inline] >> walk_p4d_range mm/pagewalk.c:139 [inline] >> walk_pgd_range mm/pagewalk.c:166 [inline] >> __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 >> walk_page_range+0x179/0x310 mm/pagewalk.c:349 >> madvise_pageout_page_range mm/madvise.c:506 [inline] >> madvise_pageout+0x1f0/0x330 mm/madvise.c:542 >> madvise_vma mm/madvise.c:931 [inline] >> __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 >> do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 >> entry_SYSCALL_64_after_hwframe+0x49/0xbe >> >> madvise_pageout access the specified range of the vma and isolate >> them, then run shrink_page_list to reclaim the memory. But It also >> isolate the unevictable page to reclaim. Hence, we can catch the >> cases in shrink_page_list. >> >> We can fix it by preventing unevictable page from isolating. >> Another way to fix the issue by removing the condition of >> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it >> is better to use the latter. Because We has taken the unevictable >> page and skip it into account in shrink_page_list. > I really don't understand the last sentence. Looks like > something got messed up :) I mean that we will check the page_evictable(page) in shrink_page_list, if it is unevictable page, we will put the page back to correct lru. Based on the condition, I make the choice. It seems to more simpler.:-) Thanks, zhong jiang > >> Signed-off-by: zhong jiang <zhongjiang@huawei.com> >> --- >> mm/vmscan.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index f7d1301..1c6e959 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, >> unlock_page(page); >> keep: >> list_add(&page->lru, &ret_pages); >> - VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); >> + VM_BUG_ON_PAGE(PageLRU(page), page); > So, this comes from > > commit b291f000393f5a0b679012b39d79fbc85c018233 > Author: Nick Piggin <npiggin@suse.de> > Date: Sat Oct 18 20:26:44 2008 -0700 > > mlock: mlocked pages are unevictable > > Make sure that mlocked pages also live on the unevictable LRU, so kswapd > will not scan them over and over again. > > > That patch is fairly old. How come we can suddenly trigger this? > Which commit is responsible for that? Was it always broken? > > I can see that > > commit ad6b67041a45497261617d7a28b15159b202cb5a > Author: Minchan Kim <minchan@kernel.org> > Date: Wed May 3 14:54:13 2017 -0700 > > mm: remove SWAP_MLOCK in ttu > > Performed some changes in that area. But also some time ago. I think the following patch introduce the issue. commit 1a4e58cce84ee88129d5d49c064bd2852b481357 Author: Minchan Kim <minchan@kernel.org> Date: Wed Sep 25 16:49:15 2019 -0700 mm: introduce MADV_PAGEOUT When a process expects no accesses to a certain memory range for a long Thanks, zhong jiang
On 28.10.19 16:45, zhong jiang wrote: > On 2019/10/28 23:27, David Hildenbrand wrote: >> On 28.10.19 16:08, zhong jiang wrote: >>> Recently, I hit the following issue when running in the upstream. >>> >>> kernel BUG at mm/vmscan.c:1521! >>> invalid opcode: 0000 [#1] SMP KASAN PTI >>> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 >>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 >>> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 >>> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74 >>> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286 >>> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36 >>> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc >>> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0 >>> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0 >>> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000 >>> FS: 00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0 >>> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000 >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 >>> PKRU: 55555554 >>> Call Trace: >>> reclaim_pages+0x499/0x800 mm/vmscan.c:2188 >>> madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 >>> walk_pmd_range mm/pagewalk.c:53 [inline] >>> walk_pud_range mm/pagewalk.c:112 [inline] >>> walk_p4d_range mm/pagewalk.c:139 [inline] >>> walk_pgd_range mm/pagewalk.c:166 [inline] >>> __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 >>> walk_page_range+0x179/0x310 mm/pagewalk.c:349 >>> madvise_pageout_page_range mm/madvise.c:506 [inline] >>> madvise_pageout+0x1f0/0x330 mm/madvise.c:542 >>> madvise_vma mm/madvise.c:931 [inline] >>> __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 >>> do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 >>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>> >>> madvise_pageout access the specified range of the vma and isolate >>> them, then run shrink_page_list to reclaim the memory. But It also >>> isolate the unevictable page to reclaim. Hence, we can catch the >>> cases in shrink_page_list. >>> >>> We can fix it by preventing unevictable page from isolating. >>> Another way to fix the issue by removing the condition of >>> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it >>> is better to use the latter. Because We has taken the unevictable >>> page and skip it into account in shrink_page_list. >> I really don't understand the last sentence. Looks like >> something got messed up :) > I mean that we will check the page_evictable(page) in shrink_page_list, > if it is unevictable page, we will put the page back to correct lru. > > Based on the condition, I make the choice. It seems to more simpler.:-) > > Thanks, > zhong jiang >> >>> Signed-off-by: zhong jiang <zhongjiang@huawei.com> >>> --- >>> mm/vmscan.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>> index f7d1301..1c6e959 100644 >>> --- a/mm/vmscan.c >>> +++ b/mm/vmscan.c >>> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, >>> unlock_page(page); >>> keep: >>> list_add(&page->lru, &ret_pages); >>> - VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); >>> + VM_BUG_ON_PAGE(PageLRU(page), page); >> So, this comes from >> >> commit b291f000393f5a0b679012b39d79fbc85c018233 >> Author: Nick Piggin <npiggin@suse.de> >> Date: Sat Oct 18 20:26:44 2008 -0700 >> >> mlock: mlocked pages are unevictable >> >> Make sure that mlocked pages also live on the unevictable LRU, so kswapd >> will not scan them over and over again. >> >> >> That patch is fairly old. How come we can suddenly trigger this? >> Which commit is responsible for that? Was it always broken? >> >> I can see that >> >> commit ad6b67041a45497261617d7a28b15159b202cb5a >> Author: Minchan Kim <minchan@kernel.org> >> Date: Wed May 3 14:54:13 2017 -0700 >> >> mm: remove SWAP_MLOCK in ttu >> >> Performed some changes in that area. But also some time ago. > I think the following patch introduce the issue. > > commit 1a4e58cce84ee88129d5d49c064bd2852b481357 > Author: Minchan Kim <minchan@kernel.org> > Date: Wed Sep 25 16:49:15 2019 -0700 > > mm: introduce MADV_PAGEOUT > > When a process expects no accesses to a certain memory range for a long > CCing Minchan Kim then. If this is indeed the introducing patch, you probably reference that patch in your cover mail somehow. (Fixes: does not apply until upstream) I am absolutely no expert on vmscan.c, so I'm afraid I can't really comment on the details.
On 2019/10/29 0:07, David Hildenbrand wrote: > On 28.10.19 16:45, zhong jiang wrote: >> On 2019/10/28 23:27, David Hildenbrand wrote: >>> On 28.10.19 16:08, zhong jiang wrote: >>>> Recently, I hit the following issue when running in the upstream. >>>> >>>> kernel BUG at mm/vmscan.c:1521! >>>> invalid opcode: 0000 [#1] SMP KASAN PTI >>>> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 >>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 >>>> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 >>>> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74 >>>> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286 >>>> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36 >>>> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc >>>> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0 >>>> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0 >>>> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000 >>>> FS: 00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000 >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0 >>>> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000 >>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 >>>> PKRU: 55555554 >>>> Call Trace: >>>> reclaim_pages+0x499/0x800 mm/vmscan.c:2188 >>>> madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 >>>> walk_pmd_range mm/pagewalk.c:53 [inline] >>>> walk_pud_range mm/pagewalk.c:112 [inline] >>>> walk_p4d_range mm/pagewalk.c:139 [inline] >>>> walk_pgd_range mm/pagewalk.c:166 [inline] >>>> __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 >>>> walk_page_range+0x179/0x310 mm/pagewalk.c:349 >>>> madvise_pageout_page_range mm/madvise.c:506 [inline] >>>> madvise_pageout+0x1f0/0x330 mm/madvise.c:542 >>>> madvise_vma mm/madvise.c:931 [inline] >>>> __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 >>>> do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 >>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>> >>>> madvise_pageout access the specified range of the vma and isolate >>>> them, then run shrink_page_list to reclaim the memory. But It also >>>> isolate the unevictable page to reclaim. Hence, we can catch the >>>> cases in shrink_page_list. >>>> >>>> We can fix it by preventing unevictable page from isolating. >>>> Another way to fix the issue by removing the condition of >>>> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it >>>> is better to use the latter. Because We has taken the unevictable >>>> page and skip it into account in shrink_page_list. >>> I really don't understand the last sentence. Looks like >>> something got messed up :) >> I mean that we will check the page_evictable(page) in shrink_page_list, >> if it is unevictable page, we will put the page back to correct lru. >> >> Based on the condition, I make the choice. It seems to more simpler.:-) >> >> Thanks, >> zhong jiang >>> >>>> Signed-off-by: zhong jiang <zhongjiang@huawei.com> >>>> --- >>>> mm/vmscan.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>>> index f7d1301..1c6e959 100644 >>>> --- a/mm/vmscan.c >>>> +++ b/mm/vmscan.c >>>> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, >>>> unlock_page(page); >>>> keep: >>>> list_add(&page->lru, &ret_pages); >>>> - VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); >>>> + VM_BUG_ON_PAGE(PageLRU(page), page); >>> So, this comes from >>> >>> commit b291f000393f5a0b679012b39d79fbc85c018233 >>> Author: Nick Piggin <npiggin@suse.de> >>> Date: Sat Oct 18 20:26:44 2008 -0700 >>> >>> mlock: mlocked pages are unevictable >>> Make sure that mlocked pages also live on the unevictable LRU, so kswapd >>> will not scan them over and over again. >>> >>> >>> That patch is fairly old. How come we can suddenly trigger this? >>> Which commit is responsible for that? Was it always broken? >>> >>> I can see that >>> >>> commit ad6b67041a45497261617d7a28b15159b202cb5a >>> Author: Minchan Kim <minchan@kernel.org> >>> Date: Wed May 3 14:54:13 2017 -0700 >>> >>> mm: remove SWAP_MLOCK in ttu >>> >>> Performed some changes in that area. But also some time ago. >> I think the following patch introduce the issue. >> >> commit 1a4e58cce84ee88129d5d49c064bd2852b481357 >> Author: Minchan Kim <minchan@kernel.org> >> Date: Wed Sep 25 16:49:15 2019 -0700 >> >> mm: introduce MADV_PAGEOUT >> >> When a process expects no accesses to a certain memory range for a long >> > > CCing Minchan Kim then. > > If this is indeed the introducing patch, you probably reference that patch in your cover mail somehow. (Fixes: does not apply until upstream) > > I am absolutely no expert on vmscan.c, so I'm afraid I can't really comment on the details. > Yep, but still thanks for your concerns and reply. Sincerely, zhong jiang
On 28.10.19 17:07, David Hildenbrand wrote: > On 28.10.19 16:45, zhong jiang wrote: >> On 2019/10/28 23:27, David Hildenbrand wrote: >>> On 28.10.19 16:08, zhong jiang wrote: >>>> Recently, I hit the following issue when running in the upstream. >>>> >>>> kernel BUG at mm/vmscan.c:1521! >>>> invalid opcode: 0000 [#1] SMP KASAN PTI >>>> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 >>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 >>>> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 >>>> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74 >>>> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286 >>>> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36 >>>> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc >>>> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0 >>>> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0 >>>> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000 >>>> FS: 00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000 >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0 >>>> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000 >>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 >>>> PKRU: 55555554 >>>> Call Trace: >>>> reclaim_pages+0x499/0x800 mm/vmscan.c:2188 >>>> madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 >>>> walk_pmd_range mm/pagewalk.c:53 [inline] >>>> walk_pud_range mm/pagewalk.c:112 [inline] >>>> walk_p4d_range mm/pagewalk.c:139 [inline] >>>> walk_pgd_range mm/pagewalk.c:166 [inline] >>>> __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 >>>> walk_page_range+0x179/0x310 mm/pagewalk.c:349 >>>> madvise_pageout_page_range mm/madvise.c:506 [inline] >>>> madvise_pageout+0x1f0/0x330 mm/madvise.c:542 >>>> madvise_vma mm/madvise.c:931 [inline] >>>> __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 >>>> do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 >>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>> >>>> madvise_pageout access the specified range of the vma and isolate >>>> them, then run shrink_page_list to reclaim the memory. But It also >>>> isolate the unevictable page to reclaim. Hence, we can catch the >>>> cases in shrink_page_list. >>>> >>>> We can fix it by preventing unevictable page from isolating. >>>> Another way to fix the issue by removing the condition of >>>> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it >>>> is better to use the latter. Because We has taken the unevictable >>>> page and skip it into account in shrink_page_list. >>> I really don't understand the last sentence. Looks like >>> something got messed up :) >> I mean that we will check the page_evictable(page) in shrink_page_list, >> if it is unevictable page, we will put the page back to correct lru. >> >> Based on the condition, I make the choice. It seems to more simpler.:-) >> >> Thanks, >> zhong jiang >>> >>>> Signed-off-by: zhong jiang <zhongjiang@huawei.com> >>>> --- >>>> mm/vmscan.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>>> index f7d1301..1c6e959 100644 >>>> --- a/mm/vmscan.c >>>> +++ b/mm/vmscan.c >>>> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, >>>> unlock_page(page); >>>> keep: >>>> list_add(&page->lru, &ret_pages); >>>> - VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); >>>> + VM_BUG_ON_PAGE(PageLRU(page), page); >>> So, this comes from >>> >>> commit b291f000393f5a0b679012b39d79fbc85c018233 >>> Author: Nick Piggin <npiggin@suse.de> >>> Date: Sat Oct 18 20:26:44 2008 -0700 >>> >>> mlock: mlocked pages are unevictable >>> >>> Make sure that mlocked pages also live on the unevictable LRU, so kswapd >>> will not scan them over and over again. >>> >>> >>> That patch is fairly old. How come we can suddenly trigger this? >>> Which commit is responsible for that? Was it always broken? >>> >>> I can see that >>> >>> commit ad6b67041a45497261617d7a28b15159b202cb5a >>> Author: Minchan Kim <minchan@kernel.org> >>> Date: Wed May 3 14:54:13 2017 -0700 >>> >>> mm: remove SWAP_MLOCK in ttu >>> >>> Performed some changes in that area. But also some time ago. >> I think the following patch introduce the issue. >> >> commit 1a4e58cce84ee88129d5d49c064bd2852b481357 >> Author: Minchan Kim <minchan@kernel.org> >> Date: Wed Sep 25 16:49:15 2019 -0700 >> >> mm: introduce MADV_PAGEOUT >> >> When a process expects no accesses to a certain memory range for a long >> > > CCing Minchan Kim then. > > If this is indeed the introducing patch, you probably reference that > patch in your cover mail somehow. (Fixes: does not apply until upstream) > > I am absolutely no expert on vmscan.c, so I'm afraid I can't really > comment on the details. > Oh, and just wondering, is this the same BUG as in https://lkml.org/lkml/2019/8/2/1506 Where a fix has been proposed? The fix does not seem to be in next/master yet. (I just realized that it is already upstream so "Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")) applies.
On 2019/10/29 0:15, David Hildenbrand wrote: > On 28.10.19 17:07, David Hildenbrand wrote: >> On 28.10.19 16:45, zhong jiang wrote: >>> On 2019/10/28 23:27, David Hildenbrand wrote: >>>> On 28.10.19 16:08, zhong jiang wrote: >>>>> Recently, I hit the following issue when running in the upstream. >>>>> >>>>> kernel BUG at mm/vmscan.c:1521! >>>>> invalid opcode: 0000 [#1] SMP KASAN PTI >>>>> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 >>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 >>>>> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 >>>>> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74 >>>>> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286 >>>>> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36 >>>>> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc >>>>> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0 >>>>> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0 >>>>> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000 >>>>> FS: 00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000 >>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0 >>>>> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000 >>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 >>>>> PKRU: 55555554 >>>>> Call Trace: >>>>> reclaim_pages+0x499/0x800 mm/vmscan.c:2188 >>>>> madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 >>>>> walk_pmd_range mm/pagewalk.c:53 [inline] >>>>> walk_pud_range mm/pagewalk.c:112 [inline] >>>>> walk_p4d_range mm/pagewalk.c:139 [inline] >>>>> walk_pgd_range mm/pagewalk.c:166 [inline] >>>>> __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 >>>>> walk_page_range+0x179/0x310 mm/pagewalk.c:349 >>>>> madvise_pageout_page_range mm/madvise.c:506 [inline] >>>>> madvise_pageout+0x1f0/0x330 mm/madvise.c:542 >>>>> madvise_vma mm/madvise.c:931 [inline] >>>>> __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 >>>>> do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 >>>>> entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>>> >>>>> madvise_pageout access the specified range of the vma and isolate >>>>> them, then run shrink_page_list to reclaim the memory. But It also >>>>> isolate the unevictable page to reclaim. Hence, we can catch the >>>>> cases in shrink_page_list. >>>>> >>>>> We can fix it by preventing unevictable page from isolating. >>>>> Another way to fix the issue by removing the condition of >>>>> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it >>>>> is better to use the latter. Because We has taken the unevictable >>>>> page and skip it into account in shrink_page_list. >>>> I really don't understand the last sentence. Looks like >>>> something got messed up :) >>> I mean that we will check the page_evictable(page) in shrink_page_list, >>> if it is unevictable page, we will put the page back to correct lru. >>> >>> Based on the condition, I make the choice. It seems to more simpler.:-) >>> >>> Thanks, >>> zhong jiang >>>> >>>>> Signed-off-by: zhong jiang <zhongjiang@huawei.com> >>>>> --- >>>>> mm/vmscan.c | 2 +- >>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>> >>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>>>> index f7d1301..1c6e959 100644 >>>>> --- a/mm/vmscan.c >>>>> +++ b/mm/vmscan.c >>>>> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, >>>>> unlock_page(page); >>>>> keep: >>>>> list_add(&page->lru, &ret_pages); >>>>> - VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); >>>>> + VM_BUG_ON_PAGE(PageLRU(page), page); >>>> So, this comes from >>>> >>>> commit b291f000393f5a0b679012b39d79fbc85c018233 >>>> Author: Nick Piggin <npiggin@suse.de> >>>> Date: Sat Oct 18 20:26:44 2008 -0700 >>>> >>>> mlock: mlocked pages are unevictable >>>> Make sure that mlocked pages also live on the unevictable LRU, so kswapd >>>> will not scan them over and over again. >>>> >>>> >>>> That patch is fairly old. How come we can suddenly trigger this? >>>> Which commit is responsible for that? Was it always broken? >>>> >>>> I can see that >>>> >>>> commit ad6b67041a45497261617d7a28b15159b202cb5a >>>> Author: Minchan Kim <minchan@kernel.org> >>>> Date: Wed May 3 14:54:13 2017 -0700 >>>> >>>> mm: remove SWAP_MLOCK in ttu >>>> >>>> Performed some changes in that area. But also some time ago. >>> I think the following patch introduce the issue. >>> >>> commit 1a4e58cce84ee88129d5d49c064bd2852b481357 >>> Author: Minchan Kim <minchan@kernel.org> >>> Date: Wed Sep 25 16:49:15 2019 -0700 >>> >>> mm: introduce MADV_PAGEOUT >>> >>> When a process expects no accesses to a certain memory range for a long >>> >> >> CCing Minchan Kim then. >> >> If this is indeed the introducing patch, you probably reference that >> patch in your cover mail somehow. (Fixes: does not apply until upstream) >> >> I am absolutely no expert on vmscan.c, so I'm afraid I can't really >> comment on the details. >> > > Oh, and just wondering, is this the same BUG as in > > https://lkml.org/lkml/2019/8/2/1506 > > Where a fix has been proposed? The fix does not seem to be in next/master yet. > > (I just realized that it is already upstream so "Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")) applies. > > > I think that is not same issue. thannks
[Cc Minchan] On Mon 28-10-19 23:08:37, zhong jiang wrote: > Recently, I hit the following issue when running in the upstream. > > kernel BUG at mm/vmscan.c:1521! > invalid opcode: 0000 [#1] SMP KASAN PTI > CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 > RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 > Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74 > RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286 > RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36 > RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc > RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0 > R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0 > R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000 > FS: 00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0 > DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 > PKRU: 55555554 > Call Trace: > reclaim_pages+0x499/0x800 mm/vmscan.c:2188 > madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 > walk_pmd_range mm/pagewalk.c:53 [inline] > walk_pud_range mm/pagewalk.c:112 [inline] > walk_p4d_range mm/pagewalk.c:139 [inline] > walk_pgd_range mm/pagewalk.c:166 [inline] > __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 > walk_page_range+0x179/0x310 mm/pagewalk.c:349 > madvise_pageout_page_range mm/madvise.c:506 [inline] > madvise_pageout+0x1f0/0x330 mm/madvise.c:542 > madvise_vma mm/madvise.c:931 [inline] > __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 > do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > madvise_pageout access the specified range of the vma and isolate > them, then run shrink_page_list to reclaim the memory. But It also > isolate the unevictable page to reclaim. Hence, we can catch the > cases in shrink_page_list. > > We can fix it by preventing unevictable page from isolating. > Another way to fix the issue by removing the condition of > BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it > is better to use the latter. Because We has taken the unevictable > page and skip it into account in shrink_page_list. The justification is indeed not clear. This is essentially the same kind of bug as a58f2cef26e1 ("mm/vmscan.c: fix trying to reclaim unevictable LRU page") which has been fixed by checking PageUnevictable before adding it to the list of pages to reclaim. Removing a long existing BUG_ON begs for a much better explanation. shrink_page_list is not a trivial piece of code but I _suspect_ that removing it should be ok for mapped pages at least (try_to_unmap) but I am not so sure how unmapped unevictable pages are handled from top of my head. Please also ad Fixes: $sha > Signed-off-by: zhong jiang <zhongjiang@huawei.com> > --- > mm/vmscan.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f7d1301..1c6e959 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, > unlock_page(page); > keep: > list_add(&page->lru, &ret_pages); > - VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); > + VM_BUG_ON_PAGE(PageLRU(page), page); > } > > pgactivate = stat->nr_activate[0] + stat->nr_activate[1]; > -- > 1.7.12.4 >
On 2019/10/29 16:11, Michal Hocko wrote: > [Cc Minchan] > > On Mon 28-10-19 23:08:37, zhong jiang wrote: >> Recently, I hit the following issue when running in the upstream. >> >> kernel BUG at mm/vmscan.c:1521! >> invalid opcode: 0000 [#1] SMP KASAN PTI >> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 >> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 >> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74 >> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286 >> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36 >> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc >> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0 >> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0 >> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000 >> FS: 00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0 >> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 >> PKRU: 55555554 >> Call Trace: >> reclaim_pages+0x499/0x800 mm/vmscan.c:2188 >> madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 >> walk_pmd_range mm/pagewalk.c:53 [inline] >> walk_pud_range mm/pagewalk.c:112 [inline] >> walk_p4d_range mm/pagewalk.c:139 [inline] >> walk_pgd_range mm/pagewalk.c:166 [inline] >> __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 >> walk_page_range+0x179/0x310 mm/pagewalk.c:349 >> madvise_pageout_page_range mm/madvise.c:506 [inline] >> madvise_pageout+0x1f0/0x330 mm/madvise.c:542 >> madvise_vma mm/madvise.c:931 [inline] >> __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 >> do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 >> entry_SYSCALL_64_after_hwframe+0x49/0xbe >> >> madvise_pageout access the specified range of the vma and isolate >> them, then run shrink_page_list to reclaim the memory. But It also >> isolate the unevictable page to reclaim. Hence, we can catch the >> cases in shrink_page_list. >> >> We can fix it by preventing unevictable page from isolating. >> Another way to fix the issue by removing the condition of >> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it >> is better to use the latter. Because We has taken the unevictable >> page and skip it into account in shrink_page_list. > The justification is indeed not clear. This is essentially the same kind > of bug as a58f2cef26e1 ("mm/vmscan.c: fix trying to reclaim unevictable > LRU page") which has been fixed by checking PageUnevictable before > adding it to the list of pages to reclaim. According to the above some bugfix from Minchan and keep the long existing BUG_ON. Maybe It is better to do that. > Removing a long existing BUG_ON begs for a much better explanation. > shrink_page_list is not a trivial piece of code but I _suspect_ that > removing it should be ok for mapped pages at least (try_to_unmap) but I > am not so sure how unmapped unevictable pages are handled from top of my > head. As to the unmapped unevictable pages. shrink_page_list has taken that into account. shinkr_page_list page_evictable --> will filter the unevictable pages to putback its lru. That should be not a problem after removing the BUG_ON. > Please also ad Fixes: $sha Will add in v2. Thanks >> Signed-off-by: zhong jiang <zhongjiang@huawei.com> >> --- >> mm/vmscan.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index f7d1301..1c6e959 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, >> unlock_page(page); >> keep: >> list_add(&page->lru, &ret_pages); >> - VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); >> + VM_BUG_ON_PAGE(PageLRU(page), page); >> } >> >> pgactivate = stat->nr_activate[0] + stat->nr_activate[1]; >> -- >> 1.7.12.4 >>
On Tue 29-10-19 17:30:57, zhong jiang wrote: > On 2019/10/29 16:11, Michal Hocko wrote: > > [Cc Minchan] [...] > > Removing a long existing BUG_ON begs for a much better explanation. > > shrink_page_list is not a trivial piece of code but I _suspect_ that > > removing it should be ok for mapped pages at least (try_to_unmap) but I > > am not so sure how unmapped unevictable pages are handled from top of my > > head. > As to the unmapped unevictable pages. shrink_page_list has taken that into account. > > shinkr_page_list > page_evictable --> will filter the unevictable pages to putback its lru. Ohh, it is right there at the top. Missed it. The check has been added by Nick along with the BUG_ON. So it is sounds more like a "this shouldn't happen" bugon. I wouldn't mind to remove it with that justification.
On 2019/10/29 17:40, Michal Hocko wrote: > On Tue 29-10-19 17:30:57, zhong jiang wrote: >> On 2019/10/29 16:11, Michal Hocko wrote: >>> [Cc Minchan] > [...] >>> Removing a long existing BUG_ON begs for a much better explanation. >>> shrink_page_list is not a trivial piece of code but I _suspect_ that >>> removing it should be ok for mapped pages at least (try_to_unmap) but I >>> am not so sure how unmapped unevictable pages are handled from top of my >>> head. >> As to the unmapped unevictable pages. shrink_page_list has taken that into account. >> >> shinkr_page_list >> page_evictable --> will filter the unevictable pages to putback its lru. > Ohh, it is right there at the top. Missed it. The check has been added > by Nick along with the BUG_ON. So it is sounds more like a "this > shouldn't happen" bugon. I wouldn't mind to remove it with that > justification. As you has said, Minchan fix the same kind of bug by checking PageUnevictable (I did not notice before) Wait for Minchan to see whether he has better reason. thanks, Sincerely, zhong jiang
On Tue, Oct 29, 2019 at 06:45:12PM +0800, zhong jiang wrote: > On 2019/10/29 17:40, Michal Hocko wrote: > > On Tue 29-10-19 17:30:57, zhong jiang wrote: > >> On 2019/10/29 16:11, Michal Hocko wrote: > >>> [Cc Minchan] > > [...] > >>> Removing a long existing BUG_ON begs for a much better explanation. > >>> shrink_page_list is not a trivial piece of code but I _suspect_ that > >>> removing it should be ok for mapped pages at least (try_to_unmap) but I > >>> am not so sure how unmapped unevictable pages are handled from top of my > >>> head. > >> As to the unmapped unevictable pages. shrink_page_list has taken that into account. > >> > >> shinkr_page_list > >> page_evictable --> will filter the unevictable pages to putback its lru. > > Ohh, it is right there at the top. Missed it. The check has been added > > by Nick along with the BUG_ON. So it is sounds more like a "this > > shouldn't happen" bugon. I wouldn't mind to remove it with that > > justification. > As you has said, Minchan fix the same kind of bug by checking PageUnevictable (I did not notice before) > Wait for Minchan to see whether he has better reason. thanks, madvise_pageout could work with a shared page and one of the vmas among processes could do mlock so it could pass Unevictable LRU pages into shrink_page_list. It's pointless to try reclaim unevictable pages from the beginning so I want to fix madvise_pageout via introducing only_evictable flag into the API so that madvise_pageout uses it as "true". If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list, I want to see more strong reason why it happens and why caller couldn't filter them out from the beginning. diff --git a/mm/gup.c b/mm/gup.c index 8f236a335ae9..d1ad1c3ec596 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1468,7 +1468,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, drain_allow = false; } - if (!isolate_lru_page(head)) { + if (!isolate_lru_page(head, false)) { list_add_tail(&head->lru, &cma_page_list); mod_node_page_state(page_pgdat(head), NR_ISOLATED_ANON + diff --git a/mm/internal.h b/mm/internal.h index 0d5f720c75ab..13319612bef0 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -85,7 +85,7 @@ extern unsigned long highest_memmap_pfn; /* * in mm/vmscan.c: */ -extern int isolate_lru_page(struct page *page); +extern int isolate_lru_page(struct page *page, bool only_evictable); extern void putback_lru_page(struct page *page); /* diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 0a1b4b484ac5..095560f7f8ec 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -609,7 +609,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, * Isolate the page to avoid collapsing an hugepage * currently in use by the VM. */ - if (isolate_lru_page(page)) { + if (isolate_lru_page(page, false)) { unlock_page(page); result = SCAN_DEL_PAGE_LRU; goto out; @@ -1642,7 +1642,7 @@ static void collapse_file(struct mm_struct *mm, goto out_unlock; } - if (isolate_lru_page(page)) { + if (isolate_lru_page(page, false)) { result = SCAN_DEL_PAGE_LRU; goto out_unlock; } diff --git a/mm/madvise.c b/mm/madvise.c index 2be9f3fdb05e..2639de560a0b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -363,7 +363,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, ClearPageReferenced(page); test_and_clear_page_young(page); if (pageout) { - if (!isolate_lru_page(page)) + if (!isolate_lru_page(page, true)) list_add(&page->lru, &page_list); } else deactivate_page(page); @@ -441,7 +441,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, ClearPageReferenced(page); test_and_clear_page_young(page); if (pageout) { - if (!isolate_lru_page(page)) + if (!isolate_lru_page(page, true)) list_add(&page->lru, &page_list); } else deactivate_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 363106578876..6d913215b074 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5847,7 +5847,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, target_type = get_mctgt_type_thp(vma, addr, *pmd, &target); if (target_type == MC_TARGET_PAGE) { page = target.page; - if (!isolate_lru_page(page)) { + if (!isolate_lru_page(page, false)) { if (!mem_cgroup_move_account(page, true, mc.from, mc.to)) { mc.precharge -= HPAGE_PMD_NR; @@ -5895,7 +5895,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, */ if (PageTransCompound(page)) goto put; - if (!device && isolate_lru_page(page)) + if (!device && isolate_lru_page(page, false)) goto put; if (!mem_cgroup_move_account(page, false, mc.from, mc.to)) { diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 3151c87dff73..ef37c67a7bab 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -567,7 +567,7 @@ static const char * const action_page_types[] = { */ static int delete_from_lru_cache(struct page *p) { - if (!isolate_lru_page(p)) { + if (!isolate_lru_page(p, false)) { /* * Clear sensible page flags, so that the buddy system won't * complain when the page is unpoison-and-freed. @@ -1782,7 +1782,7 @@ static int __soft_offline_page(struct page *page, int flags) * handles a large number of cases for us. */ if (PageLRU(page)) - ret = isolate_lru_page(page); + ret = isolate_lru_page(page, false); else ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE); /* diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index df570e5c71cc..8ba483d3d8cd 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1314,7 +1314,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) */ if (PageHWPoison(page)) { if (WARN_ON(PageLRU(page))) - isolate_lru_page(page); + isolate_lru_page(page, false); if (page_mapped(page)) try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS); continue; @@ -1327,7 +1327,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) * LRU and non-lru movable pages. */ if (PageLRU(page)) - ret = isolate_lru_page(page); + ret = isolate_lru_page(page, false); else ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE); if (!ret) { /* Success */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4ae967bcf954..585e5845f071 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -974,7 +974,7 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist, * Avoid migrating a page that is shared with others. */ if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) { - if (!isolate_lru_page(head)) { + if (!isolate_lru_page(head, false)) { list_add_tail(&head->lru, pagelist); mod_node_page_state(page_pgdat(head), NR_ISOLATED_ANON + page_is_file_cache(head), diff --git a/mm/migrate.c b/mm/migrate.c index 4fe45d1428c8..710e00317a8f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1563,7 +1563,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, struct page *head; head = compound_head(page); - err = isolate_lru_page(head); + err = isolate_lru_page(head, false); if (err) goto out_putpage; @@ -1895,7 +1895,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) if (!migrate_balanced_pgdat(pgdat, compound_nr(page))) return 0; - if (isolate_lru_page(page)) + if (isolate_lru_page(page, false)) return 0; /* @@ -2450,7 +2450,7 @@ static void migrate_vma_prepare(struct migrate_vma *migrate) allow_drain = false; } - if (isolate_lru_page(page)) { + if (isolate_lru_page(page, false)) { if (remap) { migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; migrate->cpages--; diff --git a/mm/mlock.c b/mm/mlock.c index a72c1eeded77..307e340fe2e0 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -70,7 +70,7 @@ void clear_page_mlock(struct page *page) * * See __pagevec_lru_add_fn for more explanation. */ - if (!isolate_lru_page(page)) { + if (!isolate_lru_page(page, false)) { putback_lru_page(page); } else { /* @@ -97,7 +97,7 @@ void mlock_vma_page(struct page *page) mod_zone_page_state(page_zone(page), NR_MLOCK, hpage_nr_pages(page)); count_vm_event(UNEVICTABLE_PGMLOCKED); - if (!isolate_lru_page(page)) + if (!isolate_lru_page(page, false)) putback_lru_page(page); } } diff --git a/mm/vmscan.c b/mm/vmscan.c index ee4eecc7e1c2..c44fb52c745f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1793,7 +1793,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, * (2) the lru_lock must not be held. * (3) interrupts must be enabled. */ -int isolate_lru_page(struct page *page) +int isolate_lru_page(struct page *page, bool only_evictable) { int ret = -EBUSY; @@ -1805,6 +1805,8 @@ int isolate_lru_page(struct page *page) struct lruvec *lruvec; spin_lock_irq(&pgdat->lru_lock); + if (only_evictable && PageUnevictable(page)) + goto out; lruvec = mem_cgroup_page_lruvec(page, pgdat); if (PageLRU(page)) { int lru = page_lru(page); @@ -1813,6 +1815,7 @@ int isolate_lru_page(struct page *page) del_page_from_lru_list(page, lruvec, lru); ret = 0; } +out: spin_unlock_irq(&pgdat->lru_lock); } return ret;
On Wed, Oct 30, 2019 at 09:52:39AM -0700, Minchan Kim wrote: > @@ -1468,7 +1468,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, > drain_allow = false; > } > > - if (!isolate_lru_page(head)) { > + if (!isolate_lru_page(head, false)) { > list_add_tail(&head->lru, &cma_page_list); > mod_node_page_state(page_pgdat(head), > NR_ISOLATED_ANON + It's not clear what that argument means at the callsite, and every caller needs to pass it to support one niche usecase. Let's not do that. I think there are better options. Personally, I think it's a good idea to keep the sanity check in shrink_page_list() because the mlock LRU handling is quite tricky. madvise() is really the odd one out here because it isolates LRU pages through page tables and then sends them through the regular reclaim path, so IMO it should be the madvise proper that handles the exceptional situation. Why not just this? Maybe with a comment that points out that we're coming from the page tables instead of a specific LRU list, and so need to filter out the unevictable lru pages from our end. diff --git a/mm/madvise.c b/mm/madvise.c index 99dd06fecfa9..63e130800570 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -363,8 +363,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, ClearPageReferenced(page); test_and_clear_page_young(page); if (pageout) { - if (!isolate_lru_page(page)) - list_add(&page->lru, &page_list); + if (!isolate_lru_page(page)) { + if (PageUnevictable(page)) + putback_lru_page(page); + else + list_add(&page->lru, &page_list); + } } else deactivate_page(page); huge_unlock: @@ -441,8 +445,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, ClearPageReferenced(page); test_and_clear_page_young(page); if (pageout) { - if (!isolate_lru_page(page)) - list_add(&page->lru, &page_list); + if (!isolate_lru_page(page)) { + if (PageUnevictable(page)) + putback_lru_page(page); + else + list_add(&page->lru, &page_list); + } } else deactivate_page(page); }
On Wed 30-10-19 09:52:39, Minchan Kim wrote: > On Tue, Oct 29, 2019 at 06:45:12PM +0800, zhong jiang wrote: > > On 2019/10/29 17:40, Michal Hocko wrote: > > > On Tue 29-10-19 17:30:57, zhong jiang wrote: > > >> On 2019/10/29 16:11, Michal Hocko wrote: > > >>> [Cc Minchan] > > > [...] > > >>> Removing a long existing BUG_ON begs for a much better explanation. > > >>> shrink_page_list is not a trivial piece of code but I _suspect_ that > > >>> removing it should be ok for mapped pages at least (try_to_unmap) but I > > >>> am not so sure how unmapped unevictable pages are handled from top of my > > >>> head. > > >> As to the unmapped unevictable pages. shrink_page_list has taken that into account. > > >> > > >> shinkr_page_list > > >> page_evictable --> will filter the unevictable pages to putback its lru. > > > Ohh, it is right there at the top. Missed it. The check has been added > > > by Nick along with the BUG_ON. So it is sounds more like a "this > > > shouldn't happen" bugon. I wouldn't mind to remove it with that > > > justification. > > As you has said, Minchan fix the same kind of bug by checking PageUnevictable (I did not notice before) > > Wait for Minchan to see whether he has better reason. thanks, > > madvise_pageout could work with a shared page and one of the vmas among processes > could do mlock so it could pass Unevictable LRU pages into shrink_page_list. > It's pointless to try reclaim unevictable pages from the beginning so I want to fix > madvise_pageout via introducing only_evictable flag into the API so that > madvise_pageout uses it as "true". > > If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list, > I want to see more strong reason why it happens and why caller couldn't > filter them out from the beginning. Why is this preferable over removing the VM_BUG_ON condition? In other words why should we keep PageUnevictable check there?
On Wed, Oct 30, 2019 at 01:22:41PM -0400, Johannes Weiner wrote: > On Wed, Oct 30, 2019 at 09:52:39AM -0700, Minchan Kim wrote: > > @@ -1468,7 +1468,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, > > drain_allow = false; > > } > > > > - if (!isolate_lru_page(head)) { > > + if (!isolate_lru_page(head, false)) { > > list_add_tail(&head->lru, &cma_page_list); > > mod_node_page_state(page_pgdat(head), > > NR_ISOLATED_ANON + > > It's not clear what that argument means at the callsite, and every > caller needs to pass it to support one niche usecase. Let's not do > that. > > I think there are better options. Personally, I think it's a good idea > to keep the sanity check in shrink_page_list() because the mlock LRU > handling is quite tricky. madvise() is really the odd one out here > because it isolates LRU pages through page tables and then sends them > through the regular reclaim path, so IMO it should be the madvise > proper that handles the exceptional situation. > > Why not just this? Maybe with a comment that points out that we're > coming from the page tables instead of a specific LRU list, and so > need to filter out the unevictable lru pages from our end. Totally, agree. > > diff --git a/mm/madvise.c b/mm/madvise.c > index 99dd06fecfa9..63e130800570 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -363,8 +363,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > ClearPageReferenced(page); > test_and_clear_page_young(page); > if (pageout) { > - if (!isolate_lru_page(page)) > - list_add(&page->lru, &page_list); > + if (!isolate_lru_page(page)) { > + if (PageUnevictable(page)) > + putback_lru_page(page); > + else > + list_add(&page->lru, &page_list); > + } > } else > deactivate_page(page); > huge_unlock: > @@ -441,8 +445,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > ClearPageReferenced(page); > test_and_clear_page_young(page); > if (pageout) { > - if (!isolate_lru_page(page)) > - list_add(&page->lru, &page_list); > + if (!isolate_lru_page(page)) { > + if (PageUnevictable(page)) > + putback_lru_page(page); > + else > + list_add(&page->lru, &page_list); > + } > } else > deactivate_page(page); > } >
On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote: > On Wed 30-10-19 09:52:39, Minchan Kim wrote: > > On Tue, Oct 29, 2019 at 06:45:12PM +0800, zhong jiang wrote: > > > On 2019/10/29 17:40, Michal Hocko wrote: > > > > On Tue 29-10-19 17:30:57, zhong jiang wrote: > > > >> On 2019/10/29 16:11, Michal Hocko wrote: > > > >>> [Cc Minchan] > > > > [...] > > > >>> Removing a long existing BUG_ON begs for a much better explanation. > > > >>> shrink_page_list is not a trivial piece of code but I _suspect_ that > > > >>> removing it should be ok for mapped pages at least (try_to_unmap) but I > > > >>> am not so sure how unmapped unevictable pages are handled from top of my > > > >>> head. > > > >> As to the unmapped unevictable pages. shrink_page_list has taken that into account. > > > >> > > > >> shinkr_page_list > > > >> page_evictable --> will filter the unevictable pages to putback its lru. > > > > Ohh, it is right there at the top. Missed it. The check has been added > > > > by Nick along with the BUG_ON. So it is sounds more like a "this > > > > shouldn't happen" bugon. I wouldn't mind to remove it with that > > > > justification. > > > As you has said, Minchan fix the same kind of bug by checking PageUnevictable (I did not notice before) > > > Wait for Minchan to see whether he has better reason. thanks, > > > > madvise_pageout could work with a shared page and one of the vmas among processes > > could do mlock so it could pass Unevictable LRU pages into shrink_page_list. > > It's pointless to try reclaim unevictable pages from the beginning so I want to fix > > madvise_pageout via introducing only_evictable flag into the API so that > > madvise_pageout uses it as "true". > > > > If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list, > > I want to see more strong reason why it happens and why caller couldn't > > filter them out from the beginning. > > Why is this preferable over removing the VM_BUG_ON condition? In other > words why should we keep PageUnevictable check there? I don't think it's reasonable to pass unevictalbe LRU pages into shrink_page_list so wanted to know what race is here what we are missing to remove the BUG_ON since mlock is heavily complicated. > > -- > Michal Hocko > SUSE Labs
On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote: > On Wed 30-10-19 09:52:39, Minchan Kim wrote: > > On Tue, Oct 29, 2019 at 06:45:12PM +0800, zhong jiang wrote: > > > On 2019/10/29 17:40, Michal Hocko wrote: > > > > On Tue 29-10-19 17:30:57, zhong jiang wrote: > > > >> On 2019/10/29 16:11, Michal Hocko wrote: > > > >>> [Cc Minchan] > > > > [...] > > > >>> Removing a long existing BUG_ON begs for a much better explanation. > > > >>> shrink_page_list is not a trivial piece of code but I _suspect_ that > > > >>> removing it should be ok for mapped pages at least (try_to_unmap) but I > > > >>> am not so sure how unmapped unevictable pages are handled from top of my > > > >>> head. > > > >> As to the unmapped unevictable pages. shrink_page_list has taken that into account. > > > >> > > > >> shinkr_page_list > > > >> page_evictable --> will filter the unevictable pages to putback its lru. > > > > Ohh, it is right there at the top. Missed it. The check has been added > > > > by Nick along with the BUG_ON. So it is sounds more like a "this > > > > shouldn't happen" bugon. I wouldn't mind to remove it with that > > > > justification. > > > As you has said, Minchan fix the same kind of bug by checking PageUnevictable (I did not notice before) > > > Wait for Minchan to see whether he has better reason. thanks, > > > > madvise_pageout could work with a shared page and one of the vmas among processes > > could do mlock so it could pass Unevictable LRU pages into shrink_page_list. > > It's pointless to try reclaim unevictable pages from the beginning so I want to fix > > madvise_pageout via introducing only_evictable flag into the API so that > > madvise_pageout uses it as "true". > > > > If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list, > > I want to see more strong reason why it happens and why caller couldn't > > filter them out from the beginning. > > Why is this preferable over removing the VM_BUG_ON condition? In other > words why should we keep PageUnevictable check there? The mlock LRU shuffling is a bit tricky and can race with page reclaim or others isolating the page from the LRU list. If another isolator wins, it has to move the page during putback on behalf of mlock. See the implementation and comments in __pagevec_lru_add_fn(). That's why page reclaim can see !page_evictable(), but it must not see pages that have the PageUnevictable lru bit already set. Because that would mean the isolation/putback machinery messed up somewhere and the page LRU state is corrupt. As that machinery is non-trivial, it's useful to have that sanity check in page reclaim.
On Wed 30-10-19 15:33:07, Johannes Weiner wrote: > On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote: > > On Wed 30-10-19 09:52:39, Minchan Kim wrote: [...] > > > madvise_pageout could work with a shared page and one of the vmas among processes > > > could do mlock so it could pass Unevictable LRU pages into shrink_page_list. > > > It's pointless to try reclaim unevictable pages from the beginning so I want to fix > > > madvise_pageout via introducing only_evictable flag into the API so that > > > madvise_pageout uses it as "true". > > > > > > If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list, > > > I want to see more strong reason why it happens and why caller couldn't > > > filter them out from the beginning. > > > > Why is this preferable over removing the VM_BUG_ON condition? In other > > words why should we keep PageUnevictable check there? > > The mlock LRU shuffling is a bit tricky and can race with page reclaim > or others isolating the page from the LRU list. If another isolator > wins, it has to move the page during putback on behalf of mlock. > > See the implementation and comments in __pagevec_lru_add_fn(). > > That's why page reclaim can see !page_evictable(), but it must not see > pages that have the PageUnevictable lru bit already set. Because that > would mean the isolation/putback machinery messed up somewhere and the > page LRU state is corrupt. > > As that machinery is non-trivial, it's useful to have that sanity > check in page reclaim. Thanks for the clarification! This sounds reasonable (as much as the mlock juggling does) to me. This is probably worth a comment right above the bug_on. I have to confess that I am still not clear on all the details here, though. E.g. migrate_misplaced_transhuge_page sets the flag without lru_lock and relies only on page lock IIUC and the bug on is done right after the lock is released. Maybe I am just confused or maybe the race window is too small to matter but isn't this race possible at least theoretically?
On 2019/10/31 3:33, Johannes Weiner wrote: > On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote: >> On Wed 30-10-19 09:52:39, Minchan Kim wrote: >>> On Tue, Oct 29, 2019 at 06:45:12PM +0800, zhong jiang wrote: >>>> On 2019/10/29 17:40, Michal Hocko wrote: >>>>> On Tue 29-10-19 17:30:57, zhong jiang wrote: >>>>>> On 2019/10/29 16:11, Michal Hocko wrote: >>>>>>> [Cc Minchan] >>>>> [...] >>>>>>> Removing a long existing BUG_ON begs for a much better explanation. >>>>>>> shrink_page_list is not a trivial piece of code but I _suspect_ that >>>>>>> removing it should be ok for mapped pages at least (try_to_unmap) but I >>>>>>> am not so sure how unmapped unevictable pages are handled from top of my >>>>>>> head. >>>>>> As to the unmapped unevictable pages. shrink_page_list has taken that into account. >>>>>> >>>>>> shinkr_page_list >>>>>> page_evictable --> will filter the unevictable pages to putback its lru. >>>>> Ohh, it is right there at the top. Missed it. The check has been added >>>>> by Nick along with the BUG_ON. So it is sounds more like a "this >>>>> shouldn't happen" bugon. I wouldn't mind to remove it with that >>>>> justification. >>>> As you has said, Minchan fix the same kind of bug by checking PageUnevictable (I did not notice before) >>>> Wait for Minchan to see whether he has better reason. thanks, >>> madvise_pageout could work with a shared page and one of the vmas among processes >>> could do mlock so it could pass Unevictable LRU pages into shrink_page_list. >>> It's pointless to try reclaim unevictable pages from the beginning so I want to fix >>> madvise_pageout via introducing only_evictable flag into the API so that >>> madvise_pageout uses it as "true". >>> >>> If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list, >>> I want to see more strong reason why it happens and why caller couldn't >>> filter them out from the beginning. >> Why is this preferable over removing the VM_BUG_ON condition? In other >> words why should we keep PageUnevictable check there? > The mlock LRU shuffling is a bit tricky and can race with page reclaim > or others isolating the page from the LRU list. If another isolator > wins, it has to move the page during putback on behalf of mlock. > > See the implementation and comments in __pagevec_lru_add_fn(). I see that comments in __pagevec_lru_add_fn. I have some confusion. It will result in evictable page strand in an unevictable lru without PageMlocked due to disorder If I understand it correctly. vmscan can see !page_evictable(). It should be PageMLocked is set in evictable list. Is there any race window ? Thanks, zhong jiang > That's why page reclaim can see !page_evictable(), but it must not see > pages that have the PageUnevictable lru bit already set. Because that > would mean the isolation/putback machinery messed up somewhere and the > page LRU state is corrupt. > > As that machinery is non-trivial, it's useful to have that sanity > check in page reclaim. > > . >
On Thu, Oct 31, 2019 at 10:16:01AM +0100, Michal Hocko wrote: > On Wed 30-10-19 15:33:07, Johannes Weiner wrote: > > On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote: > > > On Wed 30-10-19 09:52:39, Minchan Kim wrote: > [...] > > > > madvise_pageout could work with a shared page and one of the vmas among processes > > > > could do mlock so it could pass Unevictable LRU pages into shrink_page_list. > > > > It's pointless to try reclaim unevictable pages from the beginning so I want to fix > > > > madvise_pageout via introducing only_evictable flag into the API so that > > > > madvise_pageout uses it as "true". > > > > > > > > If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list, > > > > I want to see more strong reason why it happens and why caller couldn't > > > > filter them out from the beginning. > > > > > > Why is this preferable over removing the VM_BUG_ON condition? In other > > > words why should we keep PageUnevictable check there? > > > > The mlock LRU shuffling is a bit tricky and can race with page reclaim > > or others isolating the page from the LRU list. If another isolator > > wins, it has to move the page during putback on behalf of mlock. > > > > See the implementation and comments in __pagevec_lru_add_fn(). > > > > That's why page reclaim can see !page_evictable(), but it must not see > > pages that have the PageUnevictable lru bit already set. Because that > > would mean the isolation/putback machinery messed up somewhere and the > > page LRU state is corrupt. > > > > As that machinery is non-trivial, it's useful to have that sanity > > check in page reclaim. > > Thanks for the clarification! This sounds reasonable (as much as the > mlock juggling does) to me. This is probably worth a comment right above > the bug_on. > > I have to confess that I am still not clear on all the details here, > though. E.g. migrate_misplaced_transhuge_page sets the flag without > lru_lock and relies only on page lock IIUC and the bug on is done right > after the lock is released. Maybe I am just confused or maybe the race > window is too small to matter but isn't this race possible at least > theoretically? IIUC, reclaim couldn't see the page from LRU list because it was isolated by numamigrate_isolate_page. Thanks.
On Thu 31-10-19 07:48:11, Minchan Kim wrote: > On Thu, Oct 31, 2019 at 10:16:01AM +0100, Michal Hocko wrote: > > On Wed 30-10-19 15:33:07, Johannes Weiner wrote: > > > On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote: > > > > On Wed 30-10-19 09:52:39, Minchan Kim wrote: > > [...] > > > > > madvise_pageout could work with a shared page and one of the vmas among processes > > > > > could do mlock so it could pass Unevictable LRU pages into shrink_page_list. > > > > > It's pointless to try reclaim unevictable pages from the beginning so I want to fix > > > > > madvise_pageout via introducing only_evictable flag into the API so that > > > > > madvise_pageout uses it as "true". > > > > > > > > > > If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list, > > > > > I want to see more strong reason why it happens and why caller couldn't > > > > > filter them out from the beginning. > > > > > > > > Why is this preferable over removing the VM_BUG_ON condition? In other > > > > words why should we keep PageUnevictable check there? > > > > > > The mlock LRU shuffling is a bit tricky and can race with page reclaim > > > or others isolating the page from the LRU list. If another isolator > > > wins, it has to move the page during putback on behalf of mlock. > > > > > > See the implementation and comments in __pagevec_lru_add_fn(). > > > > > > That's why page reclaim can see !page_evictable(), but it must not see > > > pages that have the PageUnevictable lru bit already set. Because that > > > would mean the isolation/putback machinery messed up somewhere and the > > > page LRU state is corrupt. > > > > > > As that machinery is non-trivial, it's useful to have that sanity > > > check in page reclaim. > > > > Thanks for the clarification! This sounds reasonable (as much as the > > mlock juggling does) to me. This is probably worth a comment right above > > the bug_on. > > > > I have to confess that I am still not clear on all the details here, > > though. E.g. migrate_misplaced_transhuge_page sets the flag without > > lru_lock and relies only on page lock IIUC and the bug on is done right > > after the lock is released. Maybe I am just confused or maybe the race > > window is too small to matter but isn't this race possible at least > > theoretically? > > IIUC, reclaim couldn't see the page from LRU list because it was isolated by > numamigrate_isolate_page. Right you are. I have missed numamigrate_isolate_page. Thanks!
On 2019/10/31 1:22, Johannes Weiner wrote: > On Wed, Oct 30, 2019 at 09:52:39AM -0700, Minchan Kim wrote: >> @@ -1468,7 +1468,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, >> drain_allow = false; >> } >> >> - if (!isolate_lru_page(head)) { >> + if (!isolate_lru_page(head, false)) { >> list_add_tail(&head->lru, &cma_page_list); >> mod_node_page_state(page_pgdat(head), >> NR_ISOLATED_ANON + > It's not clear what that argument means at the callsite, and every > caller needs to pass it to support one niche usecase. Let's not do > that. > > I think there are better options. Personally, I think it's a good idea > to keep the sanity check in shrink_page_list() because the mlock LRU > handling is quite tricky. madvise() is really the odd one out here > because it isolates LRU pages through page tables and then sends them > through the regular reclaim path, so IMO it should be the madvise > proper that handles the exceptional situation. > > Why not just this? Maybe with a comment that points out that we're > coming from the page tables instead of a specific LRU list, and so > need to filter out the unevictable lru pages from our end. > > diff --git a/mm/madvise.c b/mm/madvise.c > index 99dd06fecfa9..63e130800570 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -363,8 +363,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > ClearPageReferenced(page); > test_and_clear_page_young(page); > if (pageout) { > - if (!isolate_lru_page(page)) > - list_add(&page->lru, &page_list); > + if (!isolate_lru_page(page)) { > + if (PageUnevictable(page)) > + putback_lru_page(page); > + else > + list_add(&page->lru, &page_list); > + } > } else > deactivate_page(page); > huge_unlock: > @@ -441,8 +445,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > ClearPageReferenced(page); > test_and_clear_page_young(page); > if (pageout) { > - if (!isolate_lru_page(page)) > - list_add(&page->lru, &page_list); > + if (!isolate_lru_page(page)) { > + if (PageUnevictable(page)) > + putback_lru_page(page); > + else > + list_add(&page->lru, &page_list); > + } > } else > deactivate_page(page); > } > > . > It seems to filter the Unevictable page is the correct fix. Even though I am not very clear how an little race window to result in an PageMlocked in vmscan. I will repost the patch in v2 as you has proposed. Thanks Sincerely, zhong jiang
On 2019/10/31 17:16, Michal Hocko wrote: > On Wed 30-10-19 15:33:07, Johannes Weiner wrote: >> On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote: >>> On Wed 30-10-19 09:52:39, Minchan Kim wrote: > [...] >>>> madvise_pageout could work with a shared page and one of the vmas among processes >>>> could do mlock so it could pass Unevictable LRU pages into shrink_page_list. >>>> It's pointless to try reclaim unevictable pages from the beginning so I want to fix >>>> madvise_pageout via introducing only_evictable flag into the API so that >>>> madvise_pageout uses it as "true". >>>> >>>> If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list, >>>> I want to see more strong reason why it happens and why caller couldn't >>>> filter them out from the beginning. >>> Why is this preferable over removing the VM_BUG_ON condition? In other >>> words why should we keep PageUnevictable check there? >> The mlock LRU shuffling is a bit tricky and can race with page reclaim >> or others isolating the page from the LRU list. If another isolator >> wins, it has to move the page during putback on behalf of mlock. >> >> See the implementation and comments in __pagevec_lru_add_fn(). >> >> That's why page reclaim can see !page_evictable(), but it must not see >> pages that have the PageUnevictable lru bit already set. Because that >> would mean the isolation/putback machinery messed up somewhere and the >> page LRU state is corrupt. >> >> As that machinery is non-trivial, it's useful to have that sanity >> check in page reclaim. > Thanks for the clarification! This sounds reasonable (as much as the > mlock juggling does) to me. This is probably worth a comment right above > the bug_on. Could you write a comment down on VM_BUG_ON() ? :-) Thanks, zhong jiang > I have to confess that I am still not clear on all the details here, > though. E.g. migrate_misplaced_transhuge_page sets the flag without > lru_lock and relies only on page lock IIUC and the bug on is done right > after the lock is released. Maybe I am just confused or maybe the race > window is too small to matter but isn't this race possible at least > theoretically? >
diff --git a/mm/vmscan.c b/mm/vmscan.c index f7d1301..1c6e959 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, unlock_page(page); keep: list_add(&page->lru, &ret_pages); - VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); + VM_BUG_ON_PAGE(PageLRU(page), page); } pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
Recently, I hit the following issue when running in the upstream. kernel BUG at mm/vmscan.c:1521! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74 RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286 RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36 RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0 R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0 R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000 FS: 00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0 DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 PKRU: 55555554 Call Trace: reclaim_pages+0x499/0x800 mm/vmscan.c:2188 madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 walk_pmd_range mm/pagewalk.c:53 [inline] walk_pud_range mm/pagewalk.c:112 [inline] walk_p4d_range mm/pagewalk.c:139 [inline] walk_pgd_range mm/pagewalk.c:166 [inline] __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 walk_page_range+0x179/0x310 mm/pagewalk.c:349 madvise_pageout_page_range mm/madvise.c:506 [inline] madvise_pageout+0x1f0/0x330 mm/madvise.c:542 madvise_vma mm/madvise.c:931 [inline] __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe madvise_pageout access the specified range of the vma and isolate them, then run shrink_page_list to reclaim the memory. But It also isolate the unevictable page to reclaim. Hence, we can catch the cases in shrink_page_list. We can fix it by preventing unevictable page from isolating. Another way to fix the issue by removing the condition of BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it is better to use the latter. Because We has taken the unevictable page and skip it into account in shrink_page_list. Signed-off-by: zhong jiang <zhongjiang@huawei.com> --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)