diff mbox series

[v3] mm: fix trying to reclaim unevictable lru page when calling madvise_pageout

Message ID 1572616245-18946-1-git-send-email-zhongjiang@huawei.com (mailing list archive)
State New, archived
Headers show
Series [v3] mm: fix trying to reclaim unevictable lru page when calling madvise_pageout | expand

Commit Message

zhong jiang Nov. 1, 2019, 1:50 p.m. UTC
Recently, I hit the following issue when running in the upstream.

kernel BUG at mm/vmscan.c:1521!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
PKRU: 55555554
Call Trace:
 reclaim_pages+0x499/0x800 mm/vmscan.c:2188
 madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
 walk_pmd_range mm/pagewalk.c:53 [inline]
 walk_pud_range mm/pagewalk.c:112 [inline]
 walk_p4d_range mm/pagewalk.c:139 [inline]
 walk_pgd_range mm/pagewalk.c:166 [inline]
 __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
 walk_page_range+0x179/0x310 mm/pagewalk.c:349
 madvise_pageout_page_range mm/madvise.c:506 [inline]
 madvise_pageout+0x1f0/0x330 mm/madvise.c:542
 madvise_vma mm/madvise.c:931 [inline]
 __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
 do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

madvise_pageout access the specified range of the vma and isolate
them, then run shrink_page_list to reclaim its memory. But It also
isolate the unevictable page to reclaim. Hence, we can catch the
cases in shrink_page_list.

The root cause is that we scan the page tables instead of specific
LRU list. and so we need to filter out the unevictable lru pages
from our end.

Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
Cc: stable@vger.kernel.org
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
---
v2 -> v3:
    Modified the mail address to hannes@cmpxchg.org

v1 -> v2:
    Remove BUG_ON is not an good solutions. Becuase it indeed can
    see !page_evictable() in shrink_page_list due to some race.  

 mm/madvise.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

Comments

Johannes Weiner Nov. 1, 2019, 2:46 p.m. UTC | #1
On Fri, Nov 01, 2019 at 09:50:45PM +0800, zhong jiang wrote:
> Recently, I hit the following issue when running in the upstream.
> 
> kernel BUG at mm/vmscan.c:1521!
> invalid opcode: 0000 [#1] SMP KASAN PTI
> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
> FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> PKRU: 55555554
> Call Trace:
>  reclaim_pages+0x499/0x800 mm/vmscan.c:2188
>  madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
>  walk_pmd_range mm/pagewalk.c:53 [inline]
>  walk_pud_range mm/pagewalk.c:112 [inline]
>  walk_p4d_range mm/pagewalk.c:139 [inline]
>  walk_pgd_range mm/pagewalk.c:166 [inline]
>  __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
>  walk_page_range+0x179/0x310 mm/pagewalk.c:349
>  madvise_pageout_page_range mm/madvise.c:506 [inline]
>  madvise_pageout+0x1f0/0x330 mm/madvise.c:542
>  madvise_vma mm/madvise.c:931 [inline]
>  __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
>  do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> madvise_pageout access the specified range of the vma and isolate
> them, then run shrink_page_list to reclaim its memory. But It also
> isolate the unevictable page to reclaim. Hence, we can catch the
> cases in shrink_page_list.
> 
> The root cause is that we scan the page tables instead of specific
> LRU list. and so we need to filter out the unevictable lru pages
> from our end.
> 
> Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
> Cc: stable@vger.kernel.org
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: zhong jiang <zhongjiang@huawei.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Minchan Kim Nov. 1, 2019, 3:08 p.m. UTC | #2
On Fri, Nov 01, 2019 at 09:50:45PM +0800, zhong jiang wrote:
> Recently, I hit the following issue when running in the upstream.
> 
> kernel BUG at mm/vmscan.c:1521!
> invalid opcode: 0000 [#1] SMP KASAN PTI
> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
> FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> PKRU: 55555554
> Call Trace:
>  reclaim_pages+0x499/0x800 mm/vmscan.c:2188
>  madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
>  walk_pmd_range mm/pagewalk.c:53 [inline]
>  walk_pud_range mm/pagewalk.c:112 [inline]
>  walk_p4d_range mm/pagewalk.c:139 [inline]
>  walk_pgd_range mm/pagewalk.c:166 [inline]
>  __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
>  walk_page_range+0x179/0x310 mm/pagewalk.c:349
>  madvise_pageout_page_range mm/madvise.c:506 [inline]
>  madvise_pageout+0x1f0/0x330 mm/madvise.c:542
>  madvise_vma mm/madvise.c:931 [inline]
>  __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
>  do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> madvise_pageout access the specified range of the vma and isolate
> them, then run shrink_page_list to reclaim its memory. But It also
> isolate the unevictable page to reclaim. Hence, we can catch the
> cases in shrink_page_list.
> 
> The root cause is that we scan the page tables instead of specific
> LRU list. and so we need to filter out the unevictable lru pages
> from our end.
> 
> Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
> Cc: stable@vger.kernel.org
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Acked-by: Minchan Kim <minchan@kernel.org>

Thanks for the fix.
Michal Hocko Nov. 1, 2019, 6:32 p.m. UTC | #3
On Fri 01-11-19 21:50:45, zhong jiang wrote:
> Recently, I hit the following issue when running in the upstream.
> 
> kernel BUG at mm/vmscan.c:1521!
> invalid opcode: 0000 [#1] SMP KASAN PTI
> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
> FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> PKRU: 55555554
> Call Trace:
>  reclaim_pages+0x499/0x800 mm/vmscan.c:2188
>  madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
>  walk_pmd_range mm/pagewalk.c:53 [inline]
>  walk_pud_range mm/pagewalk.c:112 [inline]
>  walk_p4d_range mm/pagewalk.c:139 [inline]
>  walk_pgd_range mm/pagewalk.c:166 [inline]
>  __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
>  walk_page_range+0x179/0x310 mm/pagewalk.c:349
>  madvise_pageout_page_range mm/madvise.c:506 [inline]
>  madvise_pageout+0x1f0/0x330 mm/madvise.c:542
>  madvise_vma mm/madvise.c:931 [inline]
>  __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
>  do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> madvise_pageout access the specified range of the vma and isolate
> them, then run shrink_page_list to reclaim its memory. But It also
> isolate the unevictable page to reclaim. Hence, we can catch the
> cases in shrink_page_list.
> 
> The root cause is that we scan the page tables instead of specific
> LRU list. and so we need to filter out the unevictable lru pages
> from our end.
> 
> Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
> Cc: stable@vger.kernel.org
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: zhong jiang <zhongjiang@huawei.com>

Acked-by: Michal Hocko <mhocko@suse.com>

But I would really appreciate to add a comment for the BUG_ON and
explain why do we care about PageUnevictable so much when there is an
explicit page_evictable check in the reclaim path. In other words a
short summary of what Johannes explained in
http://lkml.kernel.org/r/20191030193307.GA48128@cmpxchg.org. Maybe in a
separate patch. Care to send one or should I send it?

> ---
> v2 -> v3:
>     Modified the mail address to hannes@cmpxchg.org
> 
> v1 -> v2:
>     Remove BUG_ON is not an good solutions. Becuase it indeed can
>     see !page_evictable() in shrink_page_list due to some race.  
> 
>  mm/madvise.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 99dd06f..63e1308 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -363,8 +363,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		ClearPageReferenced(page);
>  		test_and_clear_page_young(page);
>  		if (pageout) {
> -			if (!isolate_lru_page(page))
> -				list_add(&page->lru, &page_list);
> +			if (!isolate_lru_page(page)) {
> +				if (PageUnevictable(page))
> +					putback_lru_page(page);
> +				else
> +					list_add(&page->lru, &page_list);
> +			}
>  		} else
>  			deactivate_page(page);
>  huge_unlock:
> @@ -441,8 +445,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		ClearPageReferenced(page);
>  		test_and_clear_page_young(page);
>  		if (pageout) {
> -			if (!isolate_lru_page(page))
> -				list_add(&page->lru, &page_list);
> +			if (!isolate_lru_page(page)) {
> +				if (PageUnevictable(page))
> +					putback_lru_page(page);
> +				else
> +					list_add(&page->lru, &page_list);
> +			}
>  		} else
>  			deactivate_page(page);
>  	}
> -- 
> 1.7.12.4
zhong jiang Nov. 2, 2019, 7:36 a.m. UTC | #4
On 2019/11/2 2:32, Michal Hocko wrote:
> On Fri 01-11-19 21:50:45, zhong jiang wrote:
>> Recently, I hit the following issue when running in the upstream.
>>
>> kernel BUG at mm/vmscan.c:1521!
>> invalid opcode: 0000 [#1] SMP KASAN PTI
>> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
>> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
>> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
>> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
>> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
>> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
>> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
>> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
>> FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
>> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
>> PKRU: 55555554
>> Call Trace:
>>  reclaim_pages+0x499/0x800 mm/vmscan.c:2188
>>  madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
>>  walk_pmd_range mm/pagewalk.c:53 [inline]
>>  walk_pud_range mm/pagewalk.c:112 [inline]
>>  walk_p4d_range mm/pagewalk.c:139 [inline]
>>  walk_pgd_range mm/pagewalk.c:166 [inline]
>>  __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
>>  walk_page_range+0x179/0x310 mm/pagewalk.c:349
>>  madvise_pageout_page_range mm/madvise.c:506 [inline]
>>  madvise_pageout+0x1f0/0x330 mm/madvise.c:542
>>  madvise_vma mm/madvise.c:931 [inline]
>>  __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
>>  do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> madvise_pageout access the specified range of the vma and isolate
>> them, then run shrink_page_list to reclaim its memory. But It also
>> isolate the unevictable page to reclaim. Hence, we can catch the
>> cases in shrink_page_list.
>>
>> The root cause is that we scan the page tables instead of specific
>> LRU list. and so we need to filter out the unevictable lru pages
>> from our end.
>>
>> Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
>> Cc: stable@vger.kernel.org
>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
>
> But I would really appreciate to add a comment for the BUG_ON and
> explain why do we care about PageUnevictable so much when there is an
> explicit page_evictable check in the reclaim path. In other words a
> short summary of what Johannes explained in
> http://lkml.kernel.org/r/20191030193307.GA48128@cmpxchg.org. Maybe in a
> separate patch. Care to send one or should I send it?
Hi,  Michal

Actually,  I am not very clear about the words Johannes had said.   How the race to
tirgger, it will result in an PgeMlocked page can be visible in shrink_page_list.

Could you elaborate the race in detail further ?

Thanks,
zhong jiang
>> ---
>> v2 -> v3:
>>     Modified the mail address to hannes@cmpxchg.org
>>
>> v1 -> v2:
>>     Remove BUG_ON is not an good solutions. Becuase it indeed can
>>     see !page_evictable() in shrink_page_list due to some race.  
>>
>>  mm/madvise.c | 16 ++++++++++++----
>>  1 file changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 99dd06f..63e1308 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -363,8 +363,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>  		ClearPageReferenced(page);
>>  		test_and_clear_page_young(page);
>>  		if (pageout) {
>> -			if (!isolate_lru_page(page))
>> -				list_add(&page->lru, &page_list);
>> +			if (!isolate_lru_page(page)) {
>> +				if (PageUnevictable(page))
>> +					putback_lru_page(page);
>> +				else
>> +					list_add(&page->lru, &page_list);
>> +			}
>>  		} else
>>  			deactivate_page(page);
>>  huge_unlock:
>> @@ -441,8 +445,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>  		ClearPageReferenced(page);
>>  		test_and_clear_page_young(page);
>>  		if (pageout) {
>> -			if (!isolate_lru_page(page))
>> -				list_add(&page->lru, &page_list);
>> +			if (!isolate_lru_page(page)) {
>> +				if (PageUnevictable(page))
>> +					putback_lru_page(page);
>> +				else
>> +					list_add(&page->lru, &page_list);
>> +			}
>>  		} else
>>  			deactivate_page(page);
>>  	}
>> -- 
>> 1.7.12.4
Michal Hocko Nov. 5, 2019, 6:33 a.m. UTC | #5
On Sat 02-11-19 15:36:55, zhong jiang wrote:
> On 2019/11/2 2:32, Michal Hocko wrote:
[...]
> > But I would really appreciate to add a comment for the BUG_ON and
> > explain why do we care about PageUnevictable so much when there is an
> > explicit page_evictable check in the reclaim path. In other words a
> > short summary of what Johannes explained in
> > http://lkml.kernel.org/r/20191030193307.GA48128@cmpxchg.org. Maybe in a
> > separate patch. Care to send one or should I send it?
> Hi,  Michal
> 
> Actually,  I am not very clear about the words Johannes had said.   How the race to
> tirgger, it will result in an PgeMlocked page can be visible in shrink_page_list.
> 
> Could you elaborate the race in detail further ?

I would go with the following comment

	/*
	 * Page reclaim can see !page_evictable(), but it must not see pages that
	 * have the PageUnevictable lru bit already set. See  __pagevec_lru_add_fn()
	 * for more details.
	 */
zhong jiang Nov. 5, 2019, 12:28 p.m. UTC | #6
On 2019/11/5 14:33, Michal Hocko wrote:
> On Sat 02-11-19 15:36:55, zhong jiang wrote:
>> On 2019/11/2 2:32, Michal Hocko wrote:
> [...]
>>> But I would really appreciate to add a comment for the BUG_ON and
>>> explain why do we care about PageUnevictable so much when there is an
>>> explicit page_evictable check in the reclaim path. In other words a
>>> short summary of what Johannes explained in
>>> http://lkml.kernel.org/r/20191030193307.GA48128@cmpxchg.org. Maybe in a
>>> separate patch. Care to send one or should I send it?
>> Hi,  Michal
>>
>> Actually,  I am not very clear about the words Johannes had said.   How the race to
>> tirgger, it will result in an PgeMlocked page can be visible in shrink_page_list.
>>
>> Could you elaborate the race in detail further ?
> I would go with the following comment
>
> 	/*
> 	 * Page reclaim can see !page_evictable(), but it must not see pages that
> 	 * have the PageUnevictable lru bit already set. See  __pagevec_lru_add_fn()
> 	 * for more details.
> 	 */
But  the detail still confuses me in __pagevec_lru_add_fn()  to  see PageMlocked  in vmscan :-\ .
Michal Hocko Nov. 5, 2019, 12:45 p.m. UTC | #7
On Tue 05-11-19 20:28:58, zhong jiang wrote:
> On 2019/11/5 14:33, Michal Hocko wrote:
> > On Sat 02-11-19 15:36:55, zhong jiang wrote:
> >> On 2019/11/2 2:32, Michal Hocko wrote:
> > [...]
> >>> But I would really appreciate to add a comment for the BUG_ON and
> >>> explain why do we care about PageUnevictable so much when there is an
> >>> explicit page_evictable check in the reclaim path. In other words a
> >>> short summary of what Johannes explained in
> >>> http://lkml.kernel.org/r/20191030193307.GA48128@cmpxchg.org. Maybe in a
> >>> separate patch. Care to send one or should I send it?
> >> Hi,  Michal
> >>
> >> Actually,  I am not very clear about the words Johannes had said.   How the race to
> >> tirgger, it will result in an PgeMlocked page can be visible in shrink_page_list.
> >>
> >> Could you elaborate the race in detail further ?
> > I would go with the following comment
> >
> > 	/*
> > 	 * Page reclaim can see !page_evictable(), but it must not see pages that
> > 	 * have the PageUnevictable lru bit already set. See  __pagevec_lru_add_fn()
> > 	 * for more details.
> > 	 */
> But  the detail still confuses me in __pagevec_lru_add_fn()  to  see PageMlocked  in vmscan :-\ .

Which part does confuse you exactly?
zhong jiang Nov. 5, 2019, 2:10 p.m. UTC | #8
On 2019/11/5 20:45, Michal Hocko wrote:
> On Tue 05-11-19 20:28:58, zhong jiang wrote:
>> On 2019/11/5 14:33, Michal Hocko wrote:
>>> On Sat 02-11-19 15:36:55, zhong jiang wrote:
>>>> On 2019/11/2 2:32, Michal Hocko wrote:
>>> [...]
>>>>> But I would really appreciate to add a comment for the BUG_ON and
>>>>> explain why do we care about PageUnevictable so much when there is an
>>>>> explicit page_evictable check in the reclaim path. In other words a
>>>>> short summary of what Johannes explained in
>>>>> http://lkml.kernel.org/r/20191030193307.GA48128@cmpxchg.org. Maybe in a
>>>>> separate patch. Care to send one or should I send it?
>>>> Hi,  Michal
>>>>
>>>> Actually,  I am not very clear about the words Johannes had said.   How the race to
>>>> tirgger, it will result in an PgeMlocked page can be visible in shrink_page_list.
>>>>
>>>> Could you elaborate the race in detail further ?
>>> I would go with the following comment
>>>
>>> 	/*
>>> 	 * Page reclaim can see !page_evictable(), but it must not see pages that
>>> 	 * have the PageUnevictable lru bit already set. See  __pagevec_lru_add_fn()
>>> 	 * for more details.
>>> 	 */
>> But  the detail still confuses me in __pagevec_lru_add_fn()  to  see PageMlocked  in vmscan :-\ .
> Which part does confuse you exactly?
page reclaim can see !page_evictable()  means some race still exist in the kernel. Is there any race window .

PageLocked will prevent the cases in shmem ?
Johannes Weiner Nov. 5, 2019, 6:48 p.m. UTC | #9
On Tue, Nov 05, 2019 at 10:10:37PM +0800, zhong jiang wrote:
> On 2019/11/5 20:45, Michal Hocko wrote:
> > On Tue 05-11-19 20:28:58, zhong jiang wrote:
> >> On 2019/11/5 14:33, Michal Hocko wrote:
> >>> On Sat 02-11-19 15:36:55, zhong jiang wrote:
> >>>> On 2019/11/2 2:32, Michal Hocko wrote:
> >>> [...]
> >>>>> But I would really appreciate to add a comment for the BUG_ON and
> >>>>> explain why do we care about PageUnevictable so much when there is an
> >>>>> explicit page_evictable check in the reclaim path. In other words a
> >>>>> short summary of what Johannes explained in
> >>>>> http://lkml.kernel.org/r/20191030193307.GA48128@cmpxchg.org. Maybe in a
> >>>>> separate patch. Care to send one or should I send it?
> >>>> Hi,  Michal
> >>>>
> >>>> Actually,  I am not very clear about the words Johannes had said.   How the race to
> >>>> tirgger, it will result in an PgeMlocked page can be visible in shrink_page_list.
> >>>>
> >>>> Could you elaborate the race in detail further ?
> >>> I would go with the following comment
> >>>
> >>> 	/*
> >>> 	 * Page reclaim can see !page_evictable(), but it must not see pages that
> >>> 	 * have the PageUnevictable lru bit already set. See  __pagevec_lru_add_fn()
> >>> 	 * for more details.
> >>> 	 */
> >> But  the detail still confuses me in __pagevec_lru_add_fn()  to  see PageMlocked  in vmscan :-\ .
> > Which part does confuse you exactly?
> page reclaim can see !page_evictable()  means some race still exist in the kernel. Is there any race window .

Yes. mlock does this:

lock_page()
SetPageMlocked()
if (isolate_lru_page())
  putback_lru_page() // move to unevictable list
unlock_page()

and vmscan does this:

isolate_lru_pages()
for_each_page()
  if (!try_lock_page())
    continue
  if (!page_evictable())
    continue
putback_lru_pages()

It's possible that mlock locks the page and sets PG_mlocked, but
vmscan has the page already isolated and mlock cannot move it to the
unevictable list itself. In that case, vmscan will either fail to lock
the page or see !page_unevictable() and move the page on the
unevictable list on behalf of mlock.
zhong jiang Nov. 7, 2019, 1:31 p.m. UTC | #10
On 2019/11/6 2:48, Johannes Weiner wrote:
> On Tue, Nov 05, 2019 at 10:10:37PM +0800, zhong jiang wrote:
>> On 2019/11/5 20:45, Michal Hocko wrote:
>>> On Tue 05-11-19 20:28:58, zhong jiang wrote:
>>>> On 2019/11/5 14:33, Michal Hocko wrote:
>>>>> On Sat 02-11-19 15:36:55, zhong jiang wrote:
>>>>>> On 2019/11/2 2:32, Michal Hocko wrote:
>>>>> [...]
>>>>>>> But I would really appreciate to add a comment for the BUG_ON and
>>>>>>> explain why do we care about PageUnevictable so much when there is an
>>>>>>> explicit page_evictable check in the reclaim path. In other words a
>>>>>>> short summary of what Johannes explained in
>>>>>>> http://lkml.kernel.org/r/20191030193307.GA48128@cmpxchg.org. Maybe in a
>>>>>>> separate patch. Care to send one or should I send it?
>>>>>> Hi,  Michal
>>>>>>
>>>>>> Actually,  I am not very clear about the words Johannes had said.   How the race to
>>>>>> tirgger, it will result in an PgeMlocked page can be visible in shrink_page_list.
>>>>>>
>>>>>> Could you elaborate the race in detail further ?
>>>>> I would go with the following comment
>>>>>
>>>>> 	/*
>>>>> 	 * Page reclaim can see !page_evictable(), but it must not see pages that
>>>>> 	 * have the PageUnevictable lru bit already set. See  __pagevec_lru_add_fn()
>>>>> 	 * for more details.
>>>>> 	 */
>>>> But  the detail still confuses me in __pagevec_lru_add_fn()  to  see PageMlocked  in vmscan :-\ .
>>> Which part does confuse you exactly?
>> page reclaim can see !page_evictable()  means some race still exist in the kernel. Is there any race window .
> Yes. mlock does this:
>
> lock_page()
> SetPageMlocked()
> if (isolate_lru_page())
>   putback_lru_page() // move to unevictable list
> unlock_page()
>
> and vmscan does this:
>
> isolate_lru_pages()
> for_each_page()
>   if (!try_lock_page())
>     continue
>   if (!page_evictable())
>     continue
> putback_lru_pages()
>
> It's possible that mlock locks the page and sets PG_mlocked, but
> vmscan has the page already isolated and mlock cannot move it to the
> unevictable list itself. In that case, vmscan will either fail to lock
> the page or see !page_unevictable() and move the page on the
> unevictable list on behalf of mlock.
Thanks for you kindly clarification.  

Sincerely,
zhong jiang
> .
>
diff mbox series

Patch

diff --git a/mm/madvise.c b/mm/madvise.c
index 99dd06f..63e1308 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -363,8 +363,12 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		ClearPageReferenced(page);
 		test_and_clear_page_young(page);
 		if (pageout) {
-			if (!isolate_lru_page(page))
-				list_add(&page->lru, &page_list);
+			if (!isolate_lru_page(page)) {
+				if (PageUnevictable(page))
+					putback_lru_page(page);
+				else
+					list_add(&page->lru, &page_list);
+			}
 		} else
 			deactivate_page(page);
 huge_unlock:
@@ -441,8 +445,12 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		ClearPageReferenced(page);
 		test_and_clear_page_young(page);
 		if (pageout) {
-			if (!isolate_lru_page(page))
-				list_add(&page->lru, &page_list);
+			if (!isolate_lru_page(page)) {
+				if (PageUnevictable(page))
+					putback_lru_page(page);
+				else
+					list_add(&page->lru, &page_list);
+			}
 		} else
 			deactivate_page(page);
 	}