diff mbox series

[v2] mm/memory-failure.c: skip huge_zero_page in memory_failure()

Message ID 497d3835612610e370c74e697ea3c721d1d55b9c.1649775850.git.xuyu@linux.alibaba.com (mailing list archive)
State New
Headers show
Series [v2] mm/memory-failure.c: skip huge_zero_page in memory_failure() | expand

Commit Message

Xu Yu April 12, 2022, 3:11 p.m. UTC
Kernel panic when injecting memory_failure for the global huge_zero_page,
when CONFIG_DEBUG_VM is enabled, as follows.

[    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
[    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
[    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
[    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
[    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
[    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
[    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
[    5.589398] ------------[ cut here ]------------
[    5.589952] kernel BUG at mm/huge_memory.c:2499!
[    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
[    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
[    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
[    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
[    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
[    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
[    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
[    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
[    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
[    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
[    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
[    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
[    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    5.604806] Call Trace:
[    5.605101]  <TASK>
[    5.605357]  ? __irq_work_queue_local+0x39/0x70
[    5.605904]  try_to_split_thp_page+0x3a/0x130
[    5.606430]  memory_failure+0x128/0x800
[    5.606888]  madvise_inject_error.cold+0x8b/0xa1
[    5.607444]  __x64_sys_madvise+0x54/0x60
[    5.607915]  do_syscall_64+0x35/0x80
[    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    5.608949] RIP: 0033:0x7fc3754f8bf9
[    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
[    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
[    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
[    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
[    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
[    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
[    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
[    5.616626]  </TASK>

This makes huge_zero_page bail out explicitly before split in
memory_failure(), thus the panic above won't happen again.

Reported-by: Abaci <abaci@linux.alibaba.com>
Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
---
 mm/memory-failure.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

Andrew Morton April 12, 2022, 8:01 p.m. UTC | #1
On Tue, 12 Apr 2022 23:11:01 +0800 Xu Yu <xuyu@linux.alibaba.com> wrote:

> Kernel panic when injecting memory_failure for the global huge_zero_page,
> when CONFIG_DEBUG_VM is enabled, as follows.
> 
> ...
>
> 
> Reported-by: Abaci <abaci@linux.alibaba.com>
> Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>

Thanks, I added a cc:stable to this.

If anyone could suggest a suitable Fixes:, that would help the -stable
people.
HORIGUCHI NAOYA(堀口 直也) April 15, 2022, 6:16 a.m. UTC | #2
On Tue, Apr 12, 2022 at 01:01:02PM -0700, Andrew Morton wrote:
> On Tue, 12 Apr 2022 23:11:01 +0800 Xu Yu <xuyu@linux.alibaba.com> wrote:
> 
> > Kernel panic when injecting memory_failure for the global huge_zero_page,
> > when CONFIG_DEBUG_VM is enabled, as follows.
> > 
> > ...
> >
> > 
> > Reported-by: Abaci <abaci@linux.alibaba.com>
> > Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Cc: Miaohe Lin <linmiaohe@huawei.com>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>

Thank you for the patch.

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

> 
> Thanks, I added a cc:stable to this.
> 
> If anyone could suggest a suitable Fixes:, that would help the -stable
> people.

The problem was existing from the origin of memory_failure, so the following
Fixes: tag looks fine to me.

Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")

Thanks,
Naoya Horiguchi
Miaohe Lin April 16, 2022, 1:47 a.m. UTC | #3
On 2022/4/15 14:16, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Apr 12, 2022 at 01:01:02PM -0700, Andrew Morton wrote:
>> On Tue, 12 Apr 2022 23:11:01 +0800 Xu Yu <xuyu@linux.alibaba.com> wrote:
>>
>>> Kernel panic when injecting memory_failure for the global huge_zero_page,
>>> when CONFIG_DEBUG_VM is enabled, as follows.
>>>
>>> ...
>>>
>>>
>>> Reported-by: Abaci <abaci@linux.alibaba.com>
>>> Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>> Cc: Miaohe Lin <linmiaohe@huawei.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>

LGTM. Thanks.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

> 
> Thank you for the patch.
> 
> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
>>
>> Thanks, I added a cc:stable to this.
>>
>> If anyone could suggest a suitable Fixes:, that would help the -stable
>> people.
> 
> The problem was existing from the origin of memory_failure, so the following
> Fixes: tag looks fine to me.
> 
> Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")

Thanks for your Fixes tag. But huge zero page is not even introduced in that commit.
So maybe Fixes: 4a6c1297268c ("thp: huge zero page: basic preparation") which introduced
huge zero page will be more suitable?

Thanks!

> 
> Thanks,
> Naoya Horiguchi
>
Yang Shi April 20, 2022, 11:38 p.m. UTC | #4
On Tue, Apr 12, 2022 at 8:11 AM Xu Yu <xuyu@linux.alibaba.com> wrote:
>
> Kernel panic when injecting memory_failure for the global huge_zero_page,
> when CONFIG_DEBUG_VM is enabled, as follows.
>
> [    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
> [    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
> [    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
> [    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
> [    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
> [    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
> [    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
> [    5.589398] ------------[ cut here ]------------
> [    5.589952] kernel BUG at mm/huge_memory.c:2499!
> [    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
> [    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
> [    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
> [    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
> [    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
> [    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
> [    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
> [    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
> [    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
> [    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
> [    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
> [    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
> [    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    5.604806] Call Trace:
> [    5.605101]  <TASK>
> [    5.605357]  ? __irq_work_queue_local+0x39/0x70
> [    5.605904]  try_to_split_thp_page+0x3a/0x130
> [    5.606430]  memory_failure+0x128/0x800
> [    5.606888]  madvise_inject_error.cold+0x8b/0xa1
> [    5.607444]  __x64_sys_madvise+0x54/0x60
> [    5.607915]  do_syscall_64+0x35/0x80
> [    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [    5.608949] RIP: 0033:0x7fc3754f8bf9
> [    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
> [    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
> [    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
> [    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
> [    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
> [    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
> [    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
> [    5.616626]  </TASK>
>
> This makes huge_zero_page bail out explicitly before split in
> memory_failure(), thus the panic above won't happen again.

Skipping huge_zero_page in error injection is ok to me, but I'm
actually wondering whether raising BUG is overkilling for splitting
huge_zero_page or not. Returning -EBUSY should be totally fine.

>
> Reported-by: Abaci <abaci@linux.alibaba.com>
> Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> ---
>  mm/memory-failure.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index dcb6bb9cf731..8788ea49bd28 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1799,6 +1799,19 @@ int memory_failure(unsigned long pfn, int flags)
>         }
>
>         if (PageTransHuge(hpage)) {
> +               /*
> +                * Bail out before SetPageHasHWPoisoned() if hpage is
> +                * huge_zero_page, although PG_has_hwpoisoned is not
> +                * checked in set_huge_zero_page().
> +                *
> +                * TODO: Handle memory failure of huge_zero_page thoroughly.
> +                */
> +               if (is_huge_zero_page(hpage)) {
> +                       action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> +                       res = -EBUSY;
> +                       goto unlock_mutex;
> +               }
> +
>                 /*
>                  * The flag must be set after the refcount is bumped
>                  * otherwise it may race with THP split.
> --
> 2.20.1.2432.ga663e714
>
>
David Hildenbrand April 21, 2022, 9:03 a.m. UTC | #5
On 21.04.22 01:38, Yang Shi wrote:
> On Tue, Apr 12, 2022 at 8:11 AM Xu Yu <xuyu@linux.alibaba.com> wrote:
>>
>> Kernel panic when injecting memory_failure for the global huge_zero_page,
>> when CONFIG_DEBUG_VM is enabled, as follows.
>>
>> [    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
>> [    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
>> [    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
>> [    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
>> [    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
>> [    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
>> [    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
>> [    5.589398] ------------[ cut here ]------------
>> [    5.589952] kernel BUG at mm/huge_memory.c:2499!
>> [    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
>> [    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
>> [    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
>> [    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
>> [    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
>> [    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
>> [    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
>> [    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
>> [    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
>> [    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
>> [    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
>> [    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
>> [    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
>> [    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [    5.604806] Call Trace:
>> [    5.605101]  <TASK>
>> [    5.605357]  ? __irq_work_queue_local+0x39/0x70
>> [    5.605904]  try_to_split_thp_page+0x3a/0x130
>> [    5.606430]  memory_failure+0x128/0x800
>> [    5.606888]  madvise_inject_error.cold+0x8b/0xa1
>> [    5.607444]  __x64_sys_madvise+0x54/0x60
>> [    5.607915]  do_syscall_64+0x35/0x80
>> [    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [    5.608949] RIP: 0033:0x7fc3754f8bf9
>> [    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
>> [    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
>> [    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
>> [    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
>> [    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
>> [    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
>> [    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
>> [    5.616626]  </TASK>
>>
>> This makes huge_zero_page bail out explicitly before split in
>> memory_failure(), thus the panic above won't happen again.
> 
> Skipping huge_zero_page in error injection is ok to me, but I'm
> actually wondering whether raising BUG is overkilling for splitting
> huge_zero_page or not. Returning -EBUSY should be totally fine.

I tend to agree. Just failing with -EBUSY might be cleaner. Most
probably we want to catch any bogus code here that does something we
don't really expect -- splitting the huge zeropage makes 0 sense.
Anshuman Khandual April 21, 2022, 11:15 a.m. UTC | #6
On 4/12/22 20:41, Xu Yu wrote:
> Kernel panic when injecting memory_failure for the global huge_zero_page,
> when CONFIG_DEBUG_VM is enabled, as follows.
> 
> [    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
> [    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
> [    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
> [    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
> [    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
> [    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
> [    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
> [    5.589398] ------------[ cut here ]------------
> [    5.589952] kernel BUG at mm/huge_memory.c:2499!
> [    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
> [    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
> [    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
> [    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
> [    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
> [    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
> [    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
> [    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
> [    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
> [    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
> [    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
> [    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
> [    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    5.604806] Call Trace:
> [    5.605101]  <TASK>
> [    5.605357]  ? __irq_work_queue_local+0x39/0x70
> [    5.605904]  try_to_split_thp_page+0x3a/0x130
> [    5.606430]  memory_failure+0x128/0x800
> [    5.606888]  madvise_inject_error.cold+0x8b/0xa1
> [    5.607444]  __x64_sys_madvise+0x54/0x60
> [    5.607915]  do_syscall_64+0x35/0x80
> [    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [    5.608949] RIP: 0033:0x7fc3754f8bf9
> [    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
> [    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
> [    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
> [    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
> [    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
> [    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
> [    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
> [    5.616626]  </TASK>
> 
> This makes huge_zero_page bail out explicitly before split in
> memory_failure(), thus the panic above won't happen again.
> 
> Reported-by: Abaci <abaci@linux.alibaba.com>
> Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> ---
>  mm/memory-failure.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index dcb6bb9cf731..8788ea49bd28 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1799,6 +1799,19 @@ int memory_failure(unsigned long pfn, int flags)
>  	}
>  
>  	if (PageTransHuge(hpage)) {
> +		/*
> +		 * Bail out before SetPageHasHWPoisoned() if hpage is
> +		 * huge_zero_page, although PG_has_hwpoisoned is not
> +		 * checked in set_huge_zero_page().
> +		 *
> +		 * TODO: Handle memory failure of huge_zero_page thoroughly.
> +		 */
> +		if (is_huge_zero_page(hpage)) {
> +			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);

Probably better to add a new and specific MF_MSG_ZERO_HUGE description instead ?

> +			res = -EBUSY;
> +			goto unlock_mutex;
> +		}
> +
>  		/*
>  		 * The flag must be set after the refcount is bumped
>  		 * otherwise it may race with THP split.
Anshuman Khandual April 21, 2022, 11:18 a.m. UTC | #7
On 4/21/22 14:33, David Hildenbrand wrote:
> On 21.04.22 01:38, Yang Shi wrote:
>> On Tue, Apr 12, 2022 at 8:11 AM Xu Yu <xuyu@linux.alibaba.com> wrote:
>>>
>>> Kernel panic when injecting memory_failure for the global huge_zero_page,
>>> when CONFIG_DEBUG_VM is enabled, as follows.
>>>
>>> [    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
>>> [    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
>>> [    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
>>> [    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
>>> [    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
>>> [    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
>>> [    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
>>> [    5.589398] ------------[ cut here ]------------
>>> [    5.589952] kernel BUG at mm/huge_memory.c:2499!
>>> [    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>> [    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
>>> [    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
>>> [    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
>>> [    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
>>> [    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
>>> [    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
>>> [    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
>>> [    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
>>> [    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
>>> [    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
>>> [    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
>>> [    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
>>> [    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [    5.604806] Call Trace:
>>> [    5.605101]  <TASK>
>>> [    5.605357]  ? __irq_work_queue_local+0x39/0x70
>>> [    5.605904]  try_to_split_thp_page+0x3a/0x130
>>> [    5.606430]  memory_failure+0x128/0x800
>>> [    5.606888]  madvise_inject_error.cold+0x8b/0xa1
>>> [    5.607444]  __x64_sys_madvise+0x54/0x60
>>> [    5.607915]  do_syscall_64+0x35/0x80
>>> [    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [    5.608949] RIP: 0033:0x7fc3754f8bf9
>>> [    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
>>> [    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
>>> [    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
>>> [    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
>>> [    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
>>> [    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
>>> [    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
>>> [    5.616626]  </TASK>
>>>
>>> This makes huge_zero_page bail out explicitly before split in
>>> memory_failure(), thus the panic above won't happen again.
>>
>> Skipping huge_zero_page in error injection is ok to me, but I'm
>> actually wondering whether raising BUG is overkilling for splitting
>> huge_zero_page or not. Returning -EBUSY should be totally fine.
> 
> I tend to agree. Just failing with -EBUSY might be cleaner. Most
> probably we want to catch any bogus code here that does something we
> don't really expect -- splitting the huge zeropage makes 0 sense.

Agreed, and hence MF_MSG_UNSPLIT_THP as a reason code might not be optimal.
Yang Shi April 21, 2022, 5:53 p.m. UTC | #8
On Thu, Apr 21, 2022 at 2:03 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 21.04.22 01:38, Yang Shi wrote:
> > On Tue, Apr 12, 2022 at 8:11 AM Xu Yu <xuyu@linux.alibaba.com> wrote:
> >>
> >> Kernel panic when injecting memory_failure for the global huge_zero_page,
> >> when CONFIG_DEBUG_VM is enabled, as follows.
> >>
> >> [    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
> >> [    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
> >> [    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
> >> [    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
> >> [    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
> >> [    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
> >> [    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
> >> [    5.589398] ------------[ cut here ]------------
> >> [    5.589952] kernel BUG at mm/huge_memory.c:2499!
> >> [    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >> [    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
> >> [    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
> >> [    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
> >> [    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
> >> [    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
> >> [    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
> >> [    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
> >> [    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
> >> [    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
> >> [    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
> >> [    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
> >> [    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
> >> [    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> [    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >> [    5.604806] Call Trace:
> >> [    5.605101]  <TASK>
> >> [    5.605357]  ? __irq_work_queue_local+0x39/0x70
> >> [    5.605904]  try_to_split_thp_page+0x3a/0x130
> >> [    5.606430]  memory_failure+0x128/0x800
> >> [    5.606888]  madvise_inject_error.cold+0x8b/0xa1
> >> [    5.607444]  __x64_sys_madvise+0x54/0x60
> >> [    5.607915]  do_syscall_64+0x35/0x80
> >> [    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >> [    5.608949] RIP: 0033:0x7fc3754f8bf9
> >> [    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
> >> [    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
> >> [    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
> >> [    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
> >> [    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
> >> [    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
> >> [    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
> >> [    5.616626]  </TASK>
> >>
> >> This makes huge_zero_page bail out explicitly before split in
> >> memory_failure(), thus the panic above won't happen again.
> >
> > Skipping huge_zero_page in error injection is ok to me, but I'm
> > actually wondering whether raising BUG is overkilling for splitting
> > huge_zero_page or not. Returning -EBUSY should be totally fine.
>
> I tend to agree. Just failing with -EBUSY might be cleaner. Most
> probably we want to catch any bogus code here that does something we
> don't really expect -- splitting the huge zeropage makes 0 sense.

Yeah, the huge zero page can't be met from normal paths other than
memory failure, but memory failure is a valid caller. So I tend to
replace the BUG to WARN + returning -EBUSY. If we don't care about the
reason code in memory failure, we don't have to touch memory failure.

>
> --
> Thanks,
>
> David / dhildenb
>
>
HORIGUCHI NAOYA(堀口 直也) April 22, 2022, 1 a.m. UTC | #9
On Thu, Apr 21, 2022 at 10:53:43AM -0700, Yang Shi wrote:
> On Thu, Apr 21, 2022 at 2:03 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 21.04.22 01:38, Yang Shi wrote:
> > > On Tue, Apr 12, 2022 at 8:11 AM Xu Yu <xuyu@linux.alibaba.com> wrote:
> > >>
> > >> Kernel panic when injecting memory_failure for the global huge_zero_page,
> > >> when CONFIG_DEBUG_VM is enabled, as follows.
> > >>
> > >> [    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
> > >> [    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
> > >> [    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
> > >> [    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
> > >> [    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
> > >> [    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
> > >> [    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
> > >> [    5.589398] ------------[ cut here ]------------
> > >> [    5.589952] kernel BUG at mm/huge_memory.c:2499!
> > >> [    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > >> [    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
> > >> [    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
> > >> [    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
> > >> [    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
> > >> [    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
> > >> [    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
> > >> [    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
> > >> [    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
> > >> [    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
> > >> [    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
> > >> [    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
> > >> [    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> [    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
> > >> [    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >> [    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >> [    5.604806] Call Trace:
> > >> [    5.605101]  <TASK>
> > >> [    5.605357]  ? __irq_work_queue_local+0x39/0x70
> > >> [    5.605904]  try_to_split_thp_page+0x3a/0x130
> > >> [    5.606430]  memory_failure+0x128/0x800
> > >> [    5.606888]  madvise_inject_error.cold+0x8b/0xa1
> > >> [    5.607444]  __x64_sys_madvise+0x54/0x60
> > >> [    5.607915]  do_syscall_64+0x35/0x80
> > >> [    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >> [    5.608949] RIP: 0033:0x7fc3754f8bf9
> > >> [    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
> > >> [    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
> > >> [    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
> > >> [    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
> > >> [    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
> > >> [    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
> > >> [    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
> > >> [    5.616626]  </TASK>
> > >>
> > >> This makes huge_zero_page bail out explicitly before split in
> > >> memory_failure(), thus the panic above won't happen again.
> > >
> > > Skipping huge_zero_page in error injection is ok to me, but I'm
> > > actually wondering whether raising BUG is overkilling for splitting
> > > huge_zero_page or not. Returning -EBUSY should be totally fine.
> >
> > I tend to agree. Just failing with -EBUSY might be cleaner. Most
> > probably we want to catch any bogus code here that does something we
> > don't really expect -- splitting the huge zeropage makes 0 sense.
> 
> Yeah, the huge zero page can't be met from normal paths other than
> memory failure, but memory failure is a valid caller. So I tend to
> replace the BUG to WARN + returning -EBUSY. If we don't care about the
> reason code in memory failure, we don't have to touch memory failure.

I agree with returning -EBUSY when splitting huge zero page, too.
Thank you for the suggestion.

- Naoya Horiguchi
Xu Yu April 22, 2022, 3:27 a.m. UTC | #10
On 4/22/22 1:53 AM, Yang Shi wrote:
> On Thu, Apr 21, 2022 at 2:03 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 21.04.22 01:38, Yang Shi wrote:
>>> On Tue, Apr 12, 2022 at 8:11 AM Xu Yu <xuyu@linux.alibaba.com> wrote:
>>>>
>>>> Kernel panic when injecting memory_failure for the global huge_zero_page,
>>>> when CONFIG_DEBUG_VM is enabled, as follows.
>>>>
>>>> [    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
>>>> [    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
>>>> [    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
>>>> [    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
>>>> [    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
>>>> [    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
>>>> [    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
>>>> [    5.589398] ------------[ cut here ]------------
>>>> [    5.589952] kernel BUG at mm/huge_memory.c:2499!
>>>> [    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>>> [    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
>>>> [    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
>>>> [    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
>>>> [    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
>>>> [    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
>>>> [    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
>>>> [    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
>>>> [    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
>>>> [    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
>>>> [    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
>>>> [    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
>>>> [    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
>>>> [    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> [    5.604806] Call Trace:
>>>> [    5.605101]  <TASK>
>>>> [    5.605357]  ? __irq_work_queue_local+0x39/0x70
>>>> [    5.605904]  try_to_split_thp_page+0x3a/0x130
>>>> [    5.606430]  memory_failure+0x128/0x800
>>>> [    5.606888]  madvise_inject_error.cold+0x8b/0xa1
>>>> [    5.607444]  __x64_sys_madvise+0x54/0x60
>>>> [    5.607915]  do_syscall_64+0x35/0x80
>>>> [    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> [    5.608949] RIP: 0033:0x7fc3754f8bf9
>>>> [    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
>>>> [    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
>>>> [    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
>>>> [    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
>>>> [    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
>>>> [    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
>>>> [    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
>>>> [    5.616626]  </TASK>
>>>>
>>>> This makes huge_zero_page bail out explicitly before split in
>>>> memory_failure(), thus the panic above won't happen again.
>>>
>>> Skipping huge_zero_page in error injection is ok to me, but I'm
>>> actually wondering whether raising BUG is overkilling for splitting
>>> huge_zero_page or not. Returning -EBUSY should be totally fine.
>>
>> I tend to agree. Just failing with -EBUSY might be cleaner. Most
>> probably we want to catch any bogus code here that does something we
>> don't really expect -- splitting the huge zeropage makes 0 sense.
> 
> Yeah, the huge zero page can't be met from normal paths other than
> memory failure, but memory failure is a valid caller. So I tend to
> replace the BUG to WARN + returning -EBUSY. If we don't care about the
> reason code in memory failure, we don't have to touch memory failure.

To make sure I understand this correctly, do you mean the following?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c468fee595ff..3bb464509518 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2496,10 +2496,12 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
         int extra_pins, ret;
         pgoff_t end;

-       VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
         VM_BUG_ON_PAGE(!PageLocked(head), head);
         VM_BUG_ON_PAGE(!PageCompound(head), head);

+       if (VM_WARN_ON_ONCE_PAGE(is_huge_zero_page(head), head))
+               return -EBUSY;
+
         if (PageWriteback(head))
                 return -EBUSY;


> 
>>
>> --
>> Thanks,
>>
>> David / dhildenb
>>
>>
Yang Shi April 22, 2022, 4:03 p.m. UTC | #11
On Thu, Apr 21, 2022 at 8:27 PM Yu Xu <xuyu@linux.alibaba.com> wrote:
>
> On 4/22/22 1:53 AM, Yang Shi wrote:
> > On Thu, Apr 21, 2022 at 2:03 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 21.04.22 01:38, Yang Shi wrote:
> >>> On Tue, Apr 12, 2022 at 8:11 AM Xu Yu <xuyu@linux.alibaba.com> wrote:
> >>>>
> >>>> Kernel panic when injecting memory_failure for the global huge_zero_page,
> >>>> when CONFIG_DEBUG_VM is enabled, as follows.
> >>>>
> >>>> [    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
> >>>> [    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
> >>>> [    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
> >>>> [    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
> >>>> [    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
> >>>> [    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
> >>>> [    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
> >>>> [    5.589398] ------------[ cut here ]------------
> >>>> [    5.589952] kernel BUG at mm/huge_memory.c:2499!
> >>>> [    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >>>> [    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
> >>>> [    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
> >>>> [    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
> >>>> [    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
> >>>> [    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
> >>>> [    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
> >>>> [    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
> >>>> [    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
> >>>> [    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
> >>>> [    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
> >>>> [    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
> >>>> [    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> [    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
> >>>> [    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>> [    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>> [    5.604806] Call Trace:
> >>>> [    5.605101]  <TASK>
> >>>> [    5.605357]  ? __irq_work_queue_local+0x39/0x70
> >>>> [    5.605904]  try_to_split_thp_page+0x3a/0x130
> >>>> [    5.606430]  memory_failure+0x128/0x800
> >>>> [    5.606888]  madvise_inject_error.cold+0x8b/0xa1
> >>>> [    5.607444]  __x64_sys_madvise+0x54/0x60
> >>>> [    5.607915]  do_syscall_64+0x35/0x80
> >>>> [    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>> [    5.608949] RIP: 0033:0x7fc3754f8bf9
> >>>> [    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
> >>>> [    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
> >>>> [    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
> >>>> [    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
> >>>> [    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
> >>>> [    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
> >>>> [    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
> >>>> [    5.616626]  </TASK>
> >>>>
> >>>> This makes huge_zero_page bail out explicitly before split in
> >>>> memory_failure(), thus the panic above won't happen again.
> >>>
> >>> Skipping huge_zero_page in error injection is ok to me, but I'm
> >>> actually wondering whether raising BUG is overkilling for splitting
> >>> huge_zero_page or not. Returning -EBUSY should be totally fine.
> >>
> >> I tend to agree. Just failing with -EBUSY might be cleaner. Most
> >> probably we want to catch any bogus code here that does something we
> >> don't really expect -- splitting the huge zeropage makes 0 sense.
> >
> > Yeah, the huge zero page can't be met from normal paths other than
> > memory failure, but memory failure is a valid caller. So I tend to
> > replace the BUG to WARN + returning -EBUSY. If we don't care about the
> > reason code in memory failure, we don't have to touch memory failure.
>
> To make sure I understand this correctly, do you mean the following?

Yes. And you should be able to drop your old patch as well.

>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c468fee595ff..3bb464509518 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2496,10 +2496,12 @@ int split_huge_page_to_list(struct page *page, struct
> list_head *list)
>          int extra_pins, ret;
>          pgoff_t end;
>
> -       VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
>          VM_BUG_ON_PAGE(!PageLocked(head), head);
>          VM_BUG_ON_PAGE(!PageCompound(head), head);
>
> +       if (VM_WARN_ON_ONCE_PAGE(is_huge_zero_page(head), head))
> +               return -EBUSY;
> +
>          if (PageWriteback(head))
>                  return -EBUSY;
>
>
> >
> >>
> >> --
> >> Thanks,
> >>
> >> David / dhildenb
> >>
> >>
>
> --
> Thanks,
> Yu
Xu Yu April 22, 2022, 5:37 p.m. UTC | #12
On 4/23/22 12:03 AM, Yang Shi wrote:
> On Thu, Apr 21, 2022 at 8:27 PM Yu Xu <xuyu@linux.alibaba.com> wrote:
>>
>> On 4/22/22 1:53 AM, Yang Shi wrote:
>>> On Thu, Apr 21, 2022 at 2:03 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 21.04.22 01:38, Yang Shi wrote:
>>>>> On Tue, Apr 12, 2022 at 8:11 AM Xu Yu <xuyu@linux.alibaba.com> wrote:
>>>>>>
>>>>>> Kernel panic when injecting memory_failure for the global huge_zero_page,
>>>>>> when CONFIG_DEBUG_VM is enabled, as follows.
>>>>>>
>>>>>> [    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
>>>>>> [    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
>>>>>> [    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
>>>>>> [    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
>>>>>> [    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
>>>>>> [    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
>>>>>> [    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
>>>>>> [    5.589398] ------------[ cut here ]------------
>>>>>> [    5.589952] kernel BUG at mm/huge_memory.c:2499!
>>>>>> [    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>>>>> [    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
>>>>>> [    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
>>>>>> [    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
>>>>>> [    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
>>>>>> [    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
>>>>>> [    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
>>>>>> [    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
>>>>>> [    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
>>>>>> [    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
>>>>>> [    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
>>>>>> [    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
>>>>>> [    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
>>>>>> [    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>> [    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>> [    5.604806] Call Trace:
>>>>>> [    5.605101]  <TASK>
>>>>>> [    5.605357]  ? __irq_work_queue_local+0x39/0x70
>>>>>> [    5.605904]  try_to_split_thp_page+0x3a/0x130
>>>>>> [    5.606430]  memory_failure+0x128/0x800
>>>>>> [    5.606888]  madvise_inject_error.cold+0x8b/0xa1
>>>>>> [    5.607444]  __x64_sys_madvise+0x54/0x60
>>>>>> [    5.607915]  do_syscall_64+0x35/0x80
>>>>>> [    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>> [    5.608949] RIP: 0033:0x7fc3754f8bf9
>>>>>> [    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
>>>>>> [    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
>>>>>> [    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
>>>>>> [    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
>>>>>> [    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
>>>>>> [    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
>>>>>> [    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
>>>>>> [    5.616626]  </TASK>
>>>>>>
>>>>>> This makes huge_zero_page bail out explicitly before split in
>>>>>> memory_failure(), thus the panic above won't happen again.
>>>>>
>>>>> Skipping huge_zero_page in error injection is ok to me, but I'm
>>>>> actually wondering whether raising BUG is overkilling for splitting
>>>>> huge_zero_page or not. Returning -EBUSY should be totally fine.
>>>>
>>>> I tend to agree. Just failing with -EBUSY might be cleaner. Most
>>>> probably we want to catch any bogus code here that does something we
>>>> don't really expect -- splitting the huge zeropage makes 0 sense.
>>>
>>> Yeah, the huge zero page can't be met from normal paths other than
>>> memory failure, but memory failure is a valid caller. So I tend to
>>> replace the BUG to WARN + returning -EBUSY. If we don't care about the
>>> reason code in memory failure, we don't have to touch memory failure.
>>
>> To make sure I understand this correctly, do you mean the following?
> 
> Yes. And you should be able to drop your old patch as well.

Thanks!

And I have one more question, if doing so, the huge_zero_page will be set
PG_has_hwpoisoned in memory_failure(), although PG_has_hwpoisoned is not
checked in set_huge_zero_page(). It works, but isn't it a bit strange?

To this point, as well as you mentioned that "the huge zero page can't be
met from normal paths other than memory failure", I prefer to skip
huge_zero_page in memory_failure().

How do you think?

> 
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c468fee595ff..3bb464509518 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2496,10 +2496,12 @@ int split_huge_page_to_list(struct page *page, struct
>> list_head *list)
>>           int extra_pins, ret;
>>           pgoff_t end;
>>
>> -       VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
>>           VM_BUG_ON_PAGE(!PageLocked(head), head);
>>           VM_BUG_ON_PAGE(!PageCompound(head), head);
>>
>> +       if (VM_WARN_ON_ONCE_PAGE(is_huge_zero_page(head), head))
>> +               return -EBUSY;
>> +
>>           if (PageWriteback(head))
>>                   return -EBUSY;
>>
>>
>>>
>>>>
>>>> --
>>>> Thanks,
>>>>
>>>> David / dhildenb
>>>>
>>>>
>>
>> --
>> Thanks,
>> Yu
Yang Shi April 22, 2022, 5:58 p.m. UTC | #13
On Fri, Apr 22, 2022 at 10:37 AM Yu Xu <xuyu@linux.alibaba.com> wrote:
>
> On 4/23/22 12:03 AM, Yang Shi wrote:
> > On Thu, Apr 21, 2022 at 8:27 PM Yu Xu <xuyu@linux.alibaba.com> wrote:
> >>
> >> On 4/22/22 1:53 AM, Yang Shi wrote:
> >>> On Thu, Apr 21, 2022 at 2:03 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 21.04.22 01:38, Yang Shi wrote:
> >>>>> On Tue, Apr 12, 2022 at 8:11 AM Xu Yu <xuyu@linux.alibaba.com> wrote:
> >>>>>>
> >>>>>> Kernel panic when injecting memory_failure for the global huge_zero_page,
> >>>>>> when CONFIG_DEBUG_VM is enabled, as follows.
> >>>>>>
> >>>>>> [    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
> >>>>>> [    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
> >>>>>> [    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
> >>>>>> [    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
> >>>>>> [    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
> >>>>>> [    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
> >>>>>> [    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
> >>>>>> [    5.589398] ------------[ cut here ]------------
> >>>>>> [    5.589952] kernel BUG at mm/huge_memory.c:2499!
> >>>>>> [    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >>>>>> [    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
> >>>>>> [    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
> >>>>>> [    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
> >>>>>> [    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
> >>>>>> [    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
> >>>>>> [    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
> >>>>>> [    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
> >>>>>> [    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
> >>>>>> [    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
> >>>>>> [    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
> >>>>>> [    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
> >>>>>> [    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>> [    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
> >>>>>> [    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>>>> [    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>>>>> [    5.604806] Call Trace:
> >>>>>> [    5.605101]  <TASK>
> >>>>>> [    5.605357]  ? __irq_work_queue_local+0x39/0x70
> >>>>>> [    5.605904]  try_to_split_thp_page+0x3a/0x130
> >>>>>> [    5.606430]  memory_failure+0x128/0x800
> >>>>>> [    5.606888]  madvise_inject_error.cold+0x8b/0xa1
> >>>>>> [    5.607444]  __x64_sys_madvise+0x54/0x60
> >>>>>> [    5.607915]  do_syscall_64+0x35/0x80
> >>>>>> [    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>>>> [    5.608949] RIP: 0033:0x7fc3754f8bf9
> >>>>>> [    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
> >>>>>> [    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
> >>>>>> [    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
> >>>>>> [    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
> >>>>>> [    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
> >>>>>> [    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
> >>>>>> [    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
> >>>>>> [    5.616626]  </TASK>
> >>>>>>
> >>>>>> This makes huge_zero_page bail out explicitly before split in
> >>>>>> memory_failure(), thus the panic above won't happen again.
> >>>>>
> >>>>> Skipping huge_zero_page in error injection is ok to me, but I'm
> >>>>> actually wondering whether raising BUG is overkilling for splitting
> >>>>> huge_zero_page or not. Returning -EBUSY should be totally fine.
> >>>>
> >>>> I tend to agree. Just failing with -EBUSY might be cleaner. Most
> >>>> probably we want to catch any bogus code here that does something we
> >>>> don't really expect -- splitting the huge zeropage makes 0 sense.
> >>>
> >>> Yeah, the huge zero page can't be met from normal paths other than
> >>> memory failure, but memory failure is a valid caller. So I tend to
> >>> replace the BUG to WARN + returning -EBUSY. If we don't care about the
> >>> reason code in memory failure, we don't have to touch memory failure.
> >>
> >> To make sure I understand this correctly, do you mean the following?
> >
> > Yes. And you should be able to drop your old patch as well.
>
> Thanks!
>
> And I have one more question, if doing so, the huge_zero_page will be set
> PG_has_hwpoisoned in memory_failure(), although PG_has_hwpoisoned is not
> checked in set_huge_zero_page(). It works, but isn't it a bit strange?

Thanks for noticing this. IMHO it seems like an existing bug. The
anonymous page fault doesn't check if the page is poisoned or not
since it typically gets a fresh allocated page and assumes the
poisoned page (isolated successfully) can't be reallocated again. But
huge zero page and base zero page are reused every time. So no matter
what fix we pick, the issue is always there.

>
> To this point, as well as you mentioned that "the huge zero page can't be
> met from normal paths other than memory failure", I prefer to skip
> huge_zero_page in memory_failure().
>
> How do you think?
>
> >
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index c468fee595ff..3bb464509518 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -2496,10 +2496,12 @@ int split_huge_page_to_list(struct page *page, struct
> >> list_head *list)
> >>           int extra_pins, ret;
> >>           pgoff_t end;
> >>
> >> -       VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
> >>           VM_BUG_ON_PAGE(!PageLocked(head), head);
> >>           VM_BUG_ON_PAGE(!PageCompound(head), head);
> >>
> >> +       if (VM_WARN_ON_ONCE_PAGE(is_huge_zero_page(head), head))
> >> +               return -EBUSY;
> >> +
> >>           if (PageWriteback(head))
> >>                   return -EBUSY;
> >>
> >>
> >>>
> >>>>
> >>>> --
> >>>> Thanks,
> >>>>
> >>>> David / dhildenb
> >>>>
> >>>>
> >>
> >> --
> >> Thanks,
> >> Yu
>
> --
> Thanks,
> Yu
Xu Yu April 23, 2022, 3:16 a.m. UTC | #14
For this patch v2,

Yang, David, Anshuman and Naoya all agree to fix the bug in
split_huge_page_to_list(), i.e., to replace the BUG to
WARN + returning -EBUSY for huge_zero_page.

However, this patch v2 has just been merged into linux mainline.
Sorry for that I didn't hurry up.

So now, should we continue sending v3, or should we end this
thread, and start a successive patch for optimization?


On 4/12/22 11:11 PM, Xu Yu wrote:
> Kernel panic when injecting memory_failure for the global huge_zero_page,
> when CONFIG_DEBUG_VM is enabled, as follows.
> 
> [    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
> [    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
> [    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
> [    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
> [    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
> [    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
> [    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
> [    5.589398] ------------[ cut here ]------------
> [    5.589952] kernel BUG at mm/huge_memory.c:2499!
> [    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
> [    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
> [    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
> [    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
> [    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
> [    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
> [    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
> [    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
> [    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
> [    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
> [    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
> [    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
> [    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    5.604806] Call Trace:
> [    5.605101]  <TASK>
> [    5.605357]  ? __irq_work_queue_local+0x39/0x70
> [    5.605904]  try_to_split_thp_page+0x3a/0x130
> [    5.606430]  memory_failure+0x128/0x800
> [    5.606888]  madvise_inject_error.cold+0x8b/0xa1
> [    5.607444]  __x64_sys_madvise+0x54/0x60
> [    5.607915]  do_syscall_64+0x35/0x80
> [    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [    5.608949] RIP: 0033:0x7fc3754f8bf9
> [    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
> [    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
> [    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
> [    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
> [    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
> [    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
> [    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
> [    5.616626]  </TASK>
> 
> This makes huge_zero_page bail out explicitly before split in
> memory_failure(), thus the panic above won't happen again.
> 
> Reported-by: Abaci <abaci@linux.alibaba.com>
> Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> ---
>   mm/memory-failure.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index dcb6bb9cf731..8788ea49bd28 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1799,6 +1799,19 @@ int memory_failure(unsigned long pfn, int flags)
>   	}
>   
>   	if (PageTransHuge(hpage)) {
> +		/*
> +		 * Bail out before SetPageHasHWPoisoned() if hpage is
> +		 * huge_zero_page, although PG_has_hwpoisoned is not
> +		 * checked in set_huge_zero_page().
> +		 *
> +		 * TODO: Handle memory failure of huge_zero_page thoroughly.
> +		 */
> +		if (is_huge_zero_page(hpage)) {
> +			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> +			res = -EBUSY;
> +			goto unlock_mutex;
> +		}
> +
>   		/*
>   		 * The flag must be set after the refcount is bumped
>   		 * otherwise it may race with THP split.
HORIGUCHI NAOYA(堀口 直也) April 25, 2022, 2:16 p.m. UTC | #15
On Sat, Apr 23, 2022 at 11:16:20AM +0800, Yu Xu wrote:
> For this patch v2,
> 
> Yang, David, Anshuman and Naoya all agree to fix the bug in
> split_huge_page_to_list(), i.e., to replace the BUG to
> WARN + returning -EBUSY for huge_zero_page.
> 
> However, this patch v2 has just been merged into linux mainline.
> Sorry for that I didn't hurry up.
> 
> So now, should we continue sending v3, or should we end this
> thread, and start a successive patch for optimization?

No problem, you can start a new thread for additional patches.
Maybe it's better to have separate patches (starting with one
reverting v2 and then one doing v3).

Thanks,
Naoya Horiguchi
Anshuman Khandual April 26, 2022, 6:48 a.m. UTC | #16
On 4/25/22 19:46, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Sat, Apr 23, 2022 at 11:16:20AM +0800, Yu Xu wrote:
>> For this patch v2,
>>
>> Yang, David, Anshuman and Naoya all agree to fix the bug in
>> split_huge_page_to_list(), i.e., to replace the BUG to
>> WARN + returning -EBUSY for huge_zero_page.
>>
>> However, this patch v2 has just been merged into linux mainline.
>> Sorry for that I didn't hurry up.
>>
>> So now, should we continue sending v3, or should we end this
>> thread, and start a successive patch for optimization?
> 
> No problem, you can start a new thread for additional patches.
> Maybe it's better to have separate patches (starting with one
> reverting v2 and then one doing v3).

May be also pull in the following patch which adds MF_MSG_HUGE_ZERO
in the same series ?

https://lore.kernel.org/linux-mm/20220425080306.1771480-1-anshuman.khandual@arm.com/
HORIGUCHI NAOYA(堀口 直也) April 26, 2022, 6:58 a.m. UTC | #17
On Tue, Apr 26, 2022 at 12:18:39PM +0530, Anshuman Khandual wrote:
> 
> 
> On 4/25/22 19:46, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Sat, Apr 23, 2022 at 11:16:20AM +0800, Yu Xu wrote:
> >> For this patch v2,
> >>
> >> Yang, David, Anshuman and Naoya all agree to fix the bug in
> >> split_huge_page_to_list(), i.e., to replace the BUG to
> >> WARN + returning -EBUSY for huge_zero_page.
> >>
> >> However, this patch v2 has just been merged into linux mainline.
> >> Sorry for that I didn't hurry up.
> >>
> >> So now, should we continue sending v3, or should we end this
> >> thread, and start a successive patch for optimization?
> > 
> > No problem, you can start a new thread for additional patches.
> > Maybe it's better to have separate patches (starting with one
> > reverting v2 and then one doing v3).
> 
> May be also pull in the following patch which adds MF_MSG_HUGE_ZERO
> in the same series ?
> 
> https://lore.kernel.org/linux-mm/20220425080306.1771480-1-anshuman.khandual@arm.com/

I don't think so.  I think that Yu's v3 would solely solve the reported
problem, so any check for huge_zero_page in memory_failire() should not
be needed.  (But correct me if I miss something ...)

Thanks,
Naoya Horiguchi
Anshuman Khandual May 2, 2022, 5:59 a.m. UTC | #18
On 4/26/22 12:28, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Apr 26, 2022 at 12:18:39PM +0530, Anshuman Khandual wrote:
>>
>>
>> On 4/25/22 19:46, HORIGUCHI NAOYA(堀口 直也) wrote:
>>> On Sat, Apr 23, 2022 at 11:16:20AM +0800, Yu Xu wrote:
>>>> For this patch v2,
>>>>
>>>> Yang, David, Anshuman and Naoya all agree to fix the bug in
>>>> split_huge_page_to_list(), i.e., to replace the BUG to
>>>> WARN + returning -EBUSY for huge_zero_page.
>>>>
>>>> However, this patch v2 has just been merged into linux mainline.
>>>> Sorry for that I didn't hurry up.
>>>>
>>>> So now, should we continue sending v3, or should we end this
>>>> thread, and start a successive patch for optimization?
>>>
>>> No problem, you can start a new thread for additional patches.
>>> Maybe it's better to have separate patches (starting with one
>>> reverting v2 and then one doing v3).
>>
>> May be also pull in the following patch which adds MF_MSG_HUGE_ZERO
>> in the same series ?
>>
>> https://lore.kernel.org/linux-mm/20220425080306.1771480-1-anshuman.khandual@arm.com/
> 
> I don't think so.  I think that Yu's v3 would solely solve the reported
> problem, so any check for huge_zero_page in memory_failire() should not
> be needed.  (But correct me if I miss something ...)

Never mind, you are right.
diff mbox series

Patch

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index dcb6bb9cf731..8788ea49bd28 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1799,6 +1799,19 @@  int memory_failure(unsigned long pfn, int flags)
 	}
 
 	if (PageTransHuge(hpage)) {
+		/*
+		 * Bail out before SetPageHasHWPoisoned() if hpage is
+		 * huge_zero_page, although PG_has_hwpoisoned is not
+		 * checked in set_huge_zero_page().
+		 *
+		 * TODO: Handle memory failure of huge_zero_page thoroughly.
+		 */
+		if (is_huge_zero_page(hpage)) {
+			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+			res = -EBUSY;
+			goto unlock_mutex;
+		}
+
 		/*
 		 * The flag must be set after the refcount is bumped
 		 * otherwise it may race with THP split.