diff mbox series

mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()

Message ID 20220907121157.GA2954283@ik1-406-35019.vs.sakura.ne.jp (mailing list archive)
State New
Headers show
Series mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() | expand

Commit Message

Naoya Horiguchi Sept. 7, 2022, 12:11 p.m. UTC
On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote:
> On 07.09.22 12:23, David Hildenbrand wrote:
> > On 07.09.22 12:08, Naoya Horiguchi wrote:
> > > Hi MM folks,
> > 
> > Hi,
> > 
> > > 
> > > When I'm testing memory hotremove with various settings, I found the following
> > > NULL-pointer dereference.  It reproduces easily with the folloing steps:
> > > 
> > >     $ echo offline > /sys/devices/system/memory/memoryN/state
> > >     $ echo 1 > /sys/kernel/debug/split_huge_pages
> > > 
> > 
> > That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
> > 
> > I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
> > 
> > [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> > [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> > [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> > [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > [526045.840498] page dumped because: unmovable page
> > [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
> > [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > [526056.374837] ------------[ cut here ]------------
> > [526056.379544] kernel BUG at include/linux/mm.h:1248!
> > [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> > [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
> > [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
> > [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
> > [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
> > [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
> > [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
> > [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
> > [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
> > [526056.471733] FS:  00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
> > [526056.479906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
> > [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [526056.507396] PKRU: 55555554
> > [526056.510196] Call Trace:
> > [526056.512734]  <TASK>
> > [526056.514929]  ? simple_setattr+0x40/0x60
> > [526056.518859]  ? vfs_mkobj+0x1b0/0x1c0
> > [526056.522529]  ? path_openat+0xb2e/0x1360
> > [526056.526456]  ? do_filp_open+0xa1/0x130
> > [526056.530296]  full_proxy_write+0x50/0x80
> > [526056.534229]  vfs_write+0xd7/0x3e0
> > [526056.537635]  ? fpregs_assert_state_consistent+0x22/0x50
> > [526056.542955]  ? exit_to_user_mode_prepare+0x3c/0x1d0
> > [526056.547929]  ? syscall_exit_to_user_mode+0x17/0x40
> > [526056.552808]  ksys_write+0x53/0xd0
> > [526056.556215]  do_syscall_64+0x58/0x80
> > [526056.559879]  ? exit_to_user_mode_prepare+0x3c/0x1d0
> > [526056.564846]  ? syscall_exit_to_user_mode+0x17/0x40
> > [526056.569726]  ? do_syscall_64+0x67/0x80
> > [526056.573563]  ? do_syscall_64+0x67/0x80
> > [526056.577404]  ? do_syscall_64+0x67/0x80
> > [526056.581244]  ? syscall_exit_to_user_mode+0x17/0x40
> > [526056.586120]  ? do_syscall_64+0x67/0x80
> > [526056.589961]  ? do_syscall_64+0x67/0x80
> > [526056.593801]  ? do_syscall_64+0x67/0x80
> > [526056.597641]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [526056.602779] RIP: 0033:0x7fe35ab01c17
> > [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> > [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
> > [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
> > [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
> > [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
> > [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
> > [526056.669028]  </TASK>
> > 
> > 
> > Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
> > 
> > Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
> > 
> 
> And indeed, something like the following might do the trick:
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e9414ee57c5b..f42bb51e023a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
>                 max_zone_pfn = zone_end_pfn(zone);
>                 for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>                         int nr_pages;
> -                       if (!pfn_valid(pfn))
> -                               continue;
> -                       page = pfn_to_page(pfn);
> -                       if (!get_page_unless_zero(page))
> +                       page = pfn_to_online_page(pfn);
> +                       if (!page || !get_page_unless_zero(page))
>                                 continue;
>                         if (zone != page_zone(page))
> 

Thank you very much, I confirmed the fix.  So I try to write a patch like below.
I borrowed your debug message because it has more helpful info than mine.
If you have any comment or suggestion in the description, please let me know.

Thanks,
Naoya Horiguchi
---
From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Wed, 7 Sep 2022 20:58:44 +0900
Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
 split_huge_pages_all()

NULL pointer dereference is triggered when calling thp split via debugfs
on the system with offlined memory blocks.  With debug option enabled,
the following kernel messages are printed out:

  page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
  flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
  raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
  raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
  page dumped because: unmovable page
  page:000000007d7ab72e is uninitialized and poisoned
  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
  ------------[ cut here ]------------
  kernel BUG at include/linux/mm.h:1248!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
  ...
  RIP: 0010:split_huge_pages_write+0xcf4/0xe30

This shows that page_to_nid() in page_zone() is unexpectedly called for an
offlined memmap.

Use pfn_to_online_page() to get struct page in PFN walker.

Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
---
 mm/huge_memory.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

Comments

David Hildenbrand Sept. 7, 2022, 12:39 p.m. UTC | #1
On 07.09.22 14:11, Naoya Horiguchi wrote:
> On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote:
>> On 07.09.22 12:23, David Hildenbrand wrote:
>>> On 07.09.22 12:08, Naoya Horiguchi wrote:
>>>> Hi MM folks,
>>>
>>> Hi,
>>>
>>>>
>>>> When I'm testing memory hotremove with various settings, I found the following
>>>> NULL-pointer dereference.  It reproduces easily with the folloing steps:
>>>>
>>>>      $ echo offline > /sys/devices/system/memory/memoryN/state
>>>>      $ echo 1 > /sys/kernel/debug/split_huge_pages
>>>>
>>>
>>> That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
>>>
>>> I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
>>>
>>> [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>> [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>> [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>> [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>> [526045.840498] page dumped because: unmovable page
>>> [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
>>> [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>> [526056.374837] ------------[ cut here ]------------
>>> [526056.379544] kernel BUG at include/linux/mm.h:1248!
>>> [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>> [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>>> [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
>>> [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>> [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
>>> [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
>>> [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
>>> [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
>>> [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
>>> [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
>>> [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
>>> [526056.471733] FS:  00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
>>> [526056.479906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
>>> [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [526056.507396] PKRU: 55555554
>>> [526056.510196] Call Trace:
>>> [526056.512734]  <TASK>
>>> [526056.514929]  ? simple_setattr+0x40/0x60
>>> [526056.518859]  ? vfs_mkobj+0x1b0/0x1c0
>>> [526056.522529]  ? path_openat+0xb2e/0x1360
>>> [526056.526456]  ? do_filp_open+0xa1/0x130
>>> [526056.530296]  full_proxy_write+0x50/0x80
>>> [526056.534229]  vfs_write+0xd7/0x3e0
>>> [526056.537635]  ? fpregs_assert_state_consistent+0x22/0x50
>>> [526056.542955]  ? exit_to_user_mode_prepare+0x3c/0x1d0
>>> [526056.547929]  ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.552808]  ksys_write+0x53/0xd0
>>> [526056.556215]  do_syscall_64+0x58/0x80
>>> [526056.559879]  ? exit_to_user_mode_prepare+0x3c/0x1d0
>>> [526056.564846]  ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.569726]  ? do_syscall_64+0x67/0x80
>>> [526056.573563]  ? do_syscall_64+0x67/0x80
>>> [526056.577404]  ? do_syscall_64+0x67/0x80
>>> [526056.581244]  ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.586120]  ? do_syscall_64+0x67/0x80
>>> [526056.589961]  ? do_syscall_64+0x67/0x80
>>> [526056.593801]  ? do_syscall_64+0x67/0x80
>>> [526056.597641]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>> [526056.602779] RIP: 0033:0x7fe35ab01c17
>>> [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>>> [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>> [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
>>> [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
>>> [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
>>> [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
>>> [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
>>> [526056.669028]  </TASK>
>>>
>>>
>>> Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
>>>
>>> Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
>>>
>>
>> And indeed, something like the following might do the trick:
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e9414ee57c5b..f42bb51e023a 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
>>                  max_zone_pfn = zone_end_pfn(zone);
>>                  for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>>                          int nr_pages;
>> -                       if (!pfn_valid(pfn))
>> -                               continue;
>> -                       page = pfn_to_page(pfn);
>> -                       if (!get_page_unless_zero(page))
>> +                       page = pfn_to_online_page(pfn);
>> +                       if (!page || !get_page_unless_zero(page))
>>                                  continue;
>>                          if (zone != page_zone(page))
>>
> 
> Thank you very much, I confirmed the fix.  So I try to write a patch like below.
> I borrowed your debug message because it has more helpful info than mine.
> If you have any comment or suggestion in the description, please let me know.
> 
> Thanks,
> Naoya Horiguchi
> ---
>  From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>   split_huge_pages_all()
> 
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks.  With debug option enabled,
> the following kernel messages are printed out:
> 
>    page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>    flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>    raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>    raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>    page dumped because: unmovable page
>    page:000000007d7ab72e is uninitialized and poisoned
>    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>    ------------[ cut here ]------------
>    kernel BUG at include/linux/mm.h:1248!
>    invalid opcode: 0000 [#1] PREEMPT SMP PTI
>    CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>    ...
>    RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> 
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
> 
> Use pfn_to_online_page() to get struct page in PFN walker.
> 
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>

Feel free to use

Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>

instead

> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>
> ---
>   mm/huge_memory.c | 7 ++-----
>   1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5fa2ba86dae4..03149a8f46f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
>   		max_zone_pfn = zone_end_pfn(zone);
>   		for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>   			int nr_pages;

Add an empty line here please, while at it.

> -			if (!pfn_valid(pfn))
> -				continue;
> -
> -			page = pfn_to_page(pfn);
> -			if (!get_page_unless_zero(page))
> +			page = pfn_to_online_page(pfn);
> +			if (!page || !get_page_unless_zero(page))
>   				continue;
>   
>   			if (zone != page_zone(page))

Can you send the patch "officially" as well? Thanks
Yang Shi Sept. 7, 2022, 5:10 p.m. UTC | #2
On Wed, Sep 7, 2022 at 5:12 AM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote:
> > On 07.09.22 12:23, David Hildenbrand wrote:
> > > On 07.09.22 12:08, Naoya Horiguchi wrote:
> > > > Hi MM folks,
> > >
> > > Hi,
> > >
> > > >
> > > > When I'm testing memory hotremove with various settings, I found the following
> > > > NULL-pointer dereference.  It reproduces easily with the folloing steps:
> > > >
> > > >     $ echo offline > /sys/devices/system/memory/memoryN/state
> > > >     $ echo 1 > /sys/kernel/debug/split_huge_pages
> > > >
> > >
> > > That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
> > >
> > > I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
> > >
> > > [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> > > [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> > > [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> > > [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > > [526045.840498] page dumped because: unmovable page
> > > [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
> > > [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > > [526056.374837] ------------[ cut here ]------------
> > > [526056.379544] kernel BUG at include/linux/mm.h:1248!
> > > [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > > [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> > > [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
> > > [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > > [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
> > > [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
> > > [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
> > > [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
> > > [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
> > > [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
> > > [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
> > > [526056.471733] FS:  00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
> > > [526056.479906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
> > > [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [526056.507396] PKRU: 55555554
> > > [526056.510196] Call Trace:
> > > [526056.512734]  <TASK>
> > > [526056.514929]  ? simple_setattr+0x40/0x60
> > > [526056.518859]  ? vfs_mkobj+0x1b0/0x1c0
> > > [526056.522529]  ? path_openat+0xb2e/0x1360
> > > [526056.526456]  ? do_filp_open+0xa1/0x130
> > > [526056.530296]  full_proxy_write+0x50/0x80
> > > [526056.534229]  vfs_write+0xd7/0x3e0
> > > [526056.537635]  ? fpregs_assert_state_consistent+0x22/0x50
> > > [526056.542955]  ? exit_to_user_mode_prepare+0x3c/0x1d0
> > > [526056.547929]  ? syscall_exit_to_user_mode+0x17/0x40
> > > [526056.552808]  ksys_write+0x53/0xd0
> > > [526056.556215]  do_syscall_64+0x58/0x80
> > > [526056.559879]  ? exit_to_user_mode_prepare+0x3c/0x1d0
> > > [526056.564846]  ? syscall_exit_to_user_mode+0x17/0x40
> > > [526056.569726]  ? do_syscall_64+0x67/0x80
> > > [526056.573563]  ? do_syscall_64+0x67/0x80
> > > [526056.577404]  ? do_syscall_64+0x67/0x80
> > > [526056.581244]  ? syscall_exit_to_user_mode+0x17/0x40
> > > [526056.586120]  ? do_syscall_64+0x67/0x80
> > > [526056.589961]  ? do_syscall_64+0x67/0x80
> > > [526056.593801]  ? do_syscall_64+0x67/0x80
> > > [526056.597641]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > [526056.602779] RIP: 0033:0x7fe35ab01c17
> > > [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> > > [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > > [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
> > > [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
> > > [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
> > > [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
> > > [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
> > > [526056.669028]  </TASK>
> > >
> > >
> > > Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
> > >
> > > Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
> > >
> >
> > And indeed, something like the following might do the trick:
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e9414ee57c5b..f42bb51e023a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
> >                 max_zone_pfn = zone_end_pfn(zone);
> >                 for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> >                         int nr_pages;
> > -                       if (!pfn_valid(pfn))
> > -                               continue;
> > -                       page = pfn_to_page(pfn);
> > -                       if (!get_page_unless_zero(page))
> > +                       page = pfn_to_online_page(pfn);
> > +                       if (!page || !get_page_unless_zero(page))
> >                                 continue;
> >                         if (zone != page_zone(page))
> >
>
> Thank you very much, I confirmed the fix.  So I try to write a patch like below.
> I borrowed your debug message because it has more helpful info than mine.
> If you have any comment or suggestion in the description, please let me know.
>
> Thanks,
> Naoya Horiguchi
> ---
> From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>  split_huge_pages_all()
>
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks.  With debug option enabled,
> the following kernel messages are printed out:
>
>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>   page dumped because: unmovable page
>   page:000000007d7ab72e is uninitialized and poisoned
>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>   ------------[ cut here ]------------
>   kernel BUG at include/linux/mm.h:1248!
>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>   ...
>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
>
> Use pfn_to_online_page() to get struct page in PFN walker.
>
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>

Thanks, Naoya and David. Looks good to me. Reviewed-by: Yang Shi
<shy828301@gmail.com>

> ---
>  mm/huge_memory.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5fa2ba86dae4..03149a8f46f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
>                 max_zone_pfn = zone_end_pfn(zone);
>                 for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>                         int nr_pages;
> -                       if (!pfn_valid(pfn))
> -                               continue;
> -
> -                       page = pfn_to_page(pfn);
> -                       if (!get_page_unless_zero(page))
> +                       page = pfn_to_online_page(pfn);
> +                       if (!page || !get_page_unless_zero(page))
>                                 continue;
>
>                         if (zone != page_zone(page))
> --
> 2.26.1.8815.g6a475b71f8
>
>
Michal Hocko Sept. 7, 2022, 5:32 p.m. UTC | #3
On Wed 07-09-22 21:11:57, Naoya Horiguchi wrote:
[...]
> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>  split_huge_pages_all()
> 
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks.  With debug option enabled,
> the following kernel messages are printed out:
> 
>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>   page dumped because: unmovable page
>   page:000000007d7ab72e is uninitialized and poisoned
>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>   ------------[ cut here ]------------
>   kernel BUG at include/linux/mm.h:1248!
>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>   ...
>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> 
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
> 
> Use pfn_to_online_page() to get struct page in PFN walker.
> 
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>

With changes proposed by David
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/huge_memory.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5fa2ba86dae4..03149a8f46f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
>  		max_zone_pfn = zone_end_pfn(zone);
>  		for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>  			int nr_pages;
> -			if (!pfn_valid(pfn))
> -				continue;
> -
> -			page = pfn_to_page(pfn);
> -			if (!get_page_unless_zero(page))
> +			page = pfn_to_online_page(pfn);
> +			if (!page || !get_page_unless_zero(page))
>  				continue;
>  
>  			if (zone != page_zone(page))
> -- 
> 2.26.1.8815.g6a475b71f8
>
Andrew Morton Sept. 7, 2022, 8:57 p.m. UTC | #4
On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote:

> 
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks.  With debug option enabled,
> the following kernel messages are printed out:
> 
>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>   page dumped because: unmovable page
>   page:000000007d7ab72e is uninitialized and poisoned
>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>   ------------[ cut here ]------------
>   kernel BUG at include/linux/mm.h:1248!
>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>   ...
>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> 
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
> 
> Use pfn_to_online_page() to get struct page in PFN walker.
> 
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")

January of 2016!  Or was this a long-term bug which was recently
exposed by some other change?
Miaohe Lin Sept. 8, 2022, 2:19 a.m. UTC | #5
On 2022/9/7 20:11, Naoya Horiguchi wrote:
> On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote:
>> On 07.09.22 12:23, David Hildenbrand wrote:
>>> On 07.09.22 12:08, Naoya Horiguchi wrote:
>>>> Hi MM folks,
>>>
>>> Hi,
>>>
>>>>
>>>> When I'm testing memory hotremove with various settings, I found the following
>>>> NULL-pointer dereference.  It reproduces easily with the folloing steps:
>>>>
>>>>     $ echo offline > /sys/devices/system/memory/memoryN/state
>>>>     $ echo 1 > /sys/kernel/debug/split_huge_pages
>>>>
>>>
>>> That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
>>>
>>> I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
>>>
>>> [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>> [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>> [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>> [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>> [526045.840498] page dumped because: unmovable page
>>> [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
>>> [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>> [526056.374837] ------------[ cut here ]------------
>>> [526056.379544] kernel BUG at include/linux/mm.h:1248!
>>> [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>> [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>>> [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
>>> [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>> [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
>>> [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
>>> [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
>>> [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
>>> [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
>>> [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
>>> [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
>>> [526056.471733] FS:  00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
>>> [526056.479906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
>>> [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [526056.507396] PKRU: 55555554
>>> [526056.510196] Call Trace:
>>> [526056.512734]  <TASK>
>>> [526056.514929]  ? simple_setattr+0x40/0x60
>>> [526056.518859]  ? vfs_mkobj+0x1b0/0x1c0
>>> [526056.522529]  ? path_openat+0xb2e/0x1360
>>> [526056.526456]  ? do_filp_open+0xa1/0x130
>>> [526056.530296]  full_proxy_write+0x50/0x80
>>> [526056.534229]  vfs_write+0xd7/0x3e0
>>> [526056.537635]  ? fpregs_assert_state_consistent+0x22/0x50
>>> [526056.542955]  ? exit_to_user_mode_prepare+0x3c/0x1d0
>>> [526056.547929]  ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.552808]  ksys_write+0x53/0xd0
>>> [526056.556215]  do_syscall_64+0x58/0x80
>>> [526056.559879]  ? exit_to_user_mode_prepare+0x3c/0x1d0
>>> [526056.564846]  ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.569726]  ? do_syscall_64+0x67/0x80
>>> [526056.573563]  ? do_syscall_64+0x67/0x80
>>> [526056.577404]  ? do_syscall_64+0x67/0x80
>>> [526056.581244]  ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.586120]  ? do_syscall_64+0x67/0x80
>>> [526056.589961]  ? do_syscall_64+0x67/0x80
>>> [526056.593801]  ? do_syscall_64+0x67/0x80
>>> [526056.597641]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>> [526056.602779] RIP: 0033:0x7fe35ab01c17
>>> [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>>> [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>> [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
>>> [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
>>> [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
>>> [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
>>> [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
>>> [526056.669028]  </TASK>
>>>
>>>
>>> Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
>>>
>>> Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
>>>
>>
>> And indeed, something like the following might do the trick:
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e9414ee57c5b..f42bb51e023a 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
>>                 max_zone_pfn = zone_end_pfn(zone);
>>                 for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>>                         int nr_pages;
>> -                       if (!pfn_valid(pfn))
>> -                               continue;
>> -                       page = pfn_to_page(pfn);
>> -                       if (!get_page_unless_zero(page))
>> +                       page = pfn_to_online_page(pfn);
>> +                       if (!page || !get_page_unless_zero(page))
>>                                 continue;
>>                         if (zone != page_zone(page))
>>
> 
> Thank you very much, I confirmed the fix.  So I try to write a patch like below.
> I borrowed your debug message because it has more helpful info than mine.
> If you have any comment or suggestion in the description, please let me know.
> 
> Thanks,
> Naoya Horiguchi
> ---
>>From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>  split_huge_pages_all()
> 
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks.  With debug option enabled,
> the following kernel messages are printed out:
> 
>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>   page dumped because: unmovable page
>   page:000000007d7ab72e is uninitialized and poisoned
>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>   ------------[ cut here ]------------
>   kernel BUG at include/linux/mm.h:1248!
>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>   ...
>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> 
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
> 
> Use pfn_to_online_page() to get struct page in PFN walker.

With changes proposed by David, this patch looks good to me.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
*pfn is operated directly* while it's not protected against memory hotremove.

Thanks,
Miaohe Lin


> 
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>
> ---
>  mm/huge_memory.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5fa2ba86dae4..03149a8f46f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
>  		max_zone_pfn = zone_end_pfn(zone);
>  		for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>  			int nr_pages;
> -			if (!pfn_valid(pfn))
> -				continue;
> -
> -			page = pfn_to_page(pfn);
> -			if (!get_page_unless_zero(page))
> +			page = pfn_to_online_page(pfn);
> +			if (!page || !get_page_unless_zero(page))
>  				continue;
>  
>  			if (zone != page_zone(page))
>
HORIGUCHI NAOYA(堀口 直也) Sept. 8, 2022, 2:39 a.m. UTC | #6
On Wed, Sep 07, 2022 at 02:39:02PM +0200, David Hildenbrand wrote:
> On 07.09.22 14:11, Naoya Horiguchi wrote:
...
> > ---
> >  From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Date: Wed, 7 Sep 2022 20:58:44 +0900
> > Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> >   split_huge_pages_all()
> > 
> > NULL pointer dereference is triggered when calling thp split via debugfs
> > on the system with offlined memory blocks.  With debug option enabled,
> > the following kernel messages are printed out:
> > 
> >    page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> >    flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> >    raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> >    raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> >    page dumped because: unmovable page
> >    page:000000007d7ab72e is uninitialized and poisoned
> >    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >    ------------[ cut here ]------------
> >    kernel BUG at include/linux/mm.h:1248!
> >    invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >    CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> >    ...
> >    RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > 
> > This shows that page_to_nid() in page_zone() is unexpectedly called for an
> > offlined memmap.
> > 
> > Use pfn_to_online_page() to get struct page in PFN walker.
> > 
> > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> > Suggested-by: David Hildenbrand <david@redhat.com>
> 
> Feel free to use
> 
> Co-developed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Sure, I'll add these.

> instead
> 
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   mm/huge_memory.c | 7 ++-----
> >   1 file changed, 2 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 5fa2ba86dae4..03149a8f46f9 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
> >   		max_zone_pfn = zone_end_pfn(zone);
> >   		for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> >   			int nr_pages;
> 
> Add an empty line here please, while at it.

Added it.

> > -			if (!pfn_valid(pfn))
> > -				continue;
> > -
> > -			page = pfn_to_page(pfn);
> > -			if (!get_page_unless_zero(page))
> > +			page = pfn_to_online_page(pfn);
> > +			if (!page || !get_page_unless_zero(page))
> >   				continue;
> >   			if (zone != page_zone(page))
> 
> Can you send the patch "officially" as well? Thanks

Yes, I'll send this later in the normal patch submission manner.

Thanks,
Naoya Horiguchi
HORIGUCHI NAOYA(堀口 直也) Sept. 8, 2022, 2:47 a.m. UTC | #7
On Wed, Sep 07, 2022 at 01:57:45PM -0700, Andrew Morton wrote:
> On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote:
> 
> > 
> > NULL pointer dereference is triggered when calling thp split via debugfs
> > on the system with offlined memory blocks.  With debug option enabled,
> > the following kernel messages are printed out:
> > 
> >   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> >   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> >   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> >   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> >   page dumped because: unmovable page
> >   page:000000007d7ab72e is uninitialized and poisoned
> >   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >   ------------[ cut here ]------------
> >   kernel BUG at include/linux/mm.h:1248!
> >   invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> >   ...
> >   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > 
> > This shows that page_to_nid() in page_zone() is unexpectedly called for an
> > offlined memmap.
> > 
> > Use pfn_to_online_page() to get struct page in PFN walker.
> > 
> > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> 
> January of 2016!  Or was this a long-term bug which was recently
> exposed by some other change?

Maybe yes, I confirmed that this issue reproduces in kernel 5.10 (released
on Dec 13, 2020, which might not be so "recent"), but does not reproduce in
kernel 5.4.  So I think that marking "5.10+" for stable is fine.

Thanks,
Naoya Horiguchi
HORIGUCHI NAOYA(堀口 直也) Sept. 8, 2022, 3:06 a.m. UTC | #8
On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
> On 2022/9/7 20:11, Naoya Horiguchi wrote:
...
> >>From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Date: Wed, 7 Sep 2022 20:58:44 +0900
> > Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> >  split_huge_pages_all()
> > 
> > NULL pointer dereference is triggered when calling thp split via debugfs
> > on the system with offlined memory blocks.  With debug option enabled,
> > the following kernel messages are printed out:
> > 
> >   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> >   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> >   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> >   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> >   page dumped because: unmovable page
> >   page:000000007d7ab72e is uninitialized and poisoned
> >   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >   ------------[ cut here ]------------
> >   kernel BUG at include/linux/mm.h:1248!
> >   invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> >   ...
> >   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > 
> > This shows that page_to_nid() in page_zone() is unexpectedly called for an
> > offlined memmap.
> > 
> > Use pfn_to_online_page() to get struct page in PFN walker.
> 
> With changes proposed by David, this patch looks good to me.
> 
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thank you.

> 
> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
> *pfn is operated directly* while it's not protected against memory hotremove.

I had the similar concern, but there seems many place doing PFN walk,
so checking them one-by-one that offlined memory can be walked over
requires much effort.

Thanks,
Naoya Horiguchi
Miaohe Lin Sept. 8, 2022, 3:25 a.m. UTC | #9
On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
>> On 2022/9/7 20:11, Naoya Horiguchi wrote:
> ...
>>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>> Date: Wed, 7 Sep 2022 20:58:44 +0900
>>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>>>  split_huge_pages_all()
>>>
>>> NULL pointer dereference is triggered when calling thp split via debugfs
>>> on the system with offlined memory blocks.  With debug option enabled,
>>> the following kernel messages are printed out:
>>>
>>>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>   page dumped because: unmovable page
>>>   page:000000007d7ab72e is uninitialized and poisoned
>>>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>   ------------[ cut here ]------------
>>>   kernel BUG at include/linux/mm.h:1248!
>>>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>>>   ...
>>>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>>
>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
>>> offlined memmap.
>>>
>>> Use pfn_to_online_page() to get struct page in PFN walker.
>>
>> With changes proposed by David, this patch looks good to me.
>>
>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> 
> Thank you.
> 
>>
>> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
>> *pfn is operated directly* while it's not protected against memory hotremove.
> 
> I had the similar concern, but there seems many place doing PFN walk,
> so checking them one-by-one that offlined memory can be walked over
> requires much effort.

Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;)

Thanks,
Miaohe Lin


> 
> Thanks,
> Naoya Horiguchi
>
Oscar Salvador Sept. 8, 2022, 3:28 a.m. UTC | #10
On Wed, Sep 07, 2022 at 09:11:57PM +0900, Naoya Horiguchi wrote:
> From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>  split_huge_pages_all()
> 
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks.  With debug option enabled,
> the following kernel messages are printed out:
> 
>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>   page dumped because: unmovable page
>   page:000000007d7ab72e is uninitialized and poisoned
>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>   ------------[ cut here ]------------
>   kernel BUG at include/linux/mm.h:1248!
>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>   ...
>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> 
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
> 
> Use pfn_to_online_page() to get struct page in PFN walker.
> 
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>

Reviewed-by: Oscar Salvador <osalvador@suse.de>
David Hildenbrand Sept. 8, 2022, 6:14 a.m. UTC | #11
On 08.09.22 04:47, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Sep 07, 2022 at 01:57:45PM -0700, Andrew Morton wrote:
>> On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote:
>>
>>>
>>> NULL pointer dereference is triggered when calling thp split via debugfs
>>> on the system with offlined memory blocks.  With debug option enabled,
>>> the following kernel messages are printed out:
>>>
>>>    page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>>    flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>>    raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>>    raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>    page dumped because: unmovable page
>>>    page:000000007d7ab72e is uninitialized and poisoned
>>>    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>    ------------[ cut here ]------------
>>>    kernel BUG at include/linux/mm.h:1248!
>>>    invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>>    CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>>>    ...
>>>    RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>>
>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
>>> offlined memmap.
>>>
>>> Use pfn_to_online_page() to get struct page in PFN walker.
>>>
>>> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
>>
>> January of 2016!  Or was this a long-term bug which was recently
>> exposed by some other change?
> 
> Maybe yes, I confirmed that this issue reproduces in kernel 5.10 (released
> on Dec 13, 2020, which might not be so "recent"), but does not reproduce in
> kernel 5.4.  So I think that marking "5.10+" for stable is fine.


I fixed plenty of such issues in the past. For example:

aad5f69bc161 ("fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c")

This one just wasn't found so far because it's triggered via
a debug interface.

The correct fixes tag would be:

Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")      [visible after d0dc12e86b319]

Exposed "recently" -- 2018 ;)
HORIGUCHI NAOYA(堀口 直也) Sept. 8, 2022, 6:31 a.m. UTC | #12
On Thu, Sep 08, 2022 at 08:14:57AM +0200, David Hildenbrand wrote:
> On 08.09.22 04:47, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Sep 07, 2022 at 01:57:45PM -0700, Andrew Morton wrote:
> > > On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote:
> > > 
> > > > 
> > > > NULL pointer dereference is triggered when calling thp split via debugfs
> > > > on the system with offlined memory blocks.  With debug option enabled,
> > > > the following kernel messages are printed out:
> > > > 
> > > >    page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> > > >    flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> > > >    raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> > > >    raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > > >    page dumped because: unmovable page
> > > >    page:000000007d7ab72e is uninitialized and poisoned
> > > >    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > > >    ------------[ cut here ]------------
> > > >    kernel BUG at include/linux/mm.h:1248!
> > > >    invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > > >    CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> > > >    ...
> > > >    RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > > > 
> > > > This shows that page_to_nid() in page_zone() is unexpectedly called for an
> > > > offlined memmap.
> > > > 
> > > > Use pfn_to_online_page() to get struct page in PFN walker.
> > > > 
> > > > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> > > 
> > > January of 2016!  Or was this a long-term bug which was recently
> > > exposed by some other change?
> > 
> > Maybe yes, I confirmed that this issue reproduces in kernel 5.10 (released
> > on Dec 13, 2020, which might not be so "recent"), but does not reproduce in
> > kernel 5.4.  So I think that marking "5.10+" for stable is fine.
> 
> 
> I fixed plenty of such issues in the past. For example:
> 
> aad5f69bc161 ("fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c")
> 
> This one just wasn't found so far because it's triggered via
> a debug interface.
> 
> The correct fixes tag would be:
> 
> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")      [visible after d0dc12e86b319]
> 
> Exposed "recently" -- 2018 ;)

Great, thank you for pinpointing the commit.

- Naoya Horiguchi
Michal Hocko Sept. 8, 2022, 7:07 a.m. UTC | #13
On Thu 08-09-22 11:25:54, Miaohe Lin wrote:
> On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
> >> On 2022/9/7 20:11, Naoya Horiguchi wrote:
> > ...
> >>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> >>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> >>> Date: Wed, 7 Sep 2022 20:58:44 +0900
> >>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> >>>  split_huge_pages_all()
> >>>
> >>> NULL pointer dereference is triggered when calling thp split via debugfs
> >>> on the system with offlined memory blocks.  With debug option enabled,
> >>> the following kernel messages are printed out:
> >>>
> >>>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> >>>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> >>>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> >>>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> >>>   page dumped because: unmovable page
> >>>   page:000000007d7ab72e is uninitialized and poisoned
> >>>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >>>   ------------[ cut here ]------------
> >>>   kernel BUG at include/linux/mm.h:1248!
> >>>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >>>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> >>>   ...
> >>>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> >>>
> >>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> >>> offlined memmap.
> >>>
> >>> Use pfn_to_online_page() to get struct page in PFN walker.
> >>
> >> With changes proposed by David, this patch looks good to me.
> >>
> >> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> > 
> > Thank you.
> > 
> >>
> >> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
> >> *pfn is operated directly* while it's not protected against memory hotremove.
> > 
> > I had the similar concern, but there seems many place doing PFN walk,
> > so checking them one-by-one that offlined memory can be walked over
> > requires much effort.
> 
> Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;)

Most of those whic are directly triggerable should be taken care of. It
would be still good to go through `git grep -w pfn_to_page' and evaluate 
all callers. Still more than 400 callsites so not a trivial task.
Miaohe Lin Sept. 9, 2022, 12:27 a.m. UTC | #14
On 2022/9/8 15:07, Michal Hocko wrote:
> On Thu 08-09-22 11:25:54, Miaohe Lin wrote:
>> On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote:
>>> On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
>>>> On 2022/9/7 20:11, Naoya Horiguchi wrote:
>>> ...
>>>>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
>>>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>>>> Date: Wed, 7 Sep 2022 20:58:44 +0900
>>>>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>>>>>  split_huge_pages_all()
>>>>>
>>>>> NULL pointer dereference is triggered when calling thp split via debugfs
>>>>> on the system with offlined memory blocks.  With debug option enabled,
>>>>> the following kernel messages are printed out:
>>>>>
>>>>>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>>>>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>>>>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>>>>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>>>   page dumped because: unmovable page
>>>>>   page:000000007d7ab72e is uninitialized and poisoned
>>>>>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>>>   ------------[ cut here ]------------
>>>>>   kernel BUG at include/linux/mm.h:1248!
>>>>>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>>>>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>>>>>   ...
>>>>>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>>>>
>>>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
>>>>> offlined memmap.
>>>>>
>>>>> Use pfn_to_online_page() to get struct page in PFN walker.
>>>>
>>>> With changes proposed by David, this patch looks good to me.
>>>>
>>>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
>>>
>>> Thank you.
>>>
>>>>
>>>> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
>>>> *pfn is operated directly* while it's not protected against memory hotremove.
>>>
>>> I had the similar concern, but there seems many place doing PFN walk,
>>> so checking them one-by-one that offlined memory can be walked over
>>> requires much effort.
>>
>> Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;)
> 
> Most of those whic are directly triggerable should be taken care of. It
> would be still good to go through `git grep -w pfn_to_page' and evaluate 
> all callers. Still more than 400 callsites so not a trivial task.

Agree. Even if pfn_to_online_page() is used, it might still races with the memory hotremove if
there's no way to guard against it, e.g. holding an extra page refcnt. So code audit should also
apply to pfn_to_online_page().

Thanks,
Miaohe Lin
David Hildenbrand Sept. 9, 2022, 9:03 a.m. UTC | #15
On 09.09.22 02:27, Miaohe Lin wrote:
> On 2022/9/8 15:07, Michal Hocko wrote:
>> On Thu 08-09-22 11:25:54, Miaohe Lin wrote:
>>> On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote:
>>>> On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
>>>>> On 2022/9/7 20:11, Naoya Horiguchi wrote:
>>>> ...
>>>>>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
>>>>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>>>>> Date: Wed, 7 Sep 2022 20:58:44 +0900
>>>>>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>>>>>>   split_huge_pages_all()
>>>>>>
>>>>>> NULL pointer dereference is triggered when calling thp split via debugfs
>>>>>> on the system with offlined memory blocks.  With debug option enabled,
>>>>>> the following kernel messages are printed out:
>>>>>>
>>>>>>    page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>>>>>    flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>>>>>    raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>>>>>    raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>>>>    page dumped because: unmovable page
>>>>>>    page:000000007d7ab72e is uninitialized and poisoned
>>>>>>    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>>>>    ------------[ cut here ]------------
>>>>>>    kernel BUG at include/linux/mm.h:1248!
>>>>>>    invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>>>>>    CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>>>>>>    ...
>>>>>>    RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>>>>>
>>>>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
>>>>>> offlined memmap.
>>>>>>
>>>>>> Use pfn_to_online_page() to get struct page in PFN walker.
>>>>>
>>>>> With changes proposed by David, this patch looks good to me.
>>>>>
>>>>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>
>>>> Thank you.
>>>>
>>>>>
>>>>> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
>>>>> *pfn is operated directly* while it's not protected against memory hotremove.
>>>>
>>>> I had the similar concern, but there seems many place doing PFN walk,
>>>> so checking them one-by-one that offlined memory can be walked over
>>>> requires much effort.
>>>
>>> Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;)
>>
>> Most of those whic are directly triggerable should be taken care of. It
>> would be still good to go through `git grep -w pfn_to_page' and evaluate
>> all callers. Still more than 400 callsites so not a trivial task.
> 
> Agree. Even if pfn_to_online_page() is used, it might still races with the memory hotremove if
> there's no way to guard against it, e.g. holding an extra page refcnt. So code audit should also
> apply to pfn_to_online_page().

Racing with offlining+unplug a long known problem and so far we decided 
to ignore it, because it never appeared in practice.

There once were ideas of using RCU as protection.
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5fa2ba86dae4..03149a8f46f9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2894,11 +2894,8 @@  static void split_huge_pages_all(void)
 		max_zone_pfn = zone_end_pfn(zone);
 		for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
 			int nr_pages;
-			if (!pfn_valid(pfn))
-				continue;
-
-			page = pfn_to_page(pfn);
-			if (!get_page_unless_zero(page))
+			page = pfn_to_online_page(pfn);
+			if (!page || !get_page_unless_zero(page))
 				continue;
 
 			if (zone != page_zone(page))