Message ID | 20220907121157.GA2954283@ik1-406-35019.vs.sakura.ne.jp (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() | expand |
On 07.09.22 14:11, Naoya Horiguchi wrote: > On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote: >> On 07.09.22 12:23, David Hildenbrand wrote: >>> On 07.09.22 12:08, Naoya Horiguchi wrote: >>>> Hi MM folks, >>> >>> Hi, >>> >>>> >>>> When I'm testing memory hotremove with various settings, I found the following >>>> NULL-pointer dereference. It reproduces easily with the folloing steps: >>>> >>>> $ echo offline > /sys/devices/system/memory/memoryN/state >>>> $ echo 1 > /sys/kernel/debug/split_huge_pages >>>> >>> >>> That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write(). >>> >>> I was able to trigger it here as well, though, and I get on my kernel with debug options enabled: >>> >>> [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 >>> [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) >>> [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 >>> [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 >>> [526045.840498] page dumped because: unmovable page >>> [526056.362715] page:000000007d7ab72e is uninitialized and poisoned >>> [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) >>> [526056.374837] ------------[ cut here ]------------ >>> [526056.379544] kernel BUG at include/linux/mm.h:1248! >>> [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI >>> [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 >>> [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021 >>> [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30 >>> [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff >>> [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292 >>> [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000 >>> [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff >>> [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08 >>> [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000 >>> [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000 >>> [526056.471733] FS: 00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000 >>> [526056.479906] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0 >>> [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> [526056.507396] PKRU: 55555554 >>> [526056.510196] Call Trace: >>> [526056.512734] <TASK> >>> [526056.514929] ? simple_setattr+0x40/0x60 >>> [526056.518859] ? vfs_mkobj+0x1b0/0x1c0 >>> [526056.522529] ? path_openat+0xb2e/0x1360 >>> [526056.526456] ? do_filp_open+0xa1/0x130 >>> [526056.530296] full_proxy_write+0x50/0x80 >>> [526056.534229] vfs_write+0xd7/0x3e0 >>> [526056.537635] ? fpregs_assert_state_consistent+0x22/0x50 >>> [526056.542955] ? exit_to_user_mode_prepare+0x3c/0x1d0 >>> [526056.547929] ? syscall_exit_to_user_mode+0x17/0x40 >>> [526056.552808] ksys_write+0x53/0xd0 >>> [526056.556215] do_syscall_64+0x58/0x80 >>> [526056.559879] ? exit_to_user_mode_prepare+0x3c/0x1d0 >>> [526056.564846] ? syscall_exit_to_user_mode+0x17/0x40 >>> [526056.569726] ? do_syscall_64+0x67/0x80 >>> [526056.573563] ? do_syscall_64+0x67/0x80 >>> [526056.577404] ? do_syscall_64+0x67/0x80 >>> [526056.581244] ? syscall_exit_to_user_mode+0x17/0x40 >>> [526056.586120] ? do_syscall_64+0x67/0x80 >>> [526056.589961] ? do_syscall_64+0x67/0x80 >>> [526056.593801] ? do_syscall_64+0x67/0x80 >>> [526056.597641] entry_SYSCALL_64_after_hwframe+0x63/0xcd >>> [526056.602779] RIP: 0033:0x7fe35ab01c17 >>> [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 >>> [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 >>> [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17 >>> [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001 >>> [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073 >>> [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 >>> [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0 >>> [526056.669028] </TASK> >>> >>> >>> Looks like there is a page_to_nid() done in an offline memmap, which is wrong. >>> >>> Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker. >>> >> >> And indeed, something like the following might do the trick: >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index e9414ee57c5b..f42bb51e023a 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void) >> max_zone_pfn = zone_end_pfn(zone); >> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) { >> int nr_pages; >> - if (!pfn_valid(pfn)) >> - continue; >> - page = pfn_to_page(pfn); >> - if (!get_page_unless_zero(page)) >> + page = pfn_to_online_page(pfn); >> + if (!page || !get_page_unless_zero(page)) >> continue; >> if (zone != page_zone(page)) >> > > Thank you very much, I confirmed the fix. So I try to write a patch like below. > I borrowed your debug message because it has more helpful info than mine. > If you have any comment or suggestion in the description, please let me know. > > Thanks, > Naoya Horiguchi > --- > From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001 > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > Date: Wed, 7 Sep 2022 20:58:44 +0900 > Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in > split_huge_pages_all() > > NULL pointer dereference is triggered when calling thp split via debugfs > on the system with offlined memory blocks. With debug option enabled, > the following kernel messages are printed out: > > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > page dumped because: unmovable page > page:000000007d7ab72e is uninitialized and poisoned > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > ------------[ cut here ]------------ > kernel BUG at include/linux/mm.h:1248! > invalid opcode: 0000 [#1] PREEMPT SMP PTI > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 > ... > RIP: 0010:split_huge_pages_write+0xcf4/0xe30 > > This shows that page_to_nid() in page_zone() is unexpectedly called for an > offlined memmap. > > Use pfn_to_online_page() to get struct page in PFN walker. > > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages") > Suggested-by: David Hildenbrand <david@redhat.com> Feel free to use Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> instead > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> > Cc: <stable@vger.kernel.org> > --- > mm/huge_memory.c | 7 ++----- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 5fa2ba86dae4..03149a8f46f9 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void) > max_zone_pfn = zone_end_pfn(zone); > for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) { > int nr_pages; Add an empty line here please, while at it. > - if (!pfn_valid(pfn)) > - continue; > - > - page = pfn_to_page(pfn); > - if (!get_page_unless_zero(page)) > + page = pfn_to_online_page(pfn); > + if (!page || !get_page_unless_zero(page)) > continue; > > if (zone != page_zone(page)) Can you send the patch "officially" as well? Thanks
On Wed, Sep 7, 2022 at 5:12 AM Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote: > > On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote: > > On 07.09.22 12:23, David Hildenbrand wrote: > > > On 07.09.22 12:08, Naoya Horiguchi wrote: > > > > Hi MM folks, > > > > > > Hi, > > > > > > > > > > > When I'm testing memory hotremove with various settings, I found the following > > > > NULL-pointer dereference. It reproduces easily with the folloing steps: > > > > > > > > $ echo offline > /sys/devices/system/memory/memoryN/state > > > > $ echo 1 > /sys/kernel/debug/split_huge_pages > > > > > > > > > > That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write(). > > > > > > I was able to trigger it here as well, though, and I get on my kernel with debug options enabled: > > > > > > [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 > > > [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) > > > [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 > > > [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > > > [526045.840498] page dumped because: unmovable page > > > [526056.362715] page:000000007d7ab72e is uninitialized and poisoned > > > [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > > > [526056.374837] ------------[ cut here ]------------ > > > [526056.379544] kernel BUG at include/linux/mm.h:1248! > > > [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI > > > [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 > > > [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021 > > > [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30 > > > [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff > > > [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292 > > > [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000 > > > [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff > > > [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08 > > > [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000 > > > [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000 > > > [526056.471733] FS: 00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000 > > > [526056.479906] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0 > > > [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > [526056.507396] PKRU: 55555554 > > > [526056.510196] Call Trace: > > > [526056.512734] <TASK> > > > [526056.514929] ? simple_setattr+0x40/0x60 > > > [526056.518859] ? vfs_mkobj+0x1b0/0x1c0 > > > [526056.522529] ? path_openat+0xb2e/0x1360 > > > [526056.526456] ? do_filp_open+0xa1/0x130 > > > [526056.530296] full_proxy_write+0x50/0x80 > > > [526056.534229] vfs_write+0xd7/0x3e0 > > > [526056.537635] ? fpregs_assert_state_consistent+0x22/0x50 > > > [526056.542955] ? exit_to_user_mode_prepare+0x3c/0x1d0 > > > [526056.547929] ? syscall_exit_to_user_mode+0x17/0x40 > > > [526056.552808] ksys_write+0x53/0xd0 > > > [526056.556215] do_syscall_64+0x58/0x80 > > > [526056.559879] ? exit_to_user_mode_prepare+0x3c/0x1d0 > > > [526056.564846] ? syscall_exit_to_user_mode+0x17/0x40 > > > [526056.569726] ? do_syscall_64+0x67/0x80 > > > [526056.573563] ? do_syscall_64+0x67/0x80 > > > [526056.577404] ? do_syscall_64+0x67/0x80 > > > [526056.581244] ? syscall_exit_to_user_mode+0x17/0x40 > > > [526056.586120] ? do_syscall_64+0x67/0x80 > > > [526056.589961] ? do_syscall_64+0x67/0x80 > > > [526056.593801] ? do_syscall_64+0x67/0x80 > > > [526056.597641] entry_SYSCALL_64_after_hwframe+0x63/0xcd > > > [526056.602779] RIP: 0033:0x7fe35ab01c17 > > > [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 > > > [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > > > [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17 > > > [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001 > > > [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073 > > > [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 > > > [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0 > > > [526056.669028] </TASK> > > > > > > > > > Looks like there is a page_to_nid() done in an offline memmap, which is wrong. > > > > > > Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker. > > > > > > > And indeed, something like the following might do the trick: > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index e9414ee57c5b..f42bb51e023a 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void) > > max_zone_pfn = zone_end_pfn(zone); > > for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) { > > int nr_pages; > > - if (!pfn_valid(pfn)) > > - continue; > > - page = pfn_to_page(pfn); > > - if (!get_page_unless_zero(page)) > > + page = pfn_to_online_page(pfn); > > + if (!page || !get_page_unless_zero(page)) > > continue; > > if (zone != page_zone(page)) > > > > Thank you very much, I confirmed the fix. So I try to write a patch like below. > I borrowed your debug message because it has more helpful info than mine. > If you have any comment or suggestion in the description, please let me know. > > Thanks, > Naoya Horiguchi > --- > From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001 > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > Date: Wed, 7 Sep 2022 20:58:44 +0900 > Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in > split_huge_pages_all() > > NULL pointer dereference is triggered when calling thp split via debugfs > on the system with offlined memory blocks. With debug option enabled, > the following kernel messages are printed out: > > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > page dumped because: unmovable page > page:000000007d7ab72e is uninitialized and poisoned > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > ------------[ cut here ]------------ > kernel BUG at include/linux/mm.h:1248! > invalid opcode: 0000 [#1] PREEMPT SMP PTI > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 > ... > RIP: 0010:split_huge_pages_write+0xcf4/0xe30 > > This shows that page_to_nid() in page_zone() is unexpectedly called for an > offlined memmap. > > Use pfn_to_online_page() to get struct page in PFN walker. > > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages") > Suggested-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> > Cc: <stable@vger.kernel.org> Thanks, Naoya and David. Looks good to me. Reviewed-by: Yang Shi <shy828301@gmail.com> > --- > mm/huge_memory.c | 7 ++----- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 5fa2ba86dae4..03149a8f46f9 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void) > max_zone_pfn = zone_end_pfn(zone); > for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) { > int nr_pages; > - if (!pfn_valid(pfn)) > - continue; > - > - page = pfn_to_page(pfn); > - if (!get_page_unless_zero(page)) > + page = pfn_to_online_page(pfn); > + if (!page || !get_page_unless_zero(page)) > continue; > > if (zone != page_zone(page)) > -- > 2.26.1.8815.g6a475b71f8 > >
On Wed 07-09-22 21:11:57, Naoya Horiguchi wrote: [...] > >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001 > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > Date: Wed, 7 Sep 2022 20:58:44 +0900 > Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in > split_huge_pages_all() > > NULL pointer dereference is triggered when calling thp split via debugfs > on the system with offlined memory blocks. With debug option enabled, > the following kernel messages are printed out: > > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > page dumped because: unmovable page > page:000000007d7ab72e is uninitialized and poisoned > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > ------------[ cut here ]------------ > kernel BUG at include/linux/mm.h:1248! > invalid opcode: 0000 [#1] PREEMPT SMP PTI > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 > ... > RIP: 0010:split_huge_pages_write+0xcf4/0xe30 > > This shows that page_to_nid() in page_zone() is unexpectedly called for an > offlined memmap. > > Use pfn_to_online_page() to get struct page in PFN walker. > > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages") > Suggested-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> > Cc: <stable@vger.kernel.org> With changes proposed by David Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > --- > mm/huge_memory.c | 7 ++----- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 5fa2ba86dae4..03149a8f46f9 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void) > max_zone_pfn = zone_end_pfn(zone); > for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) { > int nr_pages; > - if (!pfn_valid(pfn)) > - continue; > - > - page = pfn_to_page(pfn); > - if (!get_page_unless_zero(page)) > + page = pfn_to_online_page(pfn); > + if (!page || !get_page_unless_zero(page)) > continue; > > if (zone != page_zone(page)) > -- > 2.26.1.8815.g6a475b71f8 >
On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote: > > NULL pointer dereference is triggered when calling thp split via debugfs > on the system with offlined memory blocks. With debug option enabled, > the following kernel messages are printed out: > > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > page dumped because: unmovable page > page:000000007d7ab72e is uninitialized and poisoned > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > ------------[ cut here ]------------ > kernel BUG at include/linux/mm.h:1248! > invalid opcode: 0000 [#1] PREEMPT SMP PTI > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 > ... > RIP: 0010:split_huge_pages_write+0xcf4/0xe30 > > This shows that page_to_nid() in page_zone() is unexpectedly called for an > offlined memmap. > > Use pfn_to_online_page() to get struct page in PFN walker. > > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages") January of 2016! Or was this a long-term bug which was recently exposed by some other change?
On 2022/9/7 20:11, Naoya Horiguchi wrote: > On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote: >> On 07.09.22 12:23, David Hildenbrand wrote: >>> On 07.09.22 12:08, Naoya Horiguchi wrote: >>>> Hi MM folks, >>> >>> Hi, >>> >>>> >>>> When I'm testing memory hotremove with various settings, I found the following >>>> NULL-pointer dereference. It reproduces easily with the folloing steps: >>>> >>>> $ echo offline > /sys/devices/system/memory/memoryN/state >>>> $ echo 1 > /sys/kernel/debug/split_huge_pages >>>> >>> >>> That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write(). >>> >>> I was able to trigger it here as well, though, and I get on my kernel with debug options enabled: >>> >>> [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 >>> [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) >>> [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 >>> [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 >>> [526045.840498] page dumped because: unmovable page >>> [526056.362715] page:000000007d7ab72e is uninitialized and poisoned >>> [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) >>> [526056.374837] ------------[ cut here ]------------ >>> [526056.379544] kernel BUG at include/linux/mm.h:1248! >>> [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI >>> [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 >>> [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021 >>> [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30 >>> [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff >>> [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292 >>> [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000 >>> [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff >>> [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08 >>> [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000 >>> [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000 >>> [526056.471733] FS: 00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000 >>> [526056.479906] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0 >>> [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> [526056.507396] PKRU: 55555554 >>> [526056.510196] Call Trace: >>> [526056.512734] <TASK> >>> [526056.514929] ? simple_setattr+0x40/0x60 >>> [526056.518859] ? vfs_mkobj+0x1b0/0x1c0 >>> [526056.522529] ? path_openat+0xb2e/0x1360 >>> [526056.526456] ? do_filp_open+0xa1/0x130 >>> [526056.530296] full_proxy_write+0x50/0x80 >>> [526056.534229] vfs_write+0xd7/0x3e0 >>> [526056.537635] ? fpregs_assert_state_consistent+0x22/0x50 >>> [526056.542955] ? exit_to_user_mode_prepare+0x3c/0x1d0 >>> [526056.547929] ? syscall_exit_to_user_mode+0x17/0x40 >>> [526056.552808] ksys_write+0x53/0xd0 >>> [526056.556215] do_syscall_64+0x58/0x80 >>> [526056.559879] ? exit_to_user_mode_prepare+0x3c/0x1d0 >>> [526056.564846] ? syscall_exit_to_user_mode+0x17/0x40 >>> [526056.569726] ? do_syscall_64+0x67/0x80 >>> [526056.573563] ? do_syscall_64+0x67/0x80 >>> [526056.577404] ? do_syscall_64+0x67/0x80 >>> [526056.581244] ? syscall_exit_to_user_mode+0x17/0x40 >>> [526056.586120] ? do_syscall_64+0x67/0x80 >>> [526056.589961] ? do_syscall_64+0x67/0x80 >>> [526056.593801] ? do_syscall_64+0x67/0x80 >>> [526056.597641] entry_SYSCALL_64_after_hwframe+0x63/0xcd >>> [526056.602779] RIP: 0033:0x7fe35ab01c17 >>> [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 >>> [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 >>> [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17 >>> [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001 >>> [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073 >>> [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 >>> [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0 >>> [526056.669028] </TASK> >>> >>> >>> Looks like there is a page_to_nid() done in an offline memmap, which is wrong. >>> >>> Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker. >>> >> >> And indeed, something like the following might do the trick: >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index e9414ee57c5b..f42bb51e023a 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void) >> max_zone_pfn = zone_end_pfn(zone); >> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) { >> int nr_pages; >> - if (!pfn_valid(pfn)) >> - continue; >> - page = pfn_to_page(pfn); >> - if (!get_page_unless_zero(page)) >> + page = pfn_to_online_page(pfn); >> + if (!page || !get_page_unless_zero(page)) >> continue; >> if (zone != page_zone(page)) >> > > Thank you very much, I confirmed the fix. So I try to write a patch like below. > I borrowed your debug message because it has more helpful info than mine. > If you have any comment or suggestion in the description, please let me know. > > Thanks, > Naoya Horiguchi > --- >>From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001 > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > Date: Wed, 7 Sep 2022 20:58:44 +0900 > Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in > split_huge_pages_all() > > NULL pointer dereference is triggered when calling thp split via debugfs > on the system with offlined memory blocks. With debug option enabled, > the following kernel messages are printed out: > > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > page dumped because: unmovable page > page:000000007d7ab72e is uninitialized and poisoned > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > ------------[ cut here ]------------ > kernel BUG at include/linux/mm.h:1248! > invalid opcode: 0000 [#1] PREEMPT SMP PTI > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 > ... > RIP: 0010:split_huge_pages_write+0xcf4/0xe30 > > This shows that page_to_nid() in page_zone() is unexpectedly called for an > offlined memmap. > > Use pfn_to_online_page() to get struct page in PFN walker. With changes proposed by David, this patch looks good to me. Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where *pfn is operated directly* while it's not protected against memory hotremove. Thanks, Miaohe Lin > > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages") > Suggested-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> > Cc: <stable@vger.kernel.org> > --- > mm/huge_memory.c | 7 ++----- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 5fa2ba86dae4..03149a8f46f9 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void) > max_zone_pfn = zone_end_pfn(zone); > for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) { > int nr_pages; > - if (!pfn_valid(pfn)) > - continue; > - > - page = pfn_to_page(pfn); > - if (!get_page_unless_zero(page)) > + page = pfn_to_online_page(pfn); > + if (!page || !get_page_unless_zero(page)) > continue; > > if (zone != page_zone(page)) >
On Wed, Sep 07, 2022 at 02:39:02PM +0200, David Hildenbrand wrote: > On 07.09.22 14:11, Naoya Horiguchi wrote: ... > > --- > > From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001 > > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > > Date: Wed, 7 Sep 2022 20:58:44 +0900 > > Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in > > split_huge_pages_all() > > > > NULL pointer dereference is triggered when calling thp split via debugfs > > on the system with offlined memory blocks. With debug option enabled, > > the following kernel messages are printed out: > > > > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 > > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) > > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 > > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > > page dumped because: unmovable page > > page:000000007d7ab72e is uninitialized and poisoned > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > > ------------[ cut here ]------------ > > kernel BUG at include/linux/mm.h:1248! > > invalid opcode: 0000 [#1] PREEMPT SMP PTI > > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 > > ... > > RIP: 0010:split_huge_pages_write+0xcf4/0xe30 > > > > This shows that page_to_nid() in page_zone() is unexpectedly called for an > > offlined memmap. > > > > Use pfn_to_online_page() to get struct page in PFN walker. > > > > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages") > > Suggested-by: David Hildenbrand <david@redhat.com> > > Feel free to use > > Co-developed-by: David Hildenbrand <david@redhat.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Sure, I'll add these. > instead > > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> > > Cc: <stable@vger.kernel.org> > > --- > > mm/huge_memory.c | 7 ++----- > > 1 file changed, 2 insertions(+), 5 deletions(-) > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index 5fa2ba86dae4..03149a8f46f9 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void) > > max_zone_pfn = zone_end_pfn(zone); > > for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) { > > int nr_pages; > > Add an empty line here please, while at it. Added it. > > - if (!pfn_valid(pfn)) > > - continue; > > - > > - page = pfn_to_page(pfn); > > - if (!get_page_unless_zero(page)) > > + page = pfn_to_online_page(pfn); > > + if (!page || !get_page_unless_zero(page)) > > continue; > > if (zone != page_zone(page)) > > Can you send the patch "officially" as well? Thanks Yes, I'll send this later in the normal patch submission manner. Thanks, Naoya Horiguchi
On Wed, Sep 07, 2022 at 01:57:45PM -0700, Andrew Morton wrote: > On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote: > > > > > NULL pointer dereference is triggered when calling thp split via debugfs > > on the system with offlined memory blocks. With debug option enabled, > > the following kernel messages are printed out: > > > > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 > > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) > > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 > > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > > page dumped because: unmovable page > > page:000000007d7ab72e is uninitialized and poisoned > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > > ------------[ cut here ]------------ > > kernel BUG at include/linux/mm.h:1248! > > invalid opcode: 0000 [#1] PREEMPT SMP PTI > > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 > > ... > > RIP: 0010:split_huge_pages_write+0xcf4/0xe30 > > > > This shows that page_to_nid() in page_zone() is unexpectedly called for an > > offlined memmap. > > > > Use pfn_to_online_page() to get struct page in PFN walker. > > > > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages") > > January of 2016! Or was this a long-term bug which was recently > exposed by some other change? Maybe yes, I confirmed that this issue reproduces in kernel 5.10 (released on Dec 13, 2020, which might not be so "recent"), but does not reproduce in kernel 5.4. So I think that marking "5.10+" for stable is fine. Thanks, Naoya Horiguchi
On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote: > On 2022/9/7 20:11, Naoya Horiguchi wrote: ... > >>From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001 > > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > > Date: Wed, 7 Sep 2022 20:58:44 +0900 > > Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in > > split_huge_pages_all() > > > > NULL pointer dereference is triggered when calling thp split via debugfs > > on the system with offlined memory blocks. With debug option enabled, > > the following kernel messages are printed out: > > > > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 > > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) > > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 > > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > > page dumped because: unmovable page > > page:000000007d7ab72e is uninitialized and poisoned > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > > ------------[ cut here ]------------ > > kernel BUG at include/linux/mm.h:1248! > > invalid opcode: 0000 [#1] PREEMPT SMP PTI > > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 > > ... > > RIP: 0010:split_huge_pages_write+0xcf4/0xe30 > > > > This shows that page_to_nid() in page_zone() is unexpectedly called for an > > offlined memmap. > > > > Use pfn_to_online_page() to get struct page in PFN walker. > > With changes proposed by David, this patch looks good to me. > > Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Thank you. > > BTW: IMHO, there might be many similar code places need to take care of memory hotremove where > *pfn is operated directly* while it's not protected against memory hotremove. I had the similar concern, but there seems many place doing PFN walk, so checking them one-by-one that offlined memory can be walked over requires much effort. Thanks, Naoya Horiguchi
On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote: > On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote: >> On 2022/9/7 20:11, Naoya Horiguchi wrote: > ... >>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001 >>> From: Naoya Horiguchi <naoya.horiguchi@nec.com> >>> Date: Wed, 7 Sep 2022 20:58:44 +0900 >>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in >>> split_huge_pages_all() >>> >>> NULL pointer dereference is triggered when calling thp split via debugfs >>> on the system with offlined memory blocks. With debug option enabled, >>> the following kernel messages are printed out: >>> >>> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 >>> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) >>> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 >>> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 >>> page dumped because: unmovable page >>> page:000000007d7ab72e is uninitialized and poisoned >>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) >>> ------------[ cut here ]------------ >>> kernel BUG at include/linux/mm.h:1248! >>> invalid opcode: 0000 [#1] PREEMPT SMP PTI >>> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 >>> ... >>> RIP: 0010:split_huge_pages_write+0xcf4/0xe30 >>> >>> This shows that page_to_nid() in page_zone() is unexpectedly called for an >>> offlined memmap. >>> >>> Use pfn_to_online_page() to get struct page in PFN walker. >> >> With changes proposed by David, this patch looks good to me. >> >> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> > > Thank you. > >> >> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where >> *pfn is operated directly* while it's not protected against memory hotremove. > > I had the similar concern, but there seems many place doing PFN walk, > so checking them one-by-one that offlined memory can be walked over > requires much effort. Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;) Thanks, Miaohe Lin > > Thanks, > Naoya Horiguchi >
On Wed, Sep 07, 2022 at 09:11:57PM +0900, Naoya Horiguchi wrote: > From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001 > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > Date: Wed, 7 Sep 2022 20:58:44 +0900 > Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in > split_huge_pages_all() > > NULL pointer dereference is triggered when calling thp split via debugfs > on the system with offlined memory blocks. With debug option enabled, > the following kernel messages are printed out: > > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > page dumped because: unmovable page > page:000000007d7ab72e is uninitialized and poisoned > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > ------------[ cut here ]------------ > kernel BUG at include/linux/mm.h:1248! > invalid opcode: 0000 [#1] PREEMPT SMP PTI > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 > ... > RIP: 0010:split_huge_pages_write+0xcf4/0xe30 > > This shows that page_to_nid() in page_zone() is unexpectedly called for an > offlined memmap. > > Use pfn_to_online_page() to get struct page in PFN walker. > > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages") > Suggested-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> > Cc: <stable@vger.kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de>
On 08.09.22 04:47, HORIGUCHI NAOYA(堀口 直也) wrote: > On Wed, Sep 07, 2022 at 01:57:45PM -0700, Andrew Morton wrote: >> On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote: >> >>> >>> NULL pointer dereference is triggered when calling thp split via debugfs >>> on the system with offlined memory blocks. With debug option enabled, >>> the following kernel messages are printed out: >>> >>> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 >>> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) >>> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 >>> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 >>> page dumped because: unmovable page >>> page:000000007d7ab72e is uninitialized and poisoned >>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) >>> ------------[ cut here ]------------ >>> kernel BUG at include/linux/mm.h:1248! >>> invalid opcode: 0000 [#1] PREEMPT SMP PTI >>> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 >>> ... >>> RIP: 0010:split_huge_pages_write+0xcf4/0xe30 >>> >>> This shows that page_to_nid() in page_zone() is unexpectedly called for an >>> offlined memmap. >>> >>> Use pfn_to_online_page() to get struct page in PFN walker. >>> >>> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages") >> >> January of 2016! Or was this a long-term bug which was recently >> exposed by some other change? > > Maybe yes, I confirmed that this issue reproduces in kernel 5.10 (released > on Dec 13, 2020, which might not be so "recent"), but does not reproduce in > kernel 5.4. So I think that marking "5.10+" for stable is fine. I fixed plenty of such issues in the past. For example: aad5f69bc161 ("fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c") This one just wasn't found so far because it's triggered via a debug interface. The correct fixes tag would be: Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Exposed "recently" -- 2018 ;)
On Thu, Sep 08, 2022 at 08:14:57AM +0200, David Hildenbrand wrote: > On 08.09.22 04:47, HORIGUCHI NAOYA(堀口 直也) wrote: > > On Wed, Sep 07, 2022 at 01:57:45PM -0700, Andrew Morton wrote: > > > On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote: > > > > > > > > > > > NULL pointer dereference is triggered when calling thp split via debugfs > > > > on the system with offlined memory blocks. With debug option enabled, > > > > the following kernel messages are printed out: > > > > > > > > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 > > > > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) > > > > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 > > > > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > > > > page dumped because: unmovable page > > > > page:000000007d7ab72e is uninitialized and poisoned > > > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > > > > ------------[ cut here ]------------ > > > > kernel BUG at include/linux/mm.h:1248! > > > > invalid opcode: 0000 [#1] PREEMPT SMP PTI > > > > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 > > > > ... > > > > RIP: 0010:split_huge_pages_write+0xcf4/0xe30 > > > > > > > > This shows that page_to_nid() in page_zone() is unexpectedly called for an > > > > offlined memmap. > > > > > > > > Use pfn_to_online_page() to get struct page in PFN walker. > > > > > > > > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages") > > > > > > January of 2016! Or was this a long-term bug which was recently > > > exposed by some other change? > > > > Maybe yes, I confirmed that this issue reproduces in kernel 5.10 (released > > on Dec 13, 2020, which might not be so "recent"), but does not reproduce in > > kernel 5.4. So I think that marking "5.10+" for stable is fine. > > > I fixed plenty of such issues in the past. For example: > > aad5f69bc161 ("fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c") > > This one just wasn't found so far because it's triggered via > a debug interface. > > The correct fixes tag would be: > > Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] > > Exposed "recently" -- 2018 ;) Great, thank you for pinpointing the commit. - Naoya Horiguchi
On Thu 08-09-22 11:25:54, Miaohe Lin wrote: > On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote: > > On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote: > >> On 2022/9/7 20:11, Naoya Horiguchi wrote: > > ... > >>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001 > >>> From: Naoya Horiguchi <naoya.horiguchi@nec.com> > >>> Date: Wed, 7 Sep 2022 20:58:44 +0900 > >>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in > >>> split_huge_pages_all() > >>> > >>> NULL pointer dereference is triggered when calling thp split via debugfs > >>> on the system with offlined memory blocks. With debug option enabled, > >>> the following kernel messages are printed out: > >>> > >>> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 > >>> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) > >>> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 > >>> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > >>> page dumped because: unmovable page > >>> page:000000007d7ab72e is uninitialized and poisoned > >>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > >>> ------------[ cut here ]------------ > >>> kernel BUG at include/linux/mm.h:1248! > >>> invalid opcode: 0000 [#1] PREEMPT SMP PTI > >>> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 > >>> ... > >>> RIP: 0010:split_huge_pages_write+0xcf4/0xe30 > >>> > >>> This shows that page_to_nid() in page_zone() is unexpectedly called for an > >>> offlined memmap. > >>> > >>> Use pfn_to_online_page() to get struct page in PFN walker. > >> > >> With changes proposed by David, this patch looks good to me. > >> > >> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> > > > > Thank you. > > > >> > >> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where > >> *pfn is operated directly* while it's not protected against memory hotremove. > > > > I had the similar concern, but there seems many place doing PFN walk, > > so checking them one-by-one that offlined memory can be walked over > > requires much effort. > > Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;) Most of those whic are directly triggerable should be taken care of. It would be still good to go through `git grep -w pfn_to_page' and evaluate all callers. Still more than 400 callsites so not a trivial task.
On 2022/9/8 15:07, Michal Hocko wrote: > On Thu 08-09-22 11:25:54, Miaohe Lin wrote: >> On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote: >>> On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote: >>>> On 2022/9/7 20:11, Naoya Horiguchi wrote: >>> ... >>>>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001 >>>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com> >>>>> Date: Wed, 7 Sep 2022 20:58:44 +0900 >>>>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in >>>>> split_huge_pages_all() >>>>> >>>>> NULL pointer dereference is triggered when calling thp split via debugfs >>>>> on the system with offlined memory blocks. With debug option enabled, >>>>> the following kernel messages are printed out: >>>>> >>>>> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 >>>>> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) >>>>> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 >>>>> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 >>>>> page dumped because: unmovable page >>>>> page:000000007d7ab72e is uninitialized and poisoned >>>>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) >>>>> ------------[ cut here ]------------ >>>>> kernel BUG at include/linux/mm.h:1248! >>>>> invalid opcode: 0000 [#1] PREEMPT SMP PTI >>>>> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 >>>>> ... >>>>> RIP: 0010:split_huge_pages_write+0xcf4/0xe30 >>>>> >>>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an >>>>> offlined memmap. >>>>> >>>>> Use pfn_to_online_page() to get struct page in PFN walker. >>>> >>>> With changes proposed by David, this patch looks good to me. >>>> >>>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> >>> >>> Thank you. >>> >>>> >>>> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where >>>> *pfn is operated directly* while it's not protected against memory hotremove. >>> >>> I had the similar concern, but there seems many place doing PFN walk, >>> so checking them one-by-one that offlined memory can be walked over >>> requires much effort. >> >> Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;) > > Most of those whic are directly triggerable should be taken care of. It > would be still good to go through `git grep -w pfn_to_page' and evaluate > all callers. Still more than 400 callsites so not a trivial task. Agree. Even if pfn_to_online_page() is used, it might still races with the memory hotremove if there's no way to guard against it, e.g. holding an extra page refcnt. So code audit should also apply to pfn_to_online_page(). Thanks, Miaohe Lin
On 09.09.22 02:27, Miaohe Lin wrote: > On 2022/9/8 15:07, Michal Hocko wrote: >> On Thu 08-09-22 11:25:54, Miaohe Lin wrote: >>> On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote: >>>> On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote: >>>>> On 2022/9/7 20:11, Naoya Horiguchi wrote: >>>> ... >>>>>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001 >>>>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com> >>>>>> Date: Wed, 7 Sep 2022 20:58:44 +0900 >>>>>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in >>>>>> split_huge_pages_all() >>>>>> >>>>>> NULL pointer dereference is triggered when calling thp split via debugfs >>>>>> on the system with offlined memory blocks. With debug option enabled, >>>>>> the following kernel messages are printed out: >>>>>> >>>>>> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 >>>>>> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) >>>>>> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 >>>>>> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 >>>>>> page dumped because: unmovable page >>>>>> page:000000007d7ab72e is uninitialized and poisoned >>>>>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) >>>>>> ------------[ cut here ]------------ >>>>>> kernel BUG at include/linux/mm.h:1248! >>>>>> invalid opcode: 0000 [#1] PREEMPT SMP PTI >>>>>> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 >>>>>> ... >>>>>> RIP: 0010:split_huge_pages_write+0xcf4/0xe30 >>>>>> >>>>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an >>>>>> offlined memmap. >>>>>> >>>>>> Use pfn_to_online_page() to get struct page in PFN walker. >>>>> >>>>> With changes proposed by David, this patch looks good to me. >>>>> >>>>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> >>>> >>>> Thank you. >>>> >>>>> >>>>> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where >>>>> *pfn is operated directly* while it's not protected against memory hotremove. >>>> >>>> I had the similar concern, but there seems many place doing PFN walk, >>>> so checking them one-by-one that offlined memory can be walked over >>>> requires much effort. >>> >>> Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;) >> >> Most of those whic are directly triggerable should be taken care of. It >> would be still good to go through `git grep -w pfn_to_page' and evaluate >> all callers. Still more than 400 callsites so not a trivial task. > > Agree. Even if pfn_to_online_page() is used, it might still races with the memory hotremove if > there's no way to guard against it, e.g. holding an extra page refcnt. So code audit should also > apply to pfn_to_online_page(). Racing with offlining+unplug a long known problem and so far we decided to ignore it, because it never appeared in practice. There once were ideas of using RCU as protection.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5fa2ba86dae4..03149a8f46f9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void) max_zone_pfn = zone_end_pfn(zone); for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) { int nr_pages; - if (!pfn_valid(pfn)) - continue; - - page = pfn_to_page(pfn); - if (!get_page_unless_zero(page)) + page = pfn_to_online_page(pfn); + if (!page || !get_page_unless_zero(page)) continue; if (zone != page_zone(page))