Message ID | 20181220203156.43441-1-cai@lca.pw (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] mm/page_owner: fix for deferred struct page init | expand |
> On Dec 20, 2018, at 1:31 PM, Qian Cai <cai@lca.pw> wrote: > > diff --git a/mm/page_ext.c b/mm/page_ext.c > index ae44f7adbe07..d76fd51e312a 100644 > --- a/mm/page_ext.c > +++ b/mm/page_ext.c > @@ -399,9 +399,8 @@ void __init page_ext_init(void) > * -------------pfn--------------> > * N0 | N1 | N2 | N0 | N1 | N2|.... > * > - * Take into account DEFERRED_STRUCT_PAGE_INIT. > */ > - if (early_pfn_to_nid(pfn) != nid) > + if (pfn_to_nid(pfn) != nid) > continue; > if (init_section_page_ext(pfn, nid)) > goto oom; > -- > 2.17.2 (Apple Git-113) > Is there any danger in the fact that in the CONFIG_NUMA case in mmzone.h (around line 1261), pfn_to_nid() calls page_to_nid(), possibly causing the same issue seen in v2?
On Thu, 2018-12-20 at 14:00 -0700, William Kucharski wrote: > > On Dec 20, 2018, at 1:31 PM, Qian Cai <cai@lca.pw> wrote: > > > > diff --git a/mm/page_ext.c b/mm/page_ext.c > > index ae44f7adbe07..d76fd51e312a 100644 > > --- a/mm/page_ext.c > > +++ b/mm/page_ext.c > > @@ -399,9 +399,8 @@ void __init page_ext_init(void) > > * -------------pfn--------------> > > * N0 | N1 | N2 | N0 | N1 | N2|.... > > * > > - * Take into account DEFERRED_STRUCT_PAGE_INIT. > > */ > > - if (early_pfn_to_nid(pfn) != nid) > > + if (pfn_to_nid(pfn) != nid) > > continue; > > if (init_section_page_ext(pfn, nid)) > > goto oom; > > -- > > 2.17.2 (Apple Git-113) > > > > Is there any danger in the fact that in the CONFIG_NUMA case in mmzone.h > (around line 1261), pfn_to_nid() calls page_to_nid(), possibly causing the > same issue seen in v2? > No. If CONFIG_DEFERRED_STRUCT_PAGE_INIT=y, page_ext_init() is called after page_alloc_init_late() where all the memory has already been initialized, so page_to_nid() will work then.
On Thu 20-12-18 15:31:56, Qian Cai wrote: > When booting a system with "page_owner=on", > > start_kernel > page_ext_init > invoke_init_callbacks > init_section_page_ext > init_page_owner > init_early_allocated_pages > init_zones_in_node > init_pages_in_zone > lookup_page_ext > page_to_nid > > The issue here is that page_to_nid() will not work since some page > flags have no node information until later in page_alloc_init_late() due > to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds > access with an invalid nid. > > [ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50 > [ 8.672603] index 7 is out of range for type 'zone [5]' > > Also, kernel will panic since flags were poisoned earlier with, > > CONFIG_DEBUG_VM_PGFLAGS=y > CONFIG_NODE_NOT_IN_PAGE_FLAGS=n > > start_kernel > setup_arch > pagetable_init > paging_init > sparse_init > sparse_init_nid > memblock_alloc_try_nid_raw > > Although later it tries to set page flags for pages in reserved bootmem > regions, > > mm_init > mem_init > memblock_free_all > free_low_memory_core_early > reserve_bootmem_region > > there could still have some freed pages from the page allocator but yet > to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been > dealt with a bit in page_ext_init(). > > * Take into account DEFERRED_STRUCT_PAGE_INIT. > */ > if (early_pfn_to_nid(pfn) != nid) > continue; > > However, it did not handle it well in init_pages_in_zone() which end up > calling page_to_nid(). > > [ 11.917212] page:ffffea0004200000 is uninitialized and poisoned > [ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff > ffffffffffffffff > [ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff > ffffffffffffffff > [ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > [ 11.926498] page_owner info is not active (free page?) > [ 12.329560] kernel BUG at include/linux/mm.h:990! > [ 12.337632] RIP: 0010:init_page_owner+0x486/0x520 > > Since there is no other routines depend on page_ext_init() in > start_kernel(), just move it after page_alloc_init_late() to ensure that > there is no deferred pages need to de dealt with. If deselected > DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init() > earlier, so page owner could catch more early page allocation call > sites. This gives us a good compromise between catching good and bad > call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT. > > [1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/ > > Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init) > Suggested-by: Michal Hocko <mhocko@kernel.org> > Signed-off-by: Qian Cai <cai@lca.pw> > --- > > v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n. > > v2: postpone page_ext_init() to after page_alloc_init_late(). > > init/main.c | 5 +++++ > mm/page_ext.c | 3 +-- > 2 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/init/main.c b/init/main.c > index 2b7b7fe173c9..5d9904370f76 100644 > --- a/init/main.c > +++ b/init/main.c > @@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void) > initrd_start = 0; > } > #endif > +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT > page_ext_init(); > +#endif > kmemleak_init(); > setup_per_cpu_pageset(); > numa_policy_init(); > @@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void) > sched_init_smp(); > > page_alloc_init_late(); > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > + page_ext_init(); > +#endif > > do_basic_setup(); Is this really necessary? Why cannot we simply postpone page_ext_init unconditioanally? > > diff --git a/mm/page_ext.c b/mm/page_ext.c > index ae44f7adbe07..d76fd51e312a 100644 > --- a/mm/page_ext.c > +++ b/mm/page_ext.c > @@ -399,9 +399,8 @@ void __init page_ext_init(void) > * -------------pfn--------------> > * N0 | N1 | N2 | N0 | N1 | N2|.... > * > - * Take into account DEFERRED_STRUCT_PAGE_INIT. > */ > - if (early_pfn_to_nid(pfn) != nid) > + if (pfn_to_nid(pfn) != nid) > continue; > if (init_section_page_ext(pfn, nid)) > goto oom; Also this doesn't seem to be related, right?
On 1/3/19 6:51 AM, Michal Hocko wrote: > On Thu 20-12-18 15:31:56, Qian Cai wrote: >> When booting a system with "page_owner=on", >> >> start_kernel >> page_ext_init >> invoke_init_callbacks >> init_section_page_ext >> init_page_owner >> init_early_allocated_pages >> init_zones_in_node >> init_pages_in_zone >> lookup_page_ext >> page_to_nid >> >> The issue here is that page_to_nid() will not work since some page >> flags have no node information until later in page_alloc_init_late() due >> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds >> access with an invalid nid. >> >> [ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50 >> [ 8.672603] index 7 is out of range for type 'zone [5]' >> >> Also, kernel will panic since flags were poisoned earlier with, >> >> CONFIG_DEBUG_VM_PGFLAGS=y >> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n >> >> start_kernel >> setup_arch >> pagetable_init >> paging_init >> sparse_init >> sparse_init_nid >> memblock_alloc_try_nid_raw >> >> Although later it tries to set page flags for pages in reserved bootmem >> regions, >> >> mm_init >> mem_init >> memblock_free_all >> free_low_memory_core_early >> reserve_bootmem_region >> >> there could still have some freed pages from the page allocator but yet >> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been >> dealt with a bit in page_ext_init(). >> >> * Take into account DEFERRED_STRUCT_PAGE_INIT. >> */ >> if (early_pfn_to_nid(pfn) != nid) >> continue; >> >> However, it did not handle it well in init_pages_in_zone() which end up >> calling page_to_nid(). >> >> [ 11.917212] page:ffffea0004200000 is uninitialized and poisoned >> [ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff >> ffffffffffffffff >> [ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff >> ffffffffffffffff >> [ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) >> [ 11.926498] page_owner info is not active (free page?) >> [ 12.329560] kernel BUG at include/linux/mm.h:990! >> [ 12.337632] RIP: 0010:init_page_owner+0x486/0x520 >> >> Since there is no other routines depend on page_ext_init() in >> start_kernel(), just move it after page_alloc_init_late() to ensure that >> there is no deferred pages need to de dealt with. If deselected >> DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init() >> earlier, so page owner could catch more early page allocation call >> sites. This gives us a good compromise between catching good and bad >> call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT. >> >> [1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/ >> >> Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init) >> Suggested-by: Michal Hocko <mhocko@kernel.org> >> Signed-off-by: Qian Cai <cai@lca.pw> >> --- >> >> v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n. >> >> v2: postpone page_ext_init() to after page_alloc_init_late(). >> >> init/main.c | 5 +++++ >> mm/page_ext.c | 3 +-- >> 2 files changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/init/main.c b/init/main.c >> index 2b7b7fe173c9..5d9904370f76 100644 >> --- a/init/main.c >> +++ b/init/main.c >> @@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void) >> initrd_start = 0; >> } >> #endif >> +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT >> page_ext_init(); >> +#endif >> kmemleak_init(); >> setup_per_cpu_pageset(); >> numa_policy_init(); >> @@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void) >> sched_init_smp(); >> >> page_alloc_init_late(); >> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT >> + page_ext_init(); >> +#endif >> >> do_basic_setup(); > > Is this really necessary? Why cannot we simply postpone page_ext_init > unconditioanally? As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init() earlier, so page owner could catch more early page allocation call sites." > >> >> diff --git a/mm/page_ext.c b/mm/page_ext.c >> index ae44f7adbe07..d76fd51e312a 100644 >> --- a/mm/page_ext.c >> +++ b/mm/page_ext.c >> @@ -399,9 +399,8 @@ void __init page_ext_init(void) >> * -------------pfn--------------> >> * N0 | N1 | N2 | N0 | N1 | N2|.... >> * >> - * Take into account DEFERRED_STRUCT_PAGE_INIT. >> */ >> - if (early_pfn_to_nid(pfn) != nid) >> + if (pfn_to_nid(pfn) != nid) >> continue; >> if (init_section_page_ext(pfn, nid)) >> goto oom; > > Also this doesn't seem to be related, right? No, it is related. Because of this patch, page_ext_init() is called after all the memory has already been initialized, so no longer necessary to call early_pfn_to_nid().
On Thu 03-01-19 11:38:31, Qian Cai wrote: > On 1/3/19 6:51 AM, Michal Hocko wrote: > > On Thu 20-12-18 15:31:56, Qian Cai wrote: > >> When booting a system with "page_owner=on", > >> > >> start_kernel > >> page_ext_init > >> invoke_init_callbacks > >> init_section_page_ext > >> init_page_owner > >> init_early_allocated_pages > >> init_zones_in_node > >> init_pages_in_zone > >> lookup_page_ext > >> page_to_nid > >> > >> The issue here is that page_to_nid() will not work since some page > >> flags have no node information until later in page_alloc_init_late() due > >> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds > >> access with an invalid nid. > >> > >> [ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50 > >> [ 8.672603] index 7 is out of range for type 'zone [5]' > >> > >> Also, kernel will panic since flags were poisoned earlier with, > >> > >> CONFIG_DEBUG_VM_PGFLAGS=y > >> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n > >> > >> start_kernel > >> setup_arch > >> pagetable_init > >> paging_init > >> sparse_init > >> sparse_init_nid > >> memblock_alloc_try_nid_raw > >> > >> Although later it tries to set page flags for pages in reserved bootmem > >> regions, > >> > >> mm_init > >> mem_init > >> memblock_free_all > >> free_low_memory_core_early > >> reserve_bootmem_region > >> > >> there could still have some freed pages from the page allocator but yet > >> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been > >> dealt with a bit in page_ext_init(). > >> > >> * Take into account DEFERRED_STRUCT_PAGE_INIT. > >> */ > >> if (early_pfn_to_nid(pfn) != nid) > >> continue; > >> > >> However, it did not handle it well in init_pages_in_zone() which end up > >> calling page_to_nid(). > >> > >> [ 11.917212] page:ffffea0004200000 is uninitialized and poisoned > >> [ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff > >> ffffffffffffffff > >> [ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff > >> ffffffffffffffff > >> [ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) > >> [ 11.926498] page_owner info is not active (free page?) > >> [ 12.329560] kernel BUG at include/linux/mm.h:990! > >> [ 12.337632] RIP: 0010:init_page_owner+0x486/0x520 > >> > >> Since there is no other routines depend on page_ext_init() in > >> start_kernel(), just move it after page_alloc_init_late() to ensure that > >> there is no deferred pages need to de dealt with. If deselected > >> DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init() > >> earlier, so page owner could catch more early page allocation call > >> sites. This gives us a good compromise between catching good and bad > >> call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT. > >> > >> [1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/ > >> > >> Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init) > >> Suggested-by: Michal Hocko <mhocko@kernel.org> > >> Signed-off-by: Qian Cai <cai@lca.pw> > >> --- > >> > >> v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n. > >> > >> v2: postpone page_ext_init() to after page_alloc_init_late(). > >> > >> init/main.c | 5 +++++ > >> mm/page_ext.c | 3 +-- > >> 2 files changed, 6 insertions(+), 2 deletions(-) > >> > >> diff --git a/init/main.c b/init/main.c > >> index 2b7b7fe173c9..5d9904370f76 100644 > >> --- a/init/main.c > >> +++ b/init/main.c > >> @@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void) > >> initrd_start = 0; > >> } > >> #endif > >> +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT > >> page_ext_init(); > >> +#endif > >> kmemleak_init(); > >> setup_per_cpu_pageset(); > >> numa_policy_init(); > >> @@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void) > >> sched_init_smp(); > >> > >> page_alloc_init_late(); > >> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > >> + page_ext_init(); > >> +#endif > >> > >> do_basic_setup(); > > > > Is this really necessary? Why cannot we simply postpone page_ext_init > > unconditioanally? > > As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better > to call page_ext_init() earlier, so page owner could catch more early page > allocation call sites." Do you have any numbers to show how many allocation are we losing that way? In other words, do we care enough to create an ugly code? > >> diff --git a/mm/page_ext.c b/mm/page_ext.c > >> index ae44f7adbe07..d76fd51e312a 100644 > >> --- a/mm/page_ext.c > >> +++ b/mm/page_ext.c > >> @@ -399,9 +399,8 @@ void __init page_ext_init(void) > >> * -------------pfn--------------> > >> * N0 | N1 | N2 | N0 | N1 | N2|.... > >> * > >> - * Take into account DEFERRED_STRUCT_PAGE_INIT. > >> */ > >> - if (early_pfn_to_nid(pfn) != nid) > >> + if (pfn_to_nid(pfn) != nid) > >> continue; > >> if (init_section_page_ext(pfn, nid)) > >> goto oom; > > > > Also this doesn't seem to be related, right? > > No, it is related. Because of this patch, page_ext_init() is called after all > the memory has already been initialized, > so no longer necessary to call early_pfn_to_nid(). Yes, but it looks like a follow up cleanup/optimization to me.
On 1/3/19 11:59 AM, Michal Hocko wrote: >> As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better >> to call page_ext_init() earlier, so page owner could catch more early page >> allocation call sites." > > Do you have any numbers to show how many allocation are we losing that > way? In other words, do we care enough to create an ugly code? Well, I don't have any numbers, but I read that Joonsoo did not really like to defer page_ext_init() unconditionally. "because deferring page_ext_init() would make page owner which uses page_ext miss some early page allocation callsites. Although it already miss some early page allocation callsites, we don't need to miss more." https://lore.kernel.org/lkml/20160524053714.GB32186@js1304-P5Q-DELUXE/ >>>> diff --git a/mm/page_ext.c b/mm/page_ext.c >>>> index ae44f7adbe07..d76fd51e312a 100644 >>>> --- a/mm/page_ext.c >>>> +++ b/mm/page_ext.c >>>> @@ -399,9 +399,8 @@ void __init page_ext_init(void) >>>> * -------------pfn--------------> >>>> * N0 | N1 | N2 | N0 | N1 | N2|.... >>>> * >>>> - * Take into account DEFERRED_STRUCT_PAGE_INIT. >>>> */ >>>> - if (early_pfn_to_nid(pfn) != nid) >>>> + if (pfn_to_nid(pfn) != nid) >>>> continue; >>>> if (init_section_page_ext(pfn, nid)) >>>> goto oom; >>> >>> Also this doesn't seem to be related, right? >> >> No, it is related. Because of this patch, page_ext_init() is called after all >> the memory has already been initialized, >> so no longer necessary to call early_pfn_to_nid(). > > Yes, but it looks like a follow up cleanup/optimization to me. That early_pfn_to_nid() was introduced in fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init) which also messed up the order of page_ext_init() in start_kernel(), so this patch basically revert that commit.
On Thu 03-01-19 12:38:59, Qian Cai wrote: > On 1/3/19 11:59 AM, Michal Hocko wrote: > >> As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better > >> to call page_ext_init() earlier, so page owner could catch more early page > >> allocation call sites." > > > > Do you have any numbers to show how many allocation are we losing that > > way? In other words, do we care enough to create an ugly code? > > Well, I don't have any numbers, but I read that Joonsoo did not really like to > defer page_ext_init() unconditionally. > > "because deferring page_ext_init() would make page owner which uses page_ext > miss some early page allocation callsites. Although it already miss some early > page allocation callsites, we don't need to miss more." This is quite unspecific. > https://lore.kernel.org/lkml/20160524053714.GB32186@js1304-P5Q-DELUXE/ > > >>>> diff --git a/mm/page_ext.c b/mm/page_ext.c > >>>> index ae44f7adbe07..d76fd51e312a 100644 > >>>> --- a/mm/page_ext.c > >>>> +++ b/mm/page_ext.c > >>>> @@ -399,9 +399,8 @@ void __init page_ext_init(void) > >>>> * -------------pfn--------------> > >>>> * N0 | N1 | N2 | N0 | N1 | N2|.... > >>>> * > >>>> - * Take into account DEFERRED_STRUCT_PAGE_INIT. > >>>> */ > >>>> - if (early_pfn_to_nid(pfn) != nid) > >>>> + if (pfn_to_nid(pfn) != nid) > >>>> continue; > >>>> if (init_section_page_ext(pfn, nid)) > >>>> goto oom; > >>> > >>> Also this doesn't seem to be related, right? > >> > >> No, it is related. Because of this patch, page_ext_init() is called after all > >> the memory has already been initialized, > >> so no longer necessary to call early_pfn_to_nid(). > > > > Yes, but it looks like a follow up cleanup/optimization to me. > > That early_pfn_to_nid() was introduced in fe53ca54270 (mm: use early_pfn_to_nid > in page_ext_init) which also messed up the order of page_ext_init() in > start_kernel(), so this patch basically revert that commit. So can we make the revert with an explanation that the patch was wrong? If we want to make hacks to catch more objects to be tracked then it would be great to have some numbers in hands.
On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an explanation that the patch was wrong? > If we want to make hacks to catch more objects to be tracked then it > would be great to have some numbers in hands. Well, those numbers are subject to change depends on future start_kernel() order. Right now, there are many functions could be caught earlier by page owner. kmemleak_init(); debug_objects_mem_init(); setup_per_cpu_pageset(); numa_policy_init(); acpi_early_init(); if (late_time_init) late_time_init(); sched_clock_init(); calibrate_delay(); pid_idr_init(); anon_vma_init(); #ifdef CONFIG_X86 if (efi_enabled(EFI_RUNTIME_SERVICES)) efi_enter_virtual_mode(); #endif thread_stack_cache_init(); cred_init(); fork_init(); proc_caches_init(); uts_ns_init(); buffer_init(); key_init(); security_init(); dbg_late_init(); vfs_caches_init(); pagecache_init(); signals_init(); seq_file_init(); proc_root_init(); nsfs_init(); cpuset_init(); cgroup_init(); taskstats_init_early(); delayacct_init(); check_bugs(); acpi_subsystem_init(); arch_post_acpi_subsys_init(); sfi_init_late(); if (efi_enabled(EFI_RUNTIME_SERVICES)) { efi_free_boot_services(); rcu_scheduler_starting(); /* * Wait until kthreadd is all set-up. */ wait_for_completion(&kthreadd_done); /* Now the scheduler is fully set up and can do blocking allocations */ gfp_allowed_mask = __GFP_BITS_MASK; /* * init can allocate pages on any node */ set_mems_allowed(node_states[N_MEMORY]); cad_pid = task_pid(current); smp_prepare_cpus(setup_max_cpus); workqueue_init(); init_mm_internals(); do_pre_smp_initcalls(); lockup_detector_init(); smp_init(); sched_init_smp();
On Thu 03-01-19 14:53:47, Qian Cai wrote: > On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an > explanation that the patch was wrong? > > If we want to make hacks to catch more objects to be tracked then it > > would be great to have some numbers in hands. > > Well, those numbers are subject to change depends on future start_kernel() > order. Right now, there are many functions could be caught earlier by page owner. > > kmemleak_init(); [...] > sched_init_smp(); The kernel source dump will not tell us much of course. A ball park number whether we are talking about dozen, hundreds or thousands of allocations would tell us something at least, doesn't it. Handwaving that it might help us some is not particurarly useful. We are already losing some allocations already. Does it matter? Well, that depends, sometimes we do want to catch an owner of particular page and it is sad to find nothing. But how many times have you or somebody else encountered that in practice. That is exactly a useful information to judge an ugly ifdefery in the code. See my point?
On 1/3/19 3:22 PM, Michal Hocko wrote: > On Thu 03-01-19 14:53:47, Qian Cai wrote: >> On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an >> explanation that the patch was wrong? >>> If we want to make hacks to catch more objects to be tracked then it >>> would be great to have some numbers in hands. >> >> Well, those numbers are subject to change depends on future start_kernel() >> order. Right now, there are many functions could be caught earlier by page owner. >> >> kmemleak_init(); > [...] >> sched_init_smp(); > > The kernel source dump will not tell us much of course. A ball park > number whether we are talking about dozen, hundreds or thousands of > allocations would tell us something at least, doesn't it. > > Handwaving that it might help us some is not particurarly useful. We are > already losing some allocations already. Does it matter? Well, that > depends, sometimes we do want to catch an owner of particular page and > it is sad to find nothing. But how many times have you or somebody else > encountered that in practice. That is exactly a useful information to > judge an ugly ifdefery in the code. See my point? Here is the number without DEFERRED_STRUCT_PAGE_INIT. == page_ext_init() after page_alloc_init_late() == Node 0, zone DMA: page owner found early allocated 0 pages Node 0, zone DMA32: page owner found early allocated 7009 pages Node 0, zone Normal: page owner found early allocated 85827 pages Node 4, zone Normal: page owner found early allocated 75063 pages == page_ext_init() before kmemleak_init() == Node 0, zone DMA: page owner found early allocated 0 pages Node 0, zone DMA32: page owner found early allocated 6654 pages Node 0, zone Normal: page owner found early allocated 41907 pages Node 4, zone Normal: page owner found early allocated 41356 pages So, it told us that it will miss tens of thousands of early page allocation call sites.
On Thu 03-01-19 17:22:29, Qian Cai wrote: > On 1/3/19 3:22 PM, Michal Hocko wrote: > > On Thu 03-01-19 14:53:47, Qian Cai wrote: > >> On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an > >> explanation that the patch was wrong? > >>> If we want to make hacks to catch more objects to be tracked then it > >>> would be great to have some numbers in hands. > >> > >> Well, those numbers are subject to change depends on future start_kernel() > >> order. Right now, there are many functions could be caught earlier by page owner. > >> > >> kmemleak_init(); > > [...] > >> sched_init_smp(); > > > > The kernel source dump will not tell us much of course. A ball park > > number whether we are talking about dozen, hundreds or thousands of > > allocations would tell us something at least, doesn't it. > > > > Handwaving that it might help us some is not particurarly useful. We are > > already losing some allocations already. Does it matter? Well, that > > depends, sometimes we do want to catch an owner of particular page and > > it is sad to find nothing. But how many times have you or somebody else > > encountered that in practice. That is exactly a useful information to > > judge an ugly ifdefery in the code. See my point? > > Here is the number without DEFERRED_STRUCT_PAGE_INIT. > > == page_ext_init() after page_alloc_init_late() == > Node 0, zone DMA: page owner found early allocated 0 pages > Node 0, zone DMA32: page owner found early allocated 7009 pages > Node 0, zone Normal: page owner found early allocated 85827 pages > Node 4, zone Normal: page owner found early allocated 75063 pages > > == page_ext_init() before kmemleak_init() == > Node 0, zone DMA: page owner found early allocated 0 pages > Node 0, zone DMA32: page owner found early allocated 6654 pages > Node 0, zone Normal: page owner found early allocated 41907 pages > Node 4, zone Normal: page owner found early allocated 41356 pages > > So, it told us that it will miss tens of thousands of early page allocation call > sites. This is an answer for the first part of the question (how much). The second is _do_we_care_?
On 1/4/19 8:09 AM, Michal Hocko wrote: >> Here is the number without DEFERRED_STRUCT_PAGE_INIT. >> >> == page_ext_init() after page_alloc_init_late() == >> Node 0, zone DMA: page owner found early allocated 0 pages >> Node 0, zone DMA32: page owner found early allocated 7009 pages >> Node 0, zone Normal: page owner found early allocated 85827 pages >> Node 4, zone Normal: page owner found early allocated 75063 pages >> >> == page_ext_init() before kmemleak_init() == >> Node 0, zone DMA: page owner found early allocated 0 pages >> Node 0, zone DMA32: page owner found early allocated 6654 pages >> Node 0, zone Normal: page owner found early allocated 41907 pages >> Node 4, zone Normal: page owner found early allocated 41356 pages >> >> So, it told us that it will miss tens of thousands of early page allocation call >> sites. > > This is an answer for the first part of the question (how much). The > second is _do_we_care_? Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would start to miss tens of thousands early page allocation call sites. The other option I can think of to not hurt your eyes is to rewrite the whole page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without. However, I have a hard-time to convince myself it is a sensible thing to do.
On Fri 04-01-19 10:01:40, Qian Cai wrote: > On 1/4/19 8:09 AM, Michal Hocko wrote: > >> Here is the number without DEFERRED_STRUCT_PAGE_INIT. > >> > >> == page_ext_init() after page_alloc_init_late() == > >> Node 0, zone DMA: page owner found early allocated 0 pages > >> Node 0, zone DMA32: page owner found early allocated 7009 pages > >> Node 0, zone Normal: page owner found early allocated 85827 pages > >> Node 4, zone Normal: page owner found early allocated 75063 pages > >> > >> == page_ext_init() before kmemleak_init() == > >> Node 0, zone DMA: page owner found early allocated 0 pages > >> Node 0, zone DMA32: page owner found early allocated 6654 pages > >> Node 0, zone Normal: page owner found early allocated 41907 pages > >> Node 4, zone Normal: page owner found early allocated 41356 pages > >> > >> So, it told us that it will miss tens of thousands of early page allocation call > >> sites. > > > > This is an answer for the first part of the question (how much). The > > second is _do_we_care_? > > Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the > existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would > start to miss tens of thousands early page allocation call sites. I am pretty sure we will hear about that when that happens. And act accordingly. > The other option I can think of to not hurt your eyes is to rewrite the whole > page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early > functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without. > However, I have a hard-time to convince myself it is a sensible thing to do. Or simply make the page_owner initialization only touch the already initialized memory. Have you explored that option as well? Look, I am trying to push for a clean solution. Hacks I have seen so far are not convincing. You have identified a regression and as such I would consider the most straightforward to revert the buggy commit. If you want to improve the situation then I would suggest to think about something that is more robust than ifdefed hacks. It might be more work but it also might be a better thing long term. If you think I am asking too much then you are free to ignore my opinion. I am not a maintainer of the page_owner code.
On 1/4/19 10:17 AM, Michal Hocko wrote: > On Fri 04-01-19 10:01:40, Qian Cai wrote: >> On 1/4/19 8:09 AM, Michal Hocko wrote: >>>> Here is the number without DEFERRED_STRUCT_PAGE_INIT. >>>> >>>> == page_ext_init() after page_alloc_init_late() == >>>> Node 0, zone DMA: page owner found early allocated 0 pages >>>> Node 0, zone DMA32: page owner found early allocated 7009 pages >>>> Node 0, zone Normal: page owner found early allocated 85827 pages >>>> Node 4, zone Normal: page owner found early allocated 75063 pages >>>> >>>> == page_ext_init() before kmemleak_init() == >>>> Node 0, zone DMA: page owner found early allocated 0 pages >>>> Node 0, zone DMA32: page owner found early allocated 6654 pages >>>> Node 0, zone Normal: page owner found early allocated 41907 pages >>>> Node 4, zone Normal: page owner found early allocated 41356 pages >>>> >>>> So, it told us that it will miss tens of thousands of early page allocation call >>>> sites. >>> >>> This is an answer for the first part of the question (how much). The >>> second is _do_we_care_? >> >> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the >> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would >> start to miss tens of thousands early page allocation call sites. > > I am pretty sure we will hear about that when that happens. And act > accordingly. > >> The other option I can think of to not hurt your eyes is to rewrite the whole >> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early >> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without. >> However, I have a hard-time to convince myself it is a sensible thing to do. > > Or simply make the page_owner initialization only touch the already > initialized memory. Have you explored that option as well? Yes, a proof-of-concept version is v1 where ends up with more ifdefs due to dealing with all the low-level details, https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/
On Fri 04-01-19 10:25:12, Qian Cai wrote: > On 1/4/19 10:17 AM, Michal Hocko wrote: > > On Fri 04-01-19 10:01:40, Qian Cai wrote: > >> On 1/4/19 8:09 AM, Michal Hocko wrote: > >>>> Here is the number without DEFERRED_STRUCT_PAGE_INIT. > >>>> > >>>> == page_ext_init() after page_alloc_init_late() == > >>>> Node 0, zone DMA: page owner found early allocated 0 pages > >>>> Node 0, zone DMA32: page owner found early allocated 7009 pages > >>>> Node 0, zone Normal: page owner found early allocated 85827 pages > >>>> Node 4, zone Normal: page owner found early allocated 75063 pages > >>>> > >>>> == page_ext_init() before kmemleak_init() == > >>>> Node 0, zone DMA: page owner found early allocated 0 pages > >>>> Node 0, zone DMA32: page owner found early allocated 6654 pages > >>>> Node 0, zone Normal: page owner found early allocated 41907 pages > >>>> Node 4, zone Normal: page owner found early allocated 41356 pages > >>>> > >>>> So, it told us that it will miss tens of thousands of early page allocation call > >>>> sites. > >>> > >>> This is an answer for the first part of the question (how much). The > >>> second is _do_we_care_? > >> > >> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the > >> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would > >> start to miss tens of thousands early page allocation call sites. > > > > I am pretty sure we will hear about that when that happens. And act > > accordingly. > > > >> The other option I can think of to not hurt your eyes is to rewrite the whole > >> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early > >> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without. > >> However, I have a hard-time to convince myself it is a sensible thing to do. > > > > Or simply make the page_owner initialization only touch the already > > initialized memory. Have you explored that option as well? > > Yes, a proof-of-concept version is v1 where ends up with more ifdefs due to > dealing with all the low-level details, > > https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/ That is obviously not what I've had in mind. We have __init_single_page which initializes a single struct page. Is there any way to hook page_ext initialization there?
On 1/4/19 10:32 AM, Michal Hocko wrote: > On Fri 04-01-19 10:25:12, Qian Cai wrote: >> On 1/4/19 10:17 AM, Michal Hocko wrote: >>> On Fri 04-01-19 10:01:40, Qian Cai wrote: >>>> On 1/4/19 8:09 AM, Michal Hocko wrote: >>>>>> Here is the number without DEFERRED_STRUCT_PAGE_INIT. >>>>>> >>>>>> == page_ext_init() after page_alloc_init_late() == >>>>>> Node 0, zone DMA: page owner found early allocated 0 pages >>>>>> Node 0, zone DMA32: page owner found early allocated 7009 pages >>>>>> Node 0, zone Normal: page owner found early allocated 85827 pages >>>>>> Node 4, zone Normal: page owner found early allocated 75063 pages >>>>>> >>>>>> == page_ext_init() before kmemleak_init() == >>>>>> Node 0, zone DMA: page owner found early allocated 0 pages >>>>>> Node 0, zone DMA32: page owner found early allocated 6654 pages >>>>>> Node 0, zone Normal: page owner found early allocated 41907 pages >>>>>> Node 4, zone Normal: page owner found early allocated 41356 pages >>>>>> >>>>>> So, it told us that it will miss tens of thousands of early page allocation call >>>>>> sites. >>>>> >>>>> This is an answer for the first part of the question (how much). The >>>>> second is _do_we_care_? >>>> >>>> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the >>>> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would >>>> start to miss tens of thousands early page allocation call sites. >>> >>> I am pretty sure we will hear about that when that happens. And act >>> accordingly. >>> >>>> The other option I can think of to not hurt your eyes is to rewrite the whole >>>> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early >>>> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without. >>>> However, I have a hard-time to convince myself it is a sensible thing to do. >>> >>> Or simply make the page_owner initialization only touch the already >>> initialized memory. Have you explored that option as well? >> >> Yes, a proof-of-concept version is v1 where ends up with more ifdefs due to >> dealing with all the low-level details, >> >> https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/ > > That is obviously not what I've had in mind. We have __init_single_page > which initializes a single struct page. Is there any way to hook > page_ext initialization there? Well, the current design is, (1) allocate page_ext physical-contiguous pages for all nodes. (2) page owner gathers all early allocation pages. (3) page owner is fully operational. It may be possible to move (1) as early as possible just after vmalloc_init() in start_kernel() because it depends on vmalloc(), but it still needs to call (2) as there have had many early allocation pages already. Node 0, zone DMA: page owner found early allocated 0 pages Node 0, zone DMA32: page owner found early allocated 6654 pages Node 0, zone Normal: page owner found early allocated 40972 pages Node 4, zone Normal: page owner found early allocated 40968 pages But, (2) still depends on DEFERRED_STRUCT_PAGE_INIT. To get ride of this dependency, it may be possible to add some checking in __init_single_page(): - if not done (1), save those pages to an array and defer page_owner early_handle initialization until done (1). Though, I can't see any really benefit of this approach apart from "beautify" the code.
On Fri 04-01-19 15:18:08, Qian Cai wrote:
[...]
> Though, I can't see any really benefit of this approach apart from "beautify"
This is not about beautifying! This is about making the code long term
maintainable. As you can see it is just too easy to break it with the
current scheme. And that is bad especially when the code is broken
because of an optimization.
On 1/7/19 1:43 PM, Michal Hocko wrote: > On Fri 04-01-19 15:18:08, Qian Cai wrote: > [...] >> Though, I can't see any really benefit of this approach apart from "beautify" > > This is not about beautifying! This is about making the code long term > maintainable. As you can see it is just too easy to break it with the > current scheme. And that is bad especially when the code is broken > because of an optimization. > Understood, but the code is now fixed. If there is something fundamentally broken in the future, it may be a good time then to create a looks like hundred-line cleanup patch for long-term maintenance at the same time to fix real bugs.
On Mon 07-01-19 20:53:08, Qian Cai wrote: > > > On 1/7/19 1:43 PM, Michal Hocko wrote: > > On Fri 04-01-19 15:18:08, Qian Cai wrote: > > [...] > >> Though, I can't see any really benefit of this approach apart from "beautify" > > > > This is not about beautifying! This is about making the code long term > > maintainable. As you can see it is just too easy to break it with the > > current scheme. And that is bad especially when the code is broken > > because of an optimization. > > > > Understood, but the code is now fixed. If there is something fundamentally > broken in the future, it may be a good time then to create a looks like > hundred-line cleanup patch for long-term maintenance at the same time to fix > real bugs. Yeah, so revert = fix and redisign the thing to make the code more robust longterm + allow to catch more allocation. I really fail to see why this has to be repeated several times in this thread. Really.
On Tue, 2019-01-08 at 09:20 +0100, Michal Hocko wrote: > On Mon 07-01-19 20:53:08, Qian Cai wrote: > > > > > > On 1/7/19 1:43 PM, Michal Hocko wrote: > > > On Fri 04-01-19 15:18:08, Qian Cai wrote: > > > [...] > > > > Though, I can't see any really benefit of this approach apart from > > > > "beautify" > > > > > > This is not about beautifying! This is about making the code long term > > > maintainable. As you can see it is just too easy to break it with the > > > current scheme. And that is bad especially when the code is broken > > > because of an optimization. > > > > > > > Understood, but the code is now fixed. If there is something fundamentally > > broken in the future, it may be a good time then to create a looks like > > hundred-line cleanup patch for long-term maintenance at the same time to fix > > real bugs. > > Yeah, so revert = fix and redisign the thing to make the code more > robust longterm + allow to catch more allocation. I really fail to see > why this has to be repeated several times in this thread. Really. > Again, this will introduce a immediately regression (arguably small) that existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would start to miss tens of thousands early page allocation call sites. I think the disagreement comes from that you want to deal with this passively rather than proactively that you said "I am pretty sure we will hear about that when that happens. And act accordingly", but I think it is better to fix it now rather than later with a 4-line ifdef which you don't like. I suppose someone else needs to make a judgment call for this as we are in a "you can't convince me and I can't convince you" situation right now.
It's unclear (to me) where we stand with this patch. Shold we proceed with v3 for now, or is something else planned? From: Qian Cai <cai@lca.pw> Subject: mm/page_owner: fix for deferred struct page init When booting a system with "page_owner=on", start_kernel page_ext_init invoke_init_callbacks init_section_page_ext init_page_owner init_early_allocated_pages init_zones_in_node init_pages_in_zone lookup_page_ext page_to_nid The issue here is that page_to_nid() will not work since some page flags have no node information until later in page_alloc_init_late() due to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds access with an invalid nid. [ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50 [ 8.672603] index 7 is out of range for type 'zone [5]' Also, kernel will panic since flags were poisoned earlier with, CONFIG_DEBUG_VM_PGFLAGS=y CONFIG_NODE_NOT_IN_PAGE_FLAGS=n start_kernel setup_arch pagetable_init paging_init sparse_init sparse_init_nid memblock_alloc_try_nid_raw Although later it tries to set page flags for pages in reserved bootmem regions, mm_init mem_init memblock_free_all free_low_memory_core_early reserve_bootmem_region there could still have some freed pages from the page allocator but yet to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been dealt with a bit in page_ext_init(). * Take into account DEFERRED_STRUCT_PAGE_INIT. */ if (early_pfn_to_nid(pfn) != nid) continue; However, it did not handle it well in init_pages_in_zone() which end up calling page_to_nid(). [ 11.917212] page:ffffea0004200000 is uninitialized and poisoned [ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff [ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff [ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) [ 11.926498] page_owner info is not active (free page?) [ 12.329560] kernel BUG at include/linux/mm.h:990! [ 12.337632] RIP: 0010:init_page_owner+0x486/0x520 Since there is no other routines depend on page_ext_init() in start_kernel(), just move it after page_alloc_init_late() to ensure that there is no deferred pages need to de dealt with. If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init() earlier, so page owner could catch more early page allocation call sites. This gives us a good compromise between catching good and bad call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT. [1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/ Link: http://lkml.kernel.org/r/20181220203156.43441-1-cai@lca.pw Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init) Signed-off-by: Qian Cai <cai@lca.pw> Suggested-by: Michal Hocko <mhocko@kernel.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- init/main.c | 5 +++++ mm/page_ext.c | 3 +-- 2 files changed, 6 insertions(+), 2 deletions(-) --- a/init/main.c~mm-page_owner-fix-for-deferred-struct-page-init +++ a/init/main.c @@ -695,7 +695,9 @@ asmlinkage __visible void __init start_k initrd_start = 0; } #endif +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT page_ext_init(); +#endif kmemleak_init(); setup_per_cpu_pageset(); numa_policy_init(); @@ -1131,6 +1133,9 @@ static noinline void __init kernel_init_ sched_init_smp(); page_alloc_init_late(); +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT + page_ext_init(); +#endif do_basic_setup(); --- a/mm/page_ext.c~mm-page_owner-fix-for-deferred-struct-page-init +++ a/mm/page_ext.c @@ -399,9 +399,8 @@ void __init page_ext_init(void) * -------------pfn--------------> * N0 | N1 | N2 | N0 | N1 | N2|.... * - * Take into account DEFERRED_STRUCT_PAGE_INIT. */ - if (early_pfn_to_nid(pfn) != nid) + if (pfn_to_nid(pfn) != nid) continue; if (init_section_page_ext(pfn, nid)) goto oom;
On 1/8/19 5:02 PM, Andrew Morton wrote: > > It's unclear (to me) where we stand with this patch. Shold we proceed > with v3 for now, or is something else planned? I don't have anything else plan for this right now. Michal particular don't like that 4-line ifdef which supposes to avoid an immediately regression (arguably small) that existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would start to miss tens of thousands early page allocation call sites. So, feel free to choose v2 of this which has no ifdef if you agree with Michal too. I am fine either way.
On Tue 08-01-19 17:13:41, Qian Cai wrote: > > > On 1/8/19 5:02 PM, Andrew Morton wrote: > > > > It's unclear (to me) where we stand with this patch. Shold we proceed > > with v3 for now, or is something else planned? > > I don't have anything else plan for this right now. Michal particular don't like > that 4-line ifdef which supposes to avoid an immediately regression (arguably > small) that existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected > that would start to miss tens of thousands early page allocation call sites. So, > feel free to choose v2 of this which has no ifdef if you agree with Michal too. > I am fine either way. Yes I would prefer to revert the faulty commit (fe53ca54270) and work on a more robust page_owner initialization to cover earlier allocations on top of that. Not that I would insist but this would be a more straightforward approach and I hope it will result in a better long term maintainable code in the end.
diff --git a/init/main.c b/init/main.c index 2b7b7fe173c9..5d9904370f76 100644 --- a/init/main.c +++ b/init/main.c @@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void) initrd_start = 0; } #endif +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT page_ext_init(); +#endif kmemleak_init(); setup_per_cpu_pageset(); numa_policy_init(); @@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void) sched_init_smp(); page_alloc_init_late(); +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT + page_ext_init(); +#endif do_basic_setup(); diff --git a/mm/page_ext.c b/mm/page_ext.c index ae44f7adbe07..d76fd51e312a 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -399,9 +399,8 @@ void __init page_ext_init(void) * -------------pfn--------------> * N0 | N1 | N2 | N0 | N1 | N2|.... * - * Take into account DEFERRED_STRUCT_PAGE_INIT. */ - if (early_pfn_to_nid(pfn) != nid) + if (pfn_to_nid(pfn) != nid) continue; if (init_section_page_ext(pfn, nid)) goto oom;
When booting a system with "page_owner=on", start_kernel page_ext_init invoke_init_callbacks init_section_page_ext init_page_owner init_early_allocated_pages init_zones_in_node init_pages_in_zone lookup_page_ext page_to_nid The issue here is that page_to_nid() will not work since some page flags have no node information until later in page_alloc_init_late() due to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds access with an invalid nid. [ 8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50 [ 8.672603] index 7 is out of range for type 'zone [5]' Also, kernel will panic since flags were poisoned earlier with, CONFIG_DEBUG_VM_PGFLAGS=y CONFIG_NODE_NOT_IN_PAGE_FLAGS=n start_kernel setup_arch pagetable_init paging_init sparse_init sparse_init_nid memblock_alloc_try_nid_raw Although later it tries to set page flags for pages in reserved bootmem regions, mm_init mem_init memblock_free_all free_low_memory_core_early reserve_bootmem_region there could still have some freed pages from the page allocator but yet to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been dealt with a bit in page_ext_init(). * Take into account DEFERRED_STRUCT_PAGE_INIT. */ if (early_pfn_to_nid(pfn) != nid) continue; However, it did not handle it well in init_pages_in_zone() which end up calling page_to_nid(). [ 11.917212] page:ffffea0004200000 is uninitialized and poisoned [ 11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff [ 11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff [ 11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) [ 11.926498] page_owner info is not active (free page?) [ 12.329560] kernel BUG at include/linux/mm.h:990! [ 12.337632] RIP: 0010:init_page_owner+0x486/0x520 Since there is no other routines depend on page_ext_init() in start_kernel(), just move it after page_alloc_init_late() to ensure that there is no deferred pages need to de dealt with. If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init() earlier, so page owner could catch more early page allocation call sites. This gives us a good compromise between catching good and bad call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT. [1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/ Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init) Suggested-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Qian Cai <cai@lca.pw> --- v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n. v2: postpone page_ext_init() to after page_alloc_init_late(). init/main.c | 5 +++++ mm/page_ext.c | 3 +-- 2 files changed, 6 insertions(+), 2 deletions(-)