[v3] mm/page_owner: fix for deferred struct page init
diff mbox series

Message ID 20181220203156.43441-1-cai@lca.pw
State New
Headers show
Series
  • [v3] mm/page_owner: fix for deferred struct page init
Related show

Commit Message

Qian Cai Dec. 20, 2018, 8:31 p.m. UTC
When booting a system with "page_owner=on",

start_kernel
  page_ext_init
    invoke_init_callbacks
      init_section_page_ext
        init_page_owner
          init_early_allocated_pages
            init_zones_in_node
              init_pages_in_zone
                lookup_page_ext
                  page_to_nid

The issue here is that page_to_nid() will not work since some page
flags have no node information until later in page_alloc_init_late() due
to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.

[    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
[    8.672603] index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
  setup_arch
    pagetable_init
      paging_init
        sparse_init
          sparse_init_nid
            memblock_alloc_try_nid_raw

Although later it tries to set page flags for pages in reserved bootmem
regions,

mm_init
  mem_init
    memblock_free_all
      free_low_memory_core_early
        reserve_bootmem_region

there could still have some freed pages from the page allocator but yet
to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
dealt with a bit in page_ext_init().

* Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
if (early_pfn_to_nid(pfn) != nid)
	continue;

However, it did not handle it well in init_pages_in_zone() which end up
calling page_to_nid().

[   11.917212] page:ffffea0004200000 is uninitialized and poisoned
[   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[   11.926498] page_owner info is not active (free page?)
[   12.329560] kernel BUG at include/linux/mm.h:990!
[   12.337632] RIP: 0010:init_page_owner+0x486/0x520

Since there is no other routines depend on page_ext_init() in
start_kernel(), just move it after page_alloc_init_late() to ensure that
there is no deferred pages need to de dealt with. If deselected
DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
earlier, so page owner could catch more early page allocation call
sites. This gives us a good compromise between catching good and bad
call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.

[1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/

Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Qian Cai <cai@lca.pw>
---

v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n.

v2: postpone page_ext_init() to after page_alloc_init_late().

 init/main.c   | 5 +++++
 mm/page_ext.c | 3 +--
 2 files changed, 6 insertions(+), 2 deletions(-)

Comments

William Kucharski Dec. 20, 2018, 9 p.m. UTC | #1
> On Dec 20, 2018, at 1:31 PM, Qian Cai <cai@lca.pw> wrote:
> 
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index ae44f7adbe07..d76fd51e312a 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> 			 * -------------pfn-------------->
> 			 * N0 | N1 | N2 | N0 | N1 | N2|....
> 			 *
> -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
> 			 */
> -			if (early_pfn_to_nid(pfn) != nid)
> +			if (pfn_to_nid(pfn) != nid)
> 				continue;
> 			if (init_section_page_ext(pfn, nid))
> 				goto oom;
> -- 
> 2.17.2 (Apple Git-113)
> 

Is there any danger in the fact that in the CONFIG_NUMA case in mmzone.h (around line 1261), pfn_to_nid() calls page_to_nid(), possibly causing the same issue seen in v2?
Qian Cai Dec. 20, 2018, 9:04 p.m. UTC | #2
On Thu, 2018-12-20 at 14:00 -0700, William Kucharski wrote:
> > On Dec 20, 2018, at 1:31 PM, Qian Cai <cai@lca.pw> wrote:
> > 
> > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > index ae44f7adbe07..d76fd51e312a 100644
> > --- a/mm/page_ext.c
> > +++ b/mm/page_ext.c
> > @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> > 			 * -------------pfn-------------->
> > 			 * N0 | N1 | N2 | N0 | N1 | N2|....
> > 			 *
> > -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
> > 			 */
> > -			if (early_pfn_to_nid(pfn) != nid)
> > +			if (pfn_to_nid(pfn) != nid)
> > 				continue;
> > 			if (init_section_page_ext(pfn, nid))
> > 				goto oom;
> > -- 
> > 2.17.2 (Apple Git-113)
> > 
> 
> Is there any danger in the fact that in the CONFIG_NUMA case in mmzone.h
> (around line 1261), pfn_to_nid() calls page_to_nid(), possibly causing the
> same issue seen in v2?
> 

No. If CONFIG_DEFERRED_STRUCT_PAGE_INIT=y, page_ext_init() is called after
page_alloc_init_late() where all the memory has already been initialized,
so page_to_nid() will work then.
Michal Hocko Jan. 3, 2019, 11:51 a.m. UTC | #3
On Thu 20-12-18 15:31:56, Qian Cai wrote:
> When booting a system with "page_owner=on",
> 
> start_kernel
>   page_ext_init
>     invoke_init_callbacks
>       init_section_page_ext
>         init_page_owner
>           init_early_allocated_pages
>             init_zones_in_node
>               init_pages_in_zone
>                 lookup_page_ext
>                   page_to_nid
> 
> The issue here is that page_to_nid() will not work since some page
> flags have no node information until later in page_alloc_init_late() due
> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
> access with an invalid nid.
> 
> [    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
> [    8.672603] index 7 is out of range for type 'zone [5]'
> 
> Also, kernel will panic since flags were poisoned earlier with,
> 
> CONFIG_DEBUG_VM_PGFLAGS=y
> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
> 
> start_kernel
>   setup_arch
>     pagetable_init
>       paging_init
>         sparse_init
>           sparse_init_nid
>             memblock_alloc_try_nid_raw
> 
> Although later it tries to set page flags for pages in reserved bootmem
> regions,
> 
> mm_init
>   mem_init
>     memblock_free_all
>       free_low_memory_core_early
>         reserve_bootmem_region
> 
> there could still have some freed pages from the page allocator but yet
> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
> dealt with a bit in page_ext_init().
> 
> * Take into account DEFERRED_STRUCT_PAGE_INIT.
> */
> if (early_pfn_to_nid(pfn) != nid)
> 	continue;
> 
> However, it did not handle it well in init_pages_in_zone() which end up
> calling page_to_nid().
> 
> [   11.917212] page:ffffea0004200000 is uninitialized and poisoned
> [   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> ffffffffffffffff
> [   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> ffffffffffffffff
> [   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> [   11.926498] page_owner info is not active (free page?)
> [   12.329560] kernel BUG at include/linux/mm.h:990!
> [   12.337632] RIP: 0010:init_page_owner+0x486/0x520
> 
> Since there is no other routines depend on page_ext_init() in
> start_kernel(), just move it after page_alloc_init_late() to ensure that
> there is no deferred pages need to de dealt with. If deselected
> DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
> earlier, so page owner could catch more early page allocation call
> sites. This gives us a good compromise between catching good and bad
> call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.
> 
> [1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/
> 
> Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
> 
> v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n.
> 
> v2: postpone page_ext_init() to after page_alloc_init_late().
> 
>  init/main.c   | 5 +++++
>  mm/page_ext.c | 3 +--
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/init/main.c b/init/main.c
> index 2b7b7fe173c9..5d9904370f76 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void)
>  		initrd_start = 0;
>  	}
>  #endif
> +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  	page_ext_init();
> +#endif
>  	kmemleak_init();
>  	setup_per_cpu_pageset();
>  	numa_policy_init();
> @@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void)
>  	sched_init_smp();
>  
>  	page_alloc_init_late();
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +	page_ext_init();
> +#endif
>  
>  	do_basic_setup();

Is this really necessary? Why cannot we simply postpone page_ext_init
unconditioanally?

>  
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index ae44f7adbe07..d76fd51e312a 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
>  			 * -------------pfn-------------->
>  			 * N0 | N1 | N2 | N0 | N1 | N2|....
>  			 *
> -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
>  			 */
> -			if (early_pfn_to_nid(pfn) != nid)
> +			if (pfn_to_nid(pfn) != nid)
>  				continue;
>  			if (init_section_page_ext(pfn, nid))
>  				goto oom;

Also this doesn't seem to be related, right?
Qian Cai Jan. 3, 2019, 4:38 p.m. UTC | #4
On 1/3/19 6:51 AM, Michal Hocko wrote:
> On Thu 20-12-18 15:31:56, Qian Cai wrote:
>> When booting a system with "page_owner=on",
>>
>> start_kernel
>>   page_ext_init
>>     invoke_init_callbacks
>>       init_section_page_ext
>>         init_page_owner
>>           init_early_allocated_pages
>>             init_zones_in_node
>>               init_pages_in_zone
>>                 lookup_page_ext
>>                   page_to_nid
>>
>> The issue here is that page_to_nid() will not work since some page
>> flags have no node information until later in page_alloc_init_late() due
>> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
>> access with an invalid nid.
>>
>> [    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
>> [    8.672603] index 7 is out of range for type 'zone [5]'
>>
>> Also, kernel will panic since flags were poisoned earlier with,
>>
>> CONFIG_DEBUG_VM_PGFLAGS=y
>> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
>>
>> start_kernel
>>   setup_arch
>>     pagetable_init
>>       paging_init
>>         sparse_init
>>           sparse_init_nid
>>             memblock_alloc_try_nid_raw
>>
>> Although later it tries to set page flags for pages in reserved bootmem
>> regions,
>>
>> mm_init
>>   mem_init
>>     memblock_free_all
>>       free_low_memory_core_early
>>         reserve_bootmem_region
>>
>> there could still have some freed pages from the page allocator but yet
>> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
>> dealt with a bit in page_ext_init().
>>
>> * Take into account DEFERRED_STRUCT_PAGE_INIT.
>> */
>> if (early_pfn_to_nid(pfn) != nid)
>> 	continue;
>>
>> However, it did not handle it well in init_pages_in_zone() which end up
>> calling page_to_nid().
>>
>> [   11.917212] page:ffffea0004200000 is uninitialized and poisoned
>> [   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
>> ffffffffffffffff
>> [   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
>> ffffffffffffffff
>> [   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>> [   11.926498] page_owner info is not active (free page?)
>> [   12.329560] kernel BUG at include/linux/mm.h:990!
>> [   12.337632] RIP: 0010:init_page_owner+0x486/0x520
>>
>> Since there is no other routines depend on page_ext_init() in
>> start_kernel(), just move it after page_alloc_init_late() to ensure that
>> there is no deferred pages need to de dealt with. If deselected
>> DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
>> earlier, so page owner could catch more early page allocation call
>> sites. This gives us a good compromise between catching good and bad
>> call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.
>>
>> [1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/
>>
>> Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
>> Suggested-by: Michal Hocko <mhocko@kernel.org>
>> Signed-off-by: Qian Cai <cai@lca.pw>
>> ---
>>
>> v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n.
>>
>> v2: postpone page_ext_init() to after page_alloc_init_late().
>>
>>  init/main.c   | 5 +++++
>>  mm/page_ext.c | 3 +--
>>  2 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/init/main.c b/init/main.c
>> index 2b7b7fe173c9..5d9904370f76 100644
>> --- a/init/main.c
>> +++ b/init/main.c
>> @@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void)
>>  		initrd_start = 0;
>>  	}
>>  #endif
>> +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>>  	page_ext_init();
>> +#endif
>>  	kmemleak_init();
>>  	setup_per_cpu_pageset();
>>  	numa_policy_init();
>> @@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void)
>>  	sched_init_smp();
>>  
>>  	page_alloc_init_late();
>> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>> +	page_ext_init();
>> +#endif
>>  
>>  	do_basic_setup();
> 
> Is this really necessary? Why cannot we simply postpone page_ext_init
> unconditioanally?

As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better
to call page_ext_init() earlier, so page owner could catch more early page
allocation call sites."

> 
>>  
>> diff --git a/mm/page_ext.c b/mm/page_ext.c
>> index ae44f7adbe07..d76fd51e312a 100644
>> --- a/mm/page_ext.c
>> +++ b/mm/page_ext.c
>> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
>>  			 * -------------pfn-------------->
>>  			 * N0 | N1 | N2 | N0 | N1 | N2|....
>>  			 *
>> -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
>>  			 */
>> -			if (early_pfn_to_nid(pfn) != nid)
>> +			if (pfn_to_nid(pfn) != nid)
>>  				continue;
>>  			if (init_section_page_ext(pfn, nid))
>>  				goto oom;
> 
> Also this doesn't seem to be related, right?

No, it is related. Because of this patch, page_ext_init() is called after all
the memory has already been initialized,
so no longer necessary to call early_pfn_to_nid().
Michal Hocko Jan. 3, 2019, 4:59 p.m. UTC | #5
On Thu 03-01-19 11:38:31, Qian Cai wrote:
> On 1/3/19 6:51 AM, Michal Hocko wrote:
> > On Thu 20-12-18 15:31:56, Qian Cai wrote:
> >> When booting a system with "page_owner=on",
> >>
> >> start_kernel
> >>   page_ext_init
> >>     invoke_init_callbacks
> >>       init_section_page_ext
> >>         init_page_owner
> >>           init_early_allocated_pages
> >>             init_zones_in_node
> >>               init_pages_in_zone
> >>                 lookup_page_ext
> >>                   page_to_nid
> >>
> >> The issue here is that page_to_nid() will not work since some page
> >> flags have no node information until later in page_alloc_init_late() due
> >> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
> >> access with an invalid nid.
> >>
> >> [    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
> >> [    8.672603] index 7 is out of range for type 'zone [5]'
> >>
> >> Also, kernel will panic since flags were poisoned earlier with,
> >>
> >> CONFIG_DEBUG_VM_PGFLAGS=y
> >> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
> >>
> >> start_kernel
> >>   setup_arch
> >>     pagetable_init
> >>       paging_init
> >>         sparse_init
> >>           sparse_init_nid
> >>             memblock_alloc_try_nid_raw
> >>
> >> Although later it tries to set page flags for pages in reserved bootmem
> >> regions,
> >>
> >> mm_init
> >>   mem_init
> >>     memblock_free_all
> >>       free_low_memory_core_early
> >>         reserve_bootmem_region
> >>
> >> there could still have some freed pages from the page allocator but yet
> >> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
> >> dealt with a bit in page_ext_init().
> >>
> >> * Take into account DEFERRED_STRUCT_PAGE_INIT.
> >> */
> >> if (early_pfn_to_nid(pfn) != nid)
> >> 	continue;
> >>
> >> However, it did not handle it well in init_pages_in_zone() which end up
> >> calling page_to_nid().
> >>
> >> [   11.917212] page:ffffea0004200000 is uninitialized and poisoned
> >> [   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> >> ffffffffffffffff
> >> [   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> >> ffffffffffffffff
> >> [   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >> [   11.926498] page_owner info is not active (free page?)
> >> [   12.329560] kernel BUG at include/linux/mm.h:990!
> >> [   12.337632] RIP: 0010:init_page_owner+0x486/0x520
> >>
> >> Since there is no other routines depend on page_ext_init() in
> >> start_kernel(), just move it after page_alloc_init_late() to ensure that
> >> there is no deferred pages need to de dealt with. If deselected
> >> DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
> >> earlier, so page owner could catch more early page allocation call
> >> sites. This gives us a good compromise between catching good and bad
> >> call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.
> >>
> >> [1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/
> >>
> >> Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
> >> Suggested-by: Michal Hocko <mhocko@kernel.org>
> >> Signed-off-by: Qian Cai <cai@lca.pw>
> >> ---
> >>
> >> v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n.
> >>
> >> v2: postpone page_ext_init() to after page_alloc_init_late().
> >>
> >>  init/main.c   | 5 +++++
> >>  mm/page_ext.c | 3 +--
> >>  2 files changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/init/main.c b/init/main.c
> >> index 2b7b7fe173c9..5d9904370f76 100644
> >> --- a/init/main.c
> >> +++ b/init/main.c
> >> @@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void)
> >>  		initrd_start = 0;
> >>  	}
> >>  #endif
> >> +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >>  	page_ext_init();
> >> +#endif
> >>  	kmemleak_init();
> >>  	setup_per_cpu_pageset();
> >>  	numa_policy_init();
> >> @@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void)
> >>  	sched_init_smp();
> >>  
> >>  	page_alloc_init_late();
> >> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >> +	page_ext_init();
> >> +#endif
> >>  
> >>  	do_basic_setup();
> > 
> > Is this really necessary? Why cannot we simply postpone page_ext_init
> > unconditioanally?
> 
> As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better
> to call page_ext_init() earlier, so page owner could catch more early page
> allocation call sites."

Do you have any numbers to show how many allocation are we losing that
way? In other words, do we care enough to create an ugly code?

> >> diff --git a/mm/page_ext.c b/mm/page_ext.c
> >> index ae44f7adbe07..d76fd51e312a 100644
> >> --- a/mm/page_ext.c
> >> +++ b/mm/page_ext.c
> >> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> >>  			 * -------------pfn-------------->
> >>  			 * N0 | N1 | N2 | N0 | N1 | N2|....
> >>  			 *
> >> -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
> >>  			 */
> >> -			if (early_pfn_to_nid(pfn) != nid)
> >> +			if (pfn_to_nid(pfn) != nid)
> >>  				continue;
> >>  			if (init_section_page_ext(pfn, nid))
> >>  				goto oom;
> > 
> > Also this doesn't seem to be related, right?
> 
> No, it is related. Because of this patch, page_ext_init() is called after all
> the memory has already been initialized,
> so no longer necessary to call early_pfn_to_nid().

Yes, but it looks like a follow up cleanup/optimization to me.
Qian Cai Jan. 3, 2019, 5:38 p.m. UTC | #6
On 1/3/19 11:59 AM, Michal Hocko wrote:
>> As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better
>> to call page_ext_init() earlier, so page owner could catch more early page
>> allocation call sites."
> 
> Do you have any numbers to show how many allocation are we losing that
> way? In other words, do we care enough to create an ugly code?

Well, I don't have any numbers, but I read that Joonsoo did not really like to
defer page_ext_init() unconditionally.

"because deferring page_ext_init() would make page owner which uses page_ext
miss some early page allocation callsites. Although it already miss some early
page allocation callsites, we don't need to miss more."

https://lore.kernel.org/lkml/20160524053714.GB32186@js1304-P5Q-DELUXE/

>>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
>>>> index ae44f7adbe07..d76fd51e312a 100644
>>>> --- a/mm/page_ext.c
>>>> +++ b/mm/page_ext.c
>>>> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
>>>>  			 * -------------pfn-------------->
>>>>  			 * N0 | N1 | N2 | N0 | N1 | N2|....
>>>>  			 *
>>>> -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
>>>>  			 */
>>>> -			if (early_pfn_to_nid(pfn) != nid)
>>>> +			if (pfn_to_nid(pfn) != nid)
>>>>  				continue;
>>>>  			if (init_section_page_ext(pfn, nid))
>>>>  				goto oom;
>>>
>>> Also this doesn't seem to be related, right?
>>
>> No, it is related. Because of this patch, page_ext_init() is called after all
>> the memory has already been initialized,
>> so no longer necessary to call early_pfn_to_nid().
> 
> Yes, but it looks like a follow up cleanup/optimization to me.

That early_pfn_to_nid() was introduced in fe53ca54270 (mm: use early_pfn_to_nid
in page_ext_init) which also messed up the order of page_ext_init() in
start_kernel(), so this patch basically revert that commit.
Michal Hocko Jan. 3, 2019, 7:07 p.m. UTC | #7
On Thu 03-01-19 12:38:59, Qian Cai wrote:
> On 1/3/19 11:59 AM, Michal Hocko wrote:
> >> As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better
> >> to call page_ext_init() earlier, so page owner could catch more early page
> >> allocation call sites."
> > 
> > Do you have any numbers to show how many allocation are we losing that
> > way? In other words, do we care enough to create an ugly code?
> 
> Well, I don't have any numbers, but I read that Joonsoo did not really like to
> defer page_ext_init() unconditionally.
> 
> "because deferring page_ext_init() would make page owner which uses page_ext
> miss some early page allocation callsites. Although it already miss some early
> page allocation callsites, we don't need to miss more."

This is quite unspecific.

> https://lore.kernel.org/lkml/20160524053714.GB32186@js1304-P5Q-DELUXE/
> 
> >>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
> >>>> index ae44f7adbe07..d76fd51e312a 100644
> >>>> --- a/mm/page_ext.c
> >>>> +++ b/mm/page_ext.c
> >>>> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> >>>>  			 * -------------pfn-------------->
> >>>>  			 * N0 | N1 | N2 | N0 | N1 | N2|....
> >>>>  			 *
> >>>> -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
> >>>>  			 */
> >>>> -			if (early_pfn_to_nid(pfn) != nid)
> >>>> +			if (pfn_to_nid(pfn) != nid)
> >>>>  				continue;
> >>>>  			if (init_section_page_ext(pfn, nid))
> >>>>  				goto oom;
> >>>
> >>> Also this doesn't seem to be related, right?
> >>
> >> No, it is related. Because of this patch, page_ext_init() is called after all
> >> the memory has already been initialized,
> >> so no longer necessary to call early_pfn_to_nid().
> > 
> > Yes, but it looks like a follow up cleanup/optimization to me.
> 
> That early_pfn_to_nid() was introduced in fe53ca54270 (mm: use early_pfn_to_nid
> in page_ext_init) which also messed up the order of page_ext_init() in
> start_kernel(), so this patch basically revert that commit.

So can we make the revert with an explanation that the patch was wrong?
If we want to make hacks to catch more objects to be tracked then it
would be great to have some numbers in hands.
Qian Cai Jan. 3, 2019, 7:53 p.m. UTC | #8
On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an
explanation that the patch was wrong?
> If we want to make hacks to catch more objects to be tracked then it
> would be great to have some numbers in hands.

Well, those numbers are subject to change depends on future start_kernel()
order. Right now, there are many functions could be caught earlier by page owner.

	kmemleak_init();
	debug_objects_mem_init();
	setup_per_cpu_pageset();
	numa_policy_init();
	acpi_early_init();
	if (late_time_init)
		late_time_init();
	sched_clock_init();
	calibrate_delay();
	pid_idr_init();
	anon_vma_init();
#ifdef CONFIG_X86
	if (efi_enabled(EFI_RUNTIME_SERVICES))
		efi_enter_virtual_mode();
#endif
	thread_stack_cache_init();
	cred_init();
	fork_init();
	proc_caches_init();
	uts_ns_init();
	buffer_init();
	key_init();
	security_init();
	dbg_late_init();
	vfs_caches_init();
	pagecache_init();
	signals_init();
	seq_file_init();
	proc_root_init();
	nsfs_init();
	cpuset_init();
	cgroup_init();
	taskstats_init_early();
	delayacct_init();

	check_bugs();

	acpi_subsystem_init();
	arch_post_acpi_subsys_init();
	sfi_init_late();

	if (efi_enabled(EFI_RUNTIME_SERVICES)) {
		efi_free_boot_services();

	rcu_scheduler_starting();
	/*
	 * Wait until kthreadd is all set-up.
	 */
	wait_for_completion(&kthreadd_done);

	/* Now the scheduler is fully set up and can do blocking allocations */
	gfp_allowed_mask = __GFP_BITS_MASK;

	/*
	 * init can allocate pages on any node
	 */
	set_mems_allowed(node_states[N_MEMORY]);

	cad_pid = task_pid(current);

	smp_prepare_cpus(setup_max_cpus);

	workqueue_init();

	init_mm_internals();

	do_pre_smp_initcalls();
	lockup_detector_init();

	smp_init();
	sched_init_smp();
Michal Hocko Jan. 3, 2019, 8:22 p.m. UTC | #9
On Thu 03-01-19 14:53:47, Qian Cai wrote:
> On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an
> explanation that the patch was wrong?
> > If we want to make hacks to catch more objects to be tracked then it
> > would be great to have some numbers in hands.
> 
> Well, those numbers are subject to change depends on future start_kernel()
> order. Right now, there are many functions could be caught earlier by page owner.
> 
> 	kmemleak_init();
[...]
> 	sched_init_smp();

The kernel source dump will not tell us much of course. A ball park
number whether we are talking about dozen, hundreds or thousands of
allocations would tell us something at least, doesn't it.

Handwaving that it might help us some is not particurarly useful. We are
already losing some allocations already. Does it matter? Well, that
depends, sometimes we do want to catch an owner of particular page and
it is sad to find nothing. But how many times have you or somebody else
encountered that in practice. That is exactly a useful information to
judge an ugly ifdefery in the code. See my point?
Qian Cai Jan. 3, 2019, 10:22 p.m. UTC | #10
On 1/3/19 3:22 PM, Michal Hocko wrote:
> On Thu 03-01-19 14:53:47, Qian Cai wrote:
>> On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an
>> explanation that the patch was wrong?
>>> If we want to make hacks to catch more objects to be tracked then it
>>> would be great to have some numbers in hands.
>>
>> Well, those numbers are subject to change depends on future start_kernel()
>> order. Right now, there are many functions could be caught earlier by page owner.
>>
>> 	kmemleak_init();
> [...]
>> 	sched_init_smp();
> 
> The kernel source dump will not tell us much of course. A ball park
> number whether we are talking about dozen, hundreds or thousands of
> allocations would tell us something at least, doesn't it.
> 
> Handwaving that it might help us some is not particurarly useful. We are
> already losing some allocations already. Does it matter? Well, that
> depends, sometimes we do want to catch an owner of particular page and
> it is sad to find nothing. But how many times have you or somebody else
> encountered that in practice. That is exactly a useful information to
> judge an ugly ifdefery in the code. See my point?

Here is the number without DEFERRED_STRUCT_PAGE_INIT.

== page_ext_init() after page_alloc_init_late() ==
Node 0, zone DMA: page owner found early allocated 0 pages
Node 0, zone DMA32: page owner found early allocated 7009 pages
Node 0, zone Normal: page owner found early allocated 85827 pages
Node 4, zone Normal: page owner found early allocated 75063 pages

== page_ext_init() before kmemleak_init() ==
Node 0, zone DMA: page owner found early allocated 0 pages
Node 0, zone DMA32: page owner found early allocated 6654 pages
Node 0, zone Normal: page owner found early allocated 41907 pages
Node 4, zone Normal: page owner found early allocated 41356 pages

So, it told us that it will miss tens of thousands of early page allocation call
sites.
Michal Hocko Jan. 4, 2019, 1:09 p.m. UTC | #11
On Thu 03-01-19 17:22:29, Qian Cai wrote:
> On 1/3/19 3:22 PM, Michal Hocko wrote:
> > On Thu 03-01-19 14:53:47, Qian Cai wrote:
> >> On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an
> >> explanation that the patch was wrong?
> >>> If we want to make hacks to catch more objects to be tracked then it
> >>> would be great to have some numbers in hands.
> >>
> >> Well, those numbers are subject to change depends on future start_kernel()
> >> order. Right now, there are many functions could be caught earlier by page owner.
> >>
> >> 	kmemleak_init();
> > [...]
> >> 	sched_init_smp();
> > 
> > The kernel source dump will not tell us much of course. A ball park
> > number whether we are talking about dozen, hundreds or thousands of
> > allocations would tell us something at least, doesn't it.
> > 
> > Handwaving that it might help us some is not particurarly useful. We are
> > already losing some allocations already. Does it matter? Well, that
> > depends, sometimes we do want to catch an owner of particular page and
> > it is sad to find nothing. But how many times have you or somebody else
> > encountered that in practice. That is exactly a useful information to
> > judge an ugly ifdefery in the code. See my point?
> 
> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
> 
> == page_ext_init() after page_alloc_init_late() ==
> Node 0, zone DMA: page owner found early allocated 0 pages
> Node 0, zone DMA32: page owner found early allocated 7009 pages
> Node 0, zone Normal: page owner found early allocated 85827 pages
> Node 4, zone Normal: page owner found early allocated 75063 pages
> 
> == page_ext_init() before kmemleak_init() ==
> Node 0, zone DMA: page owner found early allocated 0 pages
> Node 0, zone DMA32: page owner found early allocated 6654 pages
> Node 0, zone Normal: page owner found early allocated 41907 pages
> Node 4, zone Normal: page owner found early allocated 41356 pages
> 
> So, it told us that it will miss tens of thousands of early page allocation call
> sites.

This is an answer for the first part of the question (how much). The
second is _do_we_care_?
Qian Cai Jan. 4, 2019, 3:01 p.m. UTC | #12
On 1/4/19 8:09 AM, Michal Hocko wrote:
>> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
>>
>> == page_ext_init() after page_alloc_init_late() ==
>> Node 0, zone DMA: page owner found early allocated 0 pages
>> Node 0, zone DMA32: page owner found early allocated 7009 pages
>> Node 0, zone Normal: page owner found early allocated 85827 pages
>> Node 4, zone Normal: page owner found early allocated 75063 pages
>>
>> == page_ext_init() before kmemleak_init() ==
>> Node 0, zone DMA: page owner found early allocated 0 pages
>> Node 0, zone DMA32: page owner found early allocated 6654 pages
>> Node 0, zone Normal: page owner found early allocated 41907 pages
>> Node 4, zone Normal: page owner found early allocated 41356 pages
>>
>> So, it told us that it will miss tens of thousands of early page allocation call
>> sites.
> 
> This is an answer for the first part of the question (how much). The
> second is _do_we_care_?

Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
start to miss tens of thousands early page allocation call sites.

The other option I can think of to not hurt your eyes is to rewrite the whole
page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
However, I have a hard-time to convince myself it is a sensible thing to do.
Michal Hocko Jan. 4, 2019, 3:17 p.m. UTC | #13
On Fri 04-01-19 10:01:40, Qian Cai wrote:
> On 1/4/19 8:09 AM, Michal Hocko wrote:
> >> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
> >>
> >> == page_ext_init() after page_alloc_init_late() ==
> >> Node 0, zone DMA: page owner found early allocated 0 pages
> >> Node 0, zone DMA32: page owner found early allocated 7009 pages
> >> Node 0, zone Normal: page owner found early allocated 85827 pages
> >> Node 4, zone Normal: page owner found early allocated 75063 pages
> >>
> >> == page_ext_init() before kmemleak_init() ==
> >> Node 0, zone DMA: page owner found early allocated 0 pages
> >> Node 0, zone DMA32: page owner found early allocated 6654 pages
> >> Node 0, zone Normal: page owner found early allocated 41907 pages
> >> Node 4, zone Normal: page owner found early allocated 41356 pages
> >>
> >> So, it told us that it will miss tens of thousands of early page allocation call
> >> sites.
> > 
> > This is an answer for the first part of the question (how much). The
> > second is _do_we_care_?
> 
> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
> start to miss tens of thousands early page allocation call sites.

I am pretty sure we will hear about that when that happens. And act
accordingly.

> The other option I can think of to not hurt your eyes is to rewrite the whole
> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
> However, I have a hard-time to convince myself it is a sensible thing to do.

Or simply make the page_owner initialization only touch the already
initialized memory. Have you explored that option as well?

Look, I am trying to push for a clean solution. Hacks I have seen so
far are not convincing. You have identified a regression and as such I
would consider the most straightforward to revert the buggy commit. If
you want to improve the situation then I would suggest to think about
something that is more robust than ifdefed hacks. It might be more work
but it also might be a better thing long term.

If you think I am asking too much then you are free to ignore my
opinion. I am not a maintainer of the page_owner code.
Qian Cai Jan. 4, 2019, 3:25 p.m. UTC | #14
On 1/4/19 10:17 AM, Michal Hocko wrote:
> On Fri 04-01-19 10:01:40, Qian Cai wrote:
>> On 1/4/19 8:09 AM, Michal Hocko wrote:
>>>> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
>>>>
>>>> == page_ext_init() after page_alloc_init_late() ==
>>>> Node 0, zone DMA: page owner found early allocated 0 pages
>>>> Node 0, zone DMA32: page owner found early allocated 7009 pages
>>>> Node 0, zone Normal: page owner found early allocated 85827 pages
>>>> Node 4, zone Normal: page owner found early allocated 75063 pages
>>>>
>>>> == page_ext_init() before kmemleak_init() ==
>>>> Node 0, zone DMA: page owner found early allocated 0 pages
>>>> Node 0, zone DMA32: page owner found early allocated 6654 pages
>>>> Node 0, zone Normal: page owner found early allocated 41907 pages
>>>> Node 4, zone Normal: page owner found early allocated 41356 pages
>>>>
>>>> So, it told us that it will miss tens of thousands of early page allocation call
>>>> sites.
>>>
>>> This is an answer for the first part of the question (how much). The
>>> second is _do_we_care_?
>>
>> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
>> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
>> start to miss tens of thousands early page allocation call sites.
> 
> I am pretty sure we will hear about that when that happens. And act
> accordingly.
> 
>> The other option I can think of to not hurt your eyes is to rewrite the whole
>> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
>> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
>> However, I have a hard-time to convince myself it is a sensible thing to do.
> 
> Or simply make the page_owner initialization only touch the already
> initialized memory. Have you explored that option as well?

Yes, a proof-of-concept version is v1 where ends up with more ifdefs due to
dealing with all the low-level details,

https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/
Michal Hocko Jan. 4, 2019, 3:32 p.m. UTC | #15
On Fri 04-01-19 10:25:12, Qian Cai wrote:
> On 1/4/19 10:17 AM, Michal Hocko wrote:
> > On Fri 04-01-19 10:01:40, Qian Cai wrote:
> >> On 1/4/19 8:09 AM, Michal Hocko wrote:
> >>>> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
> >>>>
> >>>> == page_ext_init() after page_alloc_init_late() ==
> >>>> Node 0, zone DMA: page owner found early allocated 0 pages
> >>>> Node 0, zone DMA32: page owner found early allocated 7009 pages
> >>>> Node 0, zone Normal: page owner found early allocated 85827 pages
> >>>> Node 4, zone Normal: page owner found early allocated 75063 pages
> >>>>
> >>>> == page_ext_init() before kmemleak_init() ==
> >>>> Node 0, zone DMA: page owner found early allocated 0 pages
> >>>> Node 0, zone DMA32: page owner found early allocated 6654 pages
> >>>> Node 0, zone Normal: page owner found early allocated 41907 pages
> >>>> Node 4, zone Normal: page owner found early allocated 41356 pages
> >>>>
> >>>> So, it told us that it will miss tens of thousands of early page allocation call
> >>>> sites.
> >>>
> >>> This is an answer for the first part of the question (how much). The
> >>> second is _do_we_care_?
> >>
> >> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
> >> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
> >> start to miss tens of thousands early page allocation call sites.
> > 
> > I am pretty sure we will hear about that when that happens. And act
> > accordingly.
> > 
> >> The other option I can think of to not hurt your eyes is to rewrite the whole
> >> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
> >> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
> >> However, I have a hard-time to convince myself it is a sensible thing to do.
> > 
> > Or simply make the page_owner initialization only touch the already
> > initialized memory. Have you explored that option as well?
> 
> Yes, a proof-of-concept version is v1 where ends up with more ifdefs due to
> dealing with all the low-level details,
> 
> https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/

That is obviously not what I've had in mind. We have __init_single_page
which initializes a single struct page. Is there any way to hook
page_ext initialization there?
Qian Cai Jan. 4, 2019, 8:18 p.m. UTC | #16
On 1/4/19 10:32 AM, Michal Hocko wrote:
> On Fri 04-01-19 10:25:12, Qian Cai wrote:
>> On 1/4/19 10:17 AM, Michal Hocko wrote:
>>> On Fri 04-01-19 10:01:40, Qian Cai wrote:
>>>> On 1/4/19 8:09 AM, Michal Hocko wrote:
>>>>>> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
>>>>>>
>>>>>> == page_ext_init() after page_alloc_init_late() ==
>>>>>> Node 0, zone DMA: page owner found early allocated 0 pages
>>>>>> Node 0, zone DMA32: page owner found early allocated 7009 pages
>>>>>> Node 0, zone Normal: page owner found early allocated 85827 pages
>>>>>> Node 4, zone Normal: page owner found early allocated 75063 pages
>>>>>>
>>>>>> == page_ext_init() before kmemleak_init() ==
>>>>>> Node 0, zone DMA: page owner found early allocated 0 pages
>>>>>> Node 0, zone DMA32: page owner found early allocated 6654 pages
>>>>>> Node 0, zone Normal: page owner found early allocated 41907 pages
>>>>>> Node 4, zone Normal: page owner found early allocated 41356 pages
>>>>>>
>>>>>> So, it told us that it will miss tens of thousands of early page allocation call
>>>>>> sites.
>>>>>
>>>>> This is an answer for the first part of the question (how much). The
>>>>> second is _do_we_care_?
>>>>
>>>> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
>>>> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
>>>> start to miss tens of thousands early page allocation call sites.
>>>
>>> I am pretty sure we will hear about that when that happens. And act
>>> accordingly.
>>>
>>>> The other option I can think of to not hurt your eyes is to rewrite the whole
>>>> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
>>>> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
>>>> However, I have a hard-time to convince myself it is a sensible thing to do.
>>>
>>> Or simply make the page_owner initialization only touch the already
>>> initialized memory. Have you explored that option as well?
>>
>> Yes, a proof-of-concept version is v1 where ends up with more ifdefs due to
>> dealing with all the low-level details,
>>
>> https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/
> 
> That is obviously not what I've had in mind. We have __init_single_page
> which initializes a single struct page. Is there any way to hook
> page_ext initialization there?

Well, the current design is,

(1) allocate page_ext physical-contiguous pages for all
    nodes.
(2) page owner gathers all early allocation pages.
(3) page owner is fully operational.

It may be possible to move (1) as early as possible just after vmalloc_init() in
start_kernel() because it depends on vmalloc(), but it still needs to call (2)
as there have had many early allocation pages already.

Node 0, zone DMA: page owner found early allocated 0 pages
Node 0, zone DMA32: page owner found early allocated 6654 pages
Node 0, zone Normal: page owner found early allocated 40972 pages
Node 4, zone Normal: page owner found early allocated 40968 pages

But, (2) still depends on DEFERRED_STRUCT_PAGE_INIT. To get ride of this
dependency, it may be possible to add some checking in __init_single_page():

- if not done (1), save those pages to an array and defer
  page_owner early_handle initialization until done (1).

Though, I can't see any really benefit of this approach apart from "beautify"
the code.
Michal Hocko Jan. 7, 2019, 6:43 p.m. UTC | #17
On Fri 04-01-19 15:18:08, Qian Cai wrote:
[...]
> Though, I can't see any really benefit of this approach apart from "beautify"

This is not about beautifying! This is about making the code long term
maintainable. As you can see it is just too easy to break it with the
current scheme. And that is bad especially when the code is broken
because of an optimization.
Qian Cai Jan. 8, 2019, 1:53 a.m. UTC | #18
On 1/7/19 1:43 PM, Michal Hocko wrote:
> On Fri 04-01-19 15:18:08, Qian Cai wrote:
> [...]
>> Though, I can't see any really benefit of this approach apart from "beautify"
> 
> This is not about beautifying! This is about making the code long term
> maintainable. As you can see it is just too easy to break it with the
> current scheme. And that is bad especially when the code is broken
> because of an optimization.
> 

Understood, but the code is now fixed. If there is something fundamentally
broken in the future, it may be a good time then to create a looks like
hundred-line cleanup patch for long-term maintenance at the same time to fix
real bugs.
Michal Hocko Jan. 8, 2019, 8:20 a.m. UTC | #19
On Mon 07-01-19 20:53:08, Qian Cai wrote:
> 
> 
> On 1/7/19 1:43 PM, Michal Hocko wrote:
> > On Fri 04-01-19 15:18:08, Qian Cai wrote:
> > [...]
> >> Though, I can't see any really benefit of this approach apart from "beautify"
> > 
> > This is not about beautifying! This is about making the code long term
> > maintainable. As you can see it is just too easy to break it with the
> > current scheme. And that is bad especially when the code is broken
> > because of an optimization.
> > 
> 
> Understood, but the code is now fixed. If there is something fundamentally
> broken in the future, it may be a good time then to create a looks like
> hundred-line cleanup patch for long-term maintenance at the same time to fix
> real bugs.

Yeah, so revert = fix and redisign the thing to make the code more
robust longterm + allow to catch more allocation. I really fail to see
why this has to be repeated several times in this thread. Really.
Qian Cai Jan. 8, 2019, 1:19 p.m. UTC | #20
On Tue, 2019-01-08 at 09:20 +0100, Michal Hocko wrote:
> On Mon 07-01-19 20:53:08, Qian Cai wrote:
> > 
> > 
> > On 1/7/19 1:43 PM, Michal Hocko wrote:
> > > On Fri 04-01-19 15:18:08, Qian Cai wrote:
> > > [...]
> > > > Though, I can't see any really benefit of this approach apart from
> > > > "beautify"
> > > 
> > > This is not about beautifying! This is about making the code long term
> > > maintainable. As you can see it is just too easy to break it with the
> > > current scheme. And that is bad especially when the code is broken
> > > because of an optimization.
> > > 
> > 
> > Understood, but the code is now fixed. If there is something fundamentally
> > broken in the future, it may be a good time then to create a looks like
> > hundred-line cleanup patch for long-term maintenance at the same time to fix
> > real bugs.
> 
> Yeah, so revert = fix and redisign the thing to make the code more
> robust longterm + allow to catch more allocation. I really fail to see
> why this has to be repeated several times in this thread. Really.
> 

Again, this will introduce a immediately regression (arguably small) that
existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
start to miss tens of thousands early page allocation call sites.

I think the disagreement comes from that you want to deal with this passively
rather than proactively that you said "I am pretty sure we will hear about that
when that happens. And act accordingly", but I think it is better to fix it now
rather than later with a 4-line ifdef which you don't like.

I suppose someone else needs to make a judgment call for this as we are in a
"you can't convince me and I can't convince you" situation right now.
Andrew Morton Jan. 8, 2019, 10:02 p.m. UTC | #21
It's unclear (to me) where we stand with this patch.  Shold we proceed
with v3 for now, or is something else planned?


From: Qian Cai <cai@lca.pw>
Subject: mm/page_owner: fix for deferred struct page init

When booting a system with "page_owner=on",

start_kernel
  page_ext_init
    invoke_init_callbacks
      init_section_page_ext
        init_page_owner
          init_early_allocated_pages
            init_zones_in_node
              init_pages_in_zone
                lookup_page_ext
                  page_to_nid

The issue here is that page_to_nid() will not work since some page
flags have no node information until later in page_alloc_init_late() due
to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.

[    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
[    8.672603] index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
  setup_arch
    pagetable_init
      paging_init
        sparse_init
          sparse_init_nid
            memblock_alloc_try_nid_raw

Although later it tries to set page flags for pages in reserved bootmem
regions,

mm_init
  mem_init
    memblock_free_all
      free_low_memory_core_early
        reserve_bootmem_region

there could still have some freed pages from the page allocator but yet
to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
dealt with a bit in page_ext_init().

* Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
if (early_pfn_to_nid(pfn) != nid)
	continue;

However, it did not handle it well in init_pages_in_zone() which end up
calling page_to_nid().

[   11.917212] page:ffffea0004200000 is uninitialized and poisoned
[   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[   11.926498] page_owner info is not active (free page?)
[   12.329560] kernel BUG at include/linux/mm.h:990!
[   12.337632] RIP: 0010:init_page_owner+0x486/0x520

Since there is no other routines depend on page_ext_init() in
start_kernel(), just move it after page_alloc_init_late() to ensure that
there is no deferred pages need to de dealt with. If deselected
DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
earlier, so page owner could catch more early page allocation call
sites. This gives us a good compromise between catching good and bad
call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.

[1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/

Link: http://lkml.kernel.org/r/20181220203156.43441-1-cai@lca.pw
Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
Signed-off-by: Qian Cai <cai@lca.pw>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 init/main.c   |    5 +++++
 mm/page_ext.c |    3 +--
 2 files changed, 6 insertions(+), 2 deletions(-)

--- a/init/main.c~mm-page_owner-fix-for-deferred-struct-page-init
+++ a/init/main.c
@@ -695,7 +695,9 @@ asmlinkage __visible void __init start_k
 		initrd_start = 0;
 	}
 #endif
+#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	page_ext_init();
+#endif
 	kmemleak_init();
 	setup_per_cpu_pageset();
 	numa_policy_init();
@@ -1131,6 +1133,9 @@ static noinline void __init kernel_init_
 	sched_init_smp();
 
 	page_alloc_init_late();
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+	page_ext_init();
+#endif
 
 	do_basic_setup();
 
--- a/mm/page_ext.c~mm-page_owner-fix-for-deferred-struct-page-init
+++ a/mm/page_ext.c
@@ -399,9 +399,8 @@ void __init page_ext_init(void)
 			 * -------------pfn-------------->
 			 * N0 | N1 | N2 | N0 | N1 | N2|....
 			 *
-			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
 			 */
-			if (early_pfn_to_nid(pfn) != nid)
+			if (pfn_to_nid(pfn) != nid)
 				continue;
 			if (init_section_page_ext(pfn, nid))
 				goto oom;
Qian Cai Jan. 8, 2019, 10:13 p.m. UTC | #22
On 1/8/19 5:02 PM, Andrew Morton wrote:
> 
> It's unclear (to me) where we stand with this patch.  Shold we proceed
> with v3 for now, or is something else planned?

I don't have anything else plan for this right now. Michal particular don't like
that 4-line ifdef which supposes to avoid an immediately regression (arguably
small) that existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected
that would start to miss tens of thousands early page allocation call sites. So,
feel free to choose v2 of this which has no ifdef if you agree with Michal too.
I am fine either way.
Michal Hocko Jan. 8, 2019, 10:40 p.m. UTC | #23
On Tue 08-01-19 17:13:41, Qian Cai wrote:
> 
> 
> On 1/8/19 5:02 PM, Andrew Morton wrote:
> > 
> > It's unclear (to me) where we stand with this patch.  Shold we proceed
> > with v3 for now, or is something else planned?
> 
> I don't have anything else plan for this right now. Michal particular don't like
> that 4-line ifdef which supposes to avoid an immediately regression (arguably
> small) that existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected
> that would start to miss tens of thousands early page allocation call sites. So,
> feel free to choose v2 of this which has no ifdef if you agree with Michal too.
> I am fine either way.

Yes I would prefer to revert the faulty commit (fe53ca54270) and work on
a more robust page_owner initialization to cover earlier allocations on
top of that.

Not that I would insist but this would be a more straightforward
approach and I hope it will result in a better long term maintainable
code in the end.

Patch
diff mbox series

diff --git a/init/main.c b/init/main.c
index 2b7b7fe173c9..5d9904370f76 100644
--- a/init/main.c
+++ b/init/main.c
@@ -696,7 +696,9 @@  asmlinkage __visible void __init start_kernel(void)
 		initrd_start = 0;
 	}
 #endif
+#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	page_ext_init();
+#endif
 	kmemleak_init();
 	setup_per_cpu_pageset();
 	numa_policy_init();
@@ -1147,6 +1149,9 @@  static noinline void __init kernel_init_freeable(void)
 	sched_init_smp();
 
 	page_alloc_init_late();
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+	page_ext_init();
+#endif
 
 	do_basic_setup();
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index ae44f7adbe07..d76fd51e312a 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -399,9 +399,8 @@  void __init page_ext_init(void)
 			 * -------------pfn-------------->
 			 * N0 | N1 | N2 | N0 | N1 | N2|....
 			 *
-			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
 			 */
-			if (early_pfn_to_nid(pfn) != nid)
+			if (pfn_to_nid(pfn) != nid)
 				continue;
 			if (init_section_page_ext(pfn, nid))
 				goto oom;