diff mbox

Revert "mm/page_alloc: fix memmap_init_zone pageblock alignment"

Message ID 20180314134431.13241-1-ard.biesheuvel@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

Ard Biesheuvel March 14, 2018, 1:44 p.m. UTC
This reverts commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae.

It breaks the boot on my Socionext SynQuacer based system, because
it enters an infinite loop iterating over the pfns.

Adding the following debug output to memmap_init_zone()

  --- a/mm/page_alloc.c
  +++ b/mm/page_alloc.c
  @@ -5365,6 +5365,11 @@
   			 * the valid region but still depends on correct page
   			 * metadata.
   			 */
  +			pr_err("pfn:%lx oldnext:%lx newnext:%lx\n", pfn,
  +				memblock_next_valid_pfn(pfn, end_pfn) - 1,
  +				(memblock_next_valid_pfn(pfn, end_pfn) &
  +					~(pageblock_nr_pages-1)) - 1);
  +
   			pfn = (memblock_next_valid_pfn(pfn, end_pfn) &
   					~(pageblock_nr_pages-1)) - 1;
   #endif

results in

   Booting Linux on physical CPU 0x0000000000 [0x410fd034]
   Linux version 4.16.0-rc5-00004-gfc6eabbbf8ef-dirty (ard@dogfood) ...
   Machine model: Socionext Developer Box
   earlycon: pl11 at MMIO 0x000000002a400000 (options '')
   bootconsole [pl11] enabled
   efi: Getting EFI parameters from FDT:
   efi: EFI v2.70 by Linaro
   efi:  SMBIOS 3.0=0xff580000  ESRT=0xf9948198  MEMATTR=0xf83b1a98  RNG=0xff7ac898
   random: fast init done
   efi: seeding entropy pool
   esrt: Reserving ESRT space from 0x00000000f9948198 to 0x00000000f99481d0.
   cma: Reserved 16 MiB at 0x00000000fd800000
   NUMA: No NUMA configuration found
   NUMA: Faking a node at [mem 0x0000000000000000-0x0000000fffffffff]
   NUMA: NODE_DATA [mem 0xffffd8d80-0xffffda87f]
   Zone ranges:
     DMA32    [mem 0x0000000080000000-0x00000000ffffffff]
     Normal   [mem 0x0000000100000000-0x0000000fffffffff]
   Movable zone start for each node
   Early memory node ranges
     node   0: [mem 0x0000000080000000-0x00000000febeffff]
     node   0: [mem 0x00000000febf0000-0x00000000fefcffff]
     node   0: [mem 0x00000000fefd0000-0x00000000ff43ffff]
     node   0: [mem 0x00000000ff440000-0x00000000ff7affff]
     node   0: [mem 0x00000000ff7b0000-0x00000000ffffffff]
     node   0: [mem 0x0000000880000000-0x0000000fffffffff]
   Initmem setup node 0 [mem 0x0000000080000000-0x0000000fffffffff]
   pfn:febf0 oldnext:febf0 newnext:fe9ff
   pfn:febf0 oldnext:febf0 newnext:fe9ff
   pfn:febf0 oldnext:febf0 newnext:fe9ff
   etc etc

and the boot never proceeds after this point.

So the logic is obviously flawed, and so it is best to revert this at
the current -rc stage (unless someone can fix the logic instead)

Fixes: 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment")
Cc: Daniel Vacek <neelx@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 mm/page_alloc.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

Comments

Michal Hocko March 14, 2018, 2:13 p.m. UTC | #1
Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com
fix your issue? From the debugging info you provided it should because
the patch prevents jumping backwards.

On Wed 14-03-18 13:44:31, Ard Biesheuvel wrote:
> This reverts commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae.
> 
> It breaks the boot on my Socionext SynQuacer based system, because
> it enters an infinite loop iterating over the pfns.
> 
> Adding the following debug output to memmap_init_zone()
> 
>   --- a/mm/page_alloc.c
>   +++ b/mm/page_alloc.c
>   @@ -5365,6 +5365,11 @@
>    			 * the valid region but still depends on correct page
>    			 * metadata.
>    			 */
>   +			pr_err("pfn:%lx oldnext:%lx newnext:%lx\n", pfn,
>   +				memblock_next_valid_pfn(pfn, end_pfn) - 1,
>   +				(memblock_next_valid_pfn(pfn, end_pfn) &
>   +					~(pageblock_nr_pages-1)) - 1);
>   +
>    			pfn = (memblock_next_valid_pfn(pfn, end_pfn) &
>    					~(pageblock_nr_pages-1)) - 1;
>    #endif
> 
> results in
> 
>    Booting Linux on physical CPU 0x0000000000 [0x410fd034]
>    Linux version 4.16.0-rc5-00004-gfc6eabbbf8ef-dirty (ard@dogfood) ...
>    Machine model: Socionext Developer Box
>    earlycon: pl11 at MMIO 0x000000002a400000 (options '')
>    bootconsole [pl11] enabled
>    efi: Getting EFI parameters from FDT:
>    efi: EFI v2.70 by Linaro
>    efi:  SMBIOS 3.0=0xff580000  ESRT=0xf9948198  MEMATTR=0xf83b1a98  RNG=0xff7ac898
>    random: fast init done
>    efi: seeding entropy pool
>    esrt: Reserving ESRT space from 0x00000000f9948198 to 0x00000000f99481d0.
>    cma: Reserved 16 MiB at 0x00000000fd800000
>    NUMA: No NUMA configuration found
>    NUMA: Faking a node at [mem 0x0000000000000000-0x0000000fffffffff]
>    NUMA: NODE_DATA [mem 0xffffd8d80-0xffffda87f]
>    Zone ranges:
>      DMA32    [mem 0x0000000080000000-0x00000000ffffffff]
>      Normal   [mem 0x0000000100000000-0x0000000fffffffff]
>    Movable zone start for each node
>    Early memory node ranges
>      node   0: [mem 0x0000000080000000-0x00000000febeffff]
>      node   0: [mem 0x00000000febf0000-0x00000000fefcffff]
>      node   0: [mem 0x00000000fefd0000-0x00000000ff43ffff]
>      node   0: [mem 0x00000000ff440000-0x00000000ff7affff]
>      node   0: [mem 0x00000000ff7b0000-0x00000000ffffffff]
>      node   0: [mem 0x0000000880000000-0x0000000fffffffff]
>    Initmem setup node 0 [mem 0x0000000080000000-0x0000000fffffffff]
>    pfn:febf0 oldnext:febf0 newnext:fe9ff
>    pfn:febf0 oldnext:febf0 newnext:fe9ff
>    pfn:febf0 oldnext:febf0 newnext:fe9ff
>    etc etc
> 
> and the boot never proceeds after this point.
> 
> So the logic is obviously flawed, and so it is best to revert this at
> the current -rc stage (unless someone can fix the logic instead)
> 
> Fixes: 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment")
> Cc: Daniel Vacek <neelx@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Paul Burton <paul.burton@imgtec.com>
> Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  mm/page_alloc.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3d974cb2a1a1..cb416723538f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5359,14 +5359,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>  			/*
>  			 * Skip to the pfn preceding the next valid one (or
>  			 * end_pfn), such that we hit a valid pfn (or end_pfn)
> -			 * on our next iteration of the loop. Note that it needs
> -			 * to be pageblock aligned even when the region itself
> -			 * is not. move_freepages_block() can shift ahead of
> -			 * the valid region but still depends on correct page
> -			 * metadata.
> +			 * on our next iteration of the loop.
>  			 */
> -			pfn = (memblock_next_valid_pfn(pfn, end_pfn) &
> -					~(pageblock_nr_pages-1)) - 1;
> +			pfn = memblock_next_valid_pfn(pfn, end_pfn) - 1;
>  #endif
>  			continue;
>  		}
> -- 
> 2.15.1
>
Ard Biesheuvel March 14, 2018, 2:35 p.m. UTC | #2
On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:
> Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com
> fix your issue? From the debugging info you provided it should because
> the patch prevents jumping backwards.
>

The patch does fix the boot hang.

But I am concerned that we are papering over a fundamental flaw in
memblock_next_valid_pfn(). If that does not always produce the next
valid PFN, surely we should be fixing *that* rather than dealing with
it here by rounding, aligning and keeping track of whether we are
advancing or not?

So in my opinion, this patch should still be reverted, and the
underlying issue fixed properly instead.



> On Wed 14-03-18 13:44:31, Ard Biesheuvel wrote:
>> This reverts commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae.
>>
>> It breaks the boot on my Socionext SynQuacer based system, because
>> it enters an infinite loop iterating over the pfns.
>>
>> Adding the following debug output to memmap_init_zone()
>>
>>   --- a/mm/page_alloc.c
>>   +++ b/mm/page_alloc.c
>>   @@ -5365,6 +5365,11 @@
>>                        * the valid region but still depends on correct page
>>                        * metadata.
>>                        */
>>   +                   pr_err("pfn:%lx oldnext:%lx newnext:%lx\n", pfn,
>>   +                           memblock_next_valid_pfn(pfn, end_pfn) - 1,
>>   +                           (memblock_next_valid_pfn(pfn, end_pfn) &
>>   +                                   ~(pageblock_nr_pages-1)) - 1);
>>   +
>>                       pfn = (memblock_next_valid_pfn(pfn, end_pfn) &
>>                                       ~(pageblock_nr_pages-1)) - 1;
>>    #endif
>>
>> results in
>>
>>    Booting Linux on physical CPU 0x0000000000 [0x410fd034]
>>    Linux version 4.16.0-rc5-00004-gfc6eabbbf8ef-dirty (ard@dogfood) ...
>>    Machine model: Socionext Developer Box
>>    earlycon: pl11 at MMIO 0x000000002a400000 (options '')
>>    bootconsole [pl11] enabled
>>    efi: Getting EFI parameters from FDT:
>>    efi: EFI v2.70 by Linaro
>>    efi:  SMBIOS 3.0=0xff580000  ESRT=0xf9948198  MEMATTR=0xf83b1a98  RNG=0xff7ac898
>>    random: fast init done
>>    efi: seeding entropy pool
>>    esrt: Reserving ESRT space from 0x00000000f9948198 to 0x00000000f99481d0.
>>    cma: Reserved 16 MiB at 0x00000000fd800000
>>    NUMA: No NUMA configuration found
>>    NUMA: Faking a node at [mem 0x0000000000000000-0x0000000fffffffff]
>>    NUMA: NODE_DATA [mem 0xffffd8d80-0xffffda87f]
>>    Zone ranges:
>>      DMA32    [mem 0x0000000080000000-0x00000000ffffffff]
>>      Normal   [mem 0x0000000100000000-0x0000000fffffffff]
>>    Movable zone start for each node
>>    Early memory node ranges
>>      node   0: [mem 0x0000000080000000-0x00000000febeffff]
>>      node   0: [mem 0x00000000febf0000-0x00000000fefcffff]
>>      node   0: [mem 0x00000000fefd0000-0x00000000ff43ffff]
>>      node   0: [mem 0x00000000ff440000-0x00000000ff7affff]
>>      node   0: [mem 0x00000000ff7b0000-0x00000000ffffffff]
>>      node   0: [mem 0x0000000880000000-0x0000000fffffffff]
>>    Initmem setup node 0 [mem 0x0000000080000000-0x0000000fffffffff]
>>    pfn:febf0 oldnext:febf0 newnext:fe9ff
>>    pfn:febf0 oldnext:febf0 newnext:fe9ff
>>    pfn:febf0 oldnext:febf0 newnext:fe9ff
>>    etc etc
>>
>> and the boot never proceeds after this point.
>>
>> So the logic is obviously flawed, and so it is best to revert this at
>> the current -rc stage (unless someone can fix the logic instead)
>>
>> Fixes: 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment")
>> Cc: Daniel Vacek <neelx@redhat.com>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Paul Burton <paul.burton@imgtec.com>
>> Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> ---
>>  mm/page_alloc.c | 9 ++-------
>>  1 file changed, 2 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3d974cb2a1a1..cb416723538f 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5359,14 +5359,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>                       /*
>>                        * Skip to the pfn preceding the next valid one (or
>>                        * end_pfn), such that we hit a valid pfn (or end_pfn)
>> -                      * on our next iteration of the loop. Note that it needs
>> -                      * to be pageblock aligned even when the region itself
>> -                      * is not. move_freepages_block() can shift ahead of
>> -                      * the valid region but still depends on correct page
>> -                      * metadata.
>> +                      * on our next iteration of the loop.
>>                        */
>> -                     pfn = (memblock_next_valid_pfn(pfn, end_pfn) &
>> -                                     ~(pageblock_nr_pages-1)) - 1;
>> +                     pfn = memblock_next_valid_pfn(pfn, end_pfn) - 1;
>>  #endif
>>                       continue;
>>               }
>> --
>> 2.15.1
>>
>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko March 14, 2018, 2:54 p.m. UTC | #3
On Wed 14-03-18 14:35:12, Ard Biesheuvel wrote:
> On 14 March 2018 at 14:13, Michal Hocko <mhocko@kernel.org> wrote:
> > Does http://lkml.kernel.org/r/20180313224240.25295-1-neelx@redhat.com
> > fix your issue? From the debugging info you provided it should because
> > the patch prevents jumping backwards.
> >
> 
> The patch does fix the boot hang.
> 
> But I am concerned that we are papering over a fundamental flaw in
> memblock_next_valid_pfn().

It seems that memblock_next_valid_pfn is doing the right thing here. It
is the alignment which moves the pfn back AFAICS. I am not really
impressed about the original patch either, to be completely honest.
It just looks awfully tricky. I still didn't manage to wrap my head
around the original issue though so I do not have much better ideas to
be honest.
diff mbox

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d974cb2a1a1..cb416723538f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5359,14 +5359,9 @@  void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 			/*
 			 * Skip to the pfn preceding the next valid one (or
 			 * end_pfn), such that we hit a valid pfn (or end_pfn)
-			 * on our next iteration of the loop. Note that it needs
-			 * to be pageblock aligned even when the region itself
-			 * is not. move_freepages_block() can shift ahead of
-			 * the valid region but still depends on correct page
-			 * metadata.
+			 * on our next iteration of the loop.
 			 */
-			pfn = (memblock_next_valid_pfn(pfn, end_pfn) &
-					~(pageblock_nr_pages-1)) - 1;
+			pfn = memblock_next_valid_pfn(pfn, end_pfn) - 1;
 #endif
 			continue;
 		}