diff mbox

[v3] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

Message ID 20180615072947.GB23273@hori1.linux.bs1.fc.nec.co.jp (mailing list archive)
State New, archived
Headers show

Commit Message

Naoya Horiguchi June 15, 2018, 7:29 a.m. UTC
Hi,
I updated the patch, so let me share it.

# I'll be offline early in the next week, maybe come back on Wednesday.
# Have a nice weekend.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Thu, 14 Jun 2018 16:04:36 +0900
Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved

There is a kernel panic that is triggered when reading /proc/kpageflags
on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':

  BUG: unable to handle kernel paging request at fffffffffffffffe
  PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
  Oops: 0000 [#1] SMP PTI
  CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
  RIP: 0010:stable_page_flags+0x27/0x3c0
  Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
  RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
  RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
  RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
  R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
  R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
  FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
  Call Trace:
   kpageflags_read+0xc7/0x120
   proc_reg_read+0x3c/0x60
   __vfs_read+0x36/0x170
   vfs_read+0x89/0x130
   ksys_pread64+0x71/0x90
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7efc42e75e23
  Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24

According to kernel bisection, this problem became visible due to commit
f7f99100d8d9 which changes how struct pages are initialized.

Memblock layout affects the pfn ranges covered by node/zone. Consider
that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
the default (no memmap= given) memblock layout is like below:

  MEMBLOCK configuration:
   memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
   memory.cnt  = 0x4
   memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
   memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
   memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
   memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
   ...

If you give memmap=1G!4G (so it just covers memory[0x2]),
the range [0x100000000-0x13fffffff] is gone:

  MEMBLOCK configuration:
   memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
   memory.cnt  = 0x3
   memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
   memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
   memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
   ...

This causes shrinking node 0's pfn range because it is calculated by
the address range of memblock.memory. So some of struct pages in the
gap range are left uninitialized.

We have a function zero_resv_unavail() which does zeroing the struct
pages within the reserved unavailable range (i.e. memblock.memory &&
!memblock.reserved). This patch utilizes it to cover all unavailable
ranges by putting them into memblock.reserved.

Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Tested-by: Oscar Salvador <osalvador@suse.de>
---
 arch/x86/kernel/e820.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Comments

Michal Hocko June 15, 2018, 8:41 a.m. UTC | #1
On Fri 15-06-18 07:29:48, Naoya Horiguchi wrote:
[...]
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date: Thu, 14 Jun 2018 16:04:36 +0900
> Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
> 
> There is a kernel panic that is triggered when reading /proc/kpageflags
> on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> 
>   BUG: unable to handle kernel paging request at fffffffffffffffe
>   PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
>   Oops: 0000 [#1] SMP PTI
>   CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
>   RIP: 0010:stable_page_flags+0x27/0x3c0
>   Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
>   RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
>   RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
>   RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
>   RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
>   R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
>   R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
>   FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
>   Call Trace:
>    kpageflags_read+0xc7/0x120
>    proc_reg_read+0x3c/0x60
>    __vfs_read+0x36/0x170
>    vfs_read+0x89/0x130
>    ksys_pread64+0x71/0x90
>    do_syscall_64+0x5b/0x160
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
>   RIP: 0033:0x7efc42e75e23
>   Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> 
> According to kernel bisection, this problem became visible due to commit
> f7f99100d8d9 which changes how struct pages are initialized.
> 
> Memblock layout affects the pfn ranges covered by node/zone. Consider
> that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> the default (no memmap= given) memblock layout is like below:
> 
>   MEMBLOCK configuration:
>    memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
>    memory.cnt  = 0x4
>    memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
>    memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
>    memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
>    memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
>    ...
> 
> If you give memmap=1G!4G (so it just covers memory[0x2]),
> the range [0x100000000-0x13fffffff] is gone:
> 
>   MEMBLOCK configuration:
>    memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
>    memory.cnt  = 0x3
>    memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
>    memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
>    memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
>    ...
> 
> This causes shrinking node 0's pfn range because it is calculated by
> the address range of memblock.memory. So some of struct pages in the
> gap range are left uninitialized.
> 
> We have a function zero_resv_unavail() which does zeroing the struct
> pages within the reserved unavailable range (i.e. memblock.memory &&
> !memblock.reserved). This patch utilizes it to cover all unavailable
> ranges by putting them into memblock.reserved.
> 
> Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Tested-by: Oscar Salvador <osalvador@suse.de>

OK, this makes sense to me. It is definitely much better than the
original attempt.

Unless I am missing something this should be correct
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  arch/x86/kernel/e820.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index d1f25c831447..c88c23c658c1 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
>  {
>  	int i;
>  	u64 end;
> +	u64 addr = 0;
>  
>  	/*
>  	 * The bootstrap memblock region count maximum is 128 entries
> @@ -1264,13 +1265,21 @@ void __init e820__memblock_setup(void)
>  		struct e820_entry *entry = &e820_table->entries[i];
>  
>  		end = entry->addr + entry->size;
> +		if (addr < entry->addr)
> +			memblock_reserve(addr, entry->addr - addr);
> +		addr = end;
>  		if (end != (resource_size_t)end)
>  			continue;
>  
> +		/*
> +		 * all !E820_TYPE_RAM ranges (including gap ranges) are put
> +		 * into memblock.reserved to make sure that struct pages in
> +		 * such regions are not left uninitialized after bootup.
> +		 */
>  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> -			continue;
> -
> -		memblock_add(entry->addr, entry->size);
> +			memblock_reserve(entry->addr, entry->size);
> +		else
> +			memblock_add(entry->addr, entry->size);
>  	}
>  
>  	/* Throw away partial pages: */
> -- 
> 2.7.4
Pavel Tatashin June 15, 2018, 2 p.m. UTC | #2
On 18-06-15 10:41:42, Michal Hocko wrote:
> On Fri 15-06-18 07:29:48, Naoya Horiguchi wrote:
> [...]
> > From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Date: Thu, 14 Jun 2018 16:04:36 +0900
> > Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
> > 
> > There is a kernel panic that is triggered when reading /proc/kpageflags
> > on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> > 
> >   BUG: unable to handle kernel paging request at fffffffffffffffe
> >   PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> >   Oops: 0000 [#1] SMP PTI
> >   CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> >   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> >   RIP: 0010:stable_page_flags+0x27/0x3c0
> >   Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> >   RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> >   RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> >   RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> >   RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> >   R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> >   R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> >   FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> >   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >   CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> >   Call Trace:
> >    kpageflags_read+0xc7/0x120
> >    proc_reg_read+0x3c/0x60
> >    __vfs_read+0x36/0x170
> >    vfs_read+0x89/0x130
> >    ksys_pread64+0x71/0x90
> >    do_syscall_64+0x5b/0x160
> >    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >   RIP: 0033:0x7efc42e75e23
> >   Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> > 
> > According to kernel bisection, this problem became visible due to commit
> > f7f99100d8d9 which changes how struct pages are initialized.
> > 
> > Memblock layout affects the pfn ranges covered by node/zone. Consider
> > that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> > the default (no memmap= given) memblock layout is like below:
> > 
> >   MEMBLOCK configuration:
> >    memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> >    memory.cnt  = 0x4
> >    memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> >    memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> >    memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> >    memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> >    ...
> > 
> > If you give memmap=1G!4G (so it just covers memory[0x2]),
> > the range [0x100000000-0x13fffffff] is gone:
> > 
> >   MEMBLOCK configuration:
> >    memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> >    memory.cnt  = 0x3
> >    memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> >    memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> >    memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> >    ...
> > 
> > This causes shrinking node 0's pfn range because it is calculated by
> > the address range of memblock.memory. So some of struct pages in the
> > gap range are left uninitialized.
> > 
> > We have a function zero_resv_unavail() which does zeroing the struct
> > pages within the reserved unavailable range (i.e. memblock.memory &&
> > !memblock.reserved). This patch utilizes it to cover all unavailable
> > ranges by putting them into memblock.reserved.
> > 
> > Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Tested-by: Oscar Salvador <osalvador@suse.de>
> 
> OK, this makes sense to me. It is definitely much better than the
> original attempt.
> 
> Unless I am missing something this should be correct
> Acked-by: Michal Hocko <mhocko@suse.com>

First of all thank you Naoya for finding and root causing this issue.

So, with this fix we reserve any hole and !E820_TYPE_RAM or
!E820_TYPE_RESERVED_KERN in e820.  I think, this will work because we
do pfn_valid() check in zero_resv_unavail(), so the ranges that do not have
backing struct pages will be skipped. But, I am worried on the performance
implications of when the holes of invalid memory are rather large. We would
have to loop through it in zero_resv_unavail() one pfn at a time.

Therefore, we might also need to optimize zero_resv_unavail() a little like
this:

6407			if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
6408				continue;

Add before "continue":
	pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) + pageblock_nr_pageas - 1.
At least, this way, we would skip a section of invalid memory at a time.

For the patch above:
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>

But, I think the 2nd patch with the optimization above should go along this
this fix.

Thank you,
Pasha

> 
> > ---
> >  arch/x86/kernel/e820.c | 15 ++++++++++++---
> >  1 file changed, 12 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index d1f25c831447..c88c23c658c1 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> >  {
> >  	int i;
> >  	u64 end;
> > +	u64 addr = 0;
> >  
> >  	/*
> >  	 * The bootstrap memblock region count maximum is 128 entries
> > @@ -1264,13 +1265,21 @@ void __init e820__memblock_setup(void)
> >  		struct e820_entry *entry = &e820_table->entries[i];
> >  
> >  		end = entry->addr + entry->size;
> > +		if (addr < entry->addr)
> > +			memblock_reserve(addr, entry->addr - addr);
> > +		addr = end;
> >  		if (end != (resource_size_t)end)
> >  			continue;
> >  
> > +		/*
> > +		 * all !E820_TYPE_RAM ranges (including gap ranges) are put
> > +		 * into memblock.reserved to make sure that struct pages in
> > +		 * such regions are not left uninitialized after bootup.
> > +		 */
> >  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > -			continue;
> > -
> > -		memblock_add(entry->addr, entry->size);
> > +			memblock_reserve(entry->addr, entry->size);
> > +		else
> > +			memblock_add(entry->addr, entry->size);
> >  	}
> >  
> >  	/* Throw away partial pages: */
> > -- 
> > 2.7.4
> 
> -- 
> Michal Hocko
> SUSE Labs
>
Michal Hocko June 15, 2018, 2:10 p.m. UTC | #3
On Fri 15-06-18 10:00:00, Pavel Tatashin wrote:
[...]
> But, I think the 2nd patch with the optimization above should go along this
> this fix.

Yes, ideally with some numbers.
Oscar Salvador June 15, 2018, 2:33 p.m. UTC | #4
On Fri, Jun 15, 2018 at 10:00:00AM -0400, Pavel Tatashin wrote:
> On 18-06-15 10:41:42, Michal Hocko wrote:
> > On Fri 15-06-18 07:29:48, Naoya Horiguchi wrote:
> > [...]
> > > From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > Date: Thu, 14 Jun 2018 16:04:36 +0900
> > > Subject: [PATCH] x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
> > > 
> > > There is a kernel panic that is triggered when reading /proc/kpageflags
> > > on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> > > 
> > >   BUG: unable to handle kernel paging request at fffffffffffffffe
> > >   PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
> > >   Oops: 0000 [#1] SMP PTI
> > >   CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
> > >   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
> > >   RIP: 0010:stable_page_flags+0x27/0x3c0
> > >   Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
> > >   RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
> > >   RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
> > >   RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
> > >   RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
> > >   R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
> > >   R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
> > >   FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
> > >   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >   CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
> > >   Call Trace:
> > >    kpageflags_read+0xc7/0x120
> > >    proc_reg_read+0x3c/0x60
> > >    __vfs_read+0x36/0x170
> > >    vfs_read+0x89/0x130
> > >    ksys_pread64+0x71/0x90
> > >    do_syscall_64+0x5b/0x160
> > >    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >   RIP: 0033:0x7efc42e75e23
> > >   Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> > > 
> > > According to kernel bisection, this problem became visible due to commit
> > > f7f99100d8d9 which changes how struct pages are initialized.
> > > 
> > > Memblock layout affects the pfn ranges covered by node/zone. Consider
> > > that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> > > the default (no memmap= given) memblock layout is like below:
> > > 
> > >   MEMBLOCK configuration:
> > >    memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
> > >    memory.cnt  = 0x4
> > >    memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > >    memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > >    memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
> > >    memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > >    ...
> > > 
> > > If you give memmap=1G!4G (so it just covers memory[0x2]),
> > > the range [0x100000000-0x13fffffff] is gone:
> > > 
> > >   MEMBLOCK configuration:
> > >    memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
> > >    memory.cnt  = 0x3
> > >    memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
> > >    memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
> > >    memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
> > >    ...
> > > 
> > > This causes shrinking node 0's pfn range because it is calculated by
> > > the address range of memblock.memory. So some of struct pages in the
> > > gap range are left uninitialized.
> > > 
> > > We have a function zero_resv_unavail() which does zeroing the struct
> > > pages within the reserved unavailable range (i.e. memblock.memory &&
> > > !memblock.reserved). This patch utilizes it to cover all unavailable
> > > ranges by putting them into memblock.reserved.
> > > 
> > > Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > Tested-by: Oscar Salvador <osalvador@suse.de>
> > 
> > OK, this makes sense to me. It is definitely much better than the
> > original attempt.
> > 
> > Unless I am missing something this should be correct
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> First of all thank you Naoya for finding and root causing this issue.
> 
> So, with this fix we reserve any hole and !E820_TYPE_RAM or
> !E820_TYPE_RESERVED_KERN in e820.  I think, this will work because we
> do pfn_valid() check in zero_resv_unavail(), so the ranges that do not have
> backing struct pages will be skipped. But, I am worried on the performance
> implications of when the holes of invalid memory are rather large. We would
> have to loop through it in zero_resv_unavail() one pfn at a time.
> 
> Therefore, we might also need to optimize zero_resv_unavail() a little like
> this:
> 
> 6407			if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
> 6408				continue;
> 
> Add before "continue":
> 	pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) + pageblock_nr_pageas - 1.
> At least, this way, we would skip a section of invalid memory at a time.
> 
> For the patch above:
> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> 
> But, I think the 2nd patch with the optimization above should go along this
> this fix.

Hi Pavel,

I think this makes a lot of sense.
Since Naoya is out until Wednesday, maybe I give it a shot next week and see if I can gather some numbers.

> 
> Thank you,
> Pasha
> 
> > 
> > > ---
> > >  arch/x86/kernel/e820.c | 15 ++++++++++++---
> > >  1 file changed, 12 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > index d1f25c831447..c88c23c658c1 100644
> > > --- a/arch/x86/kernel/e820.c
> > > +++ b/arch/x86/kernel/e820.c
> > > @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> > >  {
> > >  	int i;
> > >  	u64 end;
> > > +	u64 addr = 0;
> > >  
> > >  	/*
> > >  	 * The bootstrap memblock region count maximum is 128 entries
> > > @@ -1264,13 +1265,21 @@ void __init e820__memblock_setup(void)
> > >  		struct e820_entry *entry = &e820_table->entries[i];
> > >  
> > >  		end = entry->addr + entry->size;
> > > +		if (addr < entry->addr)
> > > +			memblock_reserve(addr, entry->addr - addr);
> > > +		addr = end;
> > >  		if (end != (resource_size_t)end)
> > >  			continue;
> > >  
> > > +		/*
> > > +		 * all !E820_TYPE_RAM ranges (including gap ranges) are put
> > > +		 * into memblock.reserved to make sure that struct pages in
> > > +		 * such regions are not left uninitialized after bootup.
> > > +		 */
> > >  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > -			continue;
> > > -
> > > -		memblock_add(entry->addr, entry->size);
> > > +			memblock_reserve(entry->addr, entry->size);
> > > +		else
> > > +			memblock_add(entry->addr, entry->size);
> > >  	}
> > >  
> > >  	/* Throw away partial pages: */
> > > -- 
> > > 2.7.4
> > 
> > -- 
> > Michal Hocko
> > SUSE Labs
> > 
> 

Best Regards
Oscar Salvador
Pavel Tatashin June 15, 2018, 4:02 p.m. UTC | #5
> Hi Pavel,
>
> I think this makes a lot of sense.
> Since Naoya is out until Wednesday, maybe I give it a shot next week and see if I can gather some numbers.

Hi Oscar,

Thank you for the offer to do this. Since, sched_clock() is not yet
initialized at the time zero_resv_unavail() is called, it is difficult
to measure it during boot. But, I had x86 early boot timestamps
patches handy, so I could measure, thus decided to submit the patch.
http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.com

Thank you,
Pavel

>
> >
> > Thank you,
> > Pasha
> >
> > >
> > > > ---
> > > >  arch/x86/kernel/e820.c | 15 ++++++++++++---
> > > >  1 file changed, 12 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > index d1f25c831447..c88c23c658c1 100644
> > > > --- a/arch/x86/kernel/e820.c
> > > > +++ b/arch/x86/kernel/e820.c
> > > > @@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
> > > >  {
> > > >   int i;
> > > >   u64 end;
> > > > + u64 addr = 0;
> > > >
> > > >   /*
> > > >    * The bootstrap memblock region count maximum is 128 entries
> > > > @@ -1264,13 +1265,21 @@ void __init e820__memblock_setup(void)
> > > >           struct e820_entry *entry = &e820_table->entries[i];
> > > >
> > > >           end = entry->addr + entry->size;
> > > > +         if (addr < entry->addr)
> > > > +                 memblock_reserve(addr, entry->addr - addr);
> > > > +         addr = end;
> > > >           if (end != (resource_size_t)end)
> > > >                   continue;
> > > >
> > > > +         /*
> > > > +          * all !E820_TYPE_RAM ranges (including gap ranges) are put
> > > > +          * into memblock.reserved to make sure that struct pages in
> > > > +          * such regions are not left uninitialized after bootup.
> > > > +          */
> > > >           if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > > -                 continue;
> > > > -
> > > > -         memblock_add(entry->addr, entry->size);
> > > > +                 memblock_reserve(entry->addr, entry->size);
> > > > +         else
> > > > +                 memblock_add(entry->addr, entry->size);
> > > >   }
> > > >
> > > >   /* Throw away partial pages: */
> > > > --
> > > > 2.7.4
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs
> > >
> >
>
> Best Regards
> Oscar Salvador
>
Andrew Morton June 18, 2018, 11:36 p.m. UTC | #6
On Fri, 15 Jun 2018 10:00:00 -0400 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > Tested-by: Oscar Salvador <osalvador@suse.de>
> > 
> > OK, this makes sense to me. It is definitely much better than the
> > original attempt.
> > 
> > Unless I am missing something this should be correct
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> First of all thank you Naoya for finding and root causing this issue.
> 
> So, with this fix we reserve any hole and !E820_TYPE_RAM or
> !E820_TYPE_RESERVED_KERN in e820.  I think, this will work because we
> do pfn_valid() check in zero_resv_unavail(), so the ranges that do not have
> backing struct pages will be skipped. But, I am worried on the performance
> implications of when the holes of invalid memory are rather large. We would
> have to loop through it in zero_resv_unavail() one pfn at a time.
> 
> Therefore, we might also need to optimize zero_resv_unavail() a little like
> this:
> 
> 6407			if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
> 6408				continue;
> 
> Add before "continue":
> 	pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) + pageblock_nr_pageas - 1.
> At least, this way, we would skip a section of invalid memory at a time.
> 
> For the patch above:
> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> 
> But, I think the 2nd patch with the optimization above should go along this
> this fix.

So I expect this patch needs a cc:stable, which I'll add.

The optimiation patch seems less important and I'd like to hold that
off for 4.19-rc1?
Pavel Tatashin June 19, 2018, 12:49 a.m. UTC | #7
> So I expect this patch needs a cc:stable, which I'll add.
Yes.
> The optimiation patch seems less important and I'd like to hold that
> off for 4.19-rc1?
I agree, the optimization is not as important, and can wait for 4.19.
Pavel Tatashin July 2, 2018, 8:05 p.m. UTC | #8
> So I expect this patch needs a cc:stable, which I'll add.
>
> The optimiation patch seems less important and I'd like to hold that
> off for 4.19-rc1?

Hi Andrew,

Should I resend the optimization patch [1] once 4.18 is released, or
will you include it, and I do not need to do anything?

[1] http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.com

Thank you,
Pavel
Andrew Morton July 2, 2018, 8:28 p.m. UTC | #9
On Mon, 2 Jul 2018 16:05:04 -0400 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> > So I expect this patch needs a cc:stable, which I'll add.
> >
> > The optimiation patch seems less important and I'd like to hold that
> > off for 4.19-rc1?
> 
> Hi Andrew,
> 
> Should I resend the optimization patch [1] once 4.18 is released, or
> will you include it, and I do not need to do anything?
> 
> [1] http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.com
> 

http://ozlabs.org/~akpm/mmots/broken-out/mm-skip-invalid-pages-block-at-a-time-in-zero_resv_unresv.patch
has been in -mm since Jun 18, so all is well.
Pavel Tatashin July 2, 2018, 8:31 p.m. UTC | #10
> http://ozlabs.org/~akpm/mmots/broken-out/mm-skip-invalid-pages-block-at-a-time-in-zero_resv_unresv.patch
> has been in -mm since Jun 18, so all is well.

Ah missed it. Thank you.

Pavel
diff mbox

Patch

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index d1f25c831447..c88c23c658c1 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1248,6 +1248,7 @@  void __init e820__memblock_setup(void)
 {
 	int i;
 	u64 end;
+	u64 addr = 0;
 
 	/*
 	 * The bootstrap memblock region count maximum is 128 entries
@@ -1264,13 +1265,21 @@  void __init e820__memblock_setup(void)
 		struct e820_entry *entry = &e820_table->entries[i];
 
 		end = entry->addr + entry->size;
+		if (addr < entry->addr)
+			memblock_reserve(addr, entry->addr - addr);
+		addr = end;
 		if (end != (resource_size_t)end)
 			continue;
 
+		/*
+		 * all !E820_TYPE_RAM ranges (including gap ranges) are put
+		 * into memblock.reserved to make sure that struct pages in
+		 * such regions are not left uninitialized after bootup.
+		 */
 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
-			continue;
-
-		memblock_add(entry->addr, entry->size);
+			memblock_reserve(entry->addr, entry->size);
+		else
+			memblock_add(entry->addr, entry->size);
 	}
 
 	/* Throw away partial pages: */