x86/mm: Fix boot with some memory above MAXMEM
diff mbox series

Message ID 20200511191721.1416-1-kirill.shutemov@linux.intel.com
State New
Headers show
Series
  • x86/mm: Fix boot with some memory above MAXMEM
Related show

Commit Message

Kirill A. Shutemov May 11, 2020, 7:17 p.m. UTC
A 5-level paging capable machine can have memory above 46-bit in the
physical address space. This memory is only addressable in the 5-level
paging mode: we don't have enough virtual address space to create direct
mapping for such memory in the 4-level paging mode.

Currently, we fail boot completely: NULL pointer dereference in
subsection_map_init().

Skip creating a memblock for such memory instead and notify user that
some memory is not addressable.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: stable@vger.kernel.org # v4.14
---

Tested with a hacked QEMU: https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d

---
 arch/x86/kernel/e820.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Comments

Kirill A. Shutemov May 25, 2020, 4:49 a.m. UTC | #1
On Mon, May 11, 2020 at 10:17:21PM +0300, Kirill A. Shutemov wrote:
> A 5-level paging capable machine can have memory above 46-bit in the
> physical address space. This memory is only addressable in the 5-level
> paging mode: we don't have enough virtual address space to create direct
> mapping for such memory in the 4-level paging mode.
> 
> Currently, we fail boot completely: NULL pointer dereference in
> subsection_map_init().
> 
> Skip creating a memblock for such memory instead and notify user that
> some memory is not addressable.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Dave Hansen <dave.hansen@intel.com>
> Cc: stable@vger.kernel.org # v4.14
> ---

Gentle ping.

It's not urgent, but it's a bug fix. Please consider applying.

> Tested with a hacked QEMU: https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d
> 
> ---
>  arch/x86/kernel/e820.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index c5399e80c59c..d320d37d0f95 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1280,8 +1280,8 @@ void __init e820__memory_setup(void)
>  
>  void __init e820__memblock_setup(void)
>  {
> +	u64 size, end, not_addressable = 0;
>  	int i;
> -	u64 end;
>  
>  	/*
>  	 * The bootstrap memblock region count maximum is 128 entries
> @@ -1307,7 +1307,22 @@ void __init e820__memblock_setup(void)
>  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
>  			continue;
>  
> -		memblock_add(entry->addr, entry->size);
> +		if (entry->addr >= MAXMEM) {
> +			not_addressable += entry->size;
> +			continue;
> +		}
> +
> +		end = min_t(u64, end, MAXMEM - 1);
> +		size = end - entry->addr;
> +		not_addressable += entry->size - size;
> +		memblock_add(entry->addr, size);
> +	}
> +
> +	if (not_addressable) {
> +		pr_err("%lldGB of physical memory is not addressable in the paging mode\n",
> +		       not_addressable >> 30);
> +		if (!pgtable_l5_enabled())
> +			pr_err("Consider enabling 5-level paging\n");
>  	}
>  
>  	/* Throw away partial pages: */
> -- 
> 2.26.2
> 
>
Mike Rapoport May 25, 2020, 2:59 p.m. UTC | #2
On Mon, May 25, 2020 at 07:49:02AM +0300, Kirill A. Shutemov wrote:
> On Mon, May 11, 2020 at 10:17:21PM +0300, Kirill A. Shutemov wrote:
> > A 5-level paging capable machine can have memory above 46-bit in the
> > physical address space. This memory is only addressable in the 5-level
> > paging mode: we don't have enough virtual address space to create direct
> > mapping for such memory in the 4-level paging mode.
> > 
> > Currently, we fail boot completely: NULL pointer dereference in
> > subsection_map_init().
> > 
> > Skip creating a memblock for such memory instead and notify user that
> > some memory is not addressable.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Reviewed-by: Dave Hansen <dave.hansen@intel.com>
> > Cc: stable@vger.kernel.org # v4.14
> > ---
> 
> Gentle ping.
> 
> It's not urgent, but it's a bug fix. Please consider applying.
> 
> > Tested with a hacked QEMU: https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d
> > 
> > ---
> >  arch/x86/kernel/e820.c | 19 +++++++++++++++++--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index c5399e80c59c..d320d37d0f95 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1280,8 +1280,8 @@ void __init e820__memory_setup(void)
> >  
> >  void __init e820__memblock_setup(void)
> >  {
> > +	u64 size, end, not_addressable = 0;
> >  	int i;
> > -	u64 end;
> >  
> >  	/*
> >  	 * The bootstrap memblock region count maximum is 128 entries
> > @@ -1307,7 +1307,22 @@ void __init e820__memblock_setup(void)
> >  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> >  			continue;
> >  
> > -		memblock_add(entry->addr, entry->size);
> > +		if (entry->addr >= MAXMEM) {
> > +			not_addressable += entry->size;
> > +			continue;
> > +		}
> > +
> > +		end = min_t(u64, end, MAXMEM - 1);
> > +		size = end - entry->addr;
> > +		not_addressable += entry->size - size;
> > +		memblock_add(entry->addr, size);
> > +	}
> > +
> > +	if (not_addressable) {
> > +		pr_err("%lldGB of physical memory is not addressable in the paging mode\n",
> > +		       not_addressable >> 30);
> > +		if (!pgtable_l5_enabled())
> > +			pr_err("Consider enabling 5-level paging\n");

Could this happen at all when l5 is enabled?
Does it mean we need kmap() for 64-bit?

> >  	}
> >  
> >  	/* Throw away partial pages: */
> > -- 
> > 2.26.2
> > 
> > 
> 
> -- 
>  Kirill A. Shutemov
>
Kirill A. Shutemov May 25, 2020, 3:08 p.m. UTC | #3
On Mon, May 25, 2020 at 05:59:43PM +0300, Mike Rapoport wrote:
> On Mon, May 25, 2020 at 07:49:02AM +0300, Kirill A. Shutemov wrote:
> > On Mon, May 11, 2020 at 10:17:21PM +0300, Kirill A. Shutemov wrote:
> > > A 5-level paging capable machine can have memory above 46-bit in the
> > > physical address space. This memory is only addressable in the 5-level
> > > paging mode: we don't have enough virtual address space to create direct
> > > mapping for such memory in the 4-level paging mode.
> > > 
> > > Currently, we fail boot completely: NULL pointer dereference in
> > > subsection_map_init().
> > > 
> > > Skip creating a memblock for such memory instead and notify user that
> > > some memory is not addressable.
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Reviewed-by: Dave Hansen <dave.hansen@intel.com>
> > > Cc: stable@vger.kernel.org # v4.14
> > > ---
> > 
> > Gentle ping.
> > 
> > It's not urgent, but it's a bug fix. Please consider applying.
> > 
> > > Tested with a hacked QEMU: https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d
> > > 
> > > ---
> > >  arch/x86/kernel/e820.c | 19 +++++++++++++++++--
> > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > index c5399e80c59c..d320d37d0f95 100644
> > > --- a/arch/x86/kernel/e820.c
> > > +++ b/arch/x86/kernel/e820.c
> > > @@ -1280,8 +1280,8 @@ void __init e820__memory_setup(void)
> > >  
> > >  void __init e820__memblock_setup(void)
> > >  {
> > > +	u64 size, end, not_addressable = 0;
> > >  	int i;
> > > -	u64 end;
> > >  
> > >  	/*
> > >  	 * The bootstrap memblock region count maximum is 128 entries
> > > @@ -1307,7 +1307,22 @@ void __init e820__memblock_setup(void)
> > >  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > >  			continue;
> > >  
> > > -		memblock_add(entry->addr, entry->size);
> > > +		if (entry->addr >= MAXMEM) {
> > > +			not_addressable += entry->size;
> > > +			continue;
> > > +		}
> > > +
> > > +		end = min_t(u64, end, MAXMEM - 1);
> > > +		size = end - entry->addr;
> > > +		not_addressable += entry->size - size;
> > > +		memblock_add(entry->addr, size);
> > > +	}
> > > +
> > > +	if (not_addressable) {
> > > +		pr_err("%lldGB of physical memory is not addressable in the paging mode\n",
> > > +		       not_addressable >> 30);
> > > +		if (!pgtable_l5_enabled())
> > > +			pr_err("Consider enabling 5-level paging\n");
> 
> Could this happen at all when l5 is enabled?
> Does it mean we need kmap() for 64-bit?

It's future-profing. Who knows what paging modes we would have in the
future.
Mike Rapoport May 25, 2020, 3:58 p.m. UTC | #4
On Mon, May 25, 2020 at 06:08:20PM +0300, Kirill A. Shutemov wrote:
> On Mon, May 25, 2020 at 05:59:43PM +0300, Mike Rapoport wrote:
> > On Mon, May 25, 2020 at 07:49:02AM +0300, Kirill A. Shutemov wrote:
> > > On Mon, May 11, 2020 at 10:17:21PM +0300, Kirill A. Shutemov wrote:
> > > > A 5-level paging capable machine can have memory above 46-bit in the
> > > > physical address space. This memory is only addressable in the 5-level
> > > > paging mode: we don't have enough virtual address space to create direct
> > > > mapping for such memory in the 4-level paging mode.
> > > > 
> > > > Currently, we fail boot completely: NULL pointer dereference in
> > > > subsection_map_init().
> > > > 
> > > > Skip creating a memblock for such memory instead and notify user that
> > > > some memory is not addressable.
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > Reviewed-by: Dave Hansen <dave.hansen@intel.com>
> > > > Cc: stable@vger.kernel.org # v4.14
> > > > ---
> > > 
> > > Gentle ping.
> > > 
> > > It's not urgent, but it's a bug fix. Please consider applying.
> > > 
> > > > Tested with a hacked QEMU: https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d
> > > > 
> > > > ---
> > > >  arch/x86/kernel/e820.c | 19 +++++++++++++++++--
> > > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > index c5399e80c59c..d320d37d0f95 100644
> > > > --- a/arch/x86/kernel/e820.c
> > > > +++ b/arch/x86/kernel/e820.c
> > > > @@ -1280,8 +1280,8 @@ void __init e820__memory_setup(void)
> > > >  
> > > >  void __init e820__memblock_setup(void)
> > > >  {
> > > > +	u64 size, end, not_addressable = 0;
> > > >  	int i;
> > > > -	u64 end;
> > > >  
> > > >  	/*
> > > >  	 * The bootstrap memblock region count maximum is 128 entries
> > > > @@ -1307,7 +1307,22 @@ void __init e820__memblock_setup(void)
> > > >  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > > >  			continue;
> > > >  
> > > > -		memblock_add(entry->addr, entry->size);
> > > > +		if (entry->addr >= MAXMEM) {
> > > > +			not_addressable += entry->size;
> > > > +			continue;
> > > > +		}
> > > > +
> > > > +		end = min_t(u64, end, MAXMEM - 1);
> > > > +		size = end - entry->addr;
> > > > +		not_addressable += entry->size - size;
> > > > +		memblock_add(entry->addr, size);
> > > > +	}
> > > > +
> > > > +	if (not_addressable) {
> > > > +		pr_err("%lldGB of physical memory is not addressable in the paging mode\n",
> > > > +		       not_addressable >> 30);
> > > > +		if (!pgtable_l5_enabled())
> > > > +			pr_err("Consider enabling 5-level paging\n");
> > 
> > Could this happen at all when l5 is enabled?
> > Does it mean we need kmap() for 64-bit?
> 
> It's future-profing. Who knows what paging modes we would have in the
> future.

Than maybe

	pr_err("%lldGB of physical memory is not addressable in %s the paging mode\n",
               not_addressable >> 30, pgtable_l5_enabled() "5-level" ? "4-level");

"the paging mode" on its own sounds a bit awkward to me.

> -- 
>  Kirill A. Shutemov
Dave Hansen May 26, 2020, 2:27 p.m. UTC | #5
On 5/25/20 8:08 AM, Kirill A. Shutemov wrote:
>>>> +	if (not_addressable) {
>>>> +		pr_err("%lldGB of physical memory is not addressable in the paging mode\n",
>>>> +		       not_addressable >> 30);
>>>> +		if (!pgtable_l5_enabled())
>>>> +			pr_err("Consider enabling 5-level paging\n");
>> Could this happen at all when l5 is enabled?
>> Does it mean we need kmap() for 64-bit?
> It's future-profing. Who knows what paging modes we would have in the
> future.

Future-proofing and firmware-proofing. :)

In any case, are we *really* limited to 52 bits of physical memory with
5-level paging?  Previously, we said we were limited to 46 bits, and now
we're saying that the limit is 52 with 5-level paging:

#define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)

The 46 was fine with the 48 bits of address space on 4-level paging
systems since we need 1/2 of the address space for userspace, 1/4 for
the direct map and 1/4 for the vmalloc-and-friends area.  At 46 bits of
address space, we fill up the direct map.

The hardware designers know this and never enumerated a MAXPHYADDR from
CPUID which was higher than what we could cover with 46 bits.  It was
nice and convenient that these two separate things matched:
1. The amount of physical address space addressable in a direct map
   consuming 1/4 of the virtual address space.
2. The CPU-enumerated MAXPHYADDR which among other things dictates how
   much physical address space is addressable in a PTE.

But, with 5-level paging, things are a little different.  The limit in
addressable memory because of running out of the direct map actually
happens at 55 bits (57-2=55, analogous to the 4-level 48-2=46).

So shouldn't it technically be this:

#define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 55 : 46)

?
Kirill A. Shutemov June 2, 2020, 11:18 p.m. UTC | #6
On Tue, May 26, 2020 at 07:27:15AM -0700, Dave Hansen wrote:
> On 5/25/20 8:08 AM, Kirill A. Shutemov wrote:
> >>>> +	if (not_addressable) {
> >>>> +		pr_err("%lldGB of physical memory is not addressable in the paging mode\n",
> >>>> +		       not_addressable >> 30);
> >>>> +		if (!pgtable_l5_enabled())
> >>>> +			pr_err("Consider enabling 5-level paging\n");
> >> Could this happen at all when l5 is enabled?
> >> Does it mean we need kmap() for 64-bit?
> > It's future-profing. Who knows what paging modes we would have in the
> > future.
> 
> Future-proofing and firmware-proofing. :)
> 
> In any case, are we *really* limited to 52 bits of physical memory with
> 5-level paging?

Yes. It's architectural. SDM says "MAXPHYADDR is at most 52" (Vol 3A,
4.1.4).

I guess it can be extended with an opt-in feature and relevant changes to
page table structure. But as of today there's no such thing.

> Previously, we said we were limited to 46 bits, and now
> we're saying that the limit is 52 with 5-level paging:
> 
> #define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
> 
> The 46 was fine with the 48 bits of address space on 4-level paging
> systems since we need 1/2 of the address space for userspace, 1/4 for
> the direct map and 1/4 for the vmalloc-and-friends area.  At 46 bits of
> address space, we fill up the direct map.
> 
> The hardware designers know this and never enumerated a MAXPHYADDR from
> CPUID which was higher than what we could cover with 46 bits.  It was
> nice and convenient that these two separate things matched:
> 1. The amount of physical address space addressable in a direct map
>    consuming 1/4 of the virtual address space.
> 2. The CPU-enumerated MAXPHYADDR which among other things dictates how
>    much physical address space is addressable in a PTE.
> 
> But, with 5-level paging, things are a little different.  The limit in
> addressable memory because of running out of the direct map actually
> happens at 55 bits (57-2=55, analogous to the 4-level 48-2=46).
> 
> So shouldn't it technically be this:
> 
> #define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 55 : 46)
> 
> ?

Bits above 52 are ignored in the page table entries and accessible to
software. Some of them got claimed by HW features (XD-bit, protection
keys), but such features require explicit opt-in on software side.

Kernel could claim bits 53-55 for the physical address, but it doesn't get
us anything: if future HW would provide such feature it would require
opt-in. On other hand claiming them now means we cannot use them for other
purposes as SW bit. I don't see a point.
Dave Hansen June 3, 2020, 7:18 p.m. UTC | #7
On 6/2/20 4:18 PM, Kirill A. Shutemov wrote:
> On Tue, May 26, 2020 at 07:27:15AM -0700, Dave Hansen wrote:
>> On 5/25/20 8:08 AM, Kirill A. Shutemov wrote:
>>>>>> +	if (not_addressable) {
>>>>>> +		pr_err("%lldGB of physical memory is not addressable in the paging mode\n",
>>>>>> +		       not_addressable >> 30);
>>>>>> +		if (!pgtable_l5_enabled())
>>>>>> +			pr_err("Consider enabling 5-level paging\n");
>>>> Could this happen at all when l5 is enabled?
>>>> Does it mean we need kmap() for 64-bit?
>>> It's future-profing. Who knows what paging modes we would have in the
>>> future.
>>
>> Future-proofing and firmware-proofing. :)
>>
>> In any case, are we *really* limited to 52 bits of physical memory with
>> 5-level paging?
> 
> Yes. It's architectural. SDM says "MAXPHYADDR is at most 52" (Vol 3A,
> 4.1.4).

Right you are.

I'm glad it's in the architecture.  Makes all of this a lot easier!

>> So shouldn't it technically be this:
>>
>> #define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 55 : 46)
>>
>> ?
> 
> Bits above 52 are ignored in the page table entries and accessible to
> software. Some of them got claimed by HW features (XD-bit, protection
> keys), but such features require explicit opt-in on software side.
> 
> Kernel could claim bits 53-55 for the physical address, but it doesn't get
> us anything: if future HW would provide such feature it would require
> opt-in. On other hand claiming them now means we cannot use them for other
> purposes as SW bit. I don't see a point.

Yep, agreed.

Patch
diff mbox series

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index c5399e80c59c..d320d37d0f95 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1280,8 +1280,8 @@  void __init e820__memory_setup(void)
 
 void __init e820__memblock_setup(void)
 {
+	u64 size, end, not_addressable = 0;
 	int i;
-	u64 end;
 
 	/*
 	 * The bootstrap memblock region count maximum is 128 entries
@@ -1307,7 +1307,22 @@  void __init e820__memblock_setup(void)
 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
 			continue;
 
-		memblock_add(entry->addr, entry->size);
+		if (entry->addr >= MAXMEM) {
+			not_addressable += entry->size;
+			continue;
+		}
+
+		end = min_t(u64, end, MAXMEM - 1);
+		size = end - entry->addr;
+		not_addressable += entry->size - size;
+		memblock_add(entry->addr, size);
+	}
+
+	if (not_addressable) {
+		pr_err("%lldGB of physical memory is not addressable in the paging mode\n",
+		       not_addressable >> 30);
+		if (!pgtable_l5_enabled())
+			pr_err("Consider enabling 5-level paging\n");
 	}
 
 	/* Throw away partial pages: */