diff mbox series

[PATCHv3,1/3] x86/mm: Move LDT remap out of KASLR region on 5-level paging

Message ID 20181026122856.66224-2-kirill.shutemov@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Fix couple of issues with LDT remap for PTI | expand

Commit Message

Kirill A . Shutemov Oct. 26, 2018, 12:28 p.m. UTC
On 5-level paging LDT remap area is placed in the middle of
KASLR randomization region and it can overlap with direct mapping,
vmalloc or vmap area.

Let's move LDT just before direct mapping which makes it safe for KASLR.
This also allows us to unify layout between 4- and 5-level paging.

We don't touch 4 pgd slot gap just before the direct mapping reserved
for a hypervisor, but move direct mapping by one slot instead.

The LDT mapping is per-mm, so we cannot move it into P4D page table next
to CPU_ENTRY_AREA without complicating PGD table allocation for 5-level
paging.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: f55f0501cbf6 ("x86/pti: Put the LDT in its own PGD if PTI is on")
---
 Documentation/x86/x86_64/mm.txt         | 34 +++++++++++++------------
 arch/x86/include/asm/page_64_types.h    | 12 +++++----
 arch/x86/include/asm/pgtable_64_types.h |  4 +--
 arch/x86/xen/mmu_pv.c                   |  6 ++---
 4 files changed, 29 insertions(+), 27 deletions(-)

Comments

Andy Lutomirski Nov. 2, 2018, 9:07 p.m. UTC | #1
On Fri, Oct 26, 2018 at 5:29 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On 5-level paging LDT remap area is placed in the middle of
> KASLR randomization region and it can overlap with direct mapping,
> vmalloc or vmap area.
>
> Let's move LDT just before direct mapping which makes it safe for KASLR.
> This also allows us to unify layout between 4- and 5-level paging.
>
> We don't touch 4 pgd slot gap just before the direct mapping reserved
> for a hypervisor, but move direct mapping by one slot instead.
>
> The LDT mapping is per-mm, so we cannot move it into P4D page table next
> to CPU_ENTRY_AREA without complicating PGD table allocation for 5-level
> paging.

Reviewed-by: Andy Lutomirski <luto@kernel.org>

(assuming it passes tests with 4-level and 5-level.  my test setup is
current busted, and i'm bisecting it.)
Baoquan He Nov. 10, 2018, 12:29 p.m. UTC | #2
On 10/26/18 at 03:28pm, Kirill A. Shutemov wrote:
> On 5-level paging LDT remap area is placed in the middle of
> KASLR randomization region and it can overlap with direct mapping,
> vmalloc or vmap area.
             ~~~
		We usually call it vmemmap.
> 
> Let's move LDT just before direct mapping which makes it safe for KASLR.
> This also allows us to unify layout between 4- and 5-level paging.
...

> diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
> index 702898633b00..75bff98928a8 100644
> --- a/Documentation/x86/x86_64/mm.txt
> +++ b/Documentation/x86/x86_64/mm.txt
> @@ -34,23 +34,24 @@ __________________|____________|__________________|_________|___________________
>  ____________________________________________________________|___________________________________________________________
>                    |            |                  |         |
>   ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
> - ffff880000000000 | -120    TB | ffffc7ffffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
> - ffffc80000000000 |  -56    TB | ffffc8ffffffffff |    1 TB | ... unused hole
> + ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | LDT remap for PTI
> + ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
> + ffffc88000000000 |  -55.5  TB | ffffc8ffffffffff |  0.5 TB | ... unused hole

Hi Kirill,

Thanks for this fix. One small concern is whether we can put LDT
remap in other place, e.g shrink KASAN area and save one pgd size for
it, Just from Redhat's enterprise relase point of view, we don't
enable CONFIG_KASAN, and LDT is rarely used for server, now cutting one
block from the direct mapping area and moving it up one pgd slot seems a
little too abrupt. Does KASAN really cost 16 TB in 4-level and 8 PB in
5-level? After all the direct mapping is the core mapping and has been
there always, LDT remap is kind of not so core and important mapping.
Just a very perceptual feeling.

Other than this, this patch looks good to me.

Thanks
Baoquan
Kirill A . Shutemov Nov. 23, 2018, 3:58 p.m. UTC | #3
On Sat, Nov 10, 2018 at 08:29:05PM +0800, Baoquan He wrote:
> > diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
> > index 702898633b00..75bff98928a8 100644
> > --- a/Documentation/x86/x86_64/mm.txt
> > +++ b/Documentation/x86/x86_64/mm.txt
> > @@ -34,23 +34,24 @@ __________________|____________|__________________|_________|___________________
> >  ____________________________________________________________|___________________________________________________________
> >                    |            |                  |         |
> >   ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
> > - ffff880000000000 | -120    TB | ffffc7ffffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
> > - ffffc80000000000 |  -56    TB | ffffc8ffffffffff |    1 TB | ... unused hole
> > + ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | LDT remap for PTI
> > + ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
> > + ffffc88000000000 |  -55.5  TB | ffffc8ffffffffff |  0.5 TB | ... unused hole
> 
> Hi Kirill,
> 
> Thanks for this fix. One small concern is whether we can put LDT
> remap in other place, e.g shrink KASAN area and save one pgd size for
> it, Just from Redhat's enterprise relase point of view, we don't
> enable CONFIG_KASAN, and LDT is rarely used for server, now cutting one
> block from the direct mapping area and moving it up one pgd slot seems a
> little too abrupt. Does KASAN really cost 16 TB in 4-level and 8 PB in
> 5-level? After all the direct mapping is the core mapping and has been
> there always, LDT remap is kind of not so core and important mapping.
> Just a very perceptual feeling.

Sorry for late reply.

KASAN requires one byte of shadow memory per 8 bytes of target memory, so,
yeah, we need 16 TiB of virtual address space with 4-level paging.

With 5-level, we might save some address space as the limit for physical
address space if 52-bit, not 55. I dedicated 55-bit address space because
it was easier: just scale 4-level layout by factor of 9 and you'll get all
nicely aligned without much thought (PGD translates to PGD, etc).

There is also complication with KASAN layout. We have to have the same
KASAN_SHADOW_OFFSET between 4- and 5-level paging to make boot time
switching between paging modes work. The offset cannot be changed at
runtime: it used as parameter to compiler. That's the reason KASAN area
alignment looks strange.

A possibly better solution would be to actually include LDT in KASLR:
randomize the area along with direct mapping, vmalloc and vmemmap.
But it's more complexity than I found reasonable for a fix.

Do you want to try this? :)
Baoquan He Dec. 3, 2018, 3:01 a.m. UTC | #4
Hi Kirill,

On 11/23/18 at 06:58pm, Kirill A. Shutemov wrote:
> > Thanks for this fix. One small concern is whether we can put LDT
> > remap in other place, e.g shrink KASAN area and save one pgd size for
> > it, Just from Redhat's enterprise relase point of view, we don't
> > enable CONFIG_KASAN, and LDT is rarely used for server, now cutting one
> > block from the direct mapping area and moving it up one pgd slot seems a
> > little too abrupt. Does KASAN really cost 16 TB in 4-level and 8 PB in
> > 5-level? After all the direct mapping is the core mapping and has been
> > there always, LDT remap is kind of not so core and important mapping.
> > Just a very perceptual feeling.
> 
> KASAN requires one byte of shadow memory per 8 bytes of target memory, so,
> yeah, we need 16 TiB of virtual address space with 4-level paging.
> 
> With 5-level, we might save some address space as the limit for physical
> address space if 52-bit, not 55. I dedicated 55-bit address space because
> it was easier: just scale 4-level layout by factor of 9 and you'll get all
> nicely aligned without much thought (PGD translates to PGD, etc).
> 
> There is also complication with KASAN layout. We have to have the same
> KASAN_SHADOW_OFFSET between 4- and 5-level paging to make boot time
> switching between paging modes work. The offset cannot be changed at
> runtime: it used as parameter to compiler. That's the reason KASAN area
> alignment looks strange.

Thanks for explanation. KASAN area can't be touched as you said.

> 
> A possibly better solution would be to actually include LDT in KASLR:
> randomize the area along with direct mapping, vmalloc and vmemmap.
> But it's more complexity than I found reasonable for a fix.
> 
> Do you want to try this? :)

                                                           |
Seems the unused hole between vmemmap and KASAN can be used. e.g put LDT
remap in -20.5 TB place like below. And meanwhile 
____________________________________________________________|___________________________________________________________
                  |            |                  |         |
 ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
 ffff888000000000 | -120    TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
 ffffc88000000000 |  -56    TB | ffffc8ffffffffff |    1 TB | ... unused hole
 ffffc90000000000 |  -55    TB | ffffe8ffffffffff |   32 TB | vmalloc/ioremap space (vmalloc_base)
 ffffe90000000000 |  -23    TB | ffffe9ffffffffff |    1 TB | ... unused hole
 ffffea0000000000 |  -22    TB | ffffeaffffffffff |    1 TB | virtual memory map (vmemmap_base)
 ffffeb0000000000 |  -21    TB | ffffebffffffffff |  0.5 TB | ... unused hole
 ffffeb0000000000 |  -20.5  TB | ffffebffffffffff |  0.5 TB | LDT remap for PTI 
 ffffec0000000000 |  -20    TB | fffffbffffffffff |   16 TB | KASAN shadow memory
__________________|____________|__________________|_________|____________________________________________________________

In non-KASLR case, only 0.5 TB left as hole between vmemmap and LDT.
Meanwhile since LDT remap only costs 128 KB at most at the beginning,
the left area can be seen as guard hole between it and KASAN.

And yes, in KASLR case, we have to take it with the old three regions
together to randomize.

It looks do-able, not sure if the test case is complicated or not, if
not hard, I can have a try. And I have some internal bugs, can focus on
this later. I saw you posted another patchset to fix xen issue, it may
not be needed any more if we take this way?

And not sure if other people have different idea.

Thanks
Baoquan
Kirill A . Shutemov Dec. 3, 2018, 9:26 a.m. UTC | #5
On Mon, Dec 03, 2018 at 11:01:00AM +0800, Baoquan He wrote:
> It looks do-able, not sure if the test case is complicated or not, if
> not hard, I can have a try. And I have some internal bugs, can focus on
> this later. I saw you posted another patchset to fix xen issue, it may
> not be needed any more if we take this way?

Well, it depends on what is the first in the KALSR group. The fix will not
be needed if direct mapping comes the first.

But I would rather go with the patch anyway. The hypervisor hole is part
of ABI and we should not calculate it based on other movable entity
(direct mapping, LDT remap, whatever). It's too fragile.
diff mbox series

Patch

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 702898633b00..75bff98928a8 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -34,23 +34,24 @@  __________________|____________|__________________|_________|___________________
 ____________________________________________________________|___________________________________________________________
                   |            |                  |         |
  ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
- ffff880000000000 | -120    TB | ffffc7ffffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
- ffffc80000000000 |  -56    TB | ffffc8ffffffffff |    1 TB | ... unused hole
+ ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | LDT remap for PTI
+ ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
+ ffffc88000000000 |  -55.5  TB | ffffc8ffffffffff |  0.5 TB | ... unused hole
  ffffc90000000000 |  -55    TB | ffffe8ffffffffff |   32 TB | vmalloc/ioremap space (vmalloc_base)
  ffffe90000000000 |  -23    TB | ffffe9ffffffffff |    1 TB | ... unused hole
  ffffea0000000000 |  -22    TB | ffffeaffffffffff |    1 TB | virtual memory map (vmemmap_base)
  ffffeb0000000000 |  -21    TB | ffffebffffffffff |    1 TB | ... unused hole
  ffffec0000000000 |  -20    TB | fffffbffffffffff |   16 TB | KASAN shadow memory
- fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
-                  |            |                  |         | vaddr_end for KASLR
- fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
- fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | LDT remap for PTI
- ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
 __________________|____________|__________________|_________|____________________________________________________________
                                                             |
-                                                            | Identical layout to the 47-bit one from here on:
+                                                            | Identical layout to the 56-bit one from here on:
 ____________________________________________________________|____________________________________________________________
                   |            |                  |         |
+ fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
+                  |            |                  |         | vaddr_end for KASLR
+ fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
+ fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | ... unused hole
+ ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
  ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
  ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI region mapping space
  ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
@@ -83,7 +84,7 @@  Notes:
 __________________|____________|__________________|_________|___________________________________________________________
                   |            |                  |         |
  0000800000000000 |  +64    PB | ffff7fffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
-                  |            |                  |         |     virtual memory addresses up to the -128 TB
+                  |            |                  |         |     virtual memory addresses up to the -64 PB
                   |            |                  |         |     starting offset of kernel mappings.
 __________________|____________|__________________|_________|___________________________________________________________
                                                             |
@@ -91,23 +92,24 @@  __________________|____________|__________________|_________|___________________
 ____________________________________________________________|___________________________________________________________
                   |            |                  |         |
  ff00000000000000 |  -64    PB | ff0fffffffffffff |    4 PB | ... guard hole, also reserved for hypervisor
- ff10000000000000 |  -60    PB | ff8fffffffffffff |   32 PB | direct mapping of all physical memory (page_offset_base)
- ff90000000000000 |  -28    PB | ff9fffffffffffff |    4 PB | LDT remap for PTI
+ ff10000000000000 |  -60    PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
+ ff11000000000000 |  -59.75 PB | ff90ffffffffffff |   32 PB | direct mapping of all physical memory (page_offset_base)
+ ff91000000000000 |  -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
  ffa0000000000000 |  -24    PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
  ffd2000000000000 |  -11.5  PB | ffd3ffffffffffff |  0.5 PB | ... unused hole
  ffd4000000000000 |  -11    PB | ffd5ffffffffffff |  0.5 PB | virtual memory map (vmemmap_base)
  ffd6000000000000 |  -10.5  PB | ffdeffffffffffff | 2.25 PB | ... unused hole
  ffdf000000000000 |   -8.25 PB | fffffdffffffffff |   ~8 PB | KASAN shadow memory
- fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
-                  |            |                  |         | vaddr_end for KASLR
- fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
- fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | ... unused hole
- ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
 __________________|____________|__________________|_________|____________________________________________________________
                                                             |
                                                             | Identical layout to the 47-bit one from here on:
 ____________________________________________________________|____________________________________________________________
                   |            |                  |         |
+ fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
+                  |            |                  |         | vaddr_end for KASLR
+ fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
+ fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | ... unused hole
+ ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
  ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
  ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI region mapping space
  ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index cd0cf1c568b4..8f657286d599 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -33,12 +33,14 @@ 
 
 /*
  * Set __PAGE_OFFSET to the most negative possible address +
- * PGDIR_SIZE*16 (pgd slot 272).  The gap is to allow a space for a
- * hypervisor to fit.  Choosing 16 slots here is arbitrary, but it's
- * what Xen requires.
+ * PGDIR_SIZE*17 (pgd slot 273).
+ *
+ * The gap is to allow a space for LDT remap for PTI (1 pgd slot) and space for
+ * a hypervisor (16 slots). Choosing 16 slots for a hypervisor is arbitrary,
+ * but it's what Xen requires.
  */
-#define __PAGE_OFFSET_BASE_L5	_AC(0xff10000000000000, UL)
-#define __PAGE_OFFSET_BASE_L4	_AC(0xffff880000000000, UL)
+#define __PAGE_OFFSET_BASE_L5	_AC(0xff11000000000000, UL)
+#define __PAGE_OFFSET_BASE_L4	_AC(0xffff888000000000, UL)
 
 #ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
 #define __PAGE_OFFSET           page_offset_base
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 04edd2d58211..84bd9bdc1987 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -111,9 +111,7 @@  extern unsigned int ptrs_per_p4d;
  */
 #define MAXMEM			(1UL << MAX_PHYSMEM_BITS)
 
-#define LDT_PGD_ENTRY_L4	-3UL
-#define LDT_PGD_ENTRY_L5	-112UL
-#define LDT_PGD_ENTRY		(pgtable_l5_enabled() ? LDT_PGD_ENTRY_L5 : LDT_PGD_ENTRY_L4)
+#define LDT_PGD_ENTRY		-240UL
 #define LDT_BASE_ADDR		(LDT_PGD_ENTRY << PGDIR_SHIFT)
 #define LDT_END_ADDR		(LDT_BASE_ADDR + PGDIR_SIZE)
 
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 70ea598a37d2..7a2a74c2dd30 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1905,7 +1905,7 @@  void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	init_top_pgt[0] = __pgd(0);
 
 	/* Pre-constructed entries are in pfn, so convert to mfn */
-	/* L4[272] -> level3_ident_pgt  */
+	/* L4[273] -> level3_ident_pgt  */
 	/* L4[511] -> level3_kernel_pgt */
 	convert_pfn_mfn(init_top_pgt);
 
@@ -1925,8 +1925,8 @@  void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	addr[0] = (unsigned long)pgd;
 	addr[1] = (unsigned long)l3;
 	addr[2] = (unsigned long)l2;
-	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
-	 * Both L4[272][0] and L4[511][510] have entries that point to the same
+	/* Graft it onto L4[273][0]. Note that we creating an aliasing problem:
+	 * Both L4[273][0] and L4[511][510] have entries that point to the same
 	 * L2 (PMD) tables. Meaning that if you modify it in __va space
 	 * it will be also modified in the __ka space! (But if you just
 	 * modify the PMD table to point to other PTE's or none, then you