Message ID | 20181026122856.66224-2-kirill.shutemov@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix couple of issues with LDT remap for PTI | expand |
On Fri, Oct 26, 2018 at 5:29 AM Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > > On 5-level paging LDT remap area is placed in the middle of > KASLR randomization region and it can overlap with direct mapping, > vmalloc or vmap area. > > Let's move LDT just before direct mapping which makes it safe for KASLR. > This also allows us to unify layout between 4- and 5-level paging. > > We don't touch 4 pgd slot gap just before the direct mapping reserved > for a hypervisor, but move direct mapping by one slot instead. > > The LDT mapping is per-mm, so we cannot move it into P4D page table next > to CPU_ENTRY_AREA without complicating PGD table allocation for 5-level > paging. Reviewed-by: Andy Lutomirski <luto@kernel.org> (assuming it passes tests with 4-level and 5-level. my test setup is current busted, and i'm bisecting it.)
On 10/26/18 at 03:28pm, Kirill A. Shutemov wrote: > On 5-level paging LDT remap area is placed in the middle of > KASLR randomization region and it can overlap with direct mapping, > vmalloc or vmap area. ~~~ We usually call it vmemmap. > > Let's move LDT just before direct mapping which makes it safe for KASLR. > This also allows us to unify layout between 4- and 5-level paging. ... > diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt > index 702898633b00..75bff98928a8 100644 > --- a/Documentation/x86/x86_64/mm.txt > +++ b/Documentation/x86/x86_64/mm.txt > @@ -34,23 +34,24 @@ __________________|____________|__________________|_________|___________________ > ____________________________________________________________|___________________________________________________________ > | | | | > ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor > - ffff880000000000 | -120 TB | ffffc7ffffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) > - ffffc80000000000 | -56 TB | ffffc8ffffffffff | 1 TB | ... unused hole > + ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI > + ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) > + ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole Hi Kirill, Thanks for this fix. One small concern is whether we can put LDT remap in other place, e.g shrink KASAN area and save one pgd size for it, Just from Redhat's enterprise relase point of view, we don't enable CONFIG_KASAN, and LDT is rarely used for server, now cutting one block from the direct mapping area and moving it up one pgd slot seems a little too abrupt. Does KASAN really cost 16 TB in 4-level and 8 PB in 5-level? After all the direct mapping is the core mapping and has been there always, LDT remap is kind of not so core and important mapping. Just a very perceptual feeling. Other than this, this patch looks good to me. Thanks Baoquan
On Sat, Nov 10, 2018 at 08:29:05PM +0800, Baoquan He wrote: > > diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt > > index 702898633b00..75bff98928a8 100644 > > --- a/Documentation/x86/x86_64/mm.txt > > +++ b/Documentation/x86/x86_64/mm.txt > > @@ -34,23 +34,24 @@ __________________|____________|__________________|_________|___________________ > > ____________________________________________________________|___________________________________________________________ > > | | | | > > ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor > > - ffff880000000000 | -120 TB | ffffc7ffffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) > > - ffffc80000000000 | -56 TB | ffffc8ffffffffff | 1 TB | ... unused hole > > + ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI > > + ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) > > + ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole > > Hi Kirill, > > Thanks for this fix. One small concern is whether we can put LDT > remap in other place, e.g shrink KASAN area and save one pgd size for > it, Just from Redhat's enterprise relase point of view, we don't > enable CONFIG_KASAN, and LDT is rarely used for server, now cutting one > block from the direct mapping area and moving it up one pgd slot seems a > little too abrupt. Does KASAN really cost 16 TB in 4-level and 8 PB in > 5-level? After all the direct mapping is the core mapping and has been > there always, LDT remap is kind of not so core and important mapping. > Just a very perceptual feeling. Sorry for late reply. KASAN requires one byte of shadow memory per 8 bytes of target memory, so, yeah, we need 16 TiB of virtual address space with 4-level paging. With 5-level, we might save some address space as the limit for physical address space if 52-bit, not 55. I dedicated 55-bit address space because it was easier: just scale 4-level layout by factor of 9 and you'll get all nicely aligned without much thought (PGD translates to PGD, etc). There is also complication with KASAN layout. We have to have the same KASAN_SHADOW_OFFSET between 4- and 5-level paging to make boot time switching between paging modes work. The offset cannot be changed at runtime: it used as parameter to compiler. That's the reason KASAN area alignment looks strange. A possibly better solution would be to actually include LDT in KASLR: randomize the area along with direct mapping, vmalloc and vmemmap. But it's more complexity than I found reasonable for a fix. Do you want to try this? :)
Hi Kirill, On 11/23/18 at 06:58pm, Kirill A. Shutemov wrote: > > Thanks for this fix. One small concern is whether we can put LDT > > remap in other place, e.g shrink KASAN area and save one pgd size for > > it, Just from Redhat's enterprise relase point of view, we don't > > enable CONFIG_KASAN, and LDT is rarely used for server, now cutting one > > block from the direct mapping area and moving it up one pgd slot seems a > > little too abrupt. Does KASAN really cost 16 TB in 4-level and 8 PB in > > 5-level? After all the direct mapping is the core mapping and has been > > there always, LDT remap is kind of not so core and important mapping. > > Just a very perceptual feeling. > > KASAN requires one byte of shadow memory per 8 bytes of target memory, so, > yeah, we need 16 TiB of virtual address space with 4-level paging. > > With 5-level, we might save some address space as the limit for physical > address space if 52-bit, not 55. I dedicated 55-bit address space because > it was easier: just scale 4-level layout by factor of 9 and you'll get all > nicely aligned without much thought (PGD translates to PGD, etc). > > There is also complication with KASAN layout. We have to have the same > KASAN_SHADOW_OFFSET between 4- and 5-level paging to make boot time > switching between paging modes work. The offset cannot be changed at > runtime: it used as parameter to compiler. That's the reason KASAN area > alignment looks strange. Thanks for explanation. KASAN area can't be touched as you said. > > A possibly better solution would be to actually include LDT in KASLR: > randomize the area along with direct mapping, vmalloc and vmemmap. > But it's more complexity than I found reasonable for a fix. > > Do you want to try this? :) | Seems the unused hole between vmemmap and KASAN can be used. e.g put LDT remap in -20.5 TB place like below. And meanwhile ____________________________________________________________|___________________________________________________________ | | | | ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor ffff888000000000 | -120 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) ffffc88000000000 | -56 TB | ffffc8ffffffffff | 1 TB | ... unused hole ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base) ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base) ffffeb0000000000 | -21 TB | ffffebffffffffff | 0.5 TB | ... unused hole ffffeb0000000000 | -20.5 TB | ffffebffffffffff | 0.5 TB | LDT remap for PTI ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory __________________|____________|__________________|_________|____________________________________________________________ In non-KASLR case, only 0.5 TB left as hole between vmemmap and LDT. Meanwhile since LDT remap only costs 128 KB at most at the beginning, the left area can be seen as guard hole between it and KASAN. And yes, in KASLR case, we have to take it with the old three regions together to randomize. It looks do-able, not sure if the test case is complicated or not, if not hard, I can have a try. And I have some internal bugs, can focus on this later. I saw you posted another patchset to fix xen issue, it may not be needed any more if we take this way? And not sure if other people have different idea. Thanks Baoquan
On Mon, Dec 03, 2018 at 11:01:00AM +0800, Baoquan He wrote: > It looks do-able, not sure if the test case is complicated or not, if > not hard, I can have a try. And I have some internal bugs, can focus on > this later. I saw you posted another patchset to fix xen issue, it may > not be needed any more if we take this way? Well, it depends on what is the first in the KALSR group. The fix will not be needed if direct mapping comes the first. But I would rather go with the patch anyway. The hypervisor hole is part of ABI and we should not calculate it based on other movable entity (direct mapping, LDT remap, whatever). It's too fragile.
diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt index 702898633b00..75bff98928a8 100644 --- a/Documentation/x86/x86_64/mm.txt +++ b/Documentation/x86/x86_64/mm.txt @@ -34,23 +34,24 @@ __________________|____________|__________________|_________|___________________ ____________________________________________________________|___________________________________________________________ | | | | ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor - ffff880000000000 | -120 TB | ffffc7ffffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) - ffffc80000000000 | -56 TB | ffffc8ffffffffff | 1 TB | ... unused hole + ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI + ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) + ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base) ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base) ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory - fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole - | | | | vaddr_end for KASLR - fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping - fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | LDT remap for PTI - ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks __________________|____________|__________________|_________|____________________________________________________________ | - | Identical layout to the 47-bit one from here on: + | Identical layout to the 56-bit one from here on: ____________________________________________________________|____________________________________________________________ | | | | + fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole + | | | | vaddr_end for KASLR + fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping + fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole + ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole @@ -83,7 +84,7 @@ Notes: __________________|____________|__________________|_________|___________________________________________________________ | | | | 0000800000000000 | +64 PB | ffff7fffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical - | | | | virtual memory addresses up to the -128 TB + | | | | virtual memory addresses up to the -64 PB | | | | starting offset of kernel mappings. __________________|____________|__________________|_________|___________________________________________________________ | @@ -91,23 +92,24 @@ __________________|____________|__________________|_________|___________________ ____________________________________________________________|___________________________________________________________ | | | | ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor - ff10000000000000 | -60 PB | ff8fffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base) - ff90000000000000 | -28 PB | ff9fffffffffffff | 4 PB | LDT remap for PTI + ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI + ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base) + ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base) ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base) ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole ffdf000000000000 | -8.25 PB | fffffdffffffffff | ~8 PB | KASAN shadow memory - fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole - | | | | vaddr_end for KASLR - fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping - fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole - ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks __________________|____________|__________________|_________|____________________________________________________________ | | Identical layout to the 47-bit one from here on: ____________________________________________________________|____________________________________________________________ | | | | + fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole + | | | | vaddr_end for KASLR + fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping + fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole + ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index cd0cf1c568b4..8f657286d599 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -33,12 +33,14 @@ /* * Set __PAGE_OFFSET to the most negative possible address + - * PGDIR_SIZE*16 (pgd slot 272). The gap is to allow a space for a - * hypervisor to fit. Choosing 16 slots here is arbitrary, but it's - * what Xen requires. + * PGDIR_SIZE*17 (pgd slot 273). + * + * The gap is to allow a space for LDT remap for PTI (1 pgd slot) and space for + * a hypervisor (16 slots). Choosing 16 slots for a hypervisor is arbitrary, + * but it's what Xen requires. */ -#define __PAGE_OFFSET_BASE_L5 _AC(0xff10000000000000, UL) -#define __PAGE_OFFSET_BASE_L4 _AC(0xffff880000000000, UL) +#define __PAGE_OFFSET_BASE_L5 _AC(0xff11000000000000, UL) +#define __PAGE_OFFSET_BASE_L4 _AC(0xffff888000000000, UL) #ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT #define __PAGE_OFFSET page_offset_base diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h index 04edd2d58211..84bd9bdc1987 100644 --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -111,9 +111,7 @@ extern unsigned int ptrs_per_p4d; */ #define MAXMEM (1UL << MAX_PHYSMEM_BITS) -#define LDT_PGD_ENTRY_L4 -3UL -#define LDT_PGD_ENTRY_L5 -112UL -#define LDT_PGD_ENTRY (pgtable_l5_enabled() ? LDT_PGD_ENTRY_L5 : LDT_PGD_ENTRY_L4) +#define LDT_PGD_ENTRY -240UL #define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT) #define LDT_END_ADDR (LDT_BASE_ADDR + PGDIR_SIZE) diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 70ea598a37d2..7a2a74c2dd30 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -1905,7 +1905,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) init_top_pgt[0] = __pgd(0); /* Pre-constructed entries are in pfn, so convert to mfn */ - /* L4[272] -> level3_ident_pgt */ + /* L4[273] -> level3_ident_pgt */ /* L4[511] -> level3_kernel_pgt */ convert_pfn_mfn(init_top_pgt); @@ -1925,8 +1925,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) addr[0] = (unsigned long)pgd; addr[1] = (unsigned long)l3; addr[2] = (unsigned long)l2; - /* Graft it onto L4[272][0]. Note that we creating an aliasing problem: - * Both L4[272][0] and L4[511][510] have entries that point to the same + /* Graft it onto L4[273][0]. Note that we creating an aliasing problem: + * Both L4[273][0] and L4[511][510] have entries that point to the same * L2 (PMD) tables. Meaning that if you modify it in __va space * it will be also modified in the __ka space! (But if you just * modify the PMD table to point to other PTE's or none, then you
On 5-level paging LDT remap area is placed in the middle of KASLR randomization region and it can overlap with direct mapping, vmalloc or vmap area. Let's move LDT just before direct mapping which makes it safe for KASLR. This also allows us to unify layout between 4- and 5-level paging. We don't touch 4 pgd slot gap just before the direct mapping reserved for a hypervisor, but move direct mapping by one slot instead. The LDT mapping is per-mm, so we cannot move it into P4D page table next to CPU_ENTRY_AREA without complicating PGD table allocation for 5-level paging. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: f55f0501cbf6 ("x86/pti: Put the LDT in its own PGD if PTI is on") --- Documentation/x86/x86_64/mm.txt | 34 +++++++++++++------------ arch/x86/include/asm/page_64_types.h | 12 +++++---- arch/x86/include/asm/pgtable_64_types.h | 4 +-- arch/x86/xen/mmu_pv.c | 6 ++--- 4 files changed, 29 insertions(+), 27 deletions(-)