Message ID | 20170907173609.22696-4-tycho@docker.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> -----Original Message----- > From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On > Behalf Of Tycho Andersen > Sent: Thursday, September 7, 2017 10:36 AM > To: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org; kernel-hardening@lists.openwall.com; Marco Benatto > <marco.antonio.780@gmail.com>; Juerg Haefliger > <juerg.haefliger@canonical.com>; x86@kernel.org; Tycho Andersen > <tycho@docker.com> > Subject: [PATCH v6 03/11] mm, x86: Add support for eXclusive Page Frame > Ownership (XPFO) > > From: Juerg Haefliger <juerg.haefliger@canonical.com> > > This patch adds support for XPFO which protects against 'ret2dir' kernel attacks. > The basic idea is to enforce exclusive ownership of page frames by either the > kernel or userspace, unless explicitly requested by the kernel. Whenever a page > destined for userspace is allocated, it is unmapped from physmap (the kernel's > page table). When such a page is reclaimed from userspace, it is mapped back to > physmap. > > Additional fields in the page_ext struct are used for XPFO housekeeping, > specifically: > - two flags to distinguish user vs. kernel pages and to tag unmapped > pages. > - a reference counter to balance kmap/kunmap operations. > - a lock to serialize access to the XPFO fields. > > This patch is based on the work of Vasileios P. Kemerlis et al. who published their > work in this paper: > http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf > > v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush > the tlb entry on all CPUs when unmapping it in kunmap > * handle lookup_page_ext()/lookup_xpfo() returning NULL > * drop lots of BUG()s in favor of WARN() > * don't disable irqs in xpfo_kmap/xpfo_kunmap, export > __split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to > pass it > > CC: x86@kernel.org > Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu> > Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com> > Signed-off-by: Tycho Andersen <tycho@docker.com> > Signed-off-by: Marco Benatto <marco.antonio.780@gmail.com> > --- > Documentation/admin-guide/kernel-parameters.txt | 2 + > arch/x86/Kconfig | 1 + > arch/x86/include/asm/pgtable.h | 25 +++ > arch/x86/mm/Makefile | 1 + > arch/x86/mm/pageattr.c | 22 +-- > arch/x86/mm/xpfo.c | 114 ++++++++++++ > include/linux/highmem.h | 15 +- > include/linux/xpfo.h | 42 +++++ > mm/Makefile | 1 + > mm/page_alloc.c | 2 + > mm/page_ext.c | 4 + > mm/xpfo.c | 222 ++++++++++++++++++++++++ > security/Kconfig | 19 ++ > 13 files changed, 449 insertions(+), 21 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > index d9c171ce4190..444d83183f75 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -2736,6 +2736,8 @@ > > nox2apic [X86-64,APIC] Do not enable x2APIC mode. > > + noxpfo [X86-64] Disable XPFO when CONFIG_XPFO is on. > + > cpu0_hotplug [X86] Turn on CPU0 hotplug feature when > CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off. > Some features depend on CPU0. Known dependencies <... snip> A bit more description for system administrators would be very useful. Perhaps something like: noxpfo [XPFO,X86-64] Disable eXclusive Page Frame Ownership (XPFO) Physical pages mapped into user applications will also be mapped in the kernel's address space as if CONFIG_XPFO was not enabled. Patch 05 should also update kernel-parameters.txt and add "ARM64" to the config option list for noxpfo.
On Thu, Sep 07, 2017 at 06:33:09PM +0000, Ralph Campbell wrote: > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -2736,6 +2736,8 @@ > > > > nox2apic [X86-64,APIC] Do not enable x2APIC mode. > > > > + noxpfo [X86-64] Disable XPFO when CONFIG_XPFO is on. > > + > > cpu0_hotplug [X86] Turn on CPU0 hotplug feature when > > CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off. > > Some features depend on CPU0. Known dependencies > <... snip> > > A bit more description for system administrators would be very useful. > Perhaps something like: > > noxpfo [XPFO,X86-64] Disable eXclusive Page Frame Ownership (XPFO) > Physical pages mapped into user applications will also be mapped > in the kernel's address space as if CONFIG_XPFO was not enabled. > > Patch 05 should also update kernel-parameters.txt and add "ARM64" to the config option list for noxpfo. Nice catch, thanks. I'll fix both. Cheers, Tycho
I think this patch needs to be split into the generic mm code, and the x86 arch code at least. > +/* > + * The current flushing context - we pass it instead of 5 arguments: > + */ > +struct cpa_data { > + unsigned long *vaddr; > + pgd_t *pgd; > + pgprot_t mask_set; > + pgprot_t mask_clr; > + unsigned long numpages; > + int flags; > + unsigned long pfn; > + unsigned force_split : 1; > + int curpage; > + struct page **pages; > +}; Fitting these 10 variables into 5 arguments would require an awesome compression scheme anyway :) > + if (__split_large_page(&cpa, pte, (unsigned long)kaddr, base) < 0) Overly long line. > +#include <linux/xpfo.h> > > #include <asm/cacheflush.h> > > @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr) > #ifndef ARCH_HAS_KMAP > static inline void *kmap(struct page *page) > { > + void *kaddr; > + > might_sleep(); > - return page_address(page); > + kaddr = page_address(page); > + xpfo_kmap(kaddr, page); > + return kaddr; > } > > static inline void kunmap(struct page *page) > { > + xpfo_kunmap(page_address(page), page); > } > > static inline void *kmap_atomic(struct page *page) > { > + void *kaddr; > + > preempt_disable(); > pagefault_disable(); > - return page_address(page); > + kaddr = page_address(page); > + xpfo_kmap(kaddr, page); > + return kaddr; > } It seems to me like we should simply direct to pure xpfo implementations for the !HIGHMEM && XPFO case. - that is just have the prototypes for kmap, kunmap and co in linux/highmem.h and implement them in xpfo under those names. Instead of sprinkling them around. > +DEFINE_STATIC_KEY_FALSE(xpfo_inited); s/inited/initialized/g ? > + bool "Enable eXclusive Page Frame Ownership (XPFO)" > + default n default n is the default, so you can remove this line.
On Fri, Sep 08, 2017 at 12:51:40AM -0700, Christoph Hellwig wrote: > > +#include <linux/xpfo.h> > > > > #include <asm/cacheflush.h> > > > > @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr) > > #ifndef ARCH_HAS_KMAP > > static inline void *kmap(struct page *page) > > { > > + void *kaddr; > > + > > might_sleep(); > > - return page_address(page); > > + kaddr = page_address(page); > > + xpfo_kmap(kaddr, page); > > + return kaddr; > > } > > > > static inline void kunmap(struct page *page) > > { > > + xpfo_kunmap(page_address(page), page); > > } > > > > static inline void *kmap_atomic(struct page *page) > > { > > + void *kaddr; > > + > > preempt_disable(); > > pagefault_disable(); > > - return page_address(page); > > + kaddr = page_address(page); > > + xpfo_kmap(kaddr, page); > > + return kaddr; > > } > > It seems to me like we should simply direct to pure xpfo > implementations for the !HIGHMEM && XPFO case. - that is > just have the prototypes for kmap, kunmap and co in > linux/highmem.h and implement them in xpfo under those names. > > Instead of sprinkling them around. Ok, IIUC we'll still need a #ifdef CONFIG_XPFO in this file, but at least the implementations here won't have a diff. I'll make this change, and all the others you've suggested. Thanks! Tycho
On 09/07/2017 10:36 AM, Tycho Andersen wrote: > +static inline struct xpfo *lookup_xpfo(struct page *page) > +{ > + struct page_ext *page_ext = lookup_page_ext(page); > + > + if (unlikely(!page_ext)) { > + WARN(1, "xpfo: failed to get page ext"); > + return NULL; > + } > + > + return (void *)page_ext + page_xpfo_ops.offset; > +} > + Just drop the WARN. On my arm64 UEFI machine this spews warnings under most normal operation. This should be normal for some situations but I haven't had the time to dig into why this is so pronounced on arm64. Thanks, Laura
Hi Tycho, On 2017/9/8 1:36, Tycho Andersen wrote: > From: Juerg Haefliger <juerg.haefliger@canonical.com> > > This patch adds support for XPFO which protects against 'ret2dir' kernel > attacks. The basic idea is to enforce exclusive ownership of page frames > by either the kernel or userspace, unless explicitly requested by the > kernel. Whenever a page destined for userspace is allocated, it is > unmapped from physmap (the kernel's page table). When such a page is > reclaimed from userspace, it is mapped back to physmap. > > Additional fields in the page_ext struct are used for XPFO housekeeping, > specifically: > - two flags to distinguish user vs. kernel pages and to tag unmapped > pages. > - a reference counter to balance kmap/kunmap operations. > - a lock to serialize access to the XPFO fields. > > This patch is based on the work of Vasileios P. Kemerlis et al. who > published their work in this paper: > http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf > > [...] > +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) > +{ > + int i, flush_tlb = 0; > + struct xpfo *xpfo; > + > + if (!static_branch_unlikely(&xpfo_inited)) > + return; > + > + for (i = 0; i < (1 << order); i++) { > + xpfo = lookup_xpfo(page + i); > + if (!xpfo) > + continue; > + > + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags), > + "xpfo: unmapped page being allocated\n"); > + > + /* Initialize the map lock and map counter */ > + if (unlikely(!xpfo->inited)) { > + spin_lock_init(&xpfo->maplock); > + atomic_set(&xpfo->mapcount, 0); > + xpfo->inited = true; > + } > + WARN(atomic_read(&xpfo->mapcount), > + "xpfo: already mapped page being allocated\n"); > + > + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) { > + /* > + * Tag the page as a user page and flush the TLB if it > + * was previously allocated to the kernel. > + */ > + if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags)) > + flush_tlb = 1; I'm not sure whether I am miss anything, however, when the page was previously allocated to kernel, should we unmap the physmap (the kernel's page table) here? For we allocate the page to user now Yisheng Xie Thanks
Hi Yisheng, On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote: > > +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) > > +{ > > + int i, flush_tlb = 0; > > + struct xpfo *xpfo; > > + > > + if (!static_branch_unlikely(&xpfo_inited)) > > + return; > > + > > + for (i = 0; i < (1 << order); i++) { > > + xpfo = lookup_xpfo(page + i); > > + if (!xpfo) > > + continue; > > + > > + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags), > > + "xpfo: unmapped page being allocated\n"); > > + > > + /* Initialize the map lock and map counter */ > > + if (unlikely(!xpfo->inited)) { > > + spin_lock_init(&xpfo->maplock); > > + atomic_set(&xpfo->mapcount, 0); > > + xpfo->inited = true; > > + } > > + WARN(atomic_read(&xpfo->mapcount), > > + "xpfo: already mapped page being allocated\n"); > > + > > + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) { > > + /* > > + * Tag the page as a user page and flush the TLB if it > > + * was previously allocated to the kernel. > > + */ > > + if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags)) > > + flush_tlb = 1; > > I'm not sure whether I am miss anything, however, when the page was previously allocated > to kernel, should we unmap the physmap (the kernel's page table) here? For we allocate > the page to user now Yes, I think you're right. Oddly, the XPFO_READ_USER test works correctly for me, but I think (?) should not because of this bug... Tycho
On Sat, Sep 09, 2017 at 08:35:17AM -0700, Laura Abbott wrote: > On 09/07/2017 10:36 AM, Tycho Andersen wrote: > > +static inline struct xpfo *lookup_xpfo(struct page *page) > > +{ > > + struct page_ext *page_ext = lookup_page_ext(page); > > + > > + if (unlikely(!page_ext)) { > > + WARN(1, "xpfo: failed to get page ext"); > > + return NULL; > > + } > > + > > + return (void *)page_ext + page_xpfo_ops.offset; > > +} > > + > > Just drop the WARN. On my arm64 UEFI machine this spews warnings > under most normal operation. This should be normal for some > situations but I haven't had the time to dig into why this > is so pronounced on arm64. Will do, thanks! If you figure out under what conditions it's normal, I'd be curious :) Tycho
On 09/11/2017 04:50 PM, Tycho Andersen wrote: > Hi Yisheng, > > On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote: >>> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) >>> +{ >>> + int i, flush_tlb = 0; >>> + struct xpfo *xpfo; >>> + >>> + if (!static_branch_unlikely(&xpfo_inited)) >>> + return; >>> + >>> + for (i = 0; i < (1 << order); i++) { >>> + xpfo = lookup_xpfo(page + i); >>> + if (!xpfo) >>> + continue; >>> + >>> + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags), >>> + "xpfo: unmapped page being allocated\n"); >>> + >>> + /* Initialize the map lock and map counter */ >>> + if (unlikely(!xpfo->inited)) { >>> + spin_lock_init(&xpfo->maplock); >>> + atomic_set(&xpfo->mapcount, 0); >>> + xpfo->inited = true; >>> + } >>> + WARN(atomic_read(&xpfo->mapcount), >>> + "xpfo: already mapped page being allocated\n"); >>> + >>> + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) { >>> + /* >>> + * Tag the page as a user page and flush the TLB if it >>> + * was previously allocated to the kernel. >>> + */ >>> + if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags)) >>> + flush_tlb = 1; >> >> I'm not sure whether I am miss anything, however, when the page was previously allocated >> to kernel, should we unmap the physmap (the kernel's page table) here? For we allocate >> the page to user now >> > Yes, I think you're right. Oddly, the XPFO_READ_USER test works > correctly for me, but I think (?) should not because of this bug... IIRC, this is an optimization carried forward from the initial implementation. The assumption is that the kernel will map the user buffer so it's not unmapped on allocation but only on the first (and subsequent) call of kunmap. I.e.: - alloc -> noop - kmap -> noop - kunmap -> unmapped from the kernel - kmap -> mapped into the kernel - kunmap -> unmapped from the kernel and so on until: - free -> mapped back into the kernel I'm not sure if that make sense though since it leaves a window. ...Juerg > Tycho >
On Mon, Sep 11, 2017 at 06:03:55PM +0200, Juerg Haefliger wrote: > > > On 09/11/2017 04:50 PM, Tycho Andersen wrote: > > Hi Yisheng, > > > > On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote: > >>> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) > >>> +{ > >>> + int i, flush_tlb = 0; > >>> + struct xpfo *xpfo; > >>> + > >>> + if (!static_branch_unlikely(&xpfo_inited)) > >>> + return; > >>> + > >>> + for (i = 0; i < (1 << order); i++) { > >>> + xpfo = lookup_xpfo(page + i); > >>> + if (!xpfo) > >>> + continue; > >>> + > >>> + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags), > >>> + "xpfo: unmapped page being allocated\n"); > >>> + > >>> + /* Initialize the map lock and map counter */ > >>> + if (unlikely(!xpfo->inited)) { > >>> + spin_lock_init(&xpfo->maplock); > >>> + atomic_set(&xpfo->mapcount, 0); > >>> + xpfo->inited = true; > >>> + } > >>> + WARN(atomic_read(&xpfo->mapcount), > >>> + "xpfo: already mapped page being allocated\n"); > >>> + > >>> + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) { > >>> + /* > >>> + * Tag the page as a user page and flush the TLB if it > >>> + * was previously allocated to the kernel. > >>> + */ > >>> + if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags)) > >>> + flush_tlb = 1; > >> > >> I'm not sure whether I am miss anything, however, when the page was previously allocated > >> to kernel, should we unmap the physmap (the kernel's page table) here? For we allocate > >> the page to user now > >> > > Yes, I think you're right. Oddly, the XPFO_READ_USER test works > > correctly for me, but I think (?) should not because of this bug... > > IIRC, this is an optimization carried forward from the initial > implementation. The assumption is that the kernel will map the user > buffer so it's not unmapped on allocation but only on the first (and Does the kernel always map it, though? e.g. in the case of XPFO_READ_USER, I'm not sure where the kernel would do a kmap() of the test's user buffer. Tycho > subsequent) call of kunmap. I.e.: > - alloc -> noop > - kmap -> noop > - kunmap -> unmapped from the kernel > - kmap -> mapped into the kernel > - kunmap -> unmapped from the kernel > and so on until: > - free -> mapped back into the kernel > > I'm not sure if that make sense though since it leaves a window. > > ...Juerg > > > > > Tycho > >
Hi all, On Thu, Sep 07, 2017 at 11:36:01AM -0600, Tycho Andersen wrote: > > +inline void xpfo_flush_kernel_tlb(struct page *page, int order) > +{ > + int level; > + unsigned long size, kaddr; > + > + kaddr = (unsigned long)page_address(page); > + > + if (unlikely(!lookup_address(kaddr, &level))) { > + WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level); > + return; > + } > + > + switch (level) { > + case PG_LEVEL_4K: > + size = PAGE_SIZE; > + break; > + case PG_LEVEL_2M: > + size = PMD_SIZE; > + break; > + case PG_LEVEL_1G: > + size = PUD_SIZE; > + break; > + default: > + WARN(1, "xpfo: unsupported page level %x\n", level); > + return; > + } > + > + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size); Marco was testing and got the stack trace below. The issue is that on x86, flush_tlb_kernel_range uses on_each_cpu, which causes the WARN() below. Since this is called from xpfo_kmap/unmap in this interrupt handler, the WARN() triggers. I'm not sure what to do about this -- based on the discussion in v6 we need to flush the TLBs for all CPUs -- but we can't do that with interrupts disabled, which basically means with this we wouldn't be able to map/unmap pages in interrupts. Any thoughts? Tycho [ 2.712912] ------------[ cut here ]------------ [ 2.712922] WARNING: CPU: 0 PID: 0 at kernel/smp.c:414 smp_call_function_many+0x9a/0x270 [ 2.712923] Modules linked in: sd_mod ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect virtio_console sysimgblt virtio_blk fb_sys_fops ttm drm 8139too ata_piix libata 8139cp virtio_pci virtio_ring virtio mii crc32c_intel i2c_core serio_raw floppy dm_mirror dm_region_hash dm_log dm_mod [ 2.712939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #8 [ 2.712940] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014 [ 2.712941] task: ffffffff81c10480 task.stack: ffffffff81c00000 [ 2.712943] RIP: 0010:smp_call_function_many+0x9a/0x270 [ 2.712944] RSP: 0018:ffff88023fc03b38 EFLAGS: 00010046 [ 2.712945] RAX: 0000000000000000 RBX: ffffffff81072a50 RCX: 0000000000000001 [ 2.712946] RDX: ffff88023fc03ba8 RSI: ffffffff81072a50 RDI: ffffffff81e22320 [ 2.712947] RBP: ffff88023fc03b70 R08: 0000000000000970 R09: 0000000000000063 [ 2.712948] R10: ffff880000000970 R11: 0000000000000000 R12: ffff88023fc03ba8 [ 2.712949] R13: 0000000000000000 R14: ffff8802332b8e18 R15: ffffffff81e22320 [ 2.712950] FS: 0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [ 2.712951] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.712951] CR2: 00007fde22f6b000 CR3: 000000022727b000 CR4: 00000000003406f0 [ 2.712954] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2.712955] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2.712955] Call Trace: [ 2.712959] <IRQ> [ 2.712964] ? x86_configure_nx+0x50/0x50 [ 2.712966] on_each_cpu+0x2d/0x60 [ 2.712967] flush_tlb_kernel_range+0x79/0x80 [ 2.712969] xpfo_flush_kernel_tlb+0xaa/0xe0 [ 2.712975] xpfo_kunmap+0xa8/0xc0 [ 2.712981] swiotlb_bounce+0xd1/0x1c0 [ 2.712982] swiotlb_tbl_unmap_single+0x10f/0x120 [ 2.712984] unmap_single+0x20/0x30 [ 2.712985] swiotlb_unmap_sg_attrs+0x46/0x70 [ 2.712991] __ata_qc_complete+0xfa/0x150 [libata] [ 2.712994] ata_qc_complete+0xd2/0x2e0 [libata] [ 2.712998] ata_hsm_qc_complete+0x6f/0x90 [libata] [ 2.713004] ata_sff_hsm_move+0xae/0x6b0 [libata] [ 2.713009] __ata_sff_port_intr+0x8e/0x100 [libata] [ 2.713013] ata_bmdma_port_intr+0x2f/0xd0 [libata] [ 2.713019] ata_bmdma_interrupt+0x161/0x1b0 [libata] [ 2.713022] __handle_irq_event_percpu+0x3c/0x190 [ 2.713024] handle_irq_event_percpu+0x32/0x80 [ 2.713026] handle_irq_event+0x3b/0x60 [ 2.713027] handle_edge_irq+0x8f/0x190 [ 2.713029] handle_irq+0xab/0x120 [ 2.713032] ? _local_bh_enable+0x21/0x30 [ 2.713039] do_IRQ+0x48/0xd0 [ 2.713040] common_interrupt+0x93/0x93 [ 2.713042] RIP: 0010:native_safe_halt+0x6/0x10 [ 2.713043] RSP: 0018:ffffffff81c03de0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffc1 [ 2.713044] RAX: 0000000000000000 RBX: ffffffff81c10480 RCX: 0000000000000000 [ 2.713045] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 2.713046] RBP: ffffffff81c03de0 R08: 00000000b656100b R09: 0000000000000000 [ 2.713047] R10: 0000000000000006 R11: 0000000000000005 R12: 0000000000000000 [ 2.713047] R13: ffffffff81c10480 R14: 0000000000000000 R15: 0000000000000000 [ 2.713048] </IRQ> [ 2.713050] default_idle+0x1e/0x100 [ 2.713052] arch_cpu_idle+0xf/0x20 [ 2.713053] default_idle_call+0x2c/0x40 [ 2.713055] do_idle+0x158/0x1e0 [ 2.713056] cpu_startup_entry+0x73/0x80 [ 2.713058] rest_init+0xb8/0xc0 [ 2.713070] start_kernel+0x4a2/0x4c3 [ 2.713072] ? set_init_arg+0x5a/0x5a [ 2.713074] ? early_idt_handler_array+0x120/0x120 [ 2.713075] x86_64_start_reservations+0x2a/0x2c [ 2.713077] x86_64_start_kernel+0x14c/0x16f [ 2.713079] secondary_startup_64+0x9f/0x9f [ 2.713080] Code: 44 3b 35 1e 6f d0 00 7c 26 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 8b 05 63 38 fc 00 85 c0 75 be 80 3d 20 0d d0 00 00 75 b5 <0f> ff eb b1 48 c7 c2 20 23 e2 81 4c 89 fe 44 89 f7 e8 20 b5 62 [ 2.713105] ---[ end trace 4d101d4c176c16b0 ]---
I think the point here is, as running under interrupt-context, we are using k{map,unmap)_atomic instead of k{map,unmap} but after the flush_tlb_kernel_range() change we don't have an XPFO atomic version from both. My suggestion here is, at least for x86, split this into xpfo_k{un}map and xpfo_k{un}map_atomic which wild only flush local tlb. This would create a window where the same page would be mapped both for user and kernel on other cpu's due to stale remote tlb entries. Is there any other suggestion for this scenario? Thanks, - Marco Benatto On Mon, Sep 11, 2017 at 3:32 PM, Tycho Andersen <tycho@docker.com> wrote: > Hi all, > > On Thu, Sep 07, 2017 at 11:36:01AM -0600, Tycho Andersen wrote: >> >> +inline void xpfo_flush_kernel_tlb(struct page *page, int order) >> +{ >> + int level; >> + unsigned long size, kaddr; >> + >> + kaddr = (unsigned long)page_address(page); >> + >> + if (unlikely(!lookup_address(kaddr, &level))) { >> + WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level); >> + return; >> + } >> + >> + switch (level) { >> + case PG_LEVEL_4K: >> + size = PAGE_SIZE; >> + break; >> + case PG_LEVEL_2M: >> + size = PMD_SIZE; >> + break; >> + case PG_LEVEL_1G: >> + size = PUD_SIZE; >> + break; >> + default: >> + WARN(1, "xpfo: unsupported page level %x\n", level); >> + return; >> + } >> + >> + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size); > > Marco was testing and got the stack trace below. The issue is that on x86, > flush_tlb_kernel_range uses on_each_cpu, which causes the WARN() below. Since > this is called from xpfo_kmap/unmap in this interrupt handler, the WARN() > triggers. > > I'm not sure what to do about this -- based on the discussion in v6 we need to > flush the TLBs for all CPUs -- but we can't do that with interrupts disabled, > which basically means with this we wouldn't be able to map/unmap pages in > interrupts. > > Any thoughts? > > Tycho > > [ 2.712912] ------------[ cut here ]------------ > [ 2.712922] WARNING: CPU: 0 PID: 0 at kernel/smp.c:414 > smp_call_function_many+0x9a/0x270 > [ 2.712923] Modules linked in: sd_mod ata_generic pata_acpi qxl > drm_kms_helper syscopyarea sysfillrect virtio_console sysimgblt > virtio_blk fb_sys_fops ttm drm 8139too ata_piix libata 8139cp > virtio_pci virtio_ring virtio mii crc32c_intel i2c_core serio_raw > floppy dm_mirror dm_region_hash dm_log dm_mod > [ 2.712939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #8 > [ 2.712940] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.9.3-1.fc25 04/01/2014 > [ 2.712941] task: ffffffff81c10480 task.stack: ffffffff81c00000 > [ 2.712943] RIP: 0010:smp_call_function_many+0x9a/0x270 > [ 2.712944] RSP: 0018:ffff88023fc03b38 EFLAGS: 00010046 > [ 2.712945] RAX: 0000000000000000 RBX: ffffffff81072a50 RCX: 0000000000000001 > [ 2.712946] RDX: ffff88023fc03ba8 RSI: ffffffff81072a50 RDI: ffffffff81e22320 > [ 2.712947] RBP: ffff88023fc03b70 R08: 0000000000000970 R09: 0000000000000063 > [ 2.712948] R10: ffff880000000970 R11: 0000000000000000 R12: ffff88023fc03ba8 > [ 2.712949] R13: 0000000000000000 R14: ffff8802332b8e18 R15: ffffffff81e22320 > [ 2.712950] FS: 0000000000000000(0000) GS:ffff88023fc00000(0000) > knlGS:0000000000000000 > [ 2.712951] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 2.712951] CR2: 00007fde22f6b000 CR3: 000000022727b000 CR4: 00000000003406f0 > [ 2.712954] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 2.712955] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 2.712955] Call Trace: > [ 2.712959] <IRQ> > [ 2.712964] ? x86_configure_nx+0x50/0x50 > [ 2.712966] on_each_cpu+0x2d/0x60 > [ 2.712967] flush_tlb_kernel_range+0x79/0x80 > [ 2.712969] xpfo_flush_kernel_tlb+0xaa/0xe0 > [ 2.712975] xpfo_kunmap+0xa8/0xc0 > [ 2.712981] swiotlb_bounce+0xd1/0x1c0 > [ 2.712982] swiotlb_tbl_unmap_single+0x10f/0x120 > [ 2.712984] unmap_single+0x20/0x30 > [ 2.712985] swiotlb_unmap_sg_attrs+0x46/0x70 > [ 2.712991] __ata_qc_complete+0xfa/0x150 [libata] > [ 2.712994] ata_qc_complete+0xd2/0x2e0 [libata] > [ 2.712998] ata_hsm_qc_complete+0x6f/0x90 [libata] > [ 2.713004] ata_sff_hsm_move+0xae/0x6b0 [libata] > [ 2.713009] __ata_sff_port_intr+0x8e/0x100 [libata] > [ 2.713013] ata_bmdma_port_intr+0x2f/0xd0 [libata] > [ 2.713019] ata_bmdma_interrupt+0x161/0x1b0 [libata] > [ 2.713022] __handle_irq_event_percpu+0x3c/0x190 > [ 2.713024] handle_irq_event_percpu+0x32/0x80 > [ 2.713026] handle_irq_event+0x3b/0x60 > [ 2.713027] handle_edge_irq+0x8f/0x190 > [ 2.713029] handle_irq+0xab/0x120 > [ 2.713032] ? _local_bh_enable+0x21/0x30 > [ 2.713039] do_IRQ+0x48/0xd0 > [ 2.713040] common_interrupt+0x93/0x93 > [ 2.713042] RIP: 0010:native_safe_halt+0x6/0x10 > [ 2.713043] RSP: 0018:ffffffff81c03de0 EFLAGS: 00000246 ORIG_RAX: > ffffffffffffffc1 > [ 2.713044] RAX: 0000000000000000 RBX: ffffffff81c10480 RCX: 0000000000000000 > [ 2.713045] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 > [ 2.713046] RBP: ffffffff81c03de0 R08: 00000000b656100b R09: 0000000000000000 > [ 2.713047] R10: 0000000000000006 R11: 0000000000000005 R12: 0000000000000000 > [ 2.713047] R13: ffffffff81c10480 R14: 0000000000000000 R15: 0000000000000000 > [ 2.713048] </IRQ> > [ 2.713050] default_idle+0x1e/0x100 > [ 2.713052] arch_cpu_idle+0xf/0x20 > [ 2.713053] default_idle_call+0x2c/0x40 > [ 2.713055] do_idle+0x158/0x1e0 > [ 2.713056] cpu_startup_entry+0x73/0x80 > [ 2.713058] rest_init+0xb8/0xc0 > [ 2.713070] start_kernel+0x4a2/0x4c3 > [ 2.713072] ? set_init_arg+0x5a/0x5a > [ 2.713074] ? early_idt_handler_array+0x120/0x120 > [ 2.713075] x86_64_start_reservations+0x2a/0x2c > [ 2.713077] x86_64_start_kernel+0x14c/0x16f > [ 2.713079] secondary_startup_64+0x9f/0x9f > [ 2.713080] Code: 44 3b 35 1e 6f d0 00 7c 26 48 83 c4 10 5b 41 5c > 41 5d 41 5e 41 5f 5d c3 8b 05 63 38 fc 00 85 c0 75 be 80 3d 20 0d d0 > 00 00 75 b5 <0f> ff eb b1 48 c7 c2 20 23 e2 81 4c 89 fe 44 89 f7 e8 20 > b5 62 > [ 2.713105] ---[ end trace 4d101d4c176c16b0 ]---
On 2017/9/12 0:03, Juerg Haefliger wrote: > > > On 09/11/2017 04:50 PM, Tycho Andersen wrote: >> Hi Yisheng, >> >> On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote: >>>> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) >>>> +{ >>>> + int i, flush_tlb = 0; >>>> + struct xpfo *xpfo; >>>> + >>>> + if (!static_branch_unlikely(&xpfo_inited)) >>>> + return; >>>> + >>>> + for (i = 0; i < (1 << order); i++) { >>>> + xpfo = lookup_xpfo(page + i); >>>> + if (!xpfo) >>>> + continue; >>>> + >>>> + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags), >>>> + "xpfo: unmapped page being allocated\n"); >>>> + >>>> + /* Initialize the map lock and map counter */ >>>> + if (unlikely(!xpfo->inited)) { >>>> + spin_lock_init(&xpfo->maplock); >>>> + atomic_set(&xpfo->mapcount, 0); >>>> + xpfo->inited = true; >>>> + } >>>> + WARN(atomic_read(&xpfo->mapcount), >>>> + "xpfo: already mapped page being allocated\n"); >>>> + >>>> + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) { >>>> + /* >>>> + * Tag the page as a user page and flush the TLB if it >>>> + * was previously allocated to the kernel. >>>> + */ >>>> + if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags)) >>>> + flush_tlb = 1; >>> >>> I'm not sure whether I am miss anything, however, when the page was previously allocated >>> to kernel, should we unmap the physmap (the kernel's page table) here? For we allocate >>> the page to user now >>> >> Yes, I think you're right. Oddly, the XPFO_READ_USER test works Hi Tycho, Could you share this test? I'd like to know how it works. Thanks >> correctly for me, but I think (?) should not because of this bug... > > IIRC, this is an optimization carried forward from the initial > implementation. Hi Juerg, hmm.. If below is the first version, then it seems this exist from the first version: https://patchwork.kernel.org/patch/8437451/ > The assumption is that the kernel will map the user > buffer so it's not unmapped on allocation but only on the first (and > subsequent) call of kunmap. IMO, before a page is allocated, it is in buddy system, which means it is free and no other 'map' on the page except direct map. Then if the page is allocated to user, XPFO should unmap the direct map. otherwise the ret2dir may works at this window before it is freed. Or maybe I'm still missing anything. Thanks Yisheng Xie > I.e.: > - alloc -> noop > - kmap -> noop > - kunmap -> unmapped from the kernel > - kmap -> mapped into the kernel > - kunmap -> unmapped from the kernel > and so on until: > - free -> mapped back into the kernel > > I'm not sure if that make sense though since it leaves a window. > > ...Juerg > > > >> Tycho >> > > . >
On Tue, Sep 12, 2017 at 04:05:22PM +0800, Yisheng Xie wrote: > > > On 2017/9/12 0:03, Juerg Haefliger wrote: > > > > > > On 09/11/2017 04:50 PM, Tycho Andersen wrote: > >> Hi Yisheng, > >> > >> On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote: > >>>> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) > >>>> +{ > >>>> + int i, flush_tlb = 0; > >>>> + struct xpfo *xpfo; > >>>> + > >>>> + if (!static_branch_unlikely(&xpfo_inited)) > >>>> + return; > >>>> + > >>>> + for (i = 0; i < (1 << order); i++) { > >>>> + xpfo = lookup_xpfo(page + i); > >>>> + if (!xpfo) > >>>> + continue; > >>>> + > >>>> + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags), > >>>> + "xpfo: unmapped page being allocated\n"); > >>>> + > >>>> + /* Initialize the map lock and map counter */ > >>>> + if (unlikely(!xpfo->inited)) { > >>>> + spin_lock_init(&xpfo->maplock); > >>>> + atomic_set(&xpfo->mapcount, 0); > >>>> + xpfo->inited = true; > >>>> + } > >>>> + WARN(atomic_read(&xpfo->mapcount), > >>>> + "xpfo: already mapped page being allocated\n"); > >>>> + > >>>> + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) { > >>>> + /* > >>>> + * Tag the page as a user page and flush the TLB if it > >>>> + * was previously allocated to the kernel. > >>>> + */ > >>>> + if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags)) > >>>> + flush_tlb = 1; > >>> > >>> I'm not sure whether I am miss anything, however, when the page was previously allocated > >>> to kernel, should we unmap the physmap (the kernel's page table) here? For we allocate > >>> the page to user now > >>> > >> Yes, I think you're right. Oddly, the XPFO_READ_USER test works > > Hi Tycho, > Could you share this test? I'd like to know how it works. See the last patch in the series. > >> correctly for me, but I think (?) should not because of this bug... > > > > IIRC, this is an optimization carried forward from the initial > > implementation. > Hi Juerg, > > hmm.. If below is the first version, then it seems this exist from the first version: > https://patchwork.kernel.org/patch/8437451/ > > > The assumption is that the kernel will map the user > > buffer so it's not unmapped on allocation but only on the first (and > > subsequent) call of kunmap. > > IMO, before a page is allocated, it is in buddy system, which means it is free > and no other 'map' on the page except direct map. Then if the page is allocated > to user, XPFO should unmap the direct map. otherwise the ret2dir may works at > this window before it is freed. Or maybe I'm still missing anything. I agree that it seems broken. I'm just not sure why the test doesn't fail. It's certainly worth understanding. Tycho
On 09/07/2017 10:36 AM, Tycho Andersen wrote: ... > Whenever a page destined for userspace is allocated, it is > unmapped from physmap (the kernel's page table). When such a page is > reclaimed from userspace, it is mapped back to physmap. I'm looking for the code where it's unmapped at allocation to userspace. I see TLB flushing and 'struct xpfo' manipulation, but I don't see the unmapping. Where is that occurring? How badly does this hurt performance? Since we (generally) have different migrate types for user and kernel allocation, I can imagine that a given page *generally* doesn't oscillate between user and kernel-allocated, but I'm curious how it works in practice. Doesn't the IPI load from the TLB flushes eat you alive? It's a bit scary to have such a deep code path under the main allocator. This all seems insanely expensive. It will *barely* work on an allocation-heavy workload on a desktop. I'm pretty sure the locking will just fall over entirely on any reasonably-sized server. I really have to wonder whether there are better ret2dir defenses than this. The allocator just seems like the *wrong* place to be doing this because it's such a hot path. > + cpa.vaddr = kaddr; > + cpa.pages = &page; > + cpa.mask_set = prot; > + cpa.mask_clr = msk_clr; > + cpa.numpages = 1; > + cpa.flags = 0; > + cpa.curpage = 0; > + cpa.force_split = 0; > + > + > + do_split = try_preserve_large_page(pte, (unsigned Is this safe to do without a TLB flush? I thought we had plenty of bugs in CPUs around having multiple entries for the same page in the TLB at once. We're *REALLY* careful when we split large pages for THP, and I'm surprised we don't do the same here. Why do you even bother keeping large pages around? Won't the entire kernel just degrade to using 4k everywhere, eventually? > + if (do_split) { > + struct page *base; > + > + base = alloc_pages(GFP_ATOMIC | __GFP_NOTRACK, Ugh, GFP_ATOMIC. That's nasty. Do you really want this allocation to fail all the time? GFP_ATOMIC could really be called GFP_YOU_BETTER_BE_OK_WITH_THIS_FAILING. :) You probably want to do what the THP code does here and keep a spare page around, then allocate it before you take the locks. > +inline void xpfo_flush_kernel_tlb(struct page *page, int order) > +{ > + int level; > + unsigned long size, kaddr; > + > + kaddr = (unsigned long)page_address(page); > + > + if (unlikely(!lookup_address(kaddr, &level))) { > + WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level); > + return; > + } > + > + switch (level) { > + case PG_LEVEL_4K: > + size = PAGE_SIZE; > + break; > + case PG_LEVEL_2M: > + size = PMD_SIZE; > + break; > + case PG_LEVEL_1G: > + size = PUD_SIZE; > + break; > + default: > + WARN(1, "xpfo: unsupported page level %x\n", level); > + return; > + } > + > + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size); > +} I'm not sure flush_tlb_kernel_range() is the best primitive to be calling here. Let's say you walk the page tables and find level=PG_LEVEL_1G. You call flush_tlb_kernel_range(), you will be above tlb_single_page_flush_ceiling, and you will do a full TLB flush. But, with a 1GB page, you could have just used a single INVLPG and skipped the global flush. I guess the cost of the IPI is way more than the flush itself, but it's still a shame to toss the entire TLB when you don't have to. I also think the TLB flush should be done closer to the page table manipulation that it is connected to. It's hard to figure out whether the flush is the right one otherwise. Also, the "(1 << order) * size" thing looks goofy to me. Let's say you are flushing a order=1 (8k) page and its mapped in a 1GB mapping. You flush 2GB. Is that intentional? > + > +void xpfo_free_pages(struct page *page, int order) > +{ ... > + /* > + * Map the page back into the kernel if it was previously > + * allocated to user space. > + */ > + if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) { > + clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags); > + set_kpte(page_address(page + i), page + i, > + PAGE_KERNEL); > + } > + } > +} This seems like a bad idea, performance-wise. Kernel and userspace pages tend to be separated by migrate types. So, a given physical page will tend to be used as kernel *or* for userspace. With this nugget, every time a userspace page is freed, we will go to the trouble of making it *back* into a kernel page. Then, when it is allocated again (probably as userspace), we will re-make it into a userspace page. That seems horribly inefficient. Also, this weakens the security guarantees. Let's say you're mounting a ret2dir attack. You populate a page with your evil data and you know the kernel address for the page. All you have to do is coordinate your attack with freeing the page. You can control when it gets freed. Now, the xpfo_free_pages() helpfully just mapped your attack code back into the kernel. Why not *just* do these moves at allocation time?
Hi Dave, Thanks for taking a look! On Wed, Sep 20, 2017 at 08:48:36AM -0700, Dave Hansen wrote: > On 09/07/2017 10:36 AM, Tycho Andersen wrote: > ... > > Whenever a page destined for userspace is allocated, it is > > unmapped from physmap (the kernel's page table). When such a page is > > reclaimed from userspace, it is mapped back to physmap. > > I'm looking for the code where it's unmapped at allocation to userspace. > I see TLB flushing and 'struct xpfo' manipulation, but I don't see the > unmapping. Where is that occurring? This is discussed here: https://lkml.org/lkml/2017/9/11/289 but, you're right that it's wrong in some cases. I've fixed it up for v7: https://lkml.org/lkml/2017/9/12/512 > How badly does this hurt performance? Since we (generally) have > different migrate types for user and kernel allocation, I can imagine > that a given page *generally* doesn't oscillate between user and > kernel-allocated, but I'm curious how it works in practice. Doesn't the > IPI load from the TLB flushes eat you alive? > > It's a bit scary to have such a deep code path under the main allocator. > > This all seems insanely expensive. It will *barely* work on an > allocation-heavy workload on a desktop. I'm pretty sure the locking > will just fall over entirely on any reasonably-sized server. Basically, yes :(. I presented some numbers at LSS, but the gist was on a 2.4x slowdown on a 24 core/48 thread Xeon E5-2650, and a 1.4x slowdown on a 4 core/8 thread E3-1240. The story seems a little bit better on ARM, but I'm struggling to get it to boot on a box with more than 4 cores, so I can't draw a better picture yet. > I really have to wonder whether there are better ret2dir defenses than > this. The allocator just seems like the *wrong* place to be doing this > because it's such a hot path. This might be crazy, but what if we defer flushing of the kernel ranges until just before we return to userspace? We'd still manipulate the prot/xpfo bits for the pages, but then just keep a list of which ranges need to be flushed, and do the right thing before we return. This leaves a little window between the actual allocation and the flush, but userspace would need another thread in its threadgroup to predict the next allocation, write the bad stuff there, and do the exploit all in that window. I'm of course open to other suggestions. I'm new :) > > + cpa.vaddr = kaddr; > > + cpa.pages = &page; > > + cpa.mask_set = prot; > > + cpa.mask_clr = msk_clr; > > + cpa.numpages = 1; > > + cpa.flags = 0; > > + cpa.curpage = 0; > > + cpa.force_split = 0; > > + > > + > > + do_split = try_preserve_large_page(pte, (unsigned > > Is this safe to do without a TLB flush? I thought we had plenty of bugs > in CPUs around having multiple entries for the same page in the TLB at > once. We're *REALLY* careful when we split large pages for THP, and I'm > surprised we don't do the same here. It looks like on some code paths we do flush, and some we don't. Sounds like it's not safe to do without a flush, so I'll see about adding one. > Why do you even bother keeping large pages around? Won't the entire > kernel just degrade to using 4k everywhere, eventually? Isn't that true of large pages in general? Is there something about xpfo that makes this worse? I thought this would only split things if they had already been split somewhere else, and the protection can't apply to the whole huge page. > > + if (do_split) { > > + struct page *base; > > + > > + base = alloc_pages(GFP_ATOMIC | __GFP_NOTRACK, > > Ugh, GFP_ATOMIC. That's nasty. Do you really want this allocation to > fail all the time? GFP_ATOMIC could really be called > GFP_YOU_BETTER_BE_OK_WITH_THIS_FAILING. :) > > You probably want to do what the THP code does here and keep a spare > page around, then allocate it before you take the locks. Sounds like a good idea, thanks. > > +inline void xpfo_flush_kernel_tlb(struct page *page, int order) > > +{ > > + int level; > > + unsigned long size, kaddr; > > + > > + kaddr = (unsigned long)page_address(page); > > + > > + if (unlikely(!lookup_address(kaddr, &level))) { > > + WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level); > > + return; > > + } > > + > > + switch (level) { > > + case PG_LEVEL_4K: > > + size = PAGE_SIZE; > > + break; > > + case PG_LEVEL_2M: > > + size = PMD_SIZE; > > + break; > > + case PG_LEVEL_1G: > > + size = PUD_SIZE; > > + break; > > + default: > > + WARN(1, "xpfo: unsupported page level %x\n", level); > > + return; > > + } > > + > > + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size); > > +} > > I'm not sure flush_tlb_kernel_range() is the best primitive to be > calling here. > > Let's say you walk the page tables and find level=PG_LEVEL_1G. You call > flush_tlb_kernel_range(), you will be above > tlb_single_page_flush_ceiling, and you will do a full TLB flush. But, > with a 1GB page, you could have just used a single INVLPG and skipped > the global flush. > > I guess the cost of the IPI is way more than the flush itself, but it's > still a shame to toss the entire TLB when you don't have to. Ok, do you think it's worth making a new helper for others to use? Or should I just keep the logic in this function? > I also think the TLB flush should be done closer to the page table > manipulation that it is connected to. It's hard to figure out whether > the flush is the right one otherwise. Yes, sounds good. > Also, the "(1 << order) * size" thing looks goofy to me. Let's say you > are flushing a order=1 (8k) page and its mapped in a 1GB mapping. You > flush 2GB. Is that intentional? I don't think so; seems like we should be flushing (1 << order) * PAGE_SIZE instead. > > + > > +void xpfo_free_pages(struct page *page, int order) > > +{ > ... > > + /* > > + * Map the page back into the kernel if it was previously > > + * allocated to user space. > > + */ > > + if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) { > > + clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags); > > + set_kpte(page_address(page + i), page + i, > > + PAGE_KERNEL); > > + } > > + } > > +} > > This seems like a bad idea, performance-wise. Kernel and userspace > pages tend to be separated by migrate types. So, a given physical page > will tend to be used as kernel *or* for userspace. With this nugget, > every time a userspace page is freed, we will go to the trouble of > making it *back* into a kernel page. Then, when it is allocated again > (probably as userspace), we will re-make it into a userspace page. That > seems horribly inefficient. > > Also, this weakens the security guarantees. Let's say you're mounting a > ret2dir attack. You populate a page with your evil data and you know > the kernel address for the page. All you have to do is coordinate your > attack with freeing the page. You can control when it gets freed. Now, > the xpfo_free_pages() helpfully just mapped your attack code back into > the kernel. > > Why not *just* do these moves at allocation time? Yes, this is a great point, thanks. I think this can be a no-op, and with the fixed up v7 logic for alloc pages that I linked to above it should work out correctly. Tycho
On 09/20/2017 03:34 PM, Tycho Andersen wrote: >> I really have to wonder whether there are better ret2dir defenses than >> this. The allocator just seems like the *wrong* place to be doing this >> because it's such a hot path. > > This might be crazy, but what if we defer flushing of the kernel > ranges until just before we return to userspace? We'd still manipulate > the prot/xpfo bits for the pages, but then just keep a list of which > ranges need to be flushed, and do the right thing before we return. > This leaves a little window between the actual allocation and the > flush, but userspace would need another thread in its threadgroup to > predict the next allocation, write the bad stuff there, and do the > exploit all in that window. I think the common case is still that you enter the kernel, allocate a single page (or very few) and then exit. So, you don't really reduce the total number of flushes. Just think of this in terms of IPIs to do the remote TLB flushes. A CPU can do roughly 1 million page faults and allocations a second. Say you have a 2-socket x 28-core x 2 hyperthead system = 112 CPU threads. That's 111M IPI interrupts/second, just for the TLB flushes, *ON* *EACH* *CPU*. I think the only thing that will really help here is if you batch the allocations. For instance, you could make sure that the per-cpu-pageset lists always contain either all kernel or all user data. Then remap the entire list at once and do a single flush after the entire list is consumed. >> Why do you even bother keeping large pages around? Won't the entire >> kernel just degrade to using 4k everywhere, eventually? > > Isn't that true of large pages in general? Is there something about > xpfo that makes this worse? I thought this would only split things if > they had already been split somewhere else, and the protection can't > apply to the whole huge page. Even though the kernel gives out 4k pages, it still *maps* them in the kernel linear direct map with the largest size available. My 16GB laptop, for instance, has 3GB of 2MB transparent huge pages, but the rest is used as 4k pages. Yet, from /proc/meminfo: DirectMap4k: 665280 kB DirectMap2M: 11315200 kB DirectMap1G: 4194304 kB Your code pretty much forces 4k pages coming out of the allocator to be mapped with 4k mappings. >>> +inline void xpfo_flush_kernel_tlb(struct page *page, int order) >>> +{ >>> + int level; >>> + unsigned long size, kaddr; >>> + >>> + kaddr = (unsigned long)page_address(page); >>> + >>> + if (unlikely(!lookup_address(kaddr, &level))) { >>> + WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level); >>> + return; >>> + } >>> + >>> + switch (level) { >>> + case PG_LEVEL_4K: >>> + size = PAGE_SIZE; >>> + break; >>> + case PG_LEVEL_2M: >>> + size = PMD_SIZE; >>> + break; >>> + case PG_LEVEL_1G: >>> + size = PUD_SIZE; >>> + break; >>> + default: >>> + WARN(1, "xpfo: unsupported page level %x\n", level); >>> + return; >>> + } >>> + >>> + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size); >>> +} >> >> I'm not sure flush_tlb_kernel_range() is the best primitive to be >> calling here. >> >> Let's say you walk the page tables and find level=PG_LEVEL_1G. You call >> flush_tlb_kernel_range(), you will be above >> tlb_single_page_flush_ceiling, and you will do a full TLB flush. But, >> with a 1GB page, you could have just used a single INVLPG and skipped >> the global flush. >> >> I guess the cost of the IPI is way more than the flush itself, but it's >> still a shame to toss the entire TLB when you don't have to. > > Ok, do you think it's worth making a new helper for others to use? Or > should I just keep the logic in this function? I'd just leave it in place. Most folks already have a PTE when they do the invalidation, so this is a bit of a weirdo.
On 09/07/2017 10:36 AM, Tycho Andersen wrote: > + /* > + * Map the page back into the kernel if it was previously > + * allocated to user space. > + */ > + if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) { > + clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags); > + set_kpte(page_address(page + i), page + i, > + PAGE_KERNEL); > + } > + } It might also be a really good idea to clear the page here. Otherwise, the page still might have attack code in it and now it is mapped into the kernel again, ready to be exploited. Think of it this way: pages either trusted data and are mapped all the time, or they have potentially bad data and are unmapped mostly. If we want to take a bad page and map it always, we have to make sure the contents are not evil. 0's are not evil. > static inline void *kmap(struct page *page) > { > + void *kaddr; > + > might_sleep(); > - return page_address(page); > + kaddr = page_address(page); > + xpfo_kmap(kaddr, page); > + return kaddr; > } The time between kmap() and kunmap() is potentially a really long operation. I think we, for instance, keep some pages kmap()'d while we do I/O to them, or wait for I/O elsewhere. IOW, this will map predictable data at a predictable location and it will do it for a long time. While that's better than the current state (mapped always), it still seems rather risky. Could you, for instance, turn kmap(page) into vmap(&page, 1, ...)? That way, at least the address may be different each time. Even if an attacker knows the physical address, they don't know where it will be mapped.
On Wed, Sep 20, 2017 at 04:21:15PM -0700, Dave Hansen wrote: > On 09/20/2017 03:34 PM, Tycho Andersen wrote: > >> I really have to wonder whether there are better ret2dir defenses than > >> this. The allocator just seems like the *wrong* place to be doing this > >> because it's such a hot path. > > > > This might be crazy, but what if we defer flushing of the kernel > > ranges until just before we return to userspace? We'd still manipulate > > the prot/xpfo bits for the pages, but then just keep a list of which > > ranges need to be flushed, and do the right thing before we return. > > This leaves a little window between the actual allocation and the > > flush, but userspace would need another thread in its threadgroup to > > predict the next allocation, write the bad stuff there, and do the > > exploit all in that window. > > I think the common case is still that you enter the kernel, allocate a > single page (or very few) and then exit. So, you don't really reduce > the total number of flushes. > > Just think of this in terms of IPIs to do the remote TLB flushes. A CPU > can do roughly 1 million page faults and allocations a second. Say you > have a 2-socket x 28-core x 2 hyperthead system = 112 CPU threads. > That's 111M IPI interrupts/second, just for the TLB flushes, *ON* *EACH* > *CPU*. Since we only need to flush when something switches from a userspace to a kernel page or back, hopefully it's not this bad, but point taken. > I think the only thing that will really help here is if you batch the > allocations. For instance, you could make sure that the per-cpu-pageset > lists always contain either all kernel or all user data. Then remap the > entire list at once and do a single flush after the entire list is consumed. Just so I understand, the idea would be that we only flush when the type of allocation alternates, so: kmalloc(..., GFP_KERNEL); kmalloc(..., GFP_KERNEL); /* remap+flush here */ kmalloc(..., GFP_HIGHUSER); /* remap+flush here */ kmalloc(..., GFP_KERNEL); ? Tycho
On 09/20/2017 05:09 PM, Tycho Andersen wrote: >> I think the only thing that will really help here is if you batch the >> allocations. For instance, you could make sure that the per-cpu-pageset >> lists always contain either all kernel or all user data. Then remap the >> entire list at once and do a single flush after the entire list is consumed. > Just so I understand, the idea would be that we only flush when the > type of allocation alternates, so: > > kmalloc(..., GFP_KERNEL); > kmalloc(..., GFP_KERNEL); > /* remap+flush here */ > kmalloc(..., GFP_HIGHUSER); > /* remap+flush here */ > kmalloc(..., GFP_KERNEL); Not really. We keep a free list per migrate type, and a per_cpu_pages (pcp) list per migratetype: > struct per_cpu_pages { > int count; /* number of pages in the list */ > int high; /* high watermark, emptying needed */ > int batch; /* chunk size for buddy add/remove */ > > /* Lists of pages, one per migrate type stored on the pcp-lists */ > struct list_head lists[MIGRATE_PCPTYPES]; > }; The migratetype is derived from the GFP flags in gfpflags_to_migratetype(). In general, GFP_HIGHUSER and GFP_KERNEL come from different migratetypes, so they come from different free lists. In your case above, the GFP_HIGHUSER allocation come through the MIGRATE_MOVABLE pcp list while the GFP_KERNEL ones come from the MIGRATE_UNMOVABLE one. Since we add a bunch of pages to those lists at once, you could do all the mapping/unmapping/flushing on a bunch of pages at once Or, you could hook your code into the places where the migratetype of memory is changed (set_pageblock_migratetype(), plus where we fall back). Those changes are much more rare than page allocation.
At a high level, does this approach keep an attacker from being able to determine the address of data in the linear map, or does it keep them from being able to *exploit* it? Can you have a ret2dir attack if the attacker doesn't know the address, for instance?
On Wed, Sep 20, 2017 at 05:28:11PM -0700, Dave Hansen wrote: > At a high level, does this approach keep an attacker from being able to > determine the address of data in the linear map, or does it keep them > from being able to *exploit* it? It keeps them from exploiting it, by faulting when a physmap alias is used. > Can you have a ret2dir attack if the attacker doesn't know the > address, for instance? Yes, through a technique similar to heap spraying. The original paper has a study of this, section 5.2 outlines the attack and 7.2 describes their success rate: http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf Tycho
On Wed, Sep 20, 2017 at 05:27:02PM -0700, Dave Hansen wrote: > On 09/20/2017 05:09 PM, Tycho Andersen wrote: > >> I think the only thing that will really help here is if you batch the > >> allocations. For instance, you could make sure that the per-cpu-pageset > >> lists always contain either all kernel or all user data. Then remap the > >> entire list at once and do a single flush after the entire list is consumed. > > Just so I understand, the idea would be that we only flush when the > > type of allocation alternates, so: > > > > kmalloc(..., GFP_KERNEL); > > kmalloc(..., GFP_KERNEL); > > /* remap+flush here */ > > kmalloc(..., GFP_HIGHUSER); > > /* remap+flush here */ > > kmalloc(..., GFP_KERNEL); > > Not really. We keep a free list per migrate type, and a per_cpu_pages > (pcp) list per migratetype: > > > struct per_cpu_pages { > > int count; /* number of pages in the list */ > > int high; /* high watermark, emptying needed */ > > int batch; /* chunk size for buddy add/remove */ > > > > /* Lists of pages, one per migrate type stored on the pcp-lists */ > > struct list_head lists[MIGRATE_PCPTYPES]; > > }; > > The migratetype is derived from the GFP flags in > gfpflags_to_migratetype(). In general, GFP_HIGHUSER and GFP_KERNEL come > from different migratetypes, so they come from different free lists. > > In your case above, the GFP_HIGHUSER allocation come through the > MIGRATE_MOVABLE pcp list while the GFP_KERNEL ones come from the > MIGRATE_UNMOVABLE one. Since we add a bunch of pages to those lists at > once, you could do all the mapping/unmapping/flushing on a bunch of > pages at once > > Or, you could hook your code into the places where the migratetype of > memory is changed (set_pageblock_migratetype(), plus where we fall > back). Those changes are much more rare than page allocation. I see, thanks for all this discussion. It has been very helpful! Tycho
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index d9c171ce4190..444d83183f75 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2736,6 +2736,8 @@ nox2apic [X86-64,APIC] Do not enable x2APIC mode. + noxpfo [X86-64] Disable XPFO when CONFIG_XPFO is on. + cpu0_hotplug [X86] Turn on CPU0 hotplug feature when CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off. Some features depend on CPU0. Known dependencies are: diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 323cb065be5e..d78a0d538900 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -185,6 +185,7 @@ config X86 select USER_STACKTRACE_SUPPORT select VIRT_TO_BUS select X86_FEATURE_NAMES if PROC_FS + select ARCH_SUPPORTS_XPFO if X86_64 config INSTRUCTION_DECODER def_bool y diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 77037b6f1caa..c2eb40f7a74b 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1238,6 +1238,31 @@ static inline bool pud_access_permitted(pud_t pud, bool write) return __pte_access_permitted(pud_val(pud), write); } +/* + * The current flushing context - we pass it instead of 5 arguments: + */ +struct cpa_data { + unsigned long *vaddr; + pgd_t *pgd; + pgprot_t mask_set; + pgprot_t mask_clr; + unsigned long numpages; + int flags; + unsigned long pfn; + unsigned force_split : 1; + int curpage; + struct page **pages; +}; + + +int +try_preserve_large_page(pte_t *kpte, unsigned long address, + struct cpa_data *cpa); +extern spinlock_t cpa_lock; +int +__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address, + struct page *base); + #include <asm-generic/pgtable.h> #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index 0fbdcb64f9f8..89ba6d25fb51 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -39,3 +39,4 @@ obj-$(CONFIG_X86_INTEL_MPX) += mpx.o obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o +obj-$(CONFIG_XPFO) += xpfo.o diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 757b0bcdf712..f25d07191e60 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -27,28 +27,12 @@ #include <asm/set_memory.h> /* - * The current flushing context - we pass it instead of 5 arguments: - */ -struct cpa_data { - unsigned long *vaddr; - pgd_t *pgd; - pgprot_t mask_set; - pgprot_t mask_clr; - unsigned long numpages; - int flags; - unsigned long pfn; - unsigned force_split : 1; - int curpage; - struct page **pages; -}; - -/* * Serialize cpa() (for !DEBUG_PAGEALLOC which uses large identity mappings) * using cpa_lock. So that we don't allow any other cpu, with stale large tlb * entries change the page attribute in parallel to some other cpu * splitting a large page entry along with changing the attribute. */ -static DEFINE_SPINLOCK(cpa_lock); +DEFINE_SPINLOCK(cpa_lock); #define CPA_FLUSHTLB 1 #define CPA_ARRAY 2 @@ -512,7 +496,7 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte) #endif } -static int +int try_preserve_large_page(pte_t *kpte, unsigned long address, struct cpa_data *cpa) { @@ -648,7 +632,7 @@ try_preserve_large_page(pte_t *kpte, unsigned long address, return do_split; } -static int +int __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address, struct page *base) { diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c new file mode 100644 index 000000000000..6794d6724ab5 --- /dev/null +++ b/arch/x86/mm/xpfo.c @@ -0,0 +1,114 @@ +/* + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P. + * Copyright (C) 2016 Brown University. All rights reserved. + * + * Authors: + * Juerg Haefliger <juerg.haefliger@hpe.com> + * Vasileios P. Kemerlis <vpk@cs.brown.edu> + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published by + * the Free Software Foundation. + */ + +#include <linux/mm.h> + +#include <asm/tlbflush.h> + +extern spinlock_t cpa_lock; + +/* Update a single kernel page table entry */ +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) +{ + unsigned int level; + pgprot_t msk_clr; + pte_t *pte = lookup_address((unsigned long)kaddr, &level); + + if (unlikely(!pte)) { + WARN(1, "xpfo: invalid address %p\n", kaddr); + return; + } + + switch (level) { + case PG_LEVEL_4K: + set_pte_atomic(pte, pfn_pte(page_to_pfn(page), canon_pgprot(prot))); + break; + case PG_LEVEL_2M: + case PG_LEVEL_1G: { + struct cpa_data cpa = { }; + int do_split; + + if (level == PG_LEVEL_2M) + msk_clr = pmd_pgprot(*(pmd_t*)pte); + else + msk_clr = pud_pgprot(*(pud_t*)pte); + + cpa.vaddr = kaddr; + cpa.pages = &page; + cpa.mask_set = prot; + cpa.mask_clr = msk_clr; + cpa.numpages = 1; + cpa.flags = 0; + cpa.curpage = 0; + cpa.force_split = 0; + + + do_split = try_preserve_large_page(pte, (unsigned long)kaddr, + &cpa); + if (do_split) { + struct page *base; + + base = alloc_pages(GFP_ATOMIC | __GFP_NOTRACK, 0); + if (!base) { + WARN(1, "xpfo: failed to split large page\n"); + break; + } + + if (!debug_pagealloc_enabled()) + spin_lock(&cpa_lock); + if (__split_large_page(&cpa, pte, (unsigned long)kaddr, base) < 0) + WARN(1, "xpfo: failed to split large page\n"); + if (!debug_pagealloc_enabled()) + spin_unlock(&cpa_lock); + } + + break; + } + case PG_LEVEL_512G: + /* fallthrough, splitting infrastructure doesn't + * support 512G pages. */ + default: + WARN(1, "xpfo: unsupported page level %x\n", level); + } + +} + +inline void xpfo_flush_kernel_tlb(struct page *page, int order) +{ + int level; + unsigned long size, kaddr; + + kaddr = (unsigned long)page_address(page); + + if (unlikely(!lookup_address(kaddr, &level))) { + WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level); + return; + } + + switch (level) { + case PG_LEVEL_4K: + size = PAGE_SIZE; + break; + case PG_LEVEL_2M: + size = PMD_SIZE; + break; + case PG_LEVEL_1G: + size = PUD_SIZE; + break; + default: + WARN(1, "xpfo: unsupported page level %x\n", level); + return; + } + + flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size); +} diff --git a/include/linux/highmem.h b/include/linux/highmem.h index bb3f3297062a..7a17c166532f 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -7,6 +7,7 @@ #include <linux/mm.h> #include <linux/uaccess.h> #include <linux/hardirq.h> +#include <linux/xpfo.h> #include <asm/cacheflush.h> @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr) #ifndef ARCH_HAS_KMAP static inline void *kmap(struct page *page) { + void *kaddr; + might_sleep(); - return page_address(page); + kaddr = page_address(page); + xpfo_kmap(kaddr, page); + return kaddr; } static inline void kunmap(struct page *page) { + xpfo_kunmap(page_address(page), page); } static inline void *kmap_atomic(struct page *page) { + void *kaddr; + preempt_disable(); pagefault_disable(); - return page_address(page); + kaddr = page_address(page); + xpfo_kmap(kaddr, page); + return kaddr; } #define kmap_atomic_prot(page, prot) kmap_atomic(page) static inline void __kunmap_atomic(void *addr) { + xpfo_kunmap(addr, virt_to_page(addr)); pagefault_enable(); preempt_enable(); } diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h new file mode 100644 index 000000000000..442c58ee930e --- /dev/null +++ b/include/linux/xpfo.h @@ -0,0 +1,42 @@ +/* + * Copyright (C) 2017 Docker, Inc. + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P. + * Copyright (C) 2016 Brown University. All rights reserved. + * + * Authors: + * Juerg Haefliger <juerg.haefliger@hpe.com> + * Vasileios P. Kemerlis <vpk@cs.brown.edu> + * Tycho Andersen <tycho@docker.com> + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published by + * the Free Software Foundation. + */ + +#ifndef _LINUX_XPFO_H +#define _LINUX_XPFO_H + +#ifdef CONFIG_XPFO + +extern struct page_ext_operations page_xpfo_ops; + +void set_kpte(void *kaddr, struct page *page, pgprot_t prot); +void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size, + enum dma_data_direction dir); +void xpfo_flush_kernel_tlb(struct page *page, int order); + +void xpfo_kmap(void *kaddr, struct page *page); +void xpfo_kunmap(void *kaddr, struct page *page); +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp); +void xpfo_free_pages(struct page *page, int order); + +#else /* !CONFIG_XPFO */ + +static inline void xpfo_kmap(void *kaddr, struct page *page) { } +static inline void xpfo_kunmap(void *kaddr, struct page *page) { } +static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { } +static inline void xpfo_free_pages(struct page *page, int order) { } + +#endif /* CONFIG_XPFO */ + +#endif /* _LINUX_XPFO_H */ diff --git a/mm/Makefile b/mm/Makefile index 411bd24d4a7c..0be67cac8f6c 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -104,3 +104,4 @@ obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o +obj-$(CONFIG_XPFO) += xpfo.o diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1423da8dd16f..09fdf1bad21f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1059,6 +1059,7 @@ static __always_inline bool free_pages_prepare(struct page *page, kernel_poison_pages(page, 1 << order, 0); kernel_map_pages(page, 1 << order, 0); kasan_free_pages(page, order); + xpfo_free_pages(page, order); return true; } @@ -1758,6 +1759,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order, kernel_map_pages(page, 1 << order, 1); kernel_poison_pages(page, 1 << order, 1); kasan_alloc_pages(page, order); + xpfo_alloc_pages(page, order, gfp_flags); set_page_owner(page, order, gfp_flags); } diff --git a/mm/page_ext.c b/mm/page_ext.c index 88ccc044b09a..4899df1f5d66 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -7,6 +7,7 @@ #include <linux/kmemleak.h> #include <linux/page_owner.h> #include <linux/page_idle.h> +#include <linux/xpfo.h> /* * struct page extension @@ -65,6 +66,9 @@ static struct page_ext_operations *page_ext_ops[] = { #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT) &page_idle_ops, #endif +#ifdef CONFIG_XPFO + &page_xpfo_ops, +#endif }; static unsigned long total_usage; diff --git a/mm/xpfo.c b/mm/xpfo.c new file mode 100644 index 000000000000..bff24afcaa2e --- /dev/null +++ b/mm/xpfo.c @@ -0,0 +1,222 @@ +/* + * Copyright (C) 2017 Docker, Inc. + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P. + * Copyright (C) 2016 Brown University. All rights reserved. + * + * Authors: + * Juerg Haefliger <juerg.haefliger@hpe.com> + * Vasileios P. Kemerlis <vpk@cs.brown.edu> + * Tycho Andersen <tycho@docker.com> + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published by + * the Free Software Foundation. + */ + +#include <linux/mm.h> +#include <linux/module.h> +#include <linux/page_ext.h> +#include <linux/xpfo.h> + +#include <asm/tlbflush.h> + +/* XPFO page state flags */ +enum xpfo_flags { + XPFO_PAGE_USER, /* Page is allocated to user-space */ + XPFO_PAGE_UNMAPPED, /* Page is unmapped from the linear map */ +}; + +/* Per-page XPFO house-keeping data */ +struct xpfo { + unsigned long flags; /* Page state */ + bool inited; /* Map counter and lock initialized */ + atomic_t mapcount; /* Counter for balancing map/unmap requests */ + spinlock_t maplock; /* Lock to serialize map/unmap requests */ +}; + +DEFINE_STATIC_KEY_FALSE(xpfo_inited); + +static bool xpfo_disabled __initdata; + +static int __init noxpfo_param(char *str) +{ + xpfo_disabled = true; + + return 0; +} + +early_param("noxpfo", noxpfo_param); + +static bool __init need_xpfo(void) +{ + if (xpfo_disabled) { + printk(KERN_INFO "XPFO disabled\n"); + return false; + } + + return true; +} + +static void init_xpfo(void) +{ + printk(KERN_INFO "XPFO enabled\n"); + static_branch_enable(&xpfo_inited); +} + +struct page_ext_operations page_xpfo_ops = { + .size = sizeof(struct xpfo), + .need = need_xpfo, + .init = init_xpfo, +}; + +static inline struct xpfo *lookup_xpfo(struct page *page) +{ + struct page_ext *page_ext = lookup_page_ext(page); + + if (unlikely(!page_ext)) { + WARN(1, "xpfo: failed to get page ext"); + return NULL; + } + + return (void *)page_ext + page_xpfo_ops.offset; +} + +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) +{ + int i, flush_tlb = 0; + struct xpfo *xpfo; + + if (!static_branch_unlikely(&xpfo_inited)) + return; + + for (i = 0; i < (1 << order); i++) { + xpfo = lookup_xpfo(page + i); + if (!xpfo) + continue; + + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags), + "xpfo: unmapped page being allocated\n"); + + /* Initialize the map lock and map counter */ + if (unlikely(!xpfo->inited)) { + spin_lock_init(&xpfo->maplock); + atomic_set(&xpfo->mapcount, 0); + xpfo->inited = true; + } + WARN(atomic_read(&xpfo->mapcount), + "xpfo: already mapped page being allocated\n"); + + if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) { + /* + * Tag the page as a user page and flush the TLB if it + * was previously allocated to the kernel. + */ + if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags)) + flush_tlb = 1; + } else { + /* Tag the page as a non-user (kernel) page */ + clear_bit(XPFO_PAGE_USER, &xpfo->flags); + } + } + + if (flush_tlb) + xpfo_flush_kernel_tlb(page, order); +} + +void xpfo_free_pages(struct page *page, int order) +{ + int i; + struct xpfo *xpfo; + + if (!static_branch_unlikely(&xpfo_inited)) + return; + + for (i = 0; i < (1 << order); i++) { + xpfo = lookup_xpfo(page + i); + if (!xpfo || unlikely(!xpfo->inited)) { + /* + * The page was allocated before page_ext was + * initialized, so it is a kernel page. + */ + continue; + } + + /* + * Map the page back into the kernel if it was previously + * allocated to user space. + */ + if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) { + clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags); + set_kpte(page_address(page + i), page + i, + PAGE_KERNEL); + } + } +} + +void xpfo_kmap(void *kaddr, struct page *page) +{ + struct xpfo *xpfo; + + if (!static_branch_unlikely(&xpfo_inited)) + return; + + xpfo = lookup_xpfo(page); + + /* + * The page was allocated before page_ext was initialized (which means + * it's a kernel page) or it's allocated to the kernel, so nothing to + * do. + */ + if (!xpfo || unlikely(!xpfo->inited) || + !test_bit(XPFO_PAGE_USER, &xpfo->flags)) + return; + + spin_lock(&xpfo->maplock); + + /* + * The page was previously allocated to user space, so map it back + * into the kernel. No TLB flush required. + */ + if ((atomic_inc_return(&xpfo->mapcount) == 1) && + test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags)) + set_kpte(kaddr, page, PAGE_KERNEL); + + spin_unlock(&xpfo->maplock); +} +EXPORT_SYMBOL(xpfo_kmap); + +void xpfo_kunmap(void *kaddr, struct page *page) +{ + struct xpfo *xpfo; + + if (!static_branch_unlikely(&xpfo_inited)) + return; + + xpfo = lookup_xpfo(page); + + /* + * The page was allocated before page_ext was initialized (which means + * it's a kernel page) or it's allocated to the kernel, so nothing to + * do. + */ + if (!xpfo || unlikely(!xpfo->inited) || + !test_bit(XPFO_PAGE_USER, &xpfo->flags)) + return; + + spin_lock(&xpfo->maplock); + + /* + * The page is to be allocated back to user space, so unmap it from the + * kernel, flush the TLB and tag it as a user page. + */ + if (atomic_dec_return(&xpfo->mapcount) == 0) { + WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags), + "xpfo: unmapping already unmapped page\n"); + set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags); + set_kpte(kaddr, page, __pgprot(0)); + xpfo_flush_kernel_tlb(page, 0); + } + + spin_unlock(&xpfo->maplock); +} +EXPORT_SYMBOL(xpfo_kunmap); diff --git a/security/Kconfig b/security/Kconfig index e8e449444e65..be5145eeed7d 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -6,6 +6,25 @@ menu "Security options" source security/keys/Kconfig +config ARCH_SUPPORTS_XPFO + bool + +config XPFO + bool "Enable eXclusive Page Frame Ownership (XPFO)" + default n + depends on ARCH_SUPPORTS_XPFO + select PAGE_EXTENSION + help + This option offers protection against 'ret2dir' kernel attacks. + When enabled, every time a page frame is allocated to user space, it + is unmapped from the direct mapped RAM region in kernel space + (physmap). Similarly, when a page frame is freed/reclaimed, it is + mapped back to physmap. + + There is a slight performance impact when this option is enabled. + + If in doubt, say "N". + config SECURITY_DMESG_RESTRICT bool "Restrict unprivileged access to the kernel syslog" default n