diff mbox

[v6,03/11] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)

Message ID 20170907173609.22696-4-tycho@docker.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tycho Andersen Sept. 7, 2017, 5:36 p.m. UTC
From: Juerg Haefliger <juerg.haefliger@canonical.com>

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping,
specifically:
  - two flags to distinguish user vs. kernel pages and to tag unmapped
    pages.
  - a reference counter to balance kmap/kunmap operations.
  - a lock to serialize access to the XPFO fields.

This patch is based on the work of Vasileios P. Kemerlis et al. who
published their work in this paper:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush
      the tlb entry on all CPUs when unmapping it in kunmap
    * handle lookup_page_ext()/lookup_xpfo() returning NULL
    * drop lots of BUG()s in favor of WARN()
    * don't disable irqs in xpfo_kmap/xpfo_kunmap, export
      __split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to
      pass it

CC: x86@kernel.org
Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Tycho Andersen <tycho@docker.com>
Signed-off-by: Marco Benatto <marco.antonio.780@gmail.com>
---
 Documentation/admin-guide/kernel-parameters.txt |   2 +
 arch/x86/Kconfig                                |   1 +
 arch/x86/include/asm/pgtable.h                  |  25 +++
 arch/x86/mm/Makefile                            |   1 +
 arch/x86/mm/pageattr.c                          |  22 +--
 arch/x86/mm/xpfo.c                              | 114 ++++++++++++
 include/linux/highmem.h                         |  15 +-
 include/linux/xpfo.h                            |  42 +++++
 mm/Makefile                                     |   1 +
 mm/page_alloc.c                                 |   2 +
 mm/page_ext.c                                   |   4 +
 mm/xpfo.c                                       | 222 ++++++++++++++++++++++++
 security/Kconfig                                |  19 ++
 13 files changed, 449 insertions(+), 21 deletions(-)

Comments

Ralph Campbell Sept. 7, 2017, 6:33 p.m. UTC | #1
> -----Original Message-----
> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On
> Behalf Of Tycho Andersen
> Sent: Thursday, September 7, 2017 10:36 AM
> To: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org; kernel-hardening@lists.openwall.com; Marco Benatto
> <marco.antonio.780@gmail.com>; Juerg Haefliger
> <juerg.haefliger@canonical.com>; x86@kernel.org; Tycho Andersen
> <tycho@docker.com>
> Subject: [PATCH v6 03/11] mm, x86: Add support for eXclusive Page Frame
> Ownership (XPFO)
> 
> From: Juerg Haefliger <juerg.haefliger@canonical.com>
> 
> This patch adds support for XPFO which protects against 'ret2dir' kernel attacks.
> The basic idea is to enforce exclusive ownership of page frames by either the
> kernel or userspace, unless explicitly requested by the kernel. Whenever a page
> destined for userspace is allocated, it is unmapped from physmap (the kernel's
> page table). When such a page is reclaimed from userspace, it is mapped back to
> physmap.
> 
> Additional fields in the page_ext struct are used for XPFO housekeeping,
> specifically:
>   - two flags to distinguish user vs. kernel pages and to tag unmapped
>     pages.
>   - a reference counter to balance kmap/kunmap operations.
>   - a lock to serialize access to the XPFO fields.
> 
> This patch is based on the work of Vasileios P. Kemerlis et al. who published their
> work in this paper:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
> 
> v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush
>       the tlb entry on all CPUs when unmapping it in kunmap
>     * handle lookup_page_ext()/lookup_xpfo() returning NULL
>     * drop lots of BUG()s in favor of WARN()
>     * don't disable irqs in xpfo_kmap/xpfo_kunmap, export
>       __split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to
>       pass it
> 
> CC: x86@kernel.org
> Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
> Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
> Signed-off-by: Tycho Andersen <tycho@docker.com>
> Signed-off-by: Marco Benatto <marco.antonio.780@gmail.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   2 +
>  arch/x86/Kconfig                                |   1 +
>  arch/x86/include/asm/pgtable.h                  |  25 +++
>  arch/x86/mm/Makefile                            |   1 +
>  arch/x86/mm/pageattr.c                          |  22 +--
>  arch/x86/mm/xpfo.c                              | 114 ++++++++++++
>  include/linux/highmem.h                         |  15 +-
>  include/linux/xpfo.h                            |  42 +++++
>  mm/Makefile                                     |   1 +
>  mm/page_alloc.c                                 |   2 +
>  mm/page_ext.c                                   |   4 +
>  mm/xpfo.c                                       | 222 ++++++++++++++++++++++++
>  security/Kconfig                                |  19 ++
>  13 files changed, 449 insertions(+), 21 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> index d9c171ce4190..444d83183f75 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2736,6 +2736,8 @@
> 
>  	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
> 
> +	noxpfo		[X86-64] Disable XPFO when CONFIG_XPFO is on.
> +
>  	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
>  			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
>  			Some features depend on CPU0. Known dependencies
<... snip>

A bit more description for system administrators would be very useful.
Perhaps something like:

noxpfo		[XPFO,X86-64] Disable eXclusive Page Frame Ownership (XPFO)
                             Physical pages mapped into user applications will also be mapped
                             in the kernel's address space as if CONFIG_XPFO was not enabled.

Patch 05 should also update kernel-parameters.txt and add "ARM64" to the config option list for noxpfo.
Tycho Andersen Sept. 7, 2017, 6:50 p.m. UTC | #2
On Thu, Sep 07, 2017 at 06:33:09PM +0000, Ralph Campbell wrote:
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2736,6 +2736,8 @@
> > 
> >  	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
> > 
> > +	noxpfo		[X86-64] Disable XPFO when CONFIG_XPFO is on.
> > +
> >  	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
> >  			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
> >  			Some features depend on CPU0. Known dependencies
> <... snip>
> 
> A bit more description for system administrators would be very useful.
> Perhaps something like:
> 
> noxpfo		[XPFO,X86-64] Disable eXclusive Page Frame Ownership (XPFO)
>                              Physical pages mapped into user applications will also be mapped
>                              in the kernel's address space as if CONFIG_XPFO was not enabled.
> 
> Patch 05 should also update kernel-parameters.txt and add "ARM64" to the config option list for noxpfo.

Nice catch, thanks. I'll fix both.

Cheers,

Tycho
Christoph Hellwig Sept. 8, 2017, 7:51 a.m. UTC | #3
I think this patch needs to be split into the generic mm code, and
the x86 arch code at least.

> +/*
> + * The current flushing context - we pass it instead of 5 arguments:
> + */
> +struct cpa_data {
> +	unsigned long	*vaddr;
> +	pgd_t		*pgd;
> +	pgprot_t	mask_set;
> +	pgprot_t	mask_clr;
> +	unsigned long	numpages;
> +	int		flags;
> +	unsigned long	pfn;
> +	unsigned	force_split : 1;
> +	int		curpage;
> +	struct page	**pages;
> +};

Fitting these 10 variables into 5 arguments would require an awesome
compression scheme anyway :)

> +			if  (__split_large_page(&cpa, pte, (unsigned long)kaddr, base) < 0)

Overly long line.

> +#include <linux/xpfo.h>
>  
>  #include <asm/cacheflush.h>
>  
> @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
>  #ifndef ARCH_HAS_KMAP
>  static inline void *kmap(struct page *page)
>  {
> +	void *kaddr;
> +
>  	might_sleep();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }
>  
>  static inline void kunmap(struct page *page)
>  {
> +	xpfo_kunmap(page_address(page), page);
>  }
>  
>  static inline void *kmap_atomic(struct page *page)
>  {
> +	void *kaddr;
> +
>  	preempt_disable();
>  	pagefault_disable();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }

It seems to me like we should simply direct to pure xpfo
implementations for the !HIGHMEM && XPFO case. - that is
just have the prototypes for kmap, kunmap and co in
linux/highmem.h and implement them in xpfo under those names.

Instead of sprinkling them around.

> +DEFINE_STATIC_KEY_FALSE(xpfo_inited);

s/inited/initialized/g ?

> +	bool "Enable eXclusive Page Frame Ownership (XPFO)"
> +	default n

default n is the default, so you can remove this line.
Tycho Andersen Sept. 8, 2017, 2:58 p.m. UTC | #4
On Fri, Sep 08, 2017 at 12:51:40AM -0700, Christoph Hellwig wrote:
> > +#include <linux/xpfo.h>
> >  
> >  #include <asm/cacheflush.h>
> >  
> > @@ -55,24 +56,34 @@ static inline struct page *kmap_to_page(void *addr)
> >  #ifndef ARCH_HAS_KMAP
> >  static inline void *kmap(struct page *page)
> >  {
> > +	void *kaddr;
> > +
> >  	might_sleep();
> > -	return page_address(page);
> > +	kaddr = page_address(page);
> > +	xpfo_kmap(kaddr, page);
> > +	return kaddr;
> >  }
> >  
> >  static inline void kunmap(struct page *page)
> >  {
> > +	xpfo_kunmap(page_address(page), page);
> >  }
> >  
> >  static inline void *kmap_atomic(struct page *page)
> >  {
> > +	void *kaddr;
> > +
> >  	preempt_disable();
> >  	pagefault_disable();
> > -	return page_address(page);
> > +	kaddr = page_address(page);
> > +	xpfo_kmap(kaddr, page);
> > +	return kaddr;
> >  }
> 
> It seems to me like we should simply direct to pure xpfo
> implementations for the !HIGHMEM && XPFO case. - that is
> just have the prototypes for kmap, kunmap and co in
> linux/highmem.h and implement them in xpfo under those names.
> 
> Instead of sprinkling them around.

Ok, IIUC we'll still need a #ifdef CONFIG_XPFO in this file, but at
least the implementations here won't have a diff. I'll make this
change, and all the others you've suggested.

Thanks!

Tycho
Laura Abbott Sept. 9, 2017, 3:35 p.m. UTC | #5
On 09/07/2017 10:36 AM, Tycho Andersen wrote:
> +static inline struct xpfo *lookup_xpfo(struct page *page)
> +{
> +	struct page_ext *page_ext = lookup_page_ext(page);
> +
> +	if (unlikely(!page_ext)) {
> +		WARN(1, "xpfo: failed to get page ext");
> +		return NULL;
> +	}
> +
> +	return (void *)page_ext + page_xpfo_ops.offset;
> +}
> +

Just drop the WARN. On my arm64 UEFI machine this spews warnings
under most normal operation. This should be normal for some
situations but I haven't had the time to dig into why this
is so pronounced on arm64.

Thanks,
Laura
Xie Yisheng Sept. 11, 2017, 7:24 a.m. UTC | #6
Hi Tycho,

On 2017/9/8 1:36, Tycho Andersen wrote:
> From: Juerg Haefliger <juerg.haefliger@canonical.com>
> 
> This patch adds support for XPFO which protects against 'ret2dir' kernel
> attacks. The basic idea is to enforce exclusive ownership of page frames
> by either the kernel or userspace, unless explicitly requested by the
> kernel. Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.
> 
> Additional fields in the page_ext struct are used for XPFO housekeeping,
> specifically:
>   - two flags to distinguish user vs. kernel pages and to tag unmapped
>     pages.
>   - a reference counter to balance kmap/kunmap operations.
>   - a lock to serialize access to the XPFO fields.
> 
> This patch is based on the work of Vasileios P. Kemerlis et al. who
> published their work in this paper:
>   http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
> 
> [...]
> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> +{
> +	int i, flush_tlb = 0;
> +	struct xpfo *xpfo;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++)  {
> +		xpfo = lookup_xpfo(page + i);
> +		if (!xpfo)
> +			continue;
> +
> +		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
> +		     "xpfo: unmapped page being allocated\n");
> +
> +		/* Initialize the map lock and map counter */
> +		if (unlikely(!xpfo->inited)) {
> +			spin_lock_init(&xpfo->maplock);
> +			atomic_set(&xpfo->mapcount, 0);
> +			xpfo->inited = true;
> +		}
> +		WARN(atomic_read(&xpfo->mapcount),
> +		     "xpfo: already mapped page being allocated\n");
> +
> +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> +			/*
> +			 * Tag the page as a user page and flush the TLB if it
> +			 * was previously allocated to the kernel.
> +			 */
> +			if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
> +				flush_tlb = 1;

I'm not sure whether I am miss anything, however, when the page was previously allocated
to kernel,  should we unmap the physmap (the kernel's page table) here? For we allocate
the page to user now

Yisheng Xie
Thanks
Tycho Andersen Sept. 11, 2017, 2:50 p.m. UTC | #7
Hi Yisheng,

On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote:
> > +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> > +{
> > +	int i, flush_tlb = 0;
> > +	struct xpfo *xpfo;
> > +
> > +	if (!static_branch_unlikely(&xpfo_inited))
> > +		return;
> > +
> > +	for (i = 0; i < (1 << order); i++)  {
> > +		xpfo = lookup_xpfo(page + i);
> > +		if (!xpfo)
> > +			continue;
> > +
> > +		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
> > +		     "xpfo: unmapped page being allocated\n");
> > +
> > +		/* Initialize the map lock and map counter */
> > +		if (unlikely(!xpfo->inited)) {
> > +			spin_lock_init(&xpfo->maplock);
> > +			atomic_set(&xpfo->mapcount, 0);
> > +			xpfo->inited = true;
> > +		}
> > +		WARN(atomic_read(&xpfo->mapcount),
> > +		     "xpfo: already mapped page being allocated\n");
> > +
> > +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> > +			/*
> > +			 * Tag the page as a user page and flush the TLB if it
> > +			 * was previously allocated to the kernel.
> > +			 */
> > +			if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
> > +				flush_tlb = 1;
> 
> I'm not sure whether I am miss anything, however, when the page was previously allocated
> to kernel,  should we unmap the physmap (the kernel's page table) here? For we allocate
> the page to user now

Yes, I think you're right. Oddly, the XPFO_READ_USER test works
correctly for me, but I think (?) should not because of this bug...

Tycho
Tycho Andersen Sept. 11, 2017, 3:03 p.m. UTC | #8
On Sat, Sep 09, 2017 at 08:35:17AM -0700, Laura Abbott wrote:
> On 09/07/2017 10:36 AM, Tycho Andersen wrote:
> > +static inline struct xpfo *lookup_xpfo(struct page *page)
> > +{
> > +	struct page_ext *page_ext = lookup_page_ext(page);
> > +
> > +	if (unlikely(!page_ext)) {
> > +		WARN(1, "xpfo: failed to get page ext");
> > +		return NULL;
> > +	}
> > +
> > +	return (void *)page_ext + page_xpfo_ops.offset;
> > +}
> > +
> 
> Just drop the WARN. On my arm64 UEFI machine this spews warnings
> under most normal operation. This should be normal for some
> situations but I haven't had the time to dig into why this
> is so pronounced on arm64.

Will do, thanks! If you figure out under what conditions it's normal,
I'd be curious :)

Tycho
Juerg Haefliger Sept. 11, 2017, 4:03 p.m. UTC | #9
On 09/11/2017 04:50 PM, Tycho Andersen wrote:
> Hi Yisheng,
> 
> On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote:
>>> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
>>> +{
>>> +	int i, flush_tlb = 0;
>>> +	struct xpfo *xpfo;
>>> +
>>> +	if (!static_branch_unlikely(&xpfo_inited))
>>> +		return;
>>> +
>>> +	for (i = 0; i < (1 << order); i++)  {
>>> +		xpfo = lookup_xpfo(page + i);
>>> +		if (!xpfo)
>>> +			continue;
>>> +
>>> +		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
>>> +		     "xpfo: unmapped page being allocated\n");
>>> +
>>> +		/* Initialize the map lock and map counter */
>>> +		if (unlikely(!xpfo->inited)) {
>>> +			spin_lock_init(&xpfo->maplock);
>>> +			atomic_set(&xpfo->mapcount, 0);
>>> +			xpfo->inited = true;
>>> +		}
>>> +		WARN(atomic_read(&xpfo->mapcount),
>>> +		     "xpfo: already mapped page being allocated\n");
>>> +
>>> +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
>>> +			/*
>>> +			 * Tag the page as a user page and flush the TLB if it
>>> +			 * was previously allocated to the kernel.
>>> +			 */
>>> +			if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
>>> +				flush_tlb = 1;
>>
>> I'm not sure whether I am miss anything, however, when the page was previously allocated
>> to kernel,  should we unmap the physmap (the kernel's page table) here? For we allocate
>> the page to user now
>> 
> Yes, I think you're right. Oddly, the XPFO_READ_USER test works
> correctly for me, but I think (?) should not because of this bug...

IIRC, this is an optimization carried forward from the initial
implementation. The assumption is that the kernel will map the user
buffer so it's not unmapped on allocation but only on the first (and
subsequent) call of kunmap. I.e.:
 - alloc  -> noop
 - kmap   -> noop
 - kunmap -> unmapped from the kernel
 - kmap   -> mapped into the kernel
 - kunmap -> unmapped from the kernel
and so on until:
 - free   -> mapped back into the kernel

I'm not sure if that make sense though since it leaves a window.

...Juerg



> Tycho
>
Tycho Andersen Sept. 11, 2017, 4:59 p.m. UTC | #10
On Mon, Sep 11, 2017 at 06:03:55PM +0200, Juerg Haefliger wrote:
> 
> 
> On 09/11/2017 04:50 PM, Tycho Andersen wrote:
> > Hi Yisheng,
> > 
> > On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote:
> >>> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> >>> +{
> >>> +	int i, flush_tlb = 0;
> >>> +	struct xpfo *xpfo;
> >>> +
> >>> +	if (!static_branch_unlikely(&xpfo_inited))
> >>> +		return;
> >>> +
> >>> +	for (i = 0; i < (1 << order); i++)  {
> >>> +		xpfo = lookup_xpfo(page + i);
> >>> +		if (!xpfo)
> >>> +			continue;
> >>> +
> >>> +		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
> >>> +		     "xpfo: unmapped page being allocated\n");
> >>> +
> >>> +		/* Initialize the map lock and map counter */
> >>> +		if (unlikely(!xpfo->inited)) {
> >>> +			spin_lock_init(&xpfo->maplock);
> >>> +			atomic_set(&xpfo->mapcount, 0);
> >>> +			xpfo->inited = true;
> >>> +		}
> >>> +		WARN(atomic_read(&xpfo->mapcount),
> >>> +		     "xpfo: already mapped page being allocated\n");
> >>> +
> >>> +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> >>> +			/*
> >>> +			 * Tag the page as a user page and flush the TLB if it
> >>> +			 * was previously allocated to the kernel.
> >>> +			 */
> >>> +			if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
> >>> +				flush_tlb = 1;
> >>
> >> I'm not sure whether I am miss anything, however, when the page was previously allocated
> >> to kernel,  should we unmap the physmap (the kernel's page table) here? For we allocate
> >> the page to user now
> >> 
> > Yes, I think you're right. Oddly, the XPFO_READ_USER test works
> > correctly for me, but I think (?) should not because of this bug...
> 
> IIRC, this is an optimization carried forward from the initial
> implementation. The assumption is that the kernel will map the user
> buffer so it's not unmapped on allocation but only on the first (and

Does the kernel always map it, though? e.g. in the case of
XPFO_READ_USER, I'm not sure where the kernel would do a kmap() of the
test's user buffer.

Tycho

> subsequent) call of kunmap. I.e.:
>  - alloc  -> noop
>  - kmap   -> noop
>  - kunmap -> unmapped from the kernel
>  - kmap   -> mapped into the kernel
>  - kunmap -> unmapped from the kernel
> and so on until:
>  - free   -> mapped back into the kernel
> 
> I'm not sure if that make sense though since it leaves a window.
> 
> ...Juerg
> 
> 
> 
> > Tycho
> >
Tycho Andersen Sept. 11, 2017, 6:32 p.m. UTC | #11
Hi all,

On Thu, Sep 07, 2017 at 11:36:01AM -0600, Tycho Andersen wrote:
>
> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
> +{
> +	int level;
> +	unsigned long size, kaddr;
> +
> +	kaddr = (unsigned long)page_address(page);
> +
> +	if (unlikely(!lookup_address(kaddr, &level))) {
> +		WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
> +		return;
> +	}
> +
> +	switch (level) {
> +	case PG_LEVEL_4K:
> +		size = PAGE_SIZE;
> +		break;
> +	case PG_LEVEL_2M:
> +		size = PMD_SIZE;
> +		break;
> +	case PG_LEVEL_1G:
> +		size = PUD_SIZE;
> +		break;
> +	default:
> +		WARN(1, "xpfo: unsupported page level %x\n", level);
> +		return;
> +	}
> +
> +	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);

Marco was testing and got the stack trace below. The issue is that on x86,
flush_tlb_kernel_range uses on_each_cpu, which causes the WARN() below. Since
this is called from xpfo_kmap/unmap in this interrupt handler, the WARN()
triggers.

I'm not sure what to do about this -- based on the discussion in v6 we need to
flush the TLBs for all CPUs -- but we can't do that with interrupts disabled,
which basically means with this we wouldn't be able to map/unmap pages in
interrupts.

Any thoughts?

Tycho

[    2.712912] ------------[ cut here ]------------
[    2.712922] WARNING: CPU: 0 PID: 0 at kernel/smp.c:414
smp_call_function_many+0x9a/0x270
[    2.712923] Modules linked in: sd_mod ata_generic pata_acpi qxl
drm_kms_helper syscopyarea sysfillrect virtio_console sysimgblt
virtio_blk fb_sys_fops ttm drm 8139too ata_piix libata 8139cp
virtio_pci virtio_ring virtio mii crc32c_intel i2c_core serio_raw
floppy dm_mirror dm_region_hash dm_log dm_mod
[    2.712939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #8
[    2.712940] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.9.3-1.fc25 04/01/2014
[    2.712941] task: ffffffff81c10480 task.stack: ffffffff81c00000
[    2.712943] RIP: 0010:smp_call_function_many+0x9a/0x270
[    2.712944] RSP: 0018:ffff88023fc03b38 EFLAGS: 00010046
[    2.712945] RAX: 0000000000000000 RBX: ffffffff81072a50 RCX: 0000000000000001
[    2.712946] RDX: ffff88023fc03ba8 RSI: ffffffff81072a50 RDI: ffffffff81e22320
[    2.712947] RBP: ffff88023fc03b70 R08: 0000000000000970 R09: 0000000000000063
[    2.712948] R10: ffff880000000970 R11: 0000000000000000 R12: ffff88023fc03ba8
[    2.712949] R13: 0000000000000000 R14: ffff8802332b8e18 R15: ffffffff81e22320
[    2.712950] FS:  0000000000000000(0000) GS:ffff88023fc00000(0000)
knlGS:0000000000000000
[    2.712951] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.712951] CR2: 00007fde22f6b000 CR3: 000000022727b000 CR4: 00000000003406f0
[    2.712954] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.712955] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.712955] Call Trace:
[    2.712959]  <IRQ>
[    2.712964]  ? x86_configure_nx+0x50/0x50
[    2.712966]  on_each_cpu+0x2d/0x60
[    2.712967]  flush_tlb_kernel_range+0x79/0x80
[    2.712969]  xpfo_flush_kernel_tlb+0xaa/0xe0
[    2.712975]  xpfo_kunmap+0xa8/0xc0
[    2.712981]  swiotlb_bounce+0xd1/0x1c0
[    2.712982]  swiotlb_tbl_unmap_single+0x10f/0x120
[    2.712984]  unmap_single+0x20/0x30
[    2.712985]  swiotlb_unmap_sg_attrs+0x46/0x70
[    2.712991]  __ata_qc_complete+0xfa/0x150 [libata]
[    2.712994]  ata_qc_complete+0xd2/0x2e0 [libata]
[    2.712998]  ata_hsm_qc_complete+0x6f/0x90 [libata]
[    2.713004]  ata_sff_hsm_move+0xae/0x6b0 [libata]
[    2.713009]  __ata_sff_port_intr+0x8e/0x100 [libata]
[    2.713013]  ata_bmdma_port_intr+0x2f/0xd0 [libata]
[    2.713019]  ata_bmdma_interrupt+0x161/0x1b0 [libata]
[    2.713022]  __handle_irq_event_percpu+0x3c/0x190
[    2.713024]  handle_irq_event_percpu+0x32/0x80
[    2.713026]  handle_irq_event+0x3b/0x60
[    2.713027]  handle_edge_irq+0x8f/0x190
[    2.713029]  handle_irq+0xab/0x120
[    2.713032]  ? _local_bh_enable+0x21/0x30
[    2.713039]  do_IRQ+0x48/0xd0
[    2.713040]  common_interrupt+0x93/0x93
[    2.713042] RIP: 0010:native_safe_halt+0x6/0x10
[    2.713043] RSP: 0018:ffffffff81c03de0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffffc1
[    2.713044] RAX: 0000000000000000 RBX: ffffffff81c10480 RCX: 0000000000000000
[    2.713045] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    2.713046] RBP: ffffffff81c03de0 R08: 00000000b656100b R09: 0000000000000000
[    2.713047] R10: 0000000000000006 R11: 0000000000000005 R12: 0000000000000000
[    2.713047] R13: ffffffff81c10480 R14: 0000000000000000 R15: 0000000000000000
[    2.713048]  </IRQ>
[    2.713050]  default_idle+0x1e/0x100
[    2.713052]  arch_cpu_idle+0xf/0x20
[    2.713053]  default_idle_call+0x2c/0x40
[    2.713055]  do_idle+0x158/0x1e0
[    2.713056]  cpu_startup_entry+0x73/0x80
[    2.713058]  rest_init+0xb8/0xc0
[    2.713070]  start_kernel+0x4a2/0x4c3
[    2.713072]  ? set_init_arg+0x5a/0x5a
[    2.713074]  ? early_idt_handler_array+0x120/0x120
[    2.713075]  x86_64_start_reservations+0x2a/0x2c
[    2.713077]  x86_64_start_kernel+0x14c/0x16f
[    2.713079]  secondary_startup_64+0x9f/0x9f
[    2.713080] Code: 44 3b 35 1e 6f d0 00 7c 26 48 83 c4 10 5b 41 5c
41 5d 41 5e 41 5f 5d c3 8b 05 63 38 fc 00 85 c0 75 be 80 3d 20 0d d0
00 00 75 b5 <0f> ff eb b1 48 c7 c2 20 23 e2 81 4c 89 fe 44 89 f7 e8 20
b5 62
[    2.713105] ---[ end trace 4d101d4c176c16b0 ]---
Marco Benatto Sept. 11, 2017, 9:54 p.m. UTC | #12
I think the point here is, as running under interrupt-context, we are
using k{map,unmap)_atomic instead of k{map,unmap}
but after the flush_tlb_kernel_range() change we don't have an XPFO
atomic version from both.

My suggestion here is, at least for x86, split this into xpfo_k{un}map
and xpfo_k{un}map_atomic which wild only flush local tlb.
This would create a window where the same page would be mapped both
for user and kernel on other cpu's due to stale remote tlb entries.

Is there any other suggestion for this scenario?

Thanks,

- Marco Benatto

On Mon, Sep 11, 2017 at 3:32 PM, Tycho Andersen <tycho@docker.com> wrote:
> Hi all,
>
> On Thu, Sep 07, 2017 at 11:36:01AM -0600, Tycho Andersen wrote:
>>
>> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
>> +{
>> +     int level;
>> +     unsigned long size, kaddr;
>> +
>> +     kaddr = (unsigned long)page_address(page);
>> +
>> +     if (unlikely(!lookup_address(kaddr, &level))) {
>> +             WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
>> +             return;
>> +     }
>> +
>> +     switch (level) {
>> +     case PG_LEVEL_4K:
>> +             size = PAGE_SIZE;
>> +             break;
>> +     case PG_LEVEL_2M:
>> +             size = PMD_SIZE;
>> +             break;
>> +     case PG_LEVEL_1G:
>> +             size = PUD_SIZE;
>> +             break;
>> +     default:
>> +             WARN(1, "xpfo: unsupported page level %x\n", level);
>> +             return;
>> +     }
>> +
>> +     flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
>
> Marco was testing and got the stack trace below. The issue is that on x86,
> flush_tlb_kernel_range uses on_each_cpu, which causes the WARN() below. Since
> this is called from xpfo_kmap/unmap in this interrupt handler, the WARN()
> triggers.
>
> I'm not sure what to do about this -- based on the discussion in v6 we need to
> flush the TLBs for all CPUs -- but we can't do that with interrupts disabled,
> which basically means with this we wouldn't be able to map/unmap pages in
> interrupts.
>
> Any thoughts?
>
> Tycho
>
> [    2.712912] ------------[ cut here ]------------
> [    2.712922] WARNING: CPU: 0 PID: 0 at kernel/smp.c:414
> smp_call_function_many+0x9a/0x270
> [    2.712923] Modules linked in: sd_mod ata_generic pata_acpi qxl
> drm_kms_helper syscopyarea sysfillrect virtio_console sysimgblt
> virtio_blk fb_sys_fops ttm drm 8139too ata_piix libata 8139cp
> virtio_pci virtio_ring virtio mii crc32c_intel i2c_core serio_raw
> floppy dm_mirror dm_region_hash dm_log dm_mod
> [    2.712939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0+ #8
> [    2.712940] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.9.3-1.fc25 04/01/2014
> [    2.712941] task: ffffffff81c10480 task.stack: ffffffff81c00000
> [    2.712943] RIP: 0010:smp_call_function_many+0x9a/0x270
> [    2.712944] RSP: 0018:ffff88023fc03b38 EFLAGS: 00010046
> [    2.712945] RAX: 0000000000000000 RBX: ffffffff81072a50 RCX: 0000000000000001
> [    2.712946] RDX: ffff88023fc03ba8 RSI: ffffffff81072a50 RDI: ffffffff81e22320
> [    2.712947] RBP: ffff88023fc03b70 R08: 0000000000000970 R09: 0000000000000063
> [    2.712948] R10: ffff880000000970 R11: 0000000000000000 R12: ffff88023fc03ba8
> [    2.712949] R13: 0000000000000000 R14: ffff8802332b8e18 R15: ffffffff81e22320
> [    2.712950] FS:  0000000000000000(0000) GS:ffff88023fc00000(0000)
> knlGS:0000000000000000
> [    2.712951] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    2.712951] CR2: 00007fde22f6b000 CR3: 000000022727b000 CR4: 00000000003406f0
> [    2.712954] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    2.712955] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    2.712955] Call Trace:
> [    2.712959]  <IRQ>
> [    2.712964]  ? x86_configure_nx+0x50/0x50
> [    2.712966]  on_each_cpu+0x2d/0x60
> [    2.712967]  flush_tlb_kernel_range+0x79/0x80
> [    2.712969]  xpfo_flush_kernel_tlb+0xaa/0xe0
> [    2.712975]  xpfo_kunmap+0xa8/0xc0
> [    2.712981]  swiotlb_bounce+0xd1/0x1c0
> [    2.712982]  swiotlb_tbl_unmap_single+0x10f/0x120
> [    2.712984]  unmap_single+0x20/0x30
> [    2.712985]  swiotlb_unmap_sg_attrs+0x46/0x70
> [    2.712991]  __ata_qc_complete+0xfa/0x150 [libata]
> [    2.712994]  ata_qc_complete+0xd2/0x2e0 [libata]
> [    2.712998]  ata_hsm_qc_complete+0x6f/0x90 [libata]
> [    2.713004]  ata_sff_hsm_move+0xae/0x6b0 [libata]
> [    2.713009]  __ata_sff_port_intr+0x8e/0x100 [libata]
> [    2.713013]  ata_bmdma_port_intr+0x2f/0xd0 [libata]
> [    2.713019]  ata_bmdma_interrupt+0x161/0x1b0 [libata]
> [    2.713022]  __handle_irq_event_percpu+0x3c/0x190
> [    2.713024]  handle_irq_event_percpu+0x32/0x80
> [    2.713026]  handle_irq_event+0x3b/0x60
> [    2.713027]  handle_edge_irq+0x8f/0x190
> [    2.713029]  handle_irq+0xab/0x120
> [    2.713032]  ? _local_bh_enable+0x21/0x30
> [    2.713039]  do_IRQ+0x48/0xd0
> [    2.713040]  common_interrupt+0x93/0x93
> [    2.713042] RIP: 0010:native_safe_halt+0x6/0x10
> [    2.713043] RSP: 0018:ffffffff81c03de0 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffffc1
> [    2.713044] RAX: 0000000000000000 RBX: ffffffff81c10480 RCX: 0000000000000000
> [    2.713045] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [    2.713046] RBP: ffffffff81c03de0 R08: 00000000b656100b R09: 0000000000000000
> [    2.713047] R10: 0000000000000006 R11: 0000000000000005 R12: 0000000000000000
> [    2.713047] R13: ffffffff81c10480 R14: 0000000000000000 R15: 0000000000000000
> [    2.713048]  </IRQ>
> [    2.713050]  default_idle+0x1e/0x100
> [    2.713052]  arch_cpu_idle+0xf/0x20
> [    2.713053]  default_idle_call+0x2c/0x40
> [    2.713055]  do_idle+0x158/0x1e0
> [    2.713056]  cpu_startup_entry+0x73/0x80
> [    2.713058]  rest_init+0xb8/0xc0
> [    2.713070]  start_kernel+0x4a2/0x4c3
> [    2.713072]  ? set_init_arg+0x5a/0x5a
> [    2.713074]  ? early_idt_handler_array+0x120/0x120
> [    2.713075]  x86_64_start_reservations+0x2a/0x2c
> [    2.713077]  x86_64_start_kernel+0x14c/0x16f
> [    2.713079]  secondary_startup_64+0x9f/0x9f
> [    2.713080] Code: 44 3b 35 1e 6f d0 00 7c 26 48 83 c4 10 5b 41 5c
> 41 5d 41 5e 41 5f 5d c3 8b 05 63 38 fc 00 85 c0 75 be 80 3d 20 0d d0
> 00 00 75 b5 <0f> ff eb b1 48 c7 c2 20 23 e2 81 4c 89 fe 44 89 f7 e8 20
> b5 62
> [    2.713105] ---[ end trace 4d101d4c176c16b0 ]---
Xie Yisheng Sept. 12, 2017, 8:05 a.m. UTC | #13
On 2017/9/12 0:03, Juerg Haefliger wrote:
> 
> 
> On 09/11/2017 04:50 PM, Tycho Andersen wrote:
>> Hi Yisheng,
>>
>> On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote:
>>>> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
>>>> +{
>>>> +	int i, flush_tlb = 0;
>>>> +	struct xpfo *xpfo;
>>>> +
>>>> +	if (!static_branch_unlikely(&xpfo_inited))
>>>> +		return;
>>>> +
>>>> +	for (i = 0; i < (1 << order); i++)  {
>>>> +		xpfo = lookup_xpfo(page + i);
>>>> +		if (!xpfo)
>>>> +			continue;
>>>> +
>>>> +		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
>>>> +		     "xpfo: unmapped page being allocated\n");
>>>> +
>>>> +		/* Initialize the map lock and map counter */
>>>> +		if (unlikely(!xpfo->inited)) {
>>>> +			spin_lock_init(&xpfo->maplock);
>>>> +			atomic_set(&xpfo->mapcount, 0);
>>>> +			xpfo->inited = true;
>>>> +		}
>>>> +		WARN(atomic_read(&xpfo->mapcount),
>>>> +		     "xpfo: already mapped page being allocated\n");
>>>> +
>>>> +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
>>>> +			/*
>>>> +			 * Tag the page as a user page and flush the TLB if it
>>>> +			 * was previously allocated to the kernel.
>>>> +			 */
>>>> +			if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
>>>> +				flush_tlb = 1;
>>>
>>> I'm not sure whether I am miss anything, however, when the page was previously allocated
>>> to kernel,  should we unmap the physmap (the kernel's page table) here? For we allocate
>>> the page to user now
>>>
>> Yes, I think you're right. Oddly, the XPFO_READ_USER test works

Hi Tycho,
Could you share this test? I'd like to know how it works.

Thanks

>> correctly for me, but I think (?) should not because of this bug...
> 
> IIRC, this is an optimization carried forward from the initial
> implementation. 
Hi Juerg,

hmm.. If below is the first version, then it seems this exist from the first version:
https://patchwork.kernel.org/patch/8437451/

> The assumption is that the kernel will map the user
> buffer so it's not unmapped on allocation but only on the first (and
> subsequent) call of kunmap.

IMO, before a page is allocated, it is in buddy system, which means it is free
and no other 'map' on the page except direct map. Then if the page is allocated
to user, XPFO should unmap the direct map. otherwise the ret2dir may works at
this window before it is freed. Or maybe I'm still missing anything.

Thanks
Yisheng Xie

>  I.e.:
>  - alloc  -> noop
>  - kmap   -> noop
>  - kunmap -> unmapped from the kernel
>  - kmap   -> mapped into the kernel
>  - kunmap -> unmapped from the kernel
> and so on until:
>  - free   -> mapped back into the kernel
> 
> I'm not sure if that make sense though since it leaves a window.
> 
> ...Juerg
> 
> 
> 
>> Tycho
>>
> 
> .
>
Tycho Andersen Sept. 12, 2017, 2:36 p.m. UTC | #14
On Tue, Sep 12, 2017 at 04:05:22PM +0800, Yisheng Xie wrote:
> 
> 
> On 2017/9/12 0:03, Juerg Haefliger wrote:
> > 
> > 
> > On 09/11/2017 04:50 PM, Tycho Andersen wrote:
> >> Hi Yisheng,
> >>
> >> On Mon, Sep 11, 2017 at 03:24:09PM +0800, Yisheng Xie wrote:
> >>>> +void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
> >>>> +{
> >>>> +	int i, flush_tlb = 0;
> >>>> +	struct xpfo *xpfo;
> >>>> +
> >>>> +	if (!static_branch_unlikely(&xpfo_inited))
> >>>> +		return;
> >>>> +
> >>>> +	for (i = 0; i < (1 << order); i++)  {
> >>>> +		xpfo = lookup_xpfo(page + i);
> >>>> +		if (!xpfo)
> >>>> +			continue;
> >>>> +
> >>>> +		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
> >>>> +		     "xpfo: unmapped page being allocated\n");
> >>>> +
> >>>> +		/* Initialize the map lock and map counter */
> >>>> +		if (unlikely(!xpfo->inited)) {
> >>>> +			spin_lock_init(&xpfo->maplock);
> >>>> +			atomic_set(&xpfo->mapcount, 0);
> >>>> +			xpfo->inited = true;
> >>>> +		}
> >>>> +		WARN(atomic_read(&xpfo->mapcount),
> >>>> +		     "xpfo: already mapped page being allocated\n");
> >>>> +
> >>>> +		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
> >>>> +			/*
> >>>> +			 * Tag the page as a user page and flush the TLB if it
> >>>> +			 * was previously allocated to the kernel.
> >>>> +			 */
> >>>> +			if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
> >>>> +				flush_tlb = 1;
> >>>
> >>> I'm not sure whether I am miss anything, however, when the page was previously allocated
> >>> to kernel,  should we unmap the physmap (the kernel's page table) here? For we allocate
> >>> the page to user now
> >>>
> >> Yes, I think you're right. Oddly, the XPFO_READ_USER test works
> 
> Hi Tycho,
> Could you share this test? I'd like to know how it works.

See the last patch in the series.

> >> correctly for me, but I think (?) should not because of this bug...
> > 
> > IIRC, this is an optimization carried forward from the initial
> > implementation. 
> Hi Juerg,
> 
> hmm.. If below is the first version, then it seems this exist from the first version:
> https://patchwork.kernel.org/patch/8437451/
> 
> > The assumption is that the kernel will map the user
> > buffer so it's not unmapped on allocation but only on the first (and
> > subsequent) call of kunmap.
> 
> IMO, before a page is allocated, it is in buddy system, which means it is free
> and no other 'map' on the page except direct map. Then if the page is allocated
> to user, XPFO should unmap the direct map. otherwise the ret2dir may works at
> this window before it is freed. Or maybe I'm still missing anything.

I agree that it seems broken. I'm just not sure why the test doesn't
fail. It's certainly worth understanding.

Tycho
Dave Hansen Sept. 20, 2017, 3:48 p.m. UTC | #15
On 09/07/2017 10:36 AM, Tycho Andersen wrote:
...
> Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.

I'm looking for the code where it's unmapped at allocation to userspace.
 I see TLB flushing and 'struct xpfo' manipulation, but I don't see the
unmapping.  Where is that occurring?

How badly does this hurt performance?  Since we (generally) have
different migrate types for user and kernel allocation, I can imagine
that a given page *generally* doesn't oscillate between user and
kernel-allocated, but I'm curious how it works in practice.  Doesn't the
IPI load from the TLB flushes eat you alive?

It's a bit scary to have such a deep code path under the main allocator.

This all seems insanely expensive.  It will *barely* work on an
allocation-heavy workload on a desktop.  I'm pretty sure the locking
will just fall over entirely on any reasonably-sized server.

I really have to wonder whether there are better ret2dir defenses than
this.  The allocator just seems like the *wrong* place to be doing this
because it's such a hot path.

> +		cpa.vaddr = kaddr;
> +		cpa.pages = &page;
> +		cpa.mask_set = prot;
> +		cpa.mask_clr = msk_clr;
> +		cpa.numpages = 1;
> +		cpa.flags = 0;
> +		cpa.curpage = 0;
> +		cpa.force_split = 0;
> +
> +
> +		do_split = try_preserve_large_page(pte, (unsigned 

Is this safe to do without a TLB flush?  I thought we had plenty of bugs
in CPUs around having multiple entries for the same page in the TLB at
once.  We're *REALLY* careful when we split large pages for THP, and I'm
surprised we don't do the same here.

Why do you even bother keeping large pages around?  Won't the entire
kernel just degrade to using 4k everywhere, eventually?

> +		if (do_split) {
> +			struct page *base;
> +
> +			base = alloc_pages(GFP_ATOMIC | __GFP_NOTRACK, 

Ugh, GFP_ATOMIC.  That's nasty.  Do you really want this allocation to
fail all the time?  GFP_ATOMIC could really be called
GFP_YOU_BETTER_BE_OK_WITH_THIS_FAILING. :)

You probably want to do what the THP code does here and keep a spare
page around, then allocate it before you take the locks.

> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
> +{
> +	int level;
> +	unsigned long size, kaddr;
> +
> +	kaddr = (unsigned long)page_address(page);
> +
> +	if (unlikely(!lookup_address(kaddr, &level))) {
> +		WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
> +		return;
> +	}
> +
> +	switch (level) {
> +	case PG_LEVEL_4K:
> +		size = PAGE_SIZE;
> +		break;
> +	case PG_LEVEL_2M:
> +		size = PMD_SIZE;
> +		break;
> +	case PG_LEVEL_1G:
> +		size = PUD_SIZE;
> +		break;
> +	default:
> +		WARN(1, "xpfo: unsupported page level %x\n", level);
> +		return;
> +	}
> +
> +	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
> +}

I'm not sure flush_tlb_kernel_range() is the best primitive to be
calling here.

Let's say you walk the page tables and find level=PG_LEVEL_1G.  You call
flush_tlb_kernel_range(), you will be above
tlb_single_page_flush_ceiling, and you will do a full TLB flush.  But,
with a 1GB page, you could have just used a single INVLPG and skipped
the global flush.

I guess the cost of the IPI is way more than the flush itself, but it's
still a shame to toss the entire TLB when you don't have to.

I also think the TLB flush should be done closer to the page table
manipulation that it is connected to.  It's hard to figure out whether
the flush is the right one otherwise.

Also, the "(1 << order) * size" thing looks goofy to me.  Let's say you
are flushing a order=1 (8k) page and its mapped in a 1GB mapping.  You
flush 2GB.  Is that intentional?

> +
> +void xpfo_free_pages(struct page *page, int order)
> +{
...
> +		/*
> +		 * Map the page back into the kernel if it was previously
> +		 * allocated to user space.
> +		 */
> +		if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) {
> +			clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
> +			set_kpte(page_address(page + i), page + i,
> +				 PAGE_KERNEL);
> +		}
> +	}
> +}

This seems like a bad idea, performance-wise.  Kernel and userspace
pages tend to be separated by migrate types.  So, a given physical page
will tend to be used as kernel *or* for userspace.  With this nugget,
every time a userspace page is freed, we will go to the trouble of
making it *back* into a kernel page.  Then, when it is allocated again
(probably as userspace), we will re-make it into a userspace page.  That
seems horribly inefficient.

Also, this weakens the security guarantees.  Let's say you're mounting a
ret2dir attack.  You populate a page with your evil data and you know
the kernel address for the page.  All you have to do is coordinate your
attack with freeing the page.  You can control when it gets freed.  Now,
the xpfo_free_pages() helpfully just mapped your attack code back into
the kernel.

Why not *just* do these moves at allocation time?
Tycho Andersen Sept. 20, 2017, 10:34 p.m. UTC | #16
Hi Dave,

Thanks for taking a look!

On Wed, Sep 20, 2017 at 08:48:36AM -0700, Dave Hansen wrote:
> On 09/07/2017 10:36 AM, Tycho Andersen wrote:
> ...
> > Whenever a page destined for userspace is allocated, it is
> > unmapped from physmap (the kernel's page table). When such a page is
> > reclaimed from userspace, it is mapped back to physmap.
> 
> I'm looking for the code where it's unmapped at allocation to userspace.
>  I see TLB flushing and 'struct xpfo' manipulation, but I don't see the
> unmapping.  Where is that occurring?

This is discussed here: https://lkml.org/lkml/2017/9/11/289 but,
you're right that it's wrong in some cases. I've fixed it up for v7:
https://lkml.org/lkml/2017/9/12/512

> How badly does this hurt performance?  Since we (generally) have
> different migrate types for user and kernel allocation, I can imagine
> that a given page *generally* doesn't oscillate between user and
> kernel-allocated, but I'm curious how it works in practice.  Doesn't the
> IPI load from the TLB flushes eat you alive?
>
> It's a bit scary to have such a deep code path under the main allocator.
> 
> This all seems insanely expensive.  It will *barely* work on an
> allocation-heavy workload on a desktop.  I'm pretty sure the locking
> will just fall over entirely on any reasonably-sized server.

Basically, yes :(. I presented some numbers at LSS, but the gist was
on a 2.4x slowdown on a 24 core/48 thread Xeon E5-2650, and a 1.4x
slowdown on a 4 core/8 thread E3-1240. The story seems a little bit
better on ARM, but I'm struggling to get it to boot on a box with more
than 4 cores, so I can't draw a better picture yet.

> I really have to wonder whether there are better ret2dir defenses than
> this.  The allocator just seems like the *wrong* place to be doing this
> because it's such a hot path.

This might be crazy, but what if we defer flushing of the kernel
ranges until just before we return to userspace? We'd still manipulate
the prot/xpfo bits for the pages, but then just keep a list of which
ranges need to be flushed, and do the right thing before we return.
This leaves a little window between the actual allocation and the
flush, but userspace would need another thread in its threadgroup to
predict the next allocation, write the bad stuff there, and do the
exploit all in that window.

I'm of course open to other suggestions. I'm new :)

> > +		cpa.vaddr = kaddr;
> > +		cpa.pages = &page;
> > +		cpa.mask_set = prot;
> > +		cpa.mask_clr = msk_clr;
> > +		cpa.numpages = 1;
> > +		cpa.flags = 0;
> > +		cpa.curpage = 0;
> > +		cpa.force_split = 0;
> > +
> > +
> > +		do_split = try_preserve_large_page(pte, (unsigned 
> 
> Is this safe to do without a TLB flush?  I thought we had plenty of bugs
> in CPUs around having multiple entries for the same page in the TLB at
> once.  We're *REALLY* careful when we split large pages for THP, and I'm
> surprised we don't do the same here.

It looks like on some code paths we do flush, and some we don't.
Sounds like it's not safe to do without a flush, so I'll see about
adding one.

> Why do you even bother keeping large pages around?  Won't the entire
> kernel just degrade to using 4k everywhere, eventually?

Isn't that true of large pages in general? Is there something about
xpfo that makes this worse? I thought this would only split things if
they had already been split somewhere else, and the protection can't
apply to the whole huge page.

> > +		if (do_split) {
> > +			struct page *base;
> > +
> > +			base = alloc_pages(GFP_ATOMIC | __GFP_NOTRACK, 
> 
> Ugh, GFP_ATOMIC.  That's nasty.  Do you really want this allocation to
> fail all the time?  GFP_ATOMIC could really be called
> GFP_YOU_BETTER_BE_OK_WITH_THIS_FAILING. :)
> 
> You probably want to do what the THP code does here and keep a spare
> page around, then allocate it before you take the locks.

Sounds like a good idea, thanks.

> > +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
> > +{
> > +	int level;
> > +	unsigned long size, kaddr;
> > +
> > +	kaddr = (unsigned long)page_address(page);
> > +
> > +	if (unlikely(!lookup_address(kaddr, &level))) {
> > +		WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
> > +		return;
> > +	}
> > +
> > +	switch (level) {
> > +	case PG_LEVEL_4K:
> > +		size = PAGE_SIZE;
> > +		break;
> > +	case PG_LEVEL_2M:
> > +		size = PMD_SIZE;
> > +		break;
> > +	case PG_LEVEL_1G:
> > +		size = PUD_SIZE;
> > +		break;
> > +	default:
> > +		WARN(1, "xpfo: unsupported page level %x\n", level);
> > +		return;
> > +	}
> > +
> > +	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
> > +}
> 
> I'm not sure flush_tlb_kernel_range() is the best primitive to be
> calling here.
> 
> Let's say you walk the page tables and find level=PG_LEVEL_1G.  You call
> flush_tlb_kernel_range(), you will be above
> tlb_single_page_flush_ceiling, and you will do a full TLB flush.  But,
> with a 1GB page, you could have just used a single INVLPG and skipped
> the global flush.
> 
> I guess the cost of the IPI is way more than the flush itself, but it's
> still a shame to toss the entire TLB when you don't have to.

Ok, do you think it's worth making a new helper for others to use? Or
should I just keep the logic in this function?

> I also think the TLB flush should be done closer to the page table
> manipulation that it is connected to.  It's hard to figure out whether
> the flush is the right one otherwise.

Yes, sounds good.

> Also, the "(1 << order) * size" thing looks goofy to me.  Let's say you
> are flushing a order=1 (8k) page and its mapped in a 1GB mapping.  You
> flush 2GB.  Is that intentional?

I don't think so; seems like we should be flushing
(1 << order) * PAGE_SIZE instead.

> > +
> > +void xpfo_free_pages(struct page *page, int order)
> > +{
> ...
> > +		/*
> > +		 * Map the page back into the kernel if it was previously
> > +		 * allocated to user space.
> > +		 */
> > +		if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) {
> > +			clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
> > +			set_kpte(page_address(page + i), page + i,
> > +				 PAGE_KERNEL);
> > +		}
> > +	}
> > +}
> 
> This seems like a bad idea, performance-wise.  Kernel and userspace
> pages tend to be separated by migrate types.  So, a given physical page
> will tend to be used as kernel *or* for userspace.  With this nugget,
> every time a userspace page is freed, we will go to the trouble of
> making it *back* into a kernel page.  Then, when it is allocated again
> (probably as userspace), we will re-make it into a userspace page.  That
> seems horribly inefficient.
> 
> Also, this weakens the security guarantees.  Let's say you're mounting a
> ret2dir attack.  You populate a page with your evil data and you know
> the kernel address for the page.  All you have to do is coordinate your
> attack with freeing the page.  You can control when it gets freed.  Now,
> the xpfo_free_pages() helpfully just mapped your attack code back into
> the kernel.
> 
> Why not *just* do these moves at allocation time?

Yes, this is a great point, thanks. I think this can be a no-op, and
with the fixed up v7 logic for alloc pages that I linked to above it
should work out correctly.

Tycho
Dave Hansen Sept. 20, 2017, 11:21 p.m. UTC | #17
On 09/20/2017 03:34 PM, Tycho Andersen wrote:
>> I really have to wonder whether there are better ret2dir defenses than
>> this.  The allocator just seems like the *wrong* place to be doing this
>> because it's such a hot path.
> 
> This might be crazy, but what if we defer flushing of the kernel
> ranges until just before we return to userspace? We'd still manipulate
> the prot/xpfo bits for the pages, but then just keep a list of which
> ranges need to be flushed, and do the right thing before we return.
> This leaves a little window between the actual allocation and the
> flush, but userspace would need another thread in its threadgroup to
> predict the next allocation, write the bad stuff there, and do the
> exploit all in that window.

I think the common case is still that you enter the kernel, allocate a
single page (or very few) and then exit.  So, you don't really reduce
the total number of flushes.

Just think of this in terms of IPIs to do the remote TLB flushes.  A CPU
can do roughly 1 million page faults and allocations a second.  Say you
have a 2-socket x 28-core x 2 hyperthead system = 112 CPU threads.
That's 111M IPI interrupts/second, just for the TLB flushes, *ON* *EACH*
*CPU*.

I think the only thing that will really help here is if you batch the
allocations.  For instance, you could make sure that the per-cpu-pageset
lists always contain either all kernel or all user data.  Then remap the
entire list at once and do a single flush after the entire list is consumed.

>> Why do you even bother keeping large pages around?  Won't the entire
>> kernel just degrade to using 4k everywhere, eventually?
> 
> Isn't that true of large pages in general? Is there something about
> xpfo that makes this worse? I thought this would only split things if
> they had already been split somewhere else, and the protection can't
> apply to the whole huge page.

Even though the kernel gives out 4k pages, it still *maps* them in the
kernel linear direct map with the largest size available.  My 16GB
laptop, for instance, has 3GB of 2MB transparent huge pages, but the
rest is used as 4k pages.  Yet, from /proc/meminfo:

DirectMap4k:      665280 kB
DirectMap2M:    11315200 kB
DirectMap1G:     4194304 kB

Your code pretty much forces 4k pages coming out of the allocator to be
mapped with 4k mappings.
>>> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
>>> +{
>>> +	int level;
>>> +	unsigned long size, kaddr;
>>> +
>>> +	kaddr = (unsigned long)page_address(page);
>>> +
>>> +	if (unlikely(!lookup_address(kaddr, &level))) {
>>> +		WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
>>> +		return;
>>> +	}
>>> +
>>> +	switch (level) {
>>> +	case PG_LEVEL_4K:
>>> +		size = PAGE_SIZE;
>>> +		break;
>>> +	case PG_LEVEL_2M:
>>> +		size = PMD_SIZE;
>>> +		break;
>>> +	case PG_LEVEL_1G:
>>> +		size = PUD_SIZE;
>>> +		break;
>>> +	default:
>>> +		WARN(1, "xpfo: unsupported page level %x\n", level);
>>> +		return;
>>> +	}
>>> +
>>> +	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
>>> +}
>>
>> I'm not sure flush_tlb_kernel_range() is the best primitive to be
>> calling here.
>>
>> Let's say you walk the page tables and find level=PG_LEVEL_1G.  You call
>> flush_tlb_kernel_range(), you will be above
>> tlb_single_page_flush_ceiling, and you will do a full TLB flush.  But,
>> with a 1GB page, you could have just used a single INVLPG and skipped
>> the global flush.
>>
>> I guess the cost of the IPI is way more than the flush itself, but it's
>> still a shame to toss the entire TLB when you don't have to.
> 
> Ok, do you think it's worth making a new helper for others to use? Or
> should I just keep the logic in this function?

I'd just leave it in place.  Most folks already have a PTE when they do
the invalidation, so this is a bit of a weirdo.
Dave Hansen Sept. 21, 2017, 12:03 a.m. UTC | #18
On 09/07/2017 10:36 AM, Tycho Andersen wrote:
> +		/*
> +		 * Map the page back into the kernel if it was previously
> +		 * allocated to user space.
> +		 */
> +		if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) {
> +			clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
> +			set_kpte(page_address(page + i), page + i,
> +				 PAGE_KERNEL);
> +		}
> +	}

It might also be a really good idea to clear the page here.  Otherwise,
the page still might have attack code in it and now it is mapped into
the kernel again, ready to be exploited.

Think of it this way: pages either trusted data and are mapped all the
time, or they have potentially bad data and are unmapped mostly.  If we
want to take a bad page and map it always, we have to make sure the
contents are not evil.  0's are not evil.

>  static inline void *kmap(struct page *page)
>  {
> +	void *kaddr;
> +
>  	might_sleep();
> -	return page_address(page);
> +	kaddr = page_address(page);
> +	xpfo_kmap(kaddr, page);
> +	return kaddr;
>  }

The time between kmap() and kunmap() is potentially a really long
operation.  I think we, for instance, keep some pages kmap()'d while we
do I/O to them, or wait for I/O elsewhere.

IOW, this will map predictable data at a predictable location and it
will do it for a long time.  While that's better than the current state
(mapped always), it still seems rather risky.

Could you, for instance, turn kmap(page) into vmap(&page, 1, ...)?  That
way, at least the address may be different each time.  Even if an
attacker knows the physical address, they don't know where it will be
mapped.
Tycho Andersen Sept. 21, 2017, 12:09 a.m. UTC | #19
On Wed, Sep 20, 2017 at 04:21:15PM -0700, Dave Hansen wrote:
> On 09/20/2017 03:34 PM, Tycho Andersen wrote:
> >> I really have to wonder whether there are better ret2dir defenses than
> >> this.  The allocator just seems like the *wrong* place to be doing this
> >> because it's such a hot path.
> > 
> > This might be crazy, but what if we defer flushing of the kernel
> > ranges until just before we return to userspace? We'd still manipulate
> > the prot/xpfo bits for the pages, but then just keep a list of which
> > ranges need to be flushed, and do the right thing before we return.
> > This leaves a little window between the actual allocation and the
> > flush, but userspace would need another thread in its threadgroup to
> > predict the next allocation, write the bad stuff there, and do the
> > exploit all in that window.
> 
> I think the common case is still that you enter the kernel, allocate a
> single page (or very few) and then exit.  So, you don't really reduce
> the total number of flushes.
> 
> Just think of this in terms of IPIs to do the remote TLB flushes.  A CPU
> can do roughly 1 million page faults and allocations a second.  Say you
> have a 2-socket x 28-core x 2 hyperthead system = 112 CPU threads.
> That's 111M IPI interrupts/second, just for the TLB flushes, *ON* *EACH*
> *CPU*.

Since we only need to flush when something switches from a userspace
to a kernel page or back, hopefully it's not this bad, but point
taken.

> I think the only thing that will really help here is if you batch the
> allocations.  For instance, you could make sure that the per-cpu-pageset
> lists always contain either all kernel or all user data.  Then remap the
> entire list at once and do a single flush after the entire list is consumed.

Just so I understand, the idea would be that we only flush when the
type of allocation alternates, so:

kmalloc(..., GFP_KERNEL);
kmalloc(..., GFP_KERNEL);
/* remap+flush here */
kmalloc(..., GFP_HIGHUSER);
/* remap+flush here */
kmalloc(..., GFP_KERNEL);

?

Tycho
Dave Hansen Sept. 21, 2017, 12:27 a.m. UTC | #20
On 09/20/2017 05:09 PM, Tycho Andersen wrote:
>> I think the only thing that will really help here is if you batch the
>> allocations.  For instance, you could make sure that the per-cpu-pageset
>> lists always contain either all kernel or all user data.  Then remap the
>> entire list at once and do a single flush after the entire list is consumed.
> Just so I understand, the idea would be that we only flush when the
> type of allocation alternates, so:
> 
> kmalloc(..., GFP_KERNEL);
> kmalloc(..., GFP_KERNEL);
> /* remap+flush here */
> kmalloc(..., GFP_HIGHUSER);
> /* remap+flush here */
> kmalloc(..., GFP_KERNEL);

Not really.  We keep a free list per migrate type, and a per_cpu_pages
(pcp) list per migratetype:

> struct per_cpu_pages {
>         int count;              /* number of pages in the list */
>         int high;               /* high watermark, emptying needed */
>         int batch;              /* chunk size for buddy add/remove */
> 
>         /* Lists of pages, one per migrate type stored on the pcp-lists */
>         struct list_head lists[MIGRATE_PCPTYPES];
> };

The migratetype is derived from the GFP flags in
gfpflags_to_migratetype().  In general, GFP_HIGHUSER and GFP_KERNEL come
from different migratetypes, so they come from different free lists.

In your case above, the GFP_HIGHUSER allocation come through the
MIGRATE_MOVABLE pcp list while the GFP_KERNEL ones come from the
MIGRATE_UNMOVABLE one.  Since we add a bunch of pages to those lists at
once, you could do all the mapping/unmapping/flushing on a bunch of
pages at once

Or, you could hook your code into the places where the migratetype of
memory is changed (set_pageblock_migratetype(), plus where we fall
back).  Those changes are much more rare than page allocation.
Dave Hansen Sept. 21, 2017, 12:28 a.m. UTC | #21
At a high level, does this approach keep an attacker from being able to
determine the address of data in the linear map, or does it keep them
from being able to *exploit* it?  Can you have a ret2dir attack if the
attacker doesn't know the address, for instance?
Tycho Andersen Sept. 21, 2017, 1:04 a.m. UTC | #22
On Wed, Sep 20, 2017 at 05:28:11PM -0700, Dave Hansen wrote:
> At a high level, does this approach keep an attacker from being able to
> determine the address of data in the linear map, or does it keep them
> from being able to *exploit* it?

It keeps them from exploiting it, by faulting when a physmap alias is
used.

> Can you have a ret2dir attack if the attacker doesn't know the
> address, for instance?

Yes, through a technique similar to heap spraying. The original paper
has a study of this, section 5.2 outlines the attack and 7.2 describes
their success rate:

http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

Tycho
Tycho Andersen Sept. 21, 2017, 1:37 a.m. UTC | #23
On Wed, Sep 20, 2017 at 05:27:02PM -0700, Dave Hansen wrote:
> On 09/20/2017 05:09 PM, Tycho Andersen wrote:
> >> I think the only thing that will really help here is if you batch the
> >> allocations.  For instance, you could make sure that the per-cpu-pageset
> >> lists always contain either all kernel or all user data.  Then remap the
> >> entire list at once and do a single flush after the entire list is consumed.
> > Just so I understand, the idea would be that we only flush when the
> > type of allocation alternates, so:
> > 
> > kmalloc(..., GFP_KERNEL);
> > kmalloc(..., GFP_KERNEL);
> > /* remap+flush here */
> > kmalloc(..., GFP_HIGHUSER);
> > /* remap+flush here */
> > kmalloc(..., GFP_KERNEL);
> 
> Not really.  We keep a free list per migrate type, and a per_cpu_pages
> (pcp) list per migratetype:
> 
> > struct per_cpu_pages {
> >         int count;              /* number of pages in the list */
> >         int high;               /* high watermark, emptying needed */
> >         int batch;              /* chunk size for buddy add/remove */
> > 
> >         /* Lists of pages, one per migrate type stored on the pcp-lists */
> >         struct list_head lists[MIGRATE_PCPTYPES];
> > };
> 
> The migratetype is derived from the GFP flags in
> gfpflags_to_migratetype().  In general, GFP_HIGHUSER and GFP_KERNEL come
> from different migratetypes, so they come from different free lists.
> 
> In your case above, the GFP_HIGHUSER allocation come through the
> MIGRATE_MOVABLE pcp list while the GFP_KERNEL ones come from the
> MIGRATE_UNMOVABLE one.  Since we add a bunch of pages to those lists at
> once, you could do all the mapping/unmapping/flushing on a bunch of
> pages at once
> 
> Or, you could hook your code into the places where the migratetype of
> memory is changed (set_pageblock_migratetype(), plus where we fall
> back).  Those changes are much more rare than page allocation.

I see, thanks for all this discussion. It has been very helpful!

Tycho
diff mbox

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index d9c171ce4190..444d83183f75 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2736,6 +2736,8 @@ 
 
 	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
 
+	noxpfo		[X86-64] Disable XPFO when CONFIG_XPFO is on.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 323cb065be5e..d78a0d538900 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -185,6 +185,7 @@  config X86
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_FEATURE_NAMES		if PROC_FS
+	select ARCH_SUPPORTS_XPFO		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 77037b6f1caa..c2eb40f7a74b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1238,6 +1238,31 @@  static inline bool pud_access_permitted(pud_t pud, bool write)
 	return __pte_access_permitted(pud_val(pud), write);
 }
 
+/*
+ * The current flushing context - we pass it instead of 5 arguments:
+ */
+struct cpa_data {
+	unsigned long	*vaddr;
+	pgd_t		*pgd;
+	pgprot_t	mask_set;
+	pgprot_t	mask_clr;
+	unsigned long	numpages;
+	int		flags;
+	unsigned long	pfn;
+	unsigned	force_split : 1;
+	int		curpage;
+	struct page	**pages;
+};
+
+
+int
+try_preserve_large_page(pte_t *kpte, unsigned long address,
+			struct cpa_data *cpa);
+extern spinlock_t cpa_lock;
+int
+__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
+		   struct page *base);
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 0fbdcb64f9f8..89ba6d25fb51 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -39,3 +39,4 @@  obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
 
+obj-$(CONFIG_XPFO)		+= xpfo.o
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 757b0bcdf712..f25d07191e60 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -27,28 +27,12 @@ 
 #include <asm/set_memory.h>
 
 /*
- * The current flushing context - we pass it instead of 5 arguments:
- */
-struct cpa_data {
-	unsigned long	*vaddr;
-	pgd_t		*pgd;
-	pgprot_t	mask_set;
-	pgprot_t	mask_clr;
-	unsigned long	numpages;
-	int		flags;
-	unsigned long	pfn;
-	unsigned	force_split : 1;
-	int		curpage;
-	struct page	**pages;
-};
-
-/*
  * Serialize cpa() (for !DEBUG_PAGEALLOC which uses large identity mappings)
  * using cpa_lock. So that we don't allow any other cpu, with stale large tlb
  * entries change the page attribute in parallel to some other cpu
  * splitting a large page entry along with changing the attribute.
  */
-static DEFINE_SPINLOCK(cpa_lock);
+DEFINE_SPINLOCK(cpa_lock);
 
 #define CPA_FLUSHTLB 1
 #define CPA_ARRAY 2
@@ -512,7 +496,7 @@  static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
 #endif
 }
 
-static int
+int
 try_preserve_large_page(pte_t *kpte, unsigned long address,
 			struct cpa_data *cpa)
 {
@@ -648,7 +632,7 @@  try_preserve_large_page(pte_t *kpte, unsigned long address,
 	return do_split;
 }
 
-static int
+int
 __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 		   struct page *base)
 {
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
new file mode 100644
index 000000000000..6794d6724ab5
--- /dev/null
+++ b/arch/x86/mm/xpfo.c
@@ -0,0 +1,114 @@ 
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+
+#include <asm/tlbflush.h>
+
+extern spinlock_t cpa_lock;
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+	unsigned int level;
+	pgprot_t msk_clr;
+	pte_t *pte = lookup_address((unsigned long)kaddr, &level);
+
+	if (unlikely(!pte)) {
+		WARN(1, "xpfo: invalid address %p\n", kaddr);
+		return;
+	}
+
+	switch (level) {
+	case PG_LEVEL_4K:
+		set_pte_atomic(pte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
+		break;
+	case PG_LEVEL_2M:
+	case PG_LEVEL_1G: {
+		struct cpa_data cpa = { };
+		int do_split;
+
+		if (level == PG_LEVEL_2M)
+			msk_clr = pmd_pgprot(*(pmd_t*)pte);
+		else
+			msk_clr = pud_pgprot(*(pud_t*)pte);
+
+		cpa.vaddr = kaddr;
+		cpa.pages = &page;
+		cpa.mask_set = prot;
+		cpa.mask_clr = msk_clr;
+		cpa.numpages = 1;
+		cpa.flags = 0;
+		cpa.curpage = 0;
+		cpa.force_split = 0;
+
+
+		do_split = try_preserve_large_page(pte, (unsigned long)kaddr,
+						   &cpa);
+		if (do_split) {
+			struct page *base;
+
+			base = alloc_pages(GFP_ATOMIC | __GFP_NOTRACK, 0);
+			if (!base) {
+				WARN(1, "xpfo: failed to split large page\n");
+				break;
+			}
+
+			if (!debug_pagealloc_enabled())
+				spin_lock(&cpa_lock);
+			if  (__split_large_page(&cpa, pte, (unsigned long)kaddr, base) < 0)
+				WARN(1, "xpfo: failed to split large page\n");
+			if (!debug_pagealloc_enabled())
+				spin_unlock(&cpa_lock);
+		}
+
+		break;
+	}
+	case PG_LEVEL_512G:
+		/* fallthrough, splitting infrastructure doesn't
+		 * support 512G pages. */
+	default:
+		WARN(1, "xpfo: unsupported page level %x\n", level);
+	}
+
+}
+
+inline void xpfo_flush_kernel_tlb(struct page *page, int order)
+{
+	int level;
+	unsigned long size, kaddr;
+
+	kaddr = (unsigned long)page_address(page);
+
+	if (unlikely(!lookup_address(kaddr, &level))) {
+		WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
+		return;
+	}
+
+	switch (level) {
+	case PG_LEVEL_4K:
+		size = PAGE_SIZE;
+		break;
+	case PG_LEVEL_2M:
+		size = PMD_SIZE;
+		break;
+	case PG_LEVEL_1G:
+		size = PUD_SIZE;
+		break;
+	default:
+		WARN(1, "xpfo: unsupported page level %x\n", level);
+		return;
+	}
+
+	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index bb3f3297062a..7a17c166532f 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -7,6 +7,7 @@ 
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 
@@ -55,24 +56,34 @@  static inline struct page *kmap_to_page(void *addr)
 #ifndef ARCH_HAS_KMAP
 static inline void *kmap(struct page *page)
 {
+	void *kaddr;
+
 	might_sleep();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 
 static inline void kunmap(struct page *page)
 {
+	xpfo_kunmap(page_address(page), page);
 }
 
 static inline void *kmap_atomic(struct page *page)
 {
+	void *kaddr;
+
 	preempt_disable();
 	pagefault_disable();
-	return page_address(page);
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
 }
 #define kmap_atomic_prot(page, prot)	kmap_atomic(page)
 
 static inline void __kunmap_atomic(void *addr)
 {
+	xpfo_kunmap(addr, virt_to_page(addr));
 	pagefault_enable();
 	preempt_enable();
 }
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..442c58ee930e
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,42 @@ 
+/*
+ * Copyright (C) 2017 Docker, Inc.
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *   Tycho Andersen <tycho@docker.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#ifdef CONFIG_XPFO
+
+extern struct page_ext_operations page_xpfo_ops;
+
+void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
+void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
+				    enum dma_data_direction dir);
+void xpfo_flush_kernel_tlb(struct page *page, int order);
+
+void xpfo_kmap(void *kaddr, struct page *page);
+void xpfo_kunmap(void *kaddr, struct page *page);
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
+void xpfo_free_pages(struct page *page, int order);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { }
+static inline void xpfo_free_pages(struct page *page, int order) { }
+
+#endif /* CONFIG_XPFO */
+
+#endif /* _LINUX_XPFO_H */
diff --git a/mm/Makefile b/mm/Makefile
index 411bd24d4a7c..0be67cac8f6c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -104,3 +104,4 @@  obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1423da8dd16f..09fdf1bad21f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1059,6 +1059,7 @@  static __always_inline bool free_pages_prepare(struct page *page,
 	kernel_poison_pages(page, 1 << order, 0);
 	kernel_map_pages(page, 1 << order, 0);
 	kasan_free_pages(page, order);
+	xpfo_free_pages(page, order);
 
 	return true;
 }
@@ -1758,6 +1759,7 @@  inline void post_alloc_hook(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_pages(page, order, gfp_flags);
 	set_page_owner(page, order, gfp_flags);
 }
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 88ccc044b09a..4899df1f5d66 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -7,6 +7,7 @@ 
 #include <linux/kmemleak.h>
 #include <linux/page_owner.h>
 #include <linux/page_idle.h>
+#include <linux/xpfo.h>
 
 /*
  * struct page extension
@@ -65,6 +66,9 @@  static struct page_ext_operations *page_ext_ops[] = {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
+#ifdef CONFIG_XPFO
+	&page_xpfo_ops,
+#endif
 };
 
 static unsigned long total_usage;
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..bff24afcaa2e
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,222 @@ 
+/*
+ * Copyright (C) 2017 Docker, Inc.
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *   Tycho Andersen <tycho@docker.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_ext.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+/* XPFO page state flags */
+enum xpfo_flags {
+	XPFO_PAGE_USER,		/* Page is allocated to user-space */
+	XPFO_PAGE_UNMAPPED,	/* Page is unmapped from the linear map */
+};
+
+/* Per-page XPFO house-keeping data */
+struct xpfo {
+	unsigned long flags;	/* Page state */
+	bool inited;		/* Map counter and lock initialized */
+	atomic_t mapcount;	/* Counter for balancing map/unmap requests */
+	spinlock_t maplock;	/* Lock to serialize map/unmap requests */
+};
+
+DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+
+static bool xpfo_disabled __initdata;
+
+static int __init noxpfo_param(char *str)
+{
+	xpfo_disabled = true;
+
+	return 0;
+}
+
+early_param("noxpfo", noxpfo_param);
+
+static bool __init need_xpfo(void)
+{
+	if (xpfo_disabled) {
+		printk(KERN_INFO "XPFO disabled\n");
+		return false;
+	}
+
+	return true;
+}
+
+static void init_xpfo(void)
+{
+	printk(KERN_INFO "XPFO enabled\n");
+	static_branch_enable(&xpfo_inited);
+}
+
+struct page_ext_operations page_xpfo_ops = {
+	.size = sizeof(struct xpfo),
+	.need = need_xpfo,
+	.init = init_xpfo,
+};
+
+static inline struct xpfo *lookup_xpfo(struct page *page)
+{
+	struct page_ext *page_ext = lookup_page_ext(page);
+
+	if (unlikely(!page_ext)) {
+		WARN(1, "xpfo: failed to get page ext");
+		return NULL;
+	}
+
+	return (void *)page_ext + page_xpfo_ops.offset;
+}
+
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp)
+{
+	int i, flush_tlb = 0;
+	struct xpfo *xpfo;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++)  {
+		xpfo = lookup_xpfo(page + i);
+		if (!xpfo)
+			continue;
+
+		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
+		     "xpfo: unmapped page being allocated\n");
+
+		/* Initialize the map lock and map counter */
+		if (unlikely(!xpfo->inited)) {
+			spin_lock_init(&xpfo->maplock);
+			atomic_set(&xpfo->mapcount, 0);
+			xpfo->inited = true;
+		}
+		WARN(atomic_read(&xpfo->mapcount),
+		     "xpfo: already mapped page being allocated\n");
+
+		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			/*
+			 * Tag the page as a user page and flush the TLB if it
+			 * was previously allocated to the kernel.
+			 */
+			if (!test_and_set_bit(XPFO_PAGE_USER, &xpfo->flags))
+				flush_tlb = 1;
+		} else {
+			/* Tag the page as a non-user (kernel) page */
+			clear_bit(XPFO_PAGE_USER, &xpfo->flags);
+		}
+	}
+
+	if (flush_tlb)
+		xpfo_flush_kernel_tlb(page, order);
+}
+
+void xpfo_free_pages(struct page *page, int order)
+{
+	int i;
+	struct xpfo *xpfo;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+		xpfo = lookup_xpfo(page + i);
+		if (!xpfo || unlikely(!xpfo->inited)) {
+			/*
+			 * The page was allocated before page_ext was
+			 * initialized, so it is a kernel page.
+			 */
+			continue;
+		}
+
+		/*
+		 * Map the page back into the kernel if it was previously
+		 * allocated to user space.
+		 */
+		if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) {
+			clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+			set_kpte(page_address(page + i), page + i,
+				 PAGE_KERNEL);
+		}
+	}
+}
+
+void xpfo_kmap(void *kaddr, struct page *page)
+{
+	struct xpfo *xpfo;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	xpfo = lookup_xpfo(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!xpfo || unlikely(!xpfo->inited) ||
+	    !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+		return;
+
+	spin_lock(&xpfo->maplock);
+
+	/*
+	 * The page was previously allocated to user space, so map it back
+	 * into the kernel. No TLB flush required.
+	 */
+	if ((atomic_inc_return(&xpfo->mapcount) == 1) &&
+	    test_and_clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags))
+		set_kpte(kaddr, page, PAGE_KERNEL);
+
+	spin_unlock(&xpfo->maplock);
+}
+EXPORT_SYMBOL(xpfo_kmap);
+
+void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	struct xpfo *xpfo;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	xpfo = lookup_xpfo(page);
+
+	/*
+	 * The page was allocated before page_ext was initialized (which means
+	 * it's a kernel page) or it's allocated to the kernel, so nothing to
+	 * do.
+	 */
+	if (!xpfo || unlikely(!xpfo->inited) ||
+	    !test_bit(XPFO_PAGE_USER, &xpfo->flags))
+		return;
+
+	spin_lock(&xpfo->maplock);
+
+	/*
+	 * The page is to be allocated back to user space, so unmap it from the
+	 * kernel, flush the TLB and tag it as a user page.
+	 */
+	if (atomic_dec_return(&xpfo->mapcount) == 0) {
+		WARN(test_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags),
+		     "xpfo: unmapping already unmapped page\n");
+		set_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
+		set_kpte(kaddr, page, __pgprot(0));
+		xpfo_flush_kernel_tlb(page, 0);
+	}
+
+	spin_unlock(&xpfo->maplock);
+}
+EXPORT_SYMBOL(xpfo_kunmap);
diff --git a/security/Kconfig b/security/Kconfig
index e8e449444e65..be5145eeed7d 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,25 @@  menu "Security options"
 
 source security/keys/Kconfig
 
+config ARCH_SUPPORTS_XPFO
+	bool
+
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	default n
+	depends on ARCH_SUPPORTS_XPFO
+	select PAGE_EXTENSION
+	help
+	  This option offers protection against 'ret2dir' kernel attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
+	  mapped back to physmap.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n