diff mbox series

[RFC,v9,03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

Message ID f1ac3700970365fb979533294774af0b0dd84b3b.1554248002.git.khalid.aziz@oracle.com (mailing list archive)
State New, archived
Headers show
Series Add support for eXclusive Page Frame Ownership | expand

Commit Message

Khalid Aziz April 3, 2019, 5:34 p.m. UTC
From: Juerg Haefliger <juerg.haefliger@canonical.com>

This patch adds basic support infrastructure for XPFO which protects
against 'ret2dir' kernel attacks. The basic idea is to enforce exclusive
ownership of page frames by either the kernel or userspace, unless
explicitly requested by the kernel. Whenever a page destined for
userspace is allocated, it is unmapped from physmap (the kernel's page
table). When such a page is reclaimed from userspace, it is mapped back
to physmap. Individual architectures can enable full XPFO support using
this infrastructure by supplying architecture specific pieces.

Additional fields in the page struct are used for XPFO housekeeping,
specifically:
  - two flags to distinguish user vs. kernel pages and to tag unmapped
    pages.
  - a reference counter to balance kmap/kunmap operations.
  - a lock to serialize access to the XPFO fields.

This patch is based on the work of Vasileios P. Kemerlis et al. who
published their work in this paper:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

CC: x86@kernel.org
Suggested-by: Vasileios P. Kemerlis <vpk@cs.columbia.edu>
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: Marco Benatto <marco.antonio.780@gmail.com>
[jsteckli@amazon.de: encode all XPFO info in struct page]
Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
---
 v9: * Do not use page extensions. Encode all xpfo information in struct
      page (Julian Stecklina).
    * Split architecture specific code into its own separate patch
    * Move kmap*() to include/linux/xpfo.h for cleaner code as suggested
      for an earlier version of this patch
    * Use irq versions of spin_lock to address possible deadlock around
      xpfo_lock caused by interrupts.
    * Incorporated various feedback provided on v6 patch way back.

v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush
      the tlb entry on all CPUs when unmapping it in kunmap
    * handle lookup_page_ext()/lookup_xpfo() returning NULL
    * drop lots of BUG()s in favor of WARN()
    * don't disable irqs in xpfo_kmap/xpfo_kunmap, export
      __split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to
      pass it

 .../admin-guide/kernel-parameters.txt         |   6 +
 include/linux/highmem.h                       |  31 +---
 include/linux/mm_types.h                      |   8 +
 include/linux/page-flags.h                    |  23 ++-
 include/linux/xpfo.h                          | 147 ++++++++++++++++++
 include/trace/events/mmflags.h                |  10 +-
 mm/Makefile                                   |   1 +
 mm/compaction.c                               |   2 +-
 mm/internal.h                                 |   2 +-
 mm/page_alloc.c                               |  10 +-
 mm/page_isolation.c                           |   2 +-
 mm/xpfo.c                                     | 106 +++++++++++++
 security/Kconfig                              |  27 ++++
 13 files changed, 337 insertions(+), 38 deletions(-)
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

Comments

Peter Zijlstra April 4, 2019, 7:21 a.m. UTC | #1
On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2c471a2c43fa..d17d33f36a01 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -204,6 +204,14 @@ struct page {
>  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>  	int _last_cpupid;
>  #endif
> +
> +#ifdef CONFIG_XPFO
> +	/* Counts the number of times this page has been kmapped. */
> +	atomic_t xpfo_mapcount;
> +
> +	/* Serialize kmap/kunmap of this page */
> +	spinlock_t xpfo_lock;

NAK, see ALLOC_SPLIT_PTLOCKS

spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_
sure you want spinlock_t and not raw_spinlock_t ? For
CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex.

> +#endif

Growing the page-frame by 8 bytes (in the good case) is really sad,
that's a _lot_ of memory.

>  } _struct_page_alignment;
Peter Zijlstra April 4, 2019, 7:43 a.m. UTC | #2
You must be so glad I no longer use kmap_atomic from NMI context :-)

On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> +static inline void xpfo_kmap(void *kaddr, struct page *page)
> +{
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	if (!PageXpfoUser(page))
> +		return;
> +
> +	/*
> +	 * The page was previously allocated to user space, so
> +	 * map it back into the kernel if needed. No TLB flush required.
> +	 */
> +	spin_lock_irqsave(&page->xpfo_lock, flags);
> +
> +	if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
> +		TestClearPageXpfoUnmapped(page))
> +		set_kpte(kaddr, page, PAGE_KERNEL);
> +
> +	spin_unlock_irqrestore(&page->xpfo_lock, flags);

That's a really sad sequence, not wrong, but sad. _3_ atomic operations,
2 on likely the same cacheline. And mostly all pointless.

This patch makes xpfo_mapcount an atomic, but then all modifications are
under the spinlock, what gives?

Anyway, a possibly saner sequence might be:

	if (atomic_inc_not_zero(&page->xpfo_mapcount))
		return;

	spin_lock_irqsave(&page->xpfo_lock, flag);
	if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
	    TestClearPageXpfoUnmapped(page))
		set_kpte(kaddr, page, PAGE_KERNEL);
	spin_unlock_irqrestore(&page->xpfo_lock, flags);

> +}
> +
> +static inline void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> +	unsigned long flags;
> +
> +	if (!static_branch_unlikely(&xpfo_inited))
> +		return;
> +
> +	if (!PageXpfoUser(page))
> +		return;
> +
> +	/*
> +	 * The page is to be allocated back to user space, so unmap it from
> +	 * the kernel, flush the TLB and tag it as a user page.
> +	 */
> +	spin_lock_irqsave(&page->xpfo_lock, flags);
> +
> +	if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
> +#ifdef CONFIG_XPFO_DEBUG
> +		WARN_ON(PageXpfoUnmapped(page));
> +#endif
> +		SetPageXpfoUnmapped(page);
> +		set_kpte(kaddr, page, __pgprot(0));
> +		xpfo_flush_kernel_tlb(page, 0);

You didn't speak about the TLB invalidation anywhere. But basically this
is one that x86 cannot do.

> +	}
> +
> +	spin_unlock_irqrestore(&page->xpfo_lock, flags);

Idem:

	if (atomic_add_unless(&page->xpfo_mapcount, -1, 1))
		return;

	....


> +}

Also I'm failing to see the point of PG_xpfo_unmapped, afaict it
is identical to !atomic_read(&page->xpfo_mapcount).
Peter Zijlstra April 4, 2019, 9:25 a.m. UTC | #3
On Thu, Apr 04, 2019 at 09:21:52AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 2c471a2c43fa..d17d33f36a01 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -204,6 +204,14 @@ struct page {
> >  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> >  	int _last_cpupid;
> >  #endif
> > +
> > +#ifdef CONFIG_XPFO
> > +	/* Counts the number of times this page has been kmapped. */
> > +	atomic_t xpfo_mapcount;
> > +
> > +	/* Serialize kmap/kunmap of this page */
> > +	spinlock_t xpfo_lock;
> 
> NAK, see ALLOC_SPLIT_PTLOCKS
> 
> spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_
> sure you want spinlock_t and not raw_spinlock_t ? For
> CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex.
> 
> > +#endif
> 
> Growing the page-frame by 8 bytes (in the good case) is really sad,
> that's a _lot_ of memory.

So if you use the original kmap_atomic/kmap code from i386 and create
an alias per user you can do away with all that.

Now, that leaves you with the fixmap kmap_atomic code, which I also
hate, but it gets rid of a lot of the ugly you introduce in these here
patches.

As to the fixmap kmap_atomic; so fundamentally the PTEs are only used on
a single CPU and therefore CPU local TLB invalidation _should_ suffice.

However, speculation...

Another CPU can speculatively hit upon a fixmap entry for another CPU
and populate it's own TLB entry. Then the TLB invalidate is
insufficient, it leaves a stale entry in a remote TLB.

If the remote CPU then re-uses that fixmap slot to alias another page,
we have two CPUs with different translations for the same VA, a
condition that AMD CPU's dislike enough to machine check on (IIRC).

Actually hitting that is incredibly difficult (we have to have
speculation, fixmap reuse and not get a full TLB invalidate in between),
but, afaict, not impossible.

Your monstrosity from the last patch avoids this particular issue by not
aliasing in this manner, but it comes at the cost of this page-frame
bloat. Also, I'm still not sure there's not other problems with it.

Bah..
Tycho Andersen April 4, 2019, 2:48 p.m. UTC | #4
On Thu, Apr 04, 2019 at 09:21:52AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 2c471a2c43fa..d17d33f36a01 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -204,6 +204,14 @@ struct page {
> >  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> >  	int _last_cpupid;
> >  #endif
> > +
> > +#ifdef CONFIG_XPFO
> > +	/* Counts the number of times this page has been kmapped. */
> > +	atomic_t xpfo_mapcount;
> > +
> > +	/* Serialize kmap/kunmap of this page */
> > +	spinlock_t xpfo_lock;
> 
> NAK, see ALLOC_SPLIT_PTLOCKS
> 
> spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_
> sure you want spinlock_t and not raw_spinlock_t ? For
> CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex.
> 
> > +#endif
> 
> Growing the page-frame by 8 bytes (in the good case) is really sad,
> that's a _lot_ of memory.

Originally we had this in page_ext, it's not really clear to me why we
moved it out.

Julien?

Tycho
Khalid Aziz April 4, 2019, 3:15 p.m. UTC | #5
On 4/4/19 1:43 AM, Peter Zijlstra wrote:
> 
> You must be so glad I no longer use kmap_atomic from NMI context :-)
> 
> On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
>> +static inline void xpfo_kmap(void *kaddr, struct page *page)
>> +{
>> +	unsigned long flags;
>> +
>> +	if (!static_branch_unlikely(&xpfo_inited))
>> +		return;
>> +
>> +	if (!PageXpfoUser(page))
>> +		return;
>> +
>> +	/*
>> +	 * The page was previously allocated to user space, so
>> +	 * map it back into the kernel if needed. No TLB flush required.
>> +	 */
>> +	spin_lock_irqsave(&page->xpfo_lock, flags);
>> +
>> +	if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
>> +		TestClearPageXpfoUnmapped(page))
>> +		set_kpte(kaddr, page, PAGE_KERNEL);
>> +
>> +	spin_unlock_irqrestore(&page->xpfo_lock, flags);
> 
> That's a really sad sequence, not wrong, but sad. _3_ atomic operations,
> 2 on likely the same cacheline. And mostly all pointless.
> 
> This patch makes xpfo_mapcount an atomic, but then all modifications are
> under the spinlock, what gives?
> 
> Anyway, a possibly saner sequence might be:
> 
> 	if (atomic_inc_not_zero(&page->xpfo_mapcount))
> 		return;
> 
> 	spin_lock_irqsave(&page->xpfo_lock, flag);
> 	if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
> 	    TestClearPageXpfoUnmapped(page))
> 		set_kpte(kaddr, page, PAGE_KERNEL);
> 	spin_unlock_irqrestore(&page->xpfo_lock, flags);
> 
>> +}
>> +
>> +static inline void xpfo_kunmap(void *kaddr, struct page *page)
>> +{
>> +	unsigned long flags;
>> +
>> +	if (!static_branch_unlikely(&xpfo_inited))
>> +		return;
>> +
>> +	if (!PageXpfoUser(page))
>> +		return;
>> +
>> +	/*
>> +	 * The page is to be allocated back to user space, so unmap it from
>> +	 * the kernel, flush the TLB and tag it as a user page.
>> +	 */
>> +	spin_lock_irqsave(&page->xpfo_lock, flags);
>> +
>> +	if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
>> +#ifdef CONFIG_XPFO_DEBUG
>> +		WARN_ON(PageXpfoUnmapped(page));
>> +#endif
>> +		SetPageXpfoUnmapped(page);
>> +		set_kpte(kaddr, page, __pgprot(0));
>> +		xpfo_flush_kernel_tlb(page, 0);
> 
> You didn't speak about the TLB invalidation anywhere. But basically this
> is one that x86 cannot do.
> 
>> +	}
>> +
>> +	spin_unlock_irqrestore(&page->xpfo_lock, flags);
> 
> Idem:
> 
> 	if (atomic_add_unless(&page->xpfo_mapcount, -1, 1))
> 		return;
> 
> 	....
> 
> 
>> +}
> 
> Also I'm failing to see the point of PG_xpfo_unmapped, afaict it
> is identical to !atomic_read(&page->xpfo_mapcount).
> 

Thanks Peter. I really appreciate your review. Your feedback helps make
this code better and closer to where I can feel comfortable not calling
it RFC any more.

The more I look at xpfo_kmap()/xpfo_kunmap() code, the more I get
uncomfortable with it. As you pointed out about calling kmap_atomic from
NMI context, that just makes the kmap_atomic code look even worse. I
pointed out one problem with this code in cover letter and suggested a
rewrite. I see these problems with this code:

1. When xpfo_kmap maps a page back in physmap, it opens up the ret2dir
attack security hole again even if just for the duration of kmap. A kmap
can stay around for some time if the page is being used for I/O.

2. This code uses spinlock which leads to problems. If it does not
disable IRQ, it is exposed to deadlock around xpfo_lock. If it disables
IRQ, I think it can still deadlock around pgd_lock.

I think a better implementation of xpfo_kmap()/xpfo_kunmap() would map
the page at a new virtual address similar to what kmap_high for i386
does. This avoids re-opening the ret2dir security hole. We can also
possibly do away with xpfo_lock saving bytes in page-frame and the not
so sane code sequence can go away.

Good point about PG_xpfo_unmapped flag. You are right that it can be
replaced with !atomic_read(&page->xpfo_mapcount).

Thanks,
Khalid
Peter Zijlstra April 4, 2019, 5:01 p.m. UTC | #6
On Thu, Apr 04, 2019 at 09:15:46AM -0600, Khalid Aziz wrote:
> Thanks Peter. I really appreciate your review. Your feedback helps make
> this code better and closer to where I can feel comfortable not calling
> it RFC any more.
> 
> The more I look at xpfo_kmap()/xpfo_kunmap() code, the more I get
> uncomfortable with it. As you pointed out about calling kmap_atomic from
> NMI context, that just makes the kmap_atomic code look even worse. I
> pointed out one problem with this code in cover letter and suggested a
> rewrite. I see these problems with this code:

Well, I no longer use it from NMI context, but I did do that for a
while. We now have a giant heap of magic in the NMI path that allows us
to take faults from NMI context (please don't ask), this means we can
mostly do copy_from_user_inatomic() now.

> 1. When xpfo_kmap maps a page back in physmap, it opens up the ret2dir
> attack security hole again even if just for the duration of kmap. A kmap
> can stay around for some time if the page is being used for I/O.

Correct.

> 2. This code uses spinlock which leads to problems. If it does not
> disable IRQ, it is exposed to deadlock around xpfo_lock. If it disables
> IRQ, I think it can still deadlock around pgd_lock.

I've not spotted that inversion yet, but then I didn't look at the lock
usage outside of k{,un}map_xpfo().

> I think a better implementation of xpfo_kmap()/xpfo_kunmap() would map
> the page at a new virtual address similar to what kmap_high for i386
> does. This avoids re-opening the ret2dir security hole. We can also
> possibly do away with xpfo_lock saving bytes in page-frame and the not
> so sane code sequence can go away.

Right, the TLB invalidation issues are still tricky, even there :/
Ingo Molnar April 17, 2019, 4:15 p.m. UTC | #7
[ Sorry, had to trim the Cc: list from hell. Tried to keep all the 
  mailing lists and all x86 developers. ]

* Khalid Aziz <khalid.aziz@oracle.com> wrote:

> From: Juerg Haefliger <juerg.haefliger@canonical.com>
> 
> This patch adds basic support infrastructure for XPFO which protects 
> against 'ret2dir' kernel attacks. The basic idea is to enforce 
> exclusive ownership of page frames by either the kernel or userspace, 
> unless explicitly requested by the kernel. Whenever a page destined for 
> userspace is allocated, it is unmapped from physmap (the kernel's page 
> table). When such a page is reclaimed from userspace, it is mapped back 
> to physmap. Individual architectures can enable full XPFO support using 
> this infrastructure by supplying architecture specific pieces.

I have a higher level, meta question:

Is there any updated analysis outlining why this XPFO overhead would be 
required on x86-64 kernels running on SMAP/SMEP CPUs which should be all 
recent Intel and AMD CPUs, and with kernel that mark all direct kernel 
mappings as non-executable - which should be all reasonably modern 
kernels later than v4.0 or so?

I.e. the original motivation of the XPFO patches was to prevent execution 
of direct kernel mappings. Is this motivation still present if those 
mappings are non-executable?

(Sorry if this has been asked and answered in previous discussions.)

Thanks,

	Ingo
Khalid Aziz April 17, 2019, 4:49 p.m. UTC | #8
On 4/17/19 10:15 AM, Ingo Molnar wrote:
> 
> [ Sorry, had to trim the Cc: list from hell. Tried to keep all the 
>   mailing lists and all x86 developers. ]
> 
> * Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
>> From: Juerg Haefliger <juerg.haefliger@canonical.com>
>>
>> This patch adds basic support infrastructure for XPFO which protects 
>> against 'ret2dir' kernel attacks. The basic idea is to enforce 
>> exclusive ownership of page frames by either the kernel or userspace, 
>> unless explicitly requested by the kernel. Whenever a page destined for 
>> userspace is allocated, it is unmapped from physmap (the kernel's page 
>> table). When such a page is reclaimed from userspace, it is mapped back 
>> to physmap. Individual architectures can enable full XPFO support using 
>> this infrastructure by supplying architecture specific pieces.
> 
> I have a higher level, meta question:
> 
> Is there any updated analysis outlining why this XPFO overhead would be 
> required on x86-64 kernels running on SMAP/SMEP CPUs which should be all 
> recent Intel and AMD CPUs, and with kernel that mark all direct kernel 
> mappings as non-executable - which should be all reasonably modern 
> kernels later than v4.0 or so?
> 
> I.e. the original motivation of the XPFO patches was to prevent execution 
> of direct kernel mappings. Is this motivation still present if those 
> mappings are non-executable?
> 
> (Sorry if this has been asked and answered in previous discussions.)

Hi Ingo,

That is a good question. Because of the cost of XPFO, we have to be very
sure we need this protection. The paper from Vasileios, Michalis and
Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
and 6.2.

Thanks,
Khalid
Ingo Molnar April 17, 2019, 5:09 p.m. UTC | #9
* Khalid Aziz <khalid.aziz@oracle.com> wrote:

> > I.e. the original motivation of the XPFO patches was to prevent execution 
> > of direct kernel mappings. Is this motivation still present if those 
> > mappings are non-executable?
> > 
> > (Sorry if this has been asked and answered in previous discussions.)
> 
> Hi Ingo,
> 
> That is a good question. Because of the cost of XPFO, we have to be very
> sure we need this protection. The paper from Vasileios, Michalis and
> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
> and 6.2.

So it would be nice if you could generally summarize external arguments 
when defending a patchset, instead of me having to dig through a PDF 
which not only causes me to spend time that you probably already spent 
reading that PDF, but I might also interpret it incorrectly. ;-)

The PDF you cited says this:

  "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced 
   in many platforms, including x86-64.  In our example, the content of 
   user address 0xBEEF000 is also accessible through kernel address 
   0xFFFF87FF9F080000 as plain, executable code."

Is this actually true of modern x86-64 kernels? We've locked down W^X 
protections in general.

I.e. this conclusion:

  "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and 
   triggering the kernel to dereference it, an attacker can directly 
   execute shell code with kernel privileges."

... appears to be predicated on imperfect W^X protections on the x86-64 
kernel.

Do such holes exist on the latest x86-64 kernel? If yes, is there a 
reason to believe that these W^X holes cannot be fixed, or that any fix 
would be more expensive than XPFO?

Thanks,

	Ingo
Nadav Amit April 17, 2019, 5:19 p.m. UTC | #10
> On Apr 17, 2019, at 10:09 AM, Ingo Molnar <mingo@kernel.org> wrote:
> 
> 
> * Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
>>> I.e. the original motivation of the XPFO patches was to prevent execution 
>>> of direct kernel mappings. Is this motivation still present if those 
>>> mappings are non-executable?
>>> 
>>> (Sorry if this has been asked and answered in previous discussions.)
>> 
>> Hi Ingo,
>> 
>> That is a good question. Because of the cost of XPFO, we have to be very
>> sure we need this protection. The paper from Vasileios, Michalis and
>> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
>> and 6.2.
> 
> So it would be nice if you could generally summarize external arguments 
> when defending a patchset, instead of me having to dig through a PDF 
> which not only causes me to spend time that you probably already spent 
> reading that PDF, but I might also interpret it incorrectly. ;-)
> 
> The PDF you cited says this:
> 
>  "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced 
>   in many platforms, including x86-64.  In our example, the content of 
>   user address 0xBEEF000 is also accessible through kernel address 
>   0xFFFF87FF9F080000 as plain, executable code."
> 
> Is this actually true of modern x86-64 kernels? We've locked down W^X 
> protections in general.

As I was curious, I looked at the paper. Here is a quote from it:

"In x86-64, however, the permissions of physmap are not in sane state.
Kernels up to v3.8.13 violate the W^X property by mapping the entire region
as “readable, writeable, and executable” (RWX)—only very recent kernels
(≥v3.9) use the more conservative RW mapping.”
Ingo Molnar April 17, 2019, 5:26 p.m. UTC | #11
* Nadav Amit <nadav.amit@gmail.com> wrote:

> > On Apr 17, 2019, at 10:09 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > 
> > 
> > * Khalid Aziz <khalid.aziz@oracle.com> wrote:
> > 
> >>> I.e. the original motivation of the XPFO patches was to prevent execution 
> >>> of direct kernel mappings. Is this motivation still present if those 
> >>> mappings are non-executable?
> >>> 
> >>> (Sorry if this has been asked and answered in previous discussions.)
> >> 
> >> Hi Ingo,
> >> 
> >> That is a good question. Because of the cost of XPFO, we have to be very
> >> sure we need this protection. The paper from Vasileios, Michalis and
> >> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
> >> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
> >> and 6.2.
> > 
> > So it would be nice if you could generally summarize external arguments 
> > when defending a patchset, instead of me having to dig through a PDF 
> > which not only causes me to spend time that you probably already spent 
> > reading that PDF, but I might also interpret it incorrectly. ;-)
> > 
> > The PDF you cited says this:
> > 
> >  "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced 
> >   in many platforms, including x86-64.  In our example, the content of 
> >   user address 0xBEEF000 is also accessible through kernel address 
> >   0xFFFF87FF9F080000 as plain, executable code."
> > 
> > Is this actually true of modern x86-64 kernels? We've locked down W^X 
> > protections in general.
> 
> As I was curious, I looked at the paper. Here is a quote from it:
> 
> "In x86-64, however, the permissions of physmap are not in sane state.
> Kernels up to v3.8.13 violate the W^X property by mapping the entire region
> as “readable, writeable, and executable” (RWX)—only very recent kernels
> (≥v3.9) use the more conservative RW mapping.”

But v3.8.13 is a 5+ years old kernel, it doesn't count as a "modern" 
kernel in any sense of the word. For any proposed patchset with 
significant complexity and non-trivial costs the benchmark version 
threshold is the "current upstream kernel".

So does that quote address my followup questions:

> Is this actually true of modern x86-64 kernels? We've locked down W^X
> protections in general.
>
> I.e. this conclusion:
>
>   "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and
>    triggering the kernel to dereference it, an attacker can directly
>    execute shell code with kernel privileges."
>
> ... appears to be predicated on imperfect W^X protections on the x86-64
> kernel.
>
> Do such holes exist on the latest x86-64 kernel? If yes, is there a
> reason to believe that these W^X holes cannot be fixed, or that any fix
> would be more expensive than XPFO?

?

What you are proposing here is a XPFO patch-set against recent kernels 
with significant runtime overhead, so my questions about the W^X holes 
are warranted.

Thanks,

	Ingo
Khalid Aziz April 17, 2019, 5:33 p.m. UTC | #12
On 4/17/19 11:09 AM, Ingo Molnar wrote:
> 
> * Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
>>> I.e. the original motivation of the XPFO patches was to prevent execution 
>>> of direct kernel mappings. Is this motivation still present if those 
>>> mappings are non-executable?
>>>
>>> (Sorry if this has been asked and answered in previous discussions.)
>>
>> Hi Ingo,
>>
>> That is a good question. Because of the cost of XPFO, we have to be very
>> sure we need this protection. The paper from Vasileios, Michalis and
>> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
>> and 6.2.
> 
> So it would be nice if you could generally summarize external arguments 
> when defending a patchset, instead of me having to dig through a PDF 
> which not only causes me to spend time that you probably already spent 
> reading that PDF, but I might also interpret it incorrectly. ;-)

Sorry, you are right. Even though that paper explains it well, a summary
is always useful.

> 
> The PDF you cited says this:
> 
>   "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced 
>    in many platforms, including x86-64.  In our example, the content of 
>    user address 0xBEEF000 is also accessible through kernel address 
>    0xFFFF87FF9F080000 as plain, executable code."
> 
> Is this actually true of modern x86-64 kernels? We've locked down W^X 
> protections in general.
> 
> I.e. this conclusion:
> 
>   "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and 
>    triggering the kernel to dereference it, an attacker can directly 
>    execute shell code with kernel privileges."
> 
> ... appears to be predicated on imperfect W^X protections on the x86-64 
> kernel.
> 
> Do such holes exist on the latest x86-64 kernel? If yes, is there a 
> reason to believe that these W^X holes cannot be fixed, or that any fix 
> would be more expensive than XPFO?

Even if physmap is not executable, return-oriented programming (ROP) can
still be used to launch an attack. Instead of placing executable code at
user address 0xBEEF000, attacker can place an ROP payload there. kfptr
is then overwritten to point to a stack-pivoting gadget. Using the
physmap address aliasing, the ROP payload becomes kernel-mode stack. The
execution can then be hijacked upon execution of ret instruction. This
is a gist of the subsection titled "Non-executable physmap" under
section 6.2 and it looked convincing enough to me. If you have a
different take on this, I am very interested in your point of view.

Thanks,
Khalid
Nadav Amit April 17, 2019, 5:44 p.m. UTC | #13
> On Apr 17, 2019, at 10:26 AM, Ingo Molnar <mingo@kernel.org> wrote:
> 
> 
> * Nadav Amit <nadav.amit@gmail.com> wrote:
> 
>>> On Apr 17, 2019, at 10:09 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>> 
>>> 
>>> * Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>> 
>>>>> I.e. the original motivation of the XPFO patches was to prevent execution 
>>>>> of direct kernel mappings. Is this motivation still present if those 
>>>>> mappings are non-executable?
>>>>> 
>>>>> (Sorry if this has been asked and answered in previous discussions.)
>>>> 
>>>> Hi Ingo,
>>>> 
>>>> That is a good question. Because of the cost of XPFO, we have to be very
>>>> sure we need this protection. The paper from Vasileios, Michalis and
>>>> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
>>>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
>>>> and 6.2.
>>> 
>>> So it would be nice if you could generally summarize external arguments 
>>> when defending a patchset, instead of me having to dig through a PDF 
>>> which not only causes me to spend time that you probably already spent 
>>> reading that PDF, but I might also interpret it incorrectly. ;-)
>>> 
>>> The PDF you cited says this:
>>> 
>>> "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced 
>>>  in many platforms, including x86-64.  In our example, the content of 
>>>  user address 0xBEEF000 is also accessible through kernel address 
>>>  0xFFFF87FF9F080000 as plain, executable code."
>>> 
>>> Is this actually true of modern x86-64 kernels? We've locked down W^X 
>>> protections in general.
>> 
>> As I was curious, I looked at the paper. Here is a quote from it:
>> 
>> "In x86-64, however, the permissions of physmap are not in sane state.
>> Kernels up to v3.8.13 violate the W^X property by mapping the entire region
>> as “readable, writeable, and executable” (RWX)—only very recent kernels
>> (≥v3.9) use the more conservative RW mapping.”
> 
> But v3.8.13 is a 5+ years old kernel, it doesn't count as a "modern" 
> kernel in any sense of the word. For any proposed patchset with 
> significant complexity and non-trivial costs the benchmark version 
> threshold is the "current upstream kernel".
> 
> So does that quote address my followup questions:
> 
>> Is this actually true of modern x86-64 kernels? We've locked down W^X
>> protections in general.
>> 
>> I.e. this conclusion:
>> 
>>  "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and
>>   triggering the kernel to dereference it, an attacker can directly
>>   execute shell code with kernel privileges."
>> 
>> ... appears to be predicated on imperfect W^X protections on the x86-64
>> kernel.
>> 
>> Do such holes exist on the latest x86-64 kernel? If yes, is there a
>> reason to believe that these W^X holes cannot be fixed, or that any fix
>> would be more expensive than XPFO?
> 
> ?
> 
> What you are proposing here is a XPFO patch-set against recent kernels 
> with significant runtime overhead, so my questions about the W^X holes 
> are warranted.
> 

Just to clarify - I am an innocent bystander and have no part in this work.
I was just looking (again) at the paper, as I was curious due to the recent
patches that I sent that improve W^X protection.
Andy Lutomirski April 17, 2019, 7:49 p.m. UTC | #14
On Wed, Apr 17, 2019 at 10:33 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 4/17/19 11:09 AM, Ingo Molnar wrote:
> >
> > * Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >
> >>> I.e. the original motivation of the XPFO patches was to prevent execution
> >>> of direct kernel mappings. Is this motivation still present if those
> >>> mappings are non-executable?
> >>>
> >>> (Sorry if this has been asked and answered in previous discussions.)
> >>
> >> Hi Ingo,
> >>
> >> That is a good question. Because of the cost of XPFO, we have to be very
> >> sure we need this protection. The paper from Vasileios, Michalis and
> >> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
> >> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
> >> and 6.2.
> >
> > So it would be nice if you could generally summarize external arguments
> > when defending a patchset, instead of me having to dig through a PDF
> > which not only causes me to spend time that you probably already spent
> > reading that PDF, but I might also interpret it incorrectly. ;-)
>
> Sorry, you are right. Even though that paper explains it well, a summary
> is always useful.
>
> >
> > The PDF you cited says this:
> >
> >   "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced
> >    in many platforms, including x86-64.  In our example, the content of
> >    user address 0xBEEF000 is also accessible through kernel address
> >    0xFFFF87FF9F080000 as plain, executable code."
> >
> > Is this actually true of modern x86-64 kernels? We've locked down W^X
> > protections in general.
> >
> > I.e. this conclusion:
> >
> >   "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and
> >    triggering the kernel to dereference it, an attacker can directly
> >    execute shell code with kernel privileges."
> >
> > ... appears to be predicated on imperfect W^X protections on the x86-64
> > kernel.
> >
> > Do such holes exist on the latest x86-64 kernel? If yes, is there a
> > reason to believe that these W^X holes cannot be fixed, or that any fix
> > would be more expensive than XPFO?
>
> Even if physmap is not executable, return-oriented programming (ROP) can
> still be used to launch an attack. Instead of placing executable code at
> user address 0xBEEF000, attacker can place an ROP payload there. kfptr
> is then overwritten to point to a stack-pivoting gadget. Using the
> physmap address aliasing, the ROP payload becomes kernel-mode stack. The
> execution can then be hijacked upon execution of ret instruction. This
> is a gist of the subsection titled "Non-executable physmap" under
> section 6.2 and it looked convincing enough to me. If you have a
> different take on this, I am very interested in your point of view.

My issue with all this is that XPFO is really very expensive.  I think
that, if we're going to seriously consider upstreaming expensive
exploit mitigations like this, we should consider others first, in
particular CFI techniques.  grsecurity's RAP would be a great start.
I also proposed using a gcc plugin (or upstream gcc feature) to add
some instrumentation to any code that pops RSP to verify that the
resulting (unsigned) change in RSP is between 0 and THREAD_SIZE bytes.
This will make ROP quite a bit harder.
Tycho Andersen April 17, 2019, 7:52 p.m. UTC | #15
On Wed, Apr 17, 2019 at 12:49:04PM -0700, Andy Lutomirski wrote:
> I also proposed using a gcc plugin (or upstream gcc feature) to add
> some instrumentation to any code that pops RSP to verify that the
> resulting (unsigned) change in RSP is between 0 and THREAD_SIZE bytes.
> This will make ROP quite a bit harder.

I've been playing around with this for a bit, and hope to have
something to post Soon :)

Tycho
Khalid Aziz April 17, 2019, 8:12 p.m. UTC | #16
On 4/17/19 1:49 PM, Andy Lutomirski wrote:
> On Wed, Apr 17, 2019 at 10:33 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> On 4/17/19 11:09 AM, Ingo Molnar wrote:
>>>
>>> * Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>
>>>>> I.e. the original motivation of the XPFO patches was to prevent execution
>>>>> of direct kernel mappings. Is this motivation still present if those
>>>>> mappings are non-executable?
>>>>>
>>>>> (Sorry if this has been asked and answered in previous discussions.)
>>>>
>>>> Hi Ingo,
>>>>
>>>> That is a good question. Because of the cost of XPFO, we have to be very
>>>> sure we need this protection. The paper from Vasileios, Michalis and
>>>> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
>>>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
>>>> and 6.2.
>>>
>>> So it would be nice if you could generally summarize external arguments
>>> when defending a patchset, instead of me having to dig through a PDF
>>> which not only causes me to spend time that you probably already spent
>>> reading that PDF, but I might also interpret it incorrectly. ;-)
>>
>> Sorry, you are right. Even though that paper explains it well, a summary
>> is always useful.
>>
>>>
>>> The PDF you cited says this:
>>>
>>>   "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced
>>>    in many platforms, including x86-64.  In our example, the content of
>>>    user address 0xBEEF000 is also accessible through kernel address
>>>    0xFFFF87FF9F080000 as plain, executable code."
>>>
>>> Is this actually true of modern x86-64 kernels? We've locked down W^X
>>> protections in general.
>>>
>>> I.e. this conclusion:
>>>
>>>   "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and
>>>    triggering the kernel to dereference it, an attacker can directly
>>>    execute shell code with kernel privileges."
>>>
>>> ... appears to be predicated on imperfect W^X protections on the x86-64
>>> kernel.
>>>
>>> Do such holes exist on the latest x86-64 kernel? If yes, is there a
>>> reason to believe that these W^X holes cannot be fixed, or that any fix
>>> would be more expensive than XPFO?
>>
>> Even if physmap is not executable, return-oriented programming (ROP) can
>> still be used to launch an attack. Instead of placing executable code at
>> user address 0xBEEF000, attacker can place an ROP payload there. kfptr
>> is then overwritten to point to a stack-pivoting gadget. Using the
>> physmap address aliasing, the ROP payload becomes kernel-mode stack. The
>> execution can then be hijacked upon execution of ret instruction. This
>> is a gist of the subsection titled "Non-executable physmap" under
>> section 6.2 and it looked convincing enough to me. If you have a
>> different take on this, I am very interested in your point of view.
> 
> My issue with all this is that XPFO is really very expensive.  I think
> that, if we're going to seriously consider upstreaming expensive
> exploit mitigations like this, we should consider others first, in
> particular CFI techniques.  grsecurity's RAP would be a great start.
> I also proposed using a gcc plugin (or upstream gcc feature) to add
> some instrumentation to any code that pops RSP to verify that the
> resulting (unsigned) change in RSP is between 0 and THREAD_SIZE bytes.
> This will make ROP quite a bit harder.
> 

Yes, XPFO is expensive. I have been able to reduce the overhead of XPFO
from 2537% to 28% (on large servers) but 28% is still quite significant.
Alternative mitigation techniques with lower impact would easily be more
acceptable as long as they provide same level of protection. If we have
to go with XPFO, we will continue to look for more performance
improvement to bring that number down further from 28%. Hopefully what
Tycho is working on will yield better results. I am continuing to look
for improvements to XPFO in parallel.

Thanks,
Khalid
Thomas Gleixner April 17, 2019, 9:19 p.m. UTC | #17
On Wed, 17 Apr 2019, Nadav Amit wrote:
> > On Apr 17, 2019, at 10:26 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >> As I was curious, I looked at the paper. Here is a quote from it:
> >> 
> >> "In x86-64, however, the permissions of physmap are not in sane state.
> >> Kernels up to v3.8.13 violate the W^X property by mapping the entire region
> >> as “readable, writeable, and executable” (RWX)—only very recent kernels
> >> (≥v3.9) use the more conservative RW mapping.”
> > 
> > But v3.8.13 is a 5+ years old kernel, it doesn't count as a "modern" 
> > kernel in any sense of the word. For any proposed patchset with 
> > significant complexity and non-trivial costs the benchmark version 
> > threshold is the "current upstream kernel".
> > 
> > So does that quote address my followup questions:
> > 
> >> Is this actually true of modern x86-64 kernels? We've locked down W^X
> >> protections in general.
> >> 
> >> I.e. this conclusion:
> >> 
> >>  "Therefore, by simply overwriting kfptr with 0xFFFF87FF9F080000 and
> >>   triggering the kernel to dereference it, an attacker can directly
> >>   execute shell code with kernel privileges."
> >> 
> >> ... appears to be predicated on imperfect W^X protections on the x86-64
> >> kernel.
> >> 
> >> Do such holes exist on the latest x86-64 kernel? If yes, is there a
> >> reason to believe that these W^X holes cannot be fixed, or that any fix
> >> would be more expensive than XPFO?
> > 
> > ?
> > 
> > What you are proposing here is a XPFO patch-set against recent kernels 
> > with significant runtime overhead, so my questions about the W^X holes 
> > are warranted.
> > 
> 
> Just to clarify - I am an innocent bystander and have no part in this work.
> I was just looking (again) at the paper, as I was curious due to the recent
> patches that I sent that improve W^X protection.

It's not necessarily a W+X issue. The user space text is mapped in the
kernel as well and even if it is mapped RX then this can happen. So any
kernel mappings of user space text need to be mapped NX!

Thanks,

	tglx
Linus Torvalds April 17, 2019, 11:18 p.m. UTC | #18
On Wed, Apr 17, 2019, 14:20 Thomas Gleixner <tglx@linutronix.de> wrote:

>
> It's not necessarily a W+X issue. The user space text is mapped in the
> kernel as well and even if it is mapped RX then this can happen. So any
> kernel mappings of user space text need to be mapped NX!


With SMEP, user space pages are always NX.

I really think SM[AE]P is something we can already take for granted. People
who have old CPU's workout it are simply not serious about security anyway.
There is no point in saying "we can do it badly in software".

       Linus

>
<div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Apr 17, 2019, 14:20 Thomas Gleixner &lt;<a href="mailto:tglx@linutronix.de">tglx@linutronix.de</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
It&#39;s not necessarily a W+X issue. The user space text is mapped in the<br>
kernel as well and even if it is mapped RX then this can happen. So any<br>
kernel mappings of user space text need to be mapped NX!</blockquote></div></div><div dir="auto"><br></div><div dir="auto">With SMEP, user space pages are always NX.</div><div dir="auto"><br></div><div dir="auto">I really think SM[AE]P is something we can already take for granted. People who have old CPU&#39;s workout it are simply not serious about security anyway. There is no point in saying &quot;we can do it badly in software&quot;.</div><div dir="auto"><br></div><div dir="auto">       Linus</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> </blockquote></div></div><div dir="auto"></div></div>
Thomas Gleixner April 17, 2019, 11:42 p.m. UTC | #19
On Wed, 17 Apr 2019, Linus Torvalds wrote:

> On Wed, Apr 17, 2019, 14:20 Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> >
> > It's not necessarily a W+X issue. The user space text is mapped in the
> > kernel as well and even if it is mapped RX then this can happen. So any
> > kernel mappings of user space text need to be mapped NX!
> 
> With SMEP, user space pages are always NX.

We talk past each other. The user space page in the ring3 valid virtual
address space (non negative) is of course protected by SMEP.

The attack utilizes the kernel linear mapping of the physical
memory. I.e. user space address 0x43210 has a kernel equivalent at
0xfxxxxxxxxxx. So if the attack manages to trick the kernel to that valid
kernel address and that is mapped X --> game over. SMEP does not help
there.

From the top of my head I'd say this is a non issue as those kernel address
space mappings _should_ be NX, but we got bitten by _should_ in the past:)

Thanks,

	tglx
Linus Torvalds April 17, 2019, 11:52 p.m. UTC | #20
On Wed, Apr 17, 2019 at 4:42 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, 17 Apr 2019, Linus Torvalds wrote:
>
> > With SMEP, user space pages are always NX.
>
> We talk past each other. The user space page in the ring3 valid virtual
> address space (non negative) is of course protected by SMEP.
>
> The attack utilizes the kernel linear mapping of the physical
> memory. I.e. user space address 0x43210 has a kernel equivalent at
> 0xfxxxxxxxxxx. So if the attack manages to trick the kernel to that valid
> kernel address and that is mapped X --> game over. SMEP does not help
> there.

Oh, agreed.

But that would simply be a kernel bug. We should only map kernel pages
executable when we have kernel code in them, and we should certainly
not allow those pages to be mapped writably in user space.

That kind of "executable in kernel, writable in user" would be a
horrendous and major bug.

So i think it's a non-issue.

> From the top of my head I'd say this is a non issue as those kernel address
> space mappings _should_ be NX, but we got bitten by _should_ in the past:)

I do agree that bugs can happen, obviously, and we might have missed something.

But in the context of XPFO, I would argue (*very* strongly) that the
likelihood of the above kind of bug is absolutely *miniscule* compared
to the likelihood that we'd have something wrong in the software
implementation of XPFO.

So if the argument is "we might have bugs in software", then I think
that's an argument _against_ XPFO rather than for it.

                Linus
Andy Lutomirski April 18, 2019, 4:41 a.m. UTC | #21
On Wed, Apr 17, 2019 at 5:00 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Apr 17, 2019 at 4:42 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Wed, 17 Apr 2019, Linus Torvalds wrote:
> >
> > > With SMEP, user space pages are always NX.
> >
> > We talk past each other. The user space page in the ring3 valid virtual
> > address space (non negative) is of course protected by SMEP.
> >
> > The attack utilizes the kernel linear mapping of the physical
> > memory. I.e. user space address 0x43210 has a kernel equivalent at
> > 0xfxxxxxxxxxx. So if the attack manages to trick the kernel to that valid
> > kernel address and that is mapped X --> game over. SMEP does not help
> > there.
>
> Oh, agreed.
>
> But that would simply be a kernel bug. We should only map kernel pages
> executable when we have kernel code in them, and we should certainly
> not allow those pages to be mapped writably in user space.
>
> That kind of "executable in kernel, writable in user" would be a
> horrendous and major bug.
>
> So i think it's a non-issue.
>
> > From the top of my head I'd say this is a non issue as those kernel address
> > space mappings _should_ be NX, but we got bitten by _should_ in the past:)
>
> I do agree that bugs can happen, obviously, and we might have missed something.
>
> But in the context of XPFO, I would argue (*very* strongly) that the
> likelihood of the above kind of bug is absolutely *miniscule* compared
> to the likelihood that we'd have something wrong in the software
> implementation of XPFO.
>
> So if the argument is "we might have bugs in software", then I think
> that's an argument _against_ XPFO rather than for it.
>

I don't think this type of NX goof was ever the argument for XPFO.
The main argument I've heard is that a malicious user program writes a
ROP payload into user memory (regular anonymous user memory) and then
gets the kernel to erroneously set RSP (*not* RIP) to point there.

I find this argument fairly weak for a couple reasons.  First, if
we're worried about this, let's do in-kernel CFI, not XPFO, to
mitigate it.  Second, I don't see why the exact same attack can't be
done using, say, page cache, and unless I'm missing something, XPFO
doesn't protect page cache.  Or network buffers, or pipe buffers, etc.
Kees Cook April 18, 2019, 5:41 a.m. UTC | #22
On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski <luto@kernel.org> wrote:
> I don't think this type of NX goof was ever the argument for XPFO.
> The main argument I've heard is that a malicious user program writes a
> ROP payload into user memory (regular anonymous user memory) and then
> gets the kernel to erroneously set RSP (*not* RIP) to point there.

Well, more than just ROP. Any of the various attack primitives. The NX
stuff is about moving RIP: SMEP-bypassing. But there is still basic
SMAP-bypassing for putting a malicious structure in userspace and
having the kernel access it via the linear mapping, etc.

> I find this argument fairly weak for a couple reasons.  First, if
> we're worried about this, let's do in-kernel CFI, not XPFO, to

CFI is getting much closer. Getting the kernel happy under Clang, LTO,
and CFI is under active development. (It's functional for arm64
already, and pieces have been getting upstreamed.)

> mitigate it.  Second, I don't see why the exact same attack can't be
> done using, say, page cache, and unless I'm missing something, XPFO
> doesn't protect page cache.  Or network buffers, or pipe buffers, etc.

My understanding is that it's much easier to feel out the linear
mapping address than for the others. But yes, all of those same attack
primitives are possible in other memory areas (though most are NX),
and plenty of exploits have done such things.
Thomas Gleixner April 18, 2019, 6:14 a.m. UTC | #23
On Wed, 17 Apr 2019, Linus Torvalds wrote:
> On Wed, Apr 17, 2019 at 4:42 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Wed, 17 Apr 2019, Linus Torvalds wrote:
> > > With SMEP, user space pages are always NX.
> >
> > We talk past each other. The user space page in the ring3 valid virtual
> > address space (non negative) is of course protected by SMEP.
> >
> > The attack utilizes the kernel linear mapping of the physical
> > memory. I.e. user space address 0x43210 has a kernel equivalent at
> > 0xfxxxxxxxxxx. So if the attack manages to trick the kernel to that valid
> > kernel address and that is mapped X --> game over. SMEP does not help
> > there.
> 
> Oh, agreed.
> 
> But that would simply be a kernel bug. We should only map kernel pages
> executable when we have kernel code in them, and we should certainly
> not allow those pages to be mapped writably in user space.
> 
> That kind of "executable in kernel, writable in user" would be a
> horrendous and major bug.
> 
> So i think it's a non-issue.

Pretty much.

> > From the top of my head I'd say this is a non issue as those kernel address
> > space mappings _should_ be NX, but we got bitten by _should_ in the past:)
> 
> I do agree that bugs can happen, obviously, and we might have missed something.
>
> But in the context of XPFO, I would argue (*very* strongly) that the
> likelihood of the above kind of bug is absolutely *miniscule* compared
> to the likelihood that we'd have something wrong in the software
> implementation of XPFO.
> 
> So if the argument is "we might have bugs in software", then I think
> that's an argument _against_ XPFO rather than for it.

No argument from my side. We better spend time to make sure that a bogus
kernel side X mapping is caught, like we catch other things.

Thanks,

	tglx
Khalid Aziz April 18, 2019, 2:34 p.m. UTC | #24
On 4/17/19 11:41 PM, Kees Cook wrote:
> On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski <luto@kernel.org> wrote:
>> I don't think this type of NX goof was ever the argument for XPFO.
>> The main argument I've heard is that a malicious user program writes a
>> ROP payload into user memory (regular anonymous user memory) and then
>> gets the kernel to erroneously set RSP (*not* RIP) to point there.
> 
> Well, more than just ROP. Any of the various attack primitives. The NX
> stuff is about moving RIP: SMEP-bypassing. But there is still basic
> SMAP-bypassing for putting a malicious structure in userspace and
> having the kernel access it via the linear mapping, etc.
> 
>> I find this argument fairly weak for a couple reasons.  First, if
>> we're worried about this, let's do in-kernel CFI, not XPFO, to
> 
> CFI is getting much closer. Getting the kernel happy under Clang, LTO,
> and CFI is under active development. (It's functional for arm64
> already, and pieces have been getting upstreamed.)
> 

CFI theoretically offers protection with fairly low overhead. I have not
played much with CFI in clang. I agree with Linus that probability of
bugs in XPFO implementation itself is a cause of concern. If CFI in
Clang can provide us the same level of protection as XPFO does, I
wouldn't want to push for an expensive change like XPFO.

If Clang/CFI can't get us there for extended period of time, does it
make sense to continue to poke at XPFO?

Thanks,
Khalid
Khalid Aziz April 22, 2019, 7:30 p.m. UTC | #25
On 4/18/19 8:34 AM, Khalid Aziz wrote:
> On 4/17/19 11:41 PM, Kees Cook wrote:
>> On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski <luto@kernel.org> wrote:
>>> I don't think this type of NX goof was ever the argument for XPFO.
>>> The main argument I've heard is that a malicious user program writes a
>>> ROP payload into user memory (regular anonymous user memory) and then
>>> gets the kernel to erroneously set RSP (*not* RIP) to point there.
>>
>> Well, more than just ROP. Any of the various attack primitives. The NX
>> stuff is about moving RIP: SMEP-bypassing. But there is still basic
>> SMAP-bypassing for putting a malicious structure in userspace and
>> having the kernel access it via the linear mapping, etc.
>>
>>> I find this argument fairly weak for a couple reasons.  First, if
>>> we're worried about this, let's do in-kernel CFI, not XPFO, to
>>
>> CFI is getting much closer. Getting the kernel happy under Clang, LTO,
>> and CFI is under active development. (It's functional for arm64
>> already, and pieces have been getting upstreamed.)
>>
> 
> CFI theoretically offers protection with fairly low overhead. I have not
> played much with CFI in clang. I agree with Linus that probability of
> bugs in XPFO implementation itself is a cause of concern. If CFI in
> Clang can provide us the same level of protection as XPFO does, I
> wouldn't want to push for an expensive change like XPFO.
> 
> If Clang/CFI can't get us there for extended period of time, does it
> make sense to continue to poke at XPFO?

Any feedback on continued effort on XPFO? If it makes sense to have XPFO
available as a solution for ret2dir issue in case Clang/CFI does not
work out, I will continue to refine it.

--
Khalid
Kees Cook April 22, 2019, 10:23 p.m. UTC | #26
On Thu, Apr 18, 2019 at 7:35 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 4/17/19 11:41 PM, Kees Cook wrote:
> > On Wed, Apr 17, 2019 at 11:41 PM Andy Lutomirski <luto@kernel.org> wrote:
> >> I don't think this type of NX goof was ever the argument for XPFO.
> >> The main argument I've heard is that a malicious user program writes a
> >> ROP payload into user memory (regular anonymous user memory) and then
> >> gets the kernel to erroneously set RSP (*not* RIP) to point there.
> >
> > Well, more than just ROP. Any of the various attack primitives. The NX
> > stuff is about moving RIP: SMEP-bypassing. But there is still basic
> > SMAP-bypassing for putting a malicious structure in userspace and
> > having the kernel access it via the linear mapping, etc.
> >
> >> I find this argument fairly weak for a couple reasons.  First, if
> >> we're worried about this, let's do in-kernel CFI, not XPFO, to
> >
> > CFI is getting much closer. Getting the kernel happy under Clang, LTO,
> > and CFI is under active development. (It's functional for arm64
> > already, and pieces have been getting upstreamed.)
> >
>
> CFI theoretically offers protection with fairly low overhead. I have not
> played much with CFI in clang. I agree with Linus that probability of
> bugs in XPFO implementation itself is a cause of concern. If CFI in
> Clang can provide us the same level of protection as XPFO does, I
> wouldn't want to push for an expensive change like XPFO.
>
> If Clang/CFI can't get us there for extended period of time, does it
> make sense to continue to poke at XPFO?

Well, I think CFI will certainly vastly narrow the execution paths
available to an attacker, but what I continue to see XPFO useful for
is stopping attacks that need to locate something in memory. (i.e. not
ret2dir but, like, read2dir.) It's arguable that such attacks would
just use heap, stack, etc to hold such things, but the linear map
remains relatively easy to find/target. But I agree: the protection is
getting more and more narrow (especially with CFI coming down the
pipe), and if it's still a 28% hit, that's not going to be tenable for
anyone but the truly paranoid. :)

All that said, there isn't a very good backward-edge CFI protection
(i.e. ROP defense) on x86 in Clang. The forward-edge looks decent, but
requires LTO, etc. My point is there is still a long path to gaining
CFI in upstream.
Waiman Long May 1, 2019, 2:49 p.m. UTC | #27
On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> diff --git a/Documentation/admin-guide/kernel-parameters.txt
b/Documentation/admin-guide/kernel-parameters.txt

> index 858b6c0b9a15..9b36da94760e 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2997,6 +2997,12 @@
>
>      nox2apic    [X86-64,APIC] Do not enable x2APIC mode.
>
> +    noxpfo        [XPFO] Disable eXclusive Page Frame Ownership (XPFO)
> +            when CONFIG_XPFO is on. Physical pages mapped into
> +            user applications will also be mapped in the
> +            kernel's address space as if CONFIG_XPFO was not
> +            enabled.
> +
>      cpu0_hotplug    [X86] Turn on CPU0 hotplug feature when
>              CONFIG_BO OTPARAM_HOTPLUG_CPU0 is off.
>              Some features depend on CPU0. Known dependencies are:

Given the big performance impact that XPFO can have. It should be off by
default when configured. Instead, the xpfo option should be used to
enable it.

Cheers,
Longman
Khalid Aziz May 1, 2019, 3:18 p.m. UTC | #28
On 5/1/19 8:49 AM, Waiman Long wrote:
> On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> 
>> index 858b6c0b9a15..9b36da94760e 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -2997,6 +2997,12 @@
>>
>>       nox2apic    [X86-64,APIC] Do not enable x2APIC mode.
>>
>> +    noxpfo        [XPFO] Disable eXclusive Page Frame Ownership (XPFO)
>> +            when CONFIG_XPFO is on. Physical pages mapped into
>> +            user applications will also be mapped in the
>> +            kernel's address space as if CONFIG_XPFO was not
>> +            enabled.
>> +
>>       cpu0_hotplug    [X86] Turn on CPU0 hotplug feature when
>>               CONFIG_BO OTPARAM_HOTPLUG_CPU0 is off.
>>               Some features depend on CPU0. Known dependencies are:
> 
> Given the big performance impact that XPFO can have. It should be off by
> default when configured. Instead, the xpfo option should be used to
> enable it.

Agreed. I plan to disable it by default in the next version of the
patch. This is likely to end up being a feature for extreme security
conscious folks only, unless I or someone else comes up with further
significant performance boost.

Thanks,
Khalid
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 858b6c0b9a15..9b36da94760e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2997,6 +2997,12 @@ 
 
 	nox2apic	[X86-64,APIC] Do not enable x2APIC mode.
 
+	noxpfo		[XPFO] Disable eXclusive Page Frame Ownership (XPFO)
+			when CONFIG_XPFO is on. Physical pages mapped into
+			user applications will also be mapped in the
+			kernel's address space as if CONFIG_XPFO was not
+			enabled.
+
 	cpu0_hotplug	[X86] Turn on CPU0 hotplug feature when
 			CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
 			Some features depend on CPU0. Known dependencies are:
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ea5cdbd8c2c3..59a1a5fa598d 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -8,6 +8,7 @@ 
 #include <linux/mm.h>
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
+#include <linux/xpfo.h>
 
 #include <asm/cacheflush.h>
 
@@ -77,36 +78,6 @@  static inline struct page *kmap_to_page(void *addr)
 
 static inline unsigned long totalhigh_pages(void) { return 0UL; }
 
-#ifndef ARCH_HAS_KMAP
-static inline void *kmap(struct page *page)
-{
-	might_sleep();
-	return page_address(page);
-}
-
-static inline void kunmap(struct page *page)
-{
-}
-
-static inline void *kmap_atomic(struct page *page)
-{
-	preempt_disable();
-	pagefault_disable();
-	return page_address(page);
-}
-#define kmap_atomic_prot(page, prot)	kmap_atomic(page)
-
-static inline void __kunmap_atomic(void *addr)
-{
-	pagefault_enable();
-	preempt_enable();
-}
-
-#define kmap_atomic_pfn(pfn)	kmap_atomic(pfn_to_page(pfn))
-
-#define kmap_flush_unused()	do {} while(0)
-#endif
-
 #endif /* CONFIG_HIGHMEM */
 
 #if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c471a2c43fa..d17d33f36a01 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -204,6 +204,14 @@  struct page {
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	int _last_cpupid;
 #endif
+
+#ifdef CONFIG_XPFO
+	/* Counts the number of times this page has been kmapped. */
+	atomic_t xpfo_mapcount;
+
+	/* Serialize kmap/kunmap of this page */
+	spinlock_t xpfo_lock;
+#endif
 } _struct_page_alignment;
 
 /*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 39b4494e29f1..3622e8c33522 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -101,6 +101,10 @@  enum pageflags {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
 	PG_young,
 	PG_idle,
+#endif
+#ifdef CONFIG_XPFO
+	PG_xpfo_user,		/* Page is allocated to user-space */
+	PG_xpfo_unmapped,	/* Page is unmapped from the linear map */
 #endif
 	__NR_PAGEFLAGS,
 
@@ -398,6 +402,22 @@  TESTCLEARFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
+#ifdef CONFIG_XPFO
+PAGEFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTCLEARFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTSETFLAG(XpfoUser, xpfo_user, PF_ANY)
+#define __PG_XPFO_USER	(1UL << PG_xpfo_user)
+PAGEFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTCLEARFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTSETFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+#define __PG_XPFO_UNMAPPED	(1UL << PG_xpfo_unmapped)
+#else
+#define __PG_XPFO_USER		0
+PAGEFLAG_FALSE(XpfoUser)
+#define __PG_XPFO_UNMAPPED	0
+PAGEFLAG_FALSE(XpfoUnmapped)
+#endif
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
@@ -780,7 +800,8 @@  static inline void ClearPageSlabPfmemalloc(struct page *page)
  * alloc-free cycle to prevent from reusing the page.
  */
 #define PAGE_FLAGS_CHECK_AT_PREP	\
-	(((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON)
+	(((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON & ~__PG_XPFO_USER & \
+	 ~__PG_XPFO_UNMAPPED)
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1UL << PG_private | 1UL << PG_private_2)
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
new file mode 100644
index 000000000000..93a1b5aceca3
--- /dev/null
+++ b/include/linux/xpfo.h
@@ -0,0 +1,147 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2017 Docker, Inc.
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *   Tycho Andersen <tycho@docker.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#ifndef _LINUX_XPFO_H
+#define _LINUX_XPFO_H
+
+#include <linux/types.h>
+#include <linux/dma-direction.h>
+#include <linux/uaccess.h>
+
+struct page;
+
+#ifdef CONFIG_XPFO
+
+DECLARE_STATIC_KEY_TRUE(xpfo_inited);
+
+/* Architecture specific implementations */
+void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
+void xpfo_flush_kernel_tlb(struct page *page, int order);
+
+void xpfo_init_single_page(struct page *page);
+
+static inline void xpfo_kmap(void *kaddr, struct page *page)
+{
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	if (!PageXpfoUser(page))
+		return;
+
+	/*
+	 * The page was previously allocated to user space, so
+	 * map it back into the kernel if needed. No TLB flush required.
+	 */
+	spin_lock_irqsave(&page->xpfo_lock, flags);
+
+	if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
+		TestClearPageXpfoUnmapped(page))
+		set_kpte(kaddr, page, PAGE_KERNEL);
+
+	spin_unlock_irqrestore(&page->xpfo_lock, flags);
+}
+
+static inline void xpfo_kunmap(void *kaddr, struct page *page)
+{
+	unsigned long flags;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	if (!PageXpfoUser(page))
+		return;
+
+	/*
+	 * The page is to be allocated back to user space, so unmap it from
+	 * the kernel, flush the TLB and tag it as a user page.
+	 */
+	spin_lock_irqsave(&page->xpfo_lock, flags);
+
+	if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
+#ifdef CONFIG_XPFO_DEBUG
+		WARN_ON(PageXpfoUnmapped(page));
+#endif
+		SetPageXpfoUnmapped(page);
+		set_kpte(kaddr, page, __pgprot(0));
+		xpfo_flush_kernel_tlb(page, 0);
+	}
+
+	spin_unlock_irqrestore(&page->xpfo_lock, flags);
+}
+
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map);
+void xpfo_free_pages(struct page *page, int order);
+
+#else /* !CONFIG_XPFO */
+
+static inline void xpfo_init_single_page(struct page *page) { }
+
+static inline void xpfo_kmap(void *kaddr, struct page *page) { }
+static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
+static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp,
+				    bool will_map) { }
+static inline void xpfo_free_pages(struct page *page, int order) { }
+
+static inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) { }
+static inline void xpfo_flush_kernel_tlb(struct page *page, int order) { }
+
+#endif /* CONFIG_XPFO */
+
+#if (!defined(CONFIG_HIGHMEM)) && (!defined(ARCH_HAS_KMAP))
+static inline void *kmap(struct page *page)
+{
+	void *kaddr;
+
+	might_sleep();
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
+}
+
+static inline void kunmap(struct page *page)
+{
+	xpfo_kunmap(page_address(page), page);
+}
+
+static inline void *kmap_atomic(struct page *page)
+{
+	void *kaddr;
+
+	preempt_disable();
+	pagefault_disable();
+	kaddr = page_address(page);
+	xpfo_kmap(kaddr, page);
+	return kaddr;
+}
+
+#define kmap_atomic_prot(page, prot)	kmap_atomic(page)
+
+static inline void __kunmap_atomic(void *addr)
+{
+	xpfo_kunmap(addr, virt_to_page(addr));
+	pagefault_enable();
+	preempt_enable();
+}
+
+#define kmap_atomic_pfn(pfn)	kmap_atomic(pfn_to_page(pfn))
+
+#define kmap_flush_unused()	do {} while (0)
+
+#endif
+
+#endif /* _LINUX_XPFO_H */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a1675d43777e..6bb000bb366f 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -79,6 +79,12 @@ 
 #define IF_HAVE_PG_IDLE(flag,string)
 #endif
 
+#ifdef CONFIG_XPFO
+#define IF_HAVE_PG_XPFO(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_XPFO(flag,string)
+#endif
+
 #define __def_pageflag_names						\
 	{1UL << PG_locked,		"locked"	},		\
 	{1UL << PG_waiters,		"waiters"	},		\
@@ -105,7 +111,9 @@  IF_HAVE_PG_MLOCK(PG_mlocked,		"mlocked"	)		\
 IF_HAVE_PG_UNCACHED(PG_uncached,	"uncached"	)		\
 IF_HAVE_PG_HWPOISON(PG_hwpoison,	"hwpoison"	)		\
 IF_HAVE_PG_IDLE(PG_young,		"young"		)		\
-IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
+IF_HAVE_PG_IDLE(PG_idle,		"idle"		)		\
+IF_HAVE_PG_XPFO(PG_xpfo_user,		"xpfo_user"	)		\
+IF_HAVE_PG_XPFO(PG_xpfo_unmapped,	"xpfo_unmapped" ) 		\
 
 #define show_page_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..e99e1e6ae5ae 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,3 +99,4 @@  obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/mm/compaction.c b/mm/compaction.c
index ef29490b0f46..fdd5d9783adb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -78,7 +78,7 @@  static void map_pages(struct list_head *list)
 		order = page_private(page);
 		nr_pages = 1 << order;
 
-		post_alloc_hook(page, order, __GFP_MOVABLE);
+		post_alloc_hook(page, order, __GFP_MOVABLE, false);
 		if (order)
 			split_page(page, order);
 
diff --git a/mm/internal.h b/mm/internal.h
index f4a7bb02decf..e076e51376df 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -165,7 +165,7 @@  extern void memblock_free_pages(struct page *page, unsigned long pfn,
 					unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned int order);
 extern void post_alloc_hook(struct page *page, unsigned int order,
-					gfp_t gfp_flags);
+					gfp_t gfp_flags, bool will_map);
 extern int user_min_free_kbytes;
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0b9f577b1a2a..2e0dda1322a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1062,6 +1062,7 @@  static __always_inline bool free_pages_prepare(struct page *page,
 	if (bad)
 		return false;
 
+	xpfo_free_pages(page, order);
 	page_cpupid_reset_last(page);
 	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	reset_page_owner(page, order);
@@ -1229,6 +1230,7 @@  static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 	if (!is_highmem_idx(zone))
 		set_page_address(page, __va(pfn << PAGE_SHIFT));
 #endif
+	xpfo_init_single_page(page);
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
@@ -1938,7 +1940,7 @@  static bool check_new_pages(struct page *page, unsigned int order)
 }
 
 inline void post_alloc_hook(struct page *page, unsigned int order,
-				gfp_t gfp_flags)
+				gfp_t gfp_flags, bool will_map)
 {
 	set_page_private(page, 0);
 	set_page_refcounted(page);
@@ -1947,6 +1949,7 @@  inline void post_alloc_hook(struct page *page, unsigned int order,
 	kernel_map_pages(page, 1 << order, 1);
 	kernel_poison_pages(page, 1 << order, 1);
 	kasan_alloc_pages(page, order);
+	xpfo_alloc_pages(page, order, gfp_flags, will_map);
 	set_page_owner(page, order, gfp_flags);
 }
 
@@ -1954,10 +1957,11 @@  static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
 							unsigned int alloc_flags)
 {
 	int i;
+	bool needs_zero = !free_pages_prezeroed() && (gfp_flags & __GFP_ZERO);
 
-	post_alloc_hook(page, order, gfp_flags);
+	post_alloc_hook(page, order, gfp_flags, needs_zero);
 
-	if (!free_pages_prezeroed() && (gfp_flags & __GFP_ZERO))
+	if (needs_zero)
 		for (i = 0; i < (1 << order); i++)
 			clear_highpage(page + i);
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index ce323e56b34d..d8730dd134a9 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -137,7 +137,7 @@  static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
 	if (isolated_page) {
-		post_alloc_hook(page, order, __GFP_MOVABLE);
+		post_alloc_hook(page, order, __GFP_MOVABLE, false);
 		__free_pages(page, order);
 	}
 }
diff --git a/mm/xpfo.c b/mm/xpfo.c
new file mode 100644
index 000000000000..b74fee0479e7
--- /dev/null
+++ b/mm/xpfo.c
@@ -0,0 +1,106 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017 Docker, Inc.
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger <juerg.haefliger@hpe.com>
+ *   Vasileios P. Kemerlis <vpk@cs.brown.edu>
+ *   Tycho Andersen <tycho@docker.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/xpfo.h>
+
+#include <asm/tlbflush.h>
+
+DEFINE_STATIC_KEY_TRUE(xpfo_inited);
+EXPORT_SYMBOL_GPL(xpfo_inited);
+
+static int __init noxpfo_param(char *str)
+{
+	static_branch_disable(&xpfo_inited);
+
+	return 0;
+}
+
+early_param("noxpfo", noxpfo_param);
+
+bool __init xpfo_enabled(void)
+{
+	if (!static_branch_unlikely(&xpfo_inited))
+		return false;
+	else
+		return true;
+}
+
+void __meminit xpfo_init_single_page(struct page *page)
+{
+	spin_lock_init(&page->xpfo_lock);
+}
+
+void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp, bool will_map)
+{
+	int i;
+	bool flush_tlb = false;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++)  {
+#ifdef CONFIG_XPFO_DEBUG
+		WARN_ON(PageXpfoUser(page + i));
+		WARN_ON(PageXpfoUnmapped(page + i));
+		lockdep_assert_held(&(page + i)->xpfo_lock);
+		WARN_ON(atomic_read(&(page + i)->xpfo_mapcount));
+#endif
+		if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
+			/*
+			 * Tag the page as a user page and flush the TLB if it
+			 * was previously allocated to the kernel.
+			 */
+			if ((!TestSetPageXpfoUser(page + i)) || !will_map) {
+				SetPageXpfoUnmapped(page + i);
+				flush_tlb = true;
+			}
+		} else {
+			/* Tag the page as a non-user (kernel) page */
+			ClearPageXpfoUser(page + i);
+		}
+	}
+
+	if (flush_tlb) {
+		set_kpte(page_address(page), page, __pgprot(0));
+		xpfo_flush_kernel_tlb(page, order);
+	}
+}
+
+void xpfo_free_pages(struct page *page, int order)
+{
+	int i;
+
+	if (!static_branch_unlikely(&xpfo_inited))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+#ifdef CONFIG_XPFO_DEBUG
+		WARN_ON(atomic_read(&(page + i)->xpfo_mapcount));
+#endif
+
+		/*
+		 * Map the page back into the kernel if it was previously
+		 * allocated to user space.
+		 */
+		if (TestClearPageXpfoUser(page + i)) {
+			ClearPageXpfoUnmapped(page + i);
+			set_kpte(page_address(page + i), page + i,
+				 PAGE_KERNEL);
+		}
+	}
+}
diff --git a/security/Kconfig b/security/Kconfig
index e4fe2f3c2c65..3636ba7e2615 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -6,6 +6,33 @@  menu "Security options"
 
 source "security/keys/Kconfig"
 
+config ARCH_SUPPORTS_XPFO
+	bool
+
+config XPFO
+	bool "Enable eXclusive Page Frame Ownership (XPFO)"
+	depends on ARCH_SUPPORTS_XPFO
+	help
+	  This option offers protection against 'ret2dir' kernel attacks.
+	  When enabled, every time a page frame is allocated to user space, it
+	  is unmapped from the direct mapped RAM region in kernel space
+	  (physmap). Similarly, when a page frame is freed/reclaimed, it is
+	  mapped back to physmap.
+
+	  There is a slight performance impact when this option is enabled.
+
+	  If in doubt, say "N".
+
+config XPFO_DEBUG
+       bool "Enable debugging of XPFO"
+       depends on XPFO
+       help
+         Enables additional checking of XPFO data structures that help find
+	 bugs in the XPFO implementation. This option comes with a slight
+	 performance cost.
+
+	 If in doubt, say "N".
+
 config SECURITY_DMESG_RESTRICT
 	bool "Restrict unprivileged access to the kernel syslog"
 	default n