mbox series

[RFC,v7,00/16] Add support for eXclusive Page Frame Ownership

Message ID cover.1547153058.git.khalid.aziz@oracle.com (mailing list archive)
Headers show
Series Add support for eXclusive Page Frame Ownership | expand

Message

Khalid Aziz Jan. 10, 2019, 9:09 p.m. UTC
I am continuing to build on the work Juerg, Tycho and Julian have done
on XPFO. After the last round of updates, we were seeing very
significant performance penalties when stale TLB entries were flushed
actively after an XPFO TLB update. Benchmark for measuring performance
is kernel build using parallel make. To get full protection from
ret2dir attackes, we must flush stale TLB entries. Performance
penalty from flushing stale TLB entries goes up as the number of
cores goes up. On a desktop class machine with only 4 cores,
enabling TLB flush for stale entries causes system time for "make
-j4" to go up by a factor of 2.614x but on a larger machine with 96
cores, system time with "make -j60" goes up by a factor of 26.366x!
I have been working on reducing this performance penalty.

I implemented a solution to reduce performance penalty and
that has had large impact. When XPFO code flushes stale TLB entries,
it does so for all CPUs on the system which may include CPUs that
may not have any matching TLB entries or may never be scheduled to
run the userspace task causing TLB flush. Problem is made worse by
the fact that if number of entries being flushed exceeds
tlb_single_page_flush_ceiling, it results in a full TLB flush on
every CPU. A rogue process can launch a ret2dir attack only from a
CPU that has dual mapping for its pages in physmap in its TLB. We
can hence defer TLB flush on a CPU until a process that would have
caused a TLB flush is scheduled on that CPU. I have added a cpumask
to task_struct which is then used to post pending TLB flush on CPUs
other than the one a process is running on. This cpumask is checked
when a process migrates to a new CPU and TLB is flushed at that
time. I measured system time for parallel make with unmodified 4.20
kernel, 4.20 with XPFO patches before this optimization and then
again after applying this optimization. Here are the results:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20				915.183s
4.20+XPFO			24129.354s	26.366x
4.20+XPFO+Deferred flush	1216.987s	 1.330xx


Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20				607.671s
4.20+XPFO			1588.646s	2.614x
4.20+XPFO+Deferred flush	794.473s	1.307xx

30+% overhead is still very high and there is room for improvement.
Dave Hansen had suggested batch updating TLB entries and Tycho had
created an initial implementation but I have not been able to get
that to work correctly. I am still working on it and I suspect we
will see a noticeable improvement in performance with that. In the
code I added, I post a pending full TLB flush to all other CPUs even
when number of TLB entries being flushed on current CPU does not
exceed tlb_single_page_flush_ceiling. There has to be a better way
to do this. I just haven't found an efficient way to implemented
delayed limited TLB flush on other CPUs.

I am not entirely sure if switch_mm_irqs_off() is indeed the right
place to perform the pending TLB flush for a CPU. Any feedback on
that will be very helpful. Delaying full TLB flushes on other CPUs
seems to help tremendously, so if there is a better way to implement
the same thing than what I have done in patch 16, I am open to
ideas.

Performance with this patch set is good enough to use these as
starting point for further refinement before we merge it into main
kernel, hence RFC.

Since not flushing stale TLB entries creates a false sense of
security, I would recommend making TLB flush mandatory and eliminate
the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
flushing by default for XPFO page table modifications").

What remains to be done beyond this patch series:

1. Performance improvements
2. Remove xpfotlbflush parameter
3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
   from Juerg. I dropped it for now since swiotlb code for ARM has
   changed a lot in 4.20.
4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
   CPUs" to other architectures besides x86.


---------------------------------------------------------

Juerg Haefliger (5):
  mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
  swiotlb: Map the buffer if it was unmapped by XPFO
  arm64/mm: Add support for XPFO
  arm64/mm, xpfo: temporarily map dcache regions
  lkdtm: Add test for XPFO

Julian Stecklina (4):
  mm, x86: omit TLB flushing by default for XPFO page table
    modifications
  xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION
  xpfo, mm: optimize spinlock usage in xpfo_kunmap
  EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap

Khalid Aziz (2):
  xpfo, mm: Fix hang when booting with "xpfotlbflush"
  xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

Tycho Andersen (5):
  mm: add MAP_HUGETLB support to vm_mmap
  x86: always set IF before oopsing from page fault
  xpfo: add primitives for mapping underlying memory
  arm64/mm: disable section/contiguous mappings if XPFO is enabled
  mm: add a user_virt_to_phys symbol

 .../admin-guide/kernel-parameters.txt         |   2 +
 arch/arm64/Kconfig                            |   1 +
 arch/arm64/mm/Makefile                        |   2 +
 arch/arm64/mm/flush.c                         |   7 +
 arch/arm64/mm/mmu.c                           |   2 +-
 arch/arm64/mm/xpfo.c                          |  58 ++++
 arch/x86/Kconfig                              |   1 +
 arch/x86/include/asm/pgtable.h                |  26 ++
 arch/x86/include/asm/tlbflush.h               |   1 +
 arch/x86/mm/Makefile                          |   2 +
 arch/x86/mm/fault.c                           |  10 +
 arch/x86/mm/pageattr.c                        |  23 +-
 arch/x86/mm/tlb.c                             |  27 ++
 arch/x86/mm/xpfo.c                            | 171 ++++++++++++
 drivers/misc/lkdtm/Makefile                   |   1 +
 drivers/misc/lkdtm/core.c                     |   3 +
 drivers/misc/lkdtm/lkdtm.h                    |   5 +
 drivers/misc/lkdtm/xpfo.c                     | 194 ++++++++++++++
 include/linux/highmem.h                       |  15 +-
 include/linux/mm.h                            |   2 +
 include/linux/mm_types.h                      |   8 +
 include/linux/page-flags.h                    |  13 +
 include/linux/sched.h                         |   9 +
 include/linux/xpfo.h                          |  90 +++++++
 include/trace/events/mmflags.h                |  10 +-
 kernel/dma/swiotlb.c                          |   3 +-
 mm/Makefile                                   |   1 +
 mm/mmap.c                                     |  19 +-
 mm/page_alloc.c                               |   3 +
 mm/util.c                                     |  32 +++
 mm/xpfo.c                                     | 247 ++++++++++++++++++
 security/Kconfig                              |  29 ++
 32 files changed, 974 insertions(+), 43 deletions(-)
 create mode 100644 arch/arm64/mm/xpfo.c
 create mode 100644 arch/x86/mm/xpfo.c
 create mode 100644 drivers/misc/lkdtm/xpfo.c
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

Comments

Kees Cook Jan. 10, 2019, 11:07 p.m. UTC | #1
On Thu, Jan 10, 2019 at 1:10 PM Khalid Aziz <khalid.aziz@oracle.com> wrote:
> I implemented a solution to reduce performance penalty and
> that has had large impact. When XPFO code flushes stale TLB entries,
> it does so for all CPUs on the system which may include CPUs that
> may not have any matching TLB entries or may never be scheduled to
> run the userspace task causing TLB flush. Problem is made worse by
> the fact that if number of entries being flushed exceeds
> tlb_single_page_flush_ceiling, it results in a full TLB flush on
> every CPU. A rogue process can launch a ret2dir attack only from a
> CPU that has dual mapping for its pages in physmap in its TLB. We
> can hence defer TLB flush on a CPU until a process that would have
> caused a TLB flush is scheduled on that CPU. I have added a cpumask
> to task_struct which is then used to post pending TLB flush on CPUs
> other than the one a process is running on. This cpumask is checked
> when a process migrates to a new CPU and TLB is flushed at that
> time. I measured system time for parallel make with unmodified 4.20
> kernel, 4.20 with XPFO patches before this optimization and then
> again after applying this optimization. Here are the results:
>
> Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
> make -j60 all
>
> 4.20                            915.183s
> 4.20+XPFO                       24129.354s      26.366x
> 4.20+XPFO+Deferred flush        1216.987s        1.330xx
>
>
> Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
> make -j4 all
>
> 4.20                            607.671s
> 4.20+XPFO                       1588.646s       2.614x
> 4.20+XPFO+Deferred flush        794.473s        1.307xx

Well that's an impressive improvement! Nice work. :)

(Are the cpumask improvements possible to be extended to other TLB
flushing needs? i.e. could there be other performance gains with that
code even for a non-XPFO system?)

> 30+% overhead is still very high and there is room for improvement.
> Dave Hansen had suggested batch updating TLB entries and Tycho had
> created an initial implementation but I have not been able to get
> that to work correctly. I am still working on it and I suspect we
> will see a noticeable improvement in performance with that. In the
> code I added, I post a pending full TLB flush to all other CPUs even
> when number of TLB entries being flushed on current CPU does not
> exceed tlb_single_page_flush_ceiling. There has to be a better way
> to do this. I just haven't found an efficient way to implemented
> delayed limited TLB flush on other CPUs.
>
> I am not entirely sure if switch_mm_irqs_off() is indeed the right
> place to perform the pending TLB flush for a CPU. Any feedback on
> that will be very helpful. Delaying full TLB flushes on other CPUs
> seems to help tremendously, so if there is a better way to implement
> the same thing than what I have done in patch 16, I am open to
> ideas.

Dave, Andy, Ingo, Thomas, does anyone have time to look this over?

> Performance with this patch set is good enough to use these as
> starting point for further refinement before we merge it into main
> kernel, hence RFC.
>
> Since not flushing stale TLB entries creates a false sense of
> security, I would recommend making TLB flush mandatory and eliminate
> the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
> flushing by default for XPFO page table modifications").

At this point, yes, that does seem to make sense.

> What remains to be done beyond this patch series:
>
> 1. Performance improvements
> 2. Remove xpfotlbflush parameter
> 3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
>    from Juerg. I dropped it for now since swiotlb code for ARM has
>    changed a lot in 4.20.
> 4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
>    CPUs" to other architectures besides x86.

This seems like a good plan.

I've put this series in one of my tree so that 0day will find it and
grind tests...
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/xpfo/v7

Thanks!
Dave Hansen Jan. 10, 2019, 11:40 p.m. UTC | #2
First of all, thanks for picking this back up.  It looks to be going in
a very positive direction!

On 1/10/19 1:09 PM, Khalid Aziz wrote:
> I implemented a solution to reduce performance penalty and
> that has had large impact. When XPFO code flushes stale TLB entries,
> it does so for all CPUs on the system which may include CPUs that
> may not have any matching TLB entries or may never be scheduled to
> run the userspace task causing TLB flush.
...
> A rogue process can launch a ret2dir attack only from a CPU that has 
> dual mapping for its pages in physmap in its TLB. We can hence defer 
> TLB flush on a CPU until a process that would have caused a TLB
> flush is scheduled on that CPU.

This logic is a bit suspect to me.  Imagine a situation where we have
two attacker processes: one which is causing page to go from
kernel->user (and be unmapped from the kernel) and a second process that
*was* accessing that page.

The second process could easily have the page's old TLB entry.  It could
abuse that entry as long as that CPU doesn't context switch
(switch_mm_irqs_off()) or otherwise flush the TLB entry.

As for where to flush the TLB...  As you know, using synchronous IPIs is
obviously the most bulletproof from a mitigation perspective.  If you
can batch the IPIs, you can get the overhead down, but you need to do
the flushes for a bunch of pages at once, which I think is what you were
exploring but haven't gotten working yet.

Anything else you do will have *some* reduced mitigation value, which
isn't a deal-breaker (to me at least).  Some ideas:

Take a look at the SWITCH_TO_KERNEL_CR3 in head_64.S.  Every time that
gets called, we've (potentially) just done a user->kernel transition and
might benefit from flushing the TLB.  We're always doing a CR3 write (on
Meltdown-vulnerable hardware) and it can do a full TLB flush based on if
X86_CR3_PCID_NOFLUSH_BIT is set.  So, when you need a TLB flush, you
would set a bit that ADJUST_KERNEL_CR3 would see on the next
user->kernel transition on *each* CPU.  Potentially, multiple TLB
flushes could be coalesced this way.  The downside of this is that
you're exposed to the old TLB entries if a flush is needed while you are
already *in* the kernel.

You could also potentially do this from C code, like in the syscall
entry code, or in sensitive places, like when you're returning from a
guest after a VMEXIT in the kvm code.
Khalid Aziz Jan. 11, 2019, 12:20 a.m. UTC | #3
Thanks for looking this over.

On 1/10/19 4:07 PM, Kees Cook wrote:
> On Thu, Jan 10, 2019 at 1:10 PM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>> I implemented a solution to reduce performance penalty and
>> that has had large impact. When XPFO code flushes stale TLB entries,
>> it does so for all CPUs on the system which may include CPUs that
>> may not have any matching TLB entries or may never be scheduled to
>> run the userspace task causing TLB flush. Problem is made worse by
>> the fact that if number of entries being flushed exceeds
>> tlb_single_page_flush_ceiling, it results in a full TLB flush on
>> every CPU. A rogue process can launch a ret2dir attack only from a
>> CPU that has dual mapping for its pages in physmap in its TLB. We
>> can hence defer TLB flush on a CPU until a process that would have
>> caused a TLB flush is scheduled on that CPU. I have added a cpumask
>> to task_struct which is then used to post pending TLB flush on CPUs
>> other than the one a process is running on. This cpumask is checked
>> when a process migrates to a new CPU and TLB is flushed at that
>> time. I measured system time for parallel make with unmodified 4.20
>> kernel, 4.20 with XPFO patches before this optimization and then
>> again after applying this optimization. Here are the results:
>>
>> Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
>> make -j60 all
>>
>> 4.20                            915.183s
>> 4.20+XPFO                       24129.354s      26.366x
>> 4.20+XPFO+Deferred flush        1216.987s        1.330xx
>>
>>
>> Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
>> make -j4 all
>>
>> 4.20                            607.671s
>> 4.20+XPFO                       1588.646s       2.614x
>> 4.20+XPFO+Deferred flush        794.473s        1.307xx
> 
> Well that's an impressive improvement! Nice work. :)
> 
> (Are the cpumask improvements possible to be extended to other TLB
> flushing needs? i.e. could there be other performance gains with that
> code even for a non-XPFO system?)

It may be usable for other situations as well but I have not given it
any thought yet. I will take a look.

> 
>> 30+% overhead is still very high and there is room for improvement.
>> Dave Hansen had suggested batch updating TLB entries and Tycho had
>> created an initial implementation but I have not been able to get
>> that to work correctly. I am still working on it and I suspect we
>> will see a noticeable improvement in performance with that. In the
>> code I added, I post a pending full TLB flush to all other CPUs even
>> when number of TLB entries being flushed on current CPU does not
>> exceed tlb_single_page_flush_ceiling. There has to be a better way
>> to do this. I just haven't found an efficient way to implemented
>> delayed limited TLB flush on other CPUs.
>>
>> I am not entirely sure if switch_mm_irqs_off() is indeed the right
>> place to perform the pending TLB flush for a CPU. Any feedback on
>> that will be very helpful. Delaying full TLB flushes on other CPUs
>> seems to help tremendously, so if there is a better way to implement
>> the same thing than what I have done in patch 16, I am open to
>> ideas.
> 
> Dave, Andy, Ingo, Thomas, does anyone have time to look this over?
> 
>> Performance with this patch set is good enough to use these as
>> starting point for further refinement before we merge it into main
>> kernel, hence RFC.
>>
>> Since not flushing stale TLB entries creates a false sense of
>> security, I would recommend making TLB flush mandatory and eliminate
>> the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
>> flushing by default for XPFO page table modifications").
> 
> At this point, yes, that does seem to make sense.
> 
>> What remains to be done beyond this patch series:
>>
>> 1. Performance improvements
>> 2. Remove xpfotlbflush parameter
>> 3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
>>    from Juerg. I dropped it for now since swiotlb code for ARM has
>>    changed a lot in 4.20.
>> 4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
>>    CPUs" to other architectures besides x86.
> 
> This seems like a good plan.
> 
> I've put this series in one of my tree so that 0day will find it and
> grind tests...
> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/xpfo/v7

Thanks for doing that!

--
Khalid
Andy Lutomirski Jan. 11, 2019, 12:44 a.m. UTC | #4
On Thu, Jan 10, 2019 at 3:07 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Thu, Jan 10, 2019 at 1:10 PM Khalid Aziz <khalid.aziz@oracle.com> wrote:
> > I implemented a solution to reduce performance penalty and
> > that has had large impact. When XPFO code flushes stale TLB entries,
> > it does so for all CPUs on the system which may include CPUs that
> > may not have any matching TLB entries or may never be scheduled to
> > run the userspace task causing TLB flush. Problem is made worse by
> > the fact that if number of entries being flushed exceeds
> > tlb_single_page_flush_ceiling, it results in a full TLB flush on
> > every CPU. A rogue process can launch a ret2dir attack only from a
> > CPU that has dual mapping for its pages in physmap in its TLB. We
> > can hence defer TLB flush on a CPU until a process that would have
> > caused a TLB flush is scheduled on that CPU. I have added a cpumask
> > to task_struct which is then used to post pending TLB flush on CPUs
> > other than the one a process is running on. This cpumask is checked
> > when a process migrates to a new CPU and TLB is flushed at that
> > time. I measured system time for parallel make with unmodified 4.20
> > kernel, 4.20 with XPFO patches before this optimization and then
> > again after applying this optimization. Here are the results:

I wasn't cc'd on the patch, so I don't know the exact details.

I'm assuming that "ret2dir" means that you corrupt the kernel into
using a direct-map page as its stack.  If so, then I don't see why the
task in whose context the attack is launched needs to be the same
process as the one that has the page mapped for user access.

My advice would be to attempt an entirely different optimization: try
to avoid putting pages *back* into the direct map when they're freed
until there is an actual need to use them for kernel purposes.

How are you handing page cache?  Presumably MAP_SHARED PROT_WRITE
pages are still in the direct map so that IO works.
Peter Zijlstra Jan. 11, 2019, 9:59 a.m. UTC | #5
On Thu, Jan 10, 2019 at 03:40:04PM -0800, Dave Hansen wrote:
> Anything else you do will have *some* reduced mitigation value, which
> isn't a deal-breaker (to me at least).  Some ideas:
> 
> Take a look at the SWITCH_TO_KERNEL_CR3 in head_64.S.  Every time that
> gets called, we've (potentially) just done a user->kernel transition and
> might benefit from flushing the TLB.  We're always doing a CR3 write (on
> Meltdown-vulnerable hardware) and it can do a full TLB flush based on if
> X86_CR3_PCID_NOFLUSH_BIT is set.  So, when you need a TLB flush, you
> would set a bit that ADJUST_KERNEL_CR3 would see on the next
> user->kernel transition on *each* CPU.  Potentially, multiple TLB
> flushes could be coalesced this way.  The downside of this is that
> you're exposed to the old TLB entries if a flush is needed while you are
> already *in* the kernel.

I would really prefer not to depend on the PTI crud for new stuff. We
really want to get rid of that code on unaffected CPUs.
Khalid Aziz Jan. 11, 2019, 6:21 p.m. UTC | #6
Hi Dave,

Thanks for looking at this and providing feedback.

On 1/10/19 4:40 PM, Dave Hansen wrote:
> First of all, thanks for picking this back up.  It looks to be going in
> a very positive direction!
> 
> On 1/10/19 1:09 PM, Khalid Aziz wrote:
>> I implemented a solution to reduce performance penalty and
>> that has had large impact. When XPFO code flushes stale TLB entries,
>> it does so for all CPUs on the system which may include CPUs that
>> may not have any matching TLB entries or may never be scheduled to
>> run the userspace task causing TLB flush.
> ...
>> A rogue process can launch a ret2dir attack only from a CPU that has 
>> dual mapping for its pages in physmap in its TLB. We can hence defer 
>> TLB flush on a CPU until a process that would have caused a TLB
>> flush is scheduled on that CPU.
> 
> This logic is a bit suspect to me.  Imagine a situation where we have
> two attacker processes: one which is causing page to go from
> kernel->user (and be unmapped from the kernel) and a second process that
> *was* accessing that page.
> 
> The second process could easily have the page's old TLB entry.  It could
> abuse that entry as long as that CPU doesn't context switch
> (switch_mm_irqs_off()) or otherwise flush the TLB entry.

That is an interesting scenario. Working through this scenario, physmap
TLB entry for a page is flushed on the local processor when the page is
allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
page back into kernel, that page is mapped into kernel space using a va
from kmap pool in xpfo_kmap() which can be different for each new
mapping of the same page. The physical page is unmapped from kernel on
the way back from kernel to userspace by xpfo_kunmap(). So two processes
on different CPUs sharing same physical page might not be seeing the
same virtual address for that page while they are in the kernel, as long
as it is an address from kmap pool. ret2dir attack relies upon being
able to craft a predictable virtual address in the kernel physmap for a
physical page and redirect execution to that address. Does that sound right?

Now what happens if only one of these cooperating processes allocates
the page, places malicious payload on that page and passes the address
of this page to the other process which can deduce physmap for the page
through /proc and exploit the physmap entry for the page on its CPU.
That must be the scenario you are referring to.

> 
> As for where to flush the TLB...  As you know, using synchronous IPIs is
> obviously the most bulletproof from a mitigation perspective.  If you
> can batch the IPIs, you can get the overhead down, but you need to do
> the flushes for a bunch of pages at once, which I think is what you were
> exploring but haven't gotten working yet.
> 
> Anything else you do will have *some* reduced mitigation value, which
> isn't a deal-breaker (to me at least).  Some ideas:

Even without batched IPIs working reliably, I was able to measure the
performance impact of this partially working solution. With just batched
IPIs and no delayed TLB flushes, performance improved by a factor of 2.
The 26x system time went down to 12x-13x but it was still too high and a
non-starter. Combining batched IPI with delayed TLB flushes improved
performance to about 1.1x as opposed to 1.33x with delayed TLB flush
alone. Those numbers are very rough since the batching implementation is
incomplete.

> 
> Take a look at the SWITCH_TO_KERNEL_CR3 in head_64.S.  Every time that
> gets called, we've (potentially) just done a user->kernel transition and
> might benefit from flushing the TLB.  We're always doing a CR3 write (on
> Meltdown-vulnerable hardware) and it can do a full TLB flush based on if
> X86_CR3_PCID_NOFLUSH_BIT is set.  So, when you need a TLB flush, you
> would set a bit that ADJUST_KERNEL_CR3 would see on the next
> user->kernel transition on *each* CPU.  Potentially, multiple TLB
> flushes could be coalesced this way.  The downside of this is that
> you're exposed to the old TLB entries if a flush is needed while you are
> already *in* the kernel.
> 
> You could also potentially do this from C code, like in the syscall
> entry code, or in sensitive places, like when you're returning from a
> guest after a VMEXIT in the kvm code.
> 

Good suggestions. Thanks.

I think benefit will be highest from batching TLB flushes. I see a lot
of time consumed by full TLB flushes on other processors when local
processor did only a limited TLB flush. I will continue to debug the
batch TLB updates.

--
Khalid
Dave Hansen Jan. 11, 2019, 8:42 p.m. UTC | #7
>> The second process could easily have the page's old TLB entry.  It could
>> abuse that entry as long as that CPU doesn't context switch
>> (switch_mm_irqs_off()) or otherwise flush the TLB entry.
> 
> That is an interesting scenario. Working through this scenario, physmap
> TLB entry for a page is flushed on the local processor when the page is
> allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
> page back into kernel, that page is mapped into kernel space using a va
> from kmap pool in xpfo_kmap() which can be different for each new
> mapping of the same page. The physical page is unmapped from kernel on
> the way back from kernel to userspace by xpfo_kunmap(). So two processes
> on different CPUs sharing same physical page might not be seeing the
> same virtual address for that page while they are in the kernel, as long
> as it is an address from kmap pool. ret2dir attack relies upon being
> able to craft a predictable virtual address in the kernel physmap for a
> physical page and redirect execution to that address. Does that sound right?

All processes share one set of kernel page tables.  Or, did your patches
change that somehow that I missed?

Since they share the page tables, they implicitly share kmap*()
mappings.  kmap_atomic() is not *used* by more than one CPU, but the
mapping is accessible and at least exists for all processors.

I'm basically assuming that any entry mapped in a shared page table is
exploitable on any CPU regardless of where we logically *want* it to be
used.
Andy Lutomirski Jan. 11, 2019, 9:06 p.m. UTC | #8
On Fri, Jan 11, 2019 at 12:42 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> >> The second process could easily have the page's old TLB entry.  It could
> >> abuse that entry as long as that CPU doesn't context switch
> >> (switch_mm_irqs_off()) or otherwise flush the TLB entry.
> >
> > That is an interesting scenario. Working through this scenario, physmap
> > TLB entry for a page is flushed on the local processor when the page is
> > allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
> > page back into kernel, that page is mapped into kernel space using a va
> > from kmap pool in xpfo_kmap() which can be different for each new
> > mapping of the same page. The physical page is unmapped from kernel on
> > the way back from kernel to userspace by xpfo_kunmap(). So two processes
> > on different CPUs sharing same physical page might not be seeing the
> > same virtual address for that page while they are in the kernel, as long
> > as it is an address from kmap pool. ret2dir attack relies upon being
> > able to craft a predictable virtual address in the kernel physmap for a
> > physical page and redirect execution to that address. Does that sound right?
>
> All processes share one set of kernel page tables.  Or, did your patches
> change that somehow that I missed?
>
> Since they share the page tables, they implicitly share kmap*()
> mappings.  kmap_atomic() is not *used* by more than one CPU, but the
> mapping is accessible and at least exists for all processors.
>
> I'm basically assuming that any entry mapped in a shared page table is
> exploitable on any CPU regardless of where we logically *want* it to be
> used.
>
>

We can, very easily, have kernel mappings that are private to a given
mm.  Maybe this is useful here.
Khalid Aziz Jan. 11, 2019, 9:45 p.m. UTC | #9
On 1/10/19 5:44 PM, Andy Lutomirski wrote:
> On Thu, Jan 10, 2019 at 3:07 PM Kees Cook <keescook@chromium.org> wrote:
>>
>> On Thu, Jan 10, 2019 at 1:10 PM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>> I implemented a solution to reduce performance penalty and
>>> that has had large impact. When XPFO code flushes stale TLB entries,
>>> it does so for all CPUs on the system which may include CPUs that
>>> may not have any matching TLB entries or may never be scheduled to
>>> run the userspace task causing TLB flush. Problem is made worse by
>>> the fact that if number of entries being flushed exceeds
>>> tlb_single_page_flush_ceiling, it results in a full TLB flush on
>>> every CPU. A rogue process can launch a ret2dir attack only from a
>>> CPU that has dual mapping for its pages in physmap in its TLB. We
>>> can hence defer TLB flush on a CPU until a process that would have
>>> caused a TLB flush is scheduled on that CPU. I have added a cpumask
>>> to task_struct which is then used to post pending TLB flush on CPUs
>>> other than the one a process is running on. This cpumask is checked
>>> when a process migrates to a new CPU and TLB is flushed at that
>>> time. I measured system time for parallel make with unmodified 4.20
>>> kernel, 4.20 with XPFO patches before this optimization and then
>>> again after applying this optimization. Here are the results:
> 
> I wasn't cc'd on the patch, so I don't know the exact details.
> 
> I'm assuming that "ret2dir" means that you corrupt the kernel into
> using a direct-map page as its stack.  If so, then I don't see why the
> task in whose context the attack is launched needs to be the same
> process as the one that has the page mapped for user access.

You are right. More work is needed to refine delayed TLB flush to close
this gap.

> 
> My advice would be to attempt an entirely different optimization: try
> to avoid putting pages *back* into the direct map when they're freed
> until there is an actual need to use them for kernel purposes.

I had thought about that but it turns out the performance impact happens
on the initial allocation of the page and resulting TLB flushes, not
from putting the pages back into direct map. The way we could benefit
from not adding page back to direct map is if we change page allocation
to prefer pages not in direct map. That way we incur the cost of TLB
flushes initially but then satisfy multiple allocation requests after
that from those "xpfo cost" free pages. More changes will be needed to
pick which of these pages can be added back to direct map without
degenerating into worst case scenario of a page bouncing constantly
between this list of preferred pages and direct mapped pages. It started
to get complex enough that I decided to put this in my back pocket and
attempt simpler approaches first :)

> 
> How are you handing page cache?  Presumably MAP_SHARED PROT_WRITE
> pages are still in the direct map so that IO works.
> 

Since Juerg wrote the actual implementation of XPFO, he probably
understands it better. XPFO tackles only the page allocation requests
from userspace and does not touch page cache pages.

--
Khalid
Khalid Aziz Jan. 11, 2019, 11:23 p.m. UTC | #10
On 1/11/19 1:42 PM, Dave Hansen wrote:
>>> The second process could easily have the page's old TLB entry.  It could
>>> abuse that entry as long as that CPU doesn't context switch
>>> (switch_mm_irqs_off()) or otherwise flush the TLB entry.
>>
>> That is an interesting scenario. Working through this scenario, physmap
>> TLB entry for a page is flushed on the local processor when the page is
>> allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
>> page back into kernel, that page is mapped into kernel space using a va
>> from kmap pool in xpfo_kmap() which can be different for each new
>> mapping of the same page. The physical page is unmapped from kernel on
>> the way back from kernel to userspace by xpfo_kunmap(). So two processes
>> on different CPUs sharing same physical page might not be seeing the
>> same virtual address for that page while they are in the kernel, as long
>> as it is an address from kmap pool. ret2dir attack relies upon being
>> able to craft a predictable virtual address in the kernel physmap for a
>> physical page and redirect execution to that address. Does that sound right?
> 
> All processes share one set of kernel page tables.  Or, did your patches
> change that somehow that I missed?
> 
> Since they share the page tables, they implicitly share kmap*()
> mappings.  kmap_atomic() is not *used* by more than one CPU, but the
> mapping is accessible and at least exists for all processors.
> 
> I'm basically assuming that any entry mapped in a shared page table is
> exploitable on any CPU regardless of where we logically *want* it to be
> used.
> 
> 

Ah, I see what you are saying. Virtual address on one processor is
visible on the other processor as well and one process could communicate
that va to the other process in some way so it could be exploited by the
other process. This va is exploitable only between the kmap and matching
kunmap but the window exists. I am trying to understand your scenario,
so I can address it right.

--
Khalid
Khalid Aziz Jan. 11, 2019, 11:25 p.m. UTC | #11
On 1/11/19 2:06 PM, Andy Lutomirski wrote:
> On Fri, Jan 11, 2019 at 12:42 PM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>>>> The second process could easily have the page's old TLB entry.  It could
>>>> abuse that entry as long as that CPU doesn't context switch
>>>> (switch_mm_irqs_off()) or otherwise flush the TLB entry.
>>>
>>> That is an interesting scenario. Working through this scenario, physmap
>>> TLB entry for a page is flushed on the local processor when the page is
>>> allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
>>> page back into kernel, that page is mapped into kernel space using a va
>>> from kmap pool in xpfo_kmap() which can be different for each new
>>> mapping of the same page. The physical page is unmapped from kernel on
>>> the way back from kernel to userspace by xpfo_kunmap(). So two processes
>>> on different CPUs sharing same physical page might not be seeing the
>>> same virtual address for that page while they are in the kernel, as long
>>> as it is an address from kmap pool. ret2dir attack relies upon being
>>> able to craft a predictable virtual address in the kernel physmap for a
>>> physical page and redirect execution to that address. Does that sound right?
>>
>> All processes share one set of kernel page tables.  Or, did your patches
>> change that somehow that I missed?
>>
>> Since they share the page tables, they implicitly share kmap*()
>> mappings.  kmap_atomic() is not *used* by more than one CPU, but the
>> mapping is accessible and at least exists for all processors.
>>
>> I'm basically assuming that any entry mapped in a shared page table is
>> exploitable on any CPU regardless of where we logically *want* it to be
>> used.
>>
>>
> 
> We can, very easily, have kernel mappings that are private to a given
> mm.  Maybe this is useful here.
> 

That sounds like an interesting idea. kmap mappings would be a good
candidate for that. Those are temporary mappings and should only be
valid for one process.

--
Khalid
Laura Abbott Jan. 16, 2019, 1:28 a.m. UTC | #12
On 1/10/19 1:09 PM, Khalid Aziz wrote:
> I am continuing to build on the work Juerg, Tycho and Julian have done
> on XPFO. After the last round of updates, we were seeing very
> significant performance penalties when stale TLB entries were flushed
> actively after an XPFO TLB update. Benchmark for measuring performance
> is kernel build using parallel make. To get full protection from
> ret2dir attackes, we must flush stale TLB entries. Performance
> penalty from flushing stale TLB entries goes up as the number of
> cores goes up. On a desktop class machine with only 4 cores,
> enabling TLB flush for stale entries causes system time for "make
> -j4" to go up by a factor of 2.614x but on a larger machine with 96
> cores, system time with "make -j60" goes up by a factor of 26.366x!
> I have been working on reducing this performance penalty.
> 
> I implemented a solution to reduce performance penalty and
> that has had large impact. When XPFO code flushes stale TLB entries,
> it does so for all CPUs on the system which may include CPUs that
> may not have any matching TLB entries or may never be scheduled to
> run the userspace task causing TLB flush. Problem is made worse by
> the fact that if number of entries being flushed exceeds
> tlb_single_page_flush_ceiling, it results in a full TLB flush on
> every CPU. A rogue process can launch a ret2dir attack only from a
> CPU that has dual mapping for its pages in physmap in its TLB. We
> can hence defer TLB flush on a CPU until a process that would have
> caused a TLB flush is scheduled on that CPU. I have added a cpumask
> to task_struct which is then used to post pending TLB flush on CPUs
> other than the one a process is running on. This cpumask is checked
> when a process migrates to a new CPU and TLB is flushed at that
> time. I measured system time for parallel make with unmodified 4.20
> kernel, 4.20 with XPFO patches before this optimization and then
> again after applying this optimization. Here are the results:
> 
> Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
> make -j60 all
> 
> 4.20				915.183s
> 4.20+XPFO			24129.354s	26.366x
> 4.20+XPFO+Deferred flush	1216.987s	 1.330xx
> 
> 
> Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
> make -j4 all
> 
> 4.20				607.671s
> 4.20+XPFO			1588.646s	2.614x
> 4.20+XPFO+Deferred flush	794.473s	1.307xx
> 
> 30+% overhead is still very high and there is room for improvement.
> Dave Hansen had suggested batch updating TLB entries and Tycho had
> created an initial implementation but I have not been able to get
> that to work correctly. I am still working on it and I suspect we
> will see a noticeable improvement in performance with that. In the
> code I added, I post a pending full TLB flush to all other CPUs even
> when number of TLB entries being flushed on current CPU does not
> exceed tlb_single_page_flush_ceiling. There has to be a better way
> to do this. I just haven't found an efficient way to implemented
> delayed limited TLB flush on other CPUs.
> 
> I am not entirely sure if switch_mm_irqs_off() is indeed the right
> place to perform the pending TLB flush for a CPU. Any feedback on
> that will be very helpful. Delaying full TLB flushes on other CPUs
> seems to help tremendously, so if there is a better way to implement
> the same thing than what I have done in patch 16, I am open to
> ideas.
> 
> Performance with this patch set is good enough to use these as
> starting point for further refinement before we merge it into main
> kernel, hence RFC.
> 
> Since not flushing stale TLB entries creates a false sense of
> security, I would recommend making TLB flush mandatory and eliminate
> the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
> flushing by default for XPFO page table modifications").
> 
> What remains to be done beyond this patch series:
> 
> 1. Performance improvements
> 2. Remove xpfotlbflush parameter
> 3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
>     from Juerg. I dropped it for now since swiotlb code for ARM has
>     changed a lot in 4.20.
> 4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
>     CPUs" to other architectures besides x86.
> 
> 
> ---------------------------------------------------------
> 
> Juerg Haefliger (5):
>    mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
>    swiotlb: Map the buffer if it was unmapped by XPFO
>    arm64/mm: Add support for XPFO
>    arm64/mm, xpfo: temporarily map dcache regions
>    lkdtm: Add test for XPFO
> 
> Julian Stecklina (4):
>    mm, x86: omit TLB flushing by default for XPFO page table
>      modifications
>    xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION
>    xpfo, mm: optimize spinlock usage in xpfo_kunmap
>    EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap
> 
> Khalid Aziz (2):
>    xpfo, mm: Fix hang when booting with "xpfotlbflush"
>    xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
> 
> Tycho Andersen (5):
>    mm: add MAP_HUGETLB support to vm_mmap
>    x86: always set IF before oopsing from page fault
>    xpfo: add primitives for mapping underlying memory
>    arm64/mm: disable section/contiguous mappings if XPFO is enabled
>    mm: add a user_virt_to_phys symbol
> 
>   .../admin-guide/kernel-parameters.txt         |   2 +
>   arch/arm64/Kconfig                            |   1 +
>   arch/arm64/mm/Makefile                        |   2 +
>   arch/arm64/mm/flush.c                         |   7 +
>   arch/arm64/mm/mmu.c                           |   2 +-
>   arch/arm64/mm/xpfo.c                          |  58 ++++
>   arch/x86/Kconfig                              |   1 +
>   arch/x86/include/asm/pgtable.h                |  26 ++
>   arch/x86/include/asm/tlbflush.h               |   1 +
>   arch/x86/mm/Makefile                          |   2 +
>   arch/x86/mm/fault.c                           |  10 +
>   arch/x86/mm/pageattr.c                        |  23 +-
>   arch/x86/mm/tlb.c                             |  27 ++
>   arch/x86/mm/xpfo.c                            | 171 ++++++++++++
>   drivers/misc/lkdtm/Makefile                   |   1 +
>   drivers/misc/lkdtm/core.c                     |   3 +
>   drivers/misc/lkdtm/lkdtm.h                    |   5 +
>   drivers/misc/lkdtm/xpfo.c                     | 194 ++++++++++++++
>   include/linux/highmem.h                       |  15 +-
>   include/linux/mm.h                            |   2 +
>   include/linux/mm_types.h                      |   8 +
>   include/linux/page-flags.h                    |  13 +
>   include/linux/sched.h                         |   9 +
>   include/linux/xpfo.h                          |  90 +++++++
>   include/trace/events/mmflags.h                |  10 +-
>   kernel/dma/swiotlb.c                          |   3 +-
>   mm/Makefile                                   |   1 +
>   mm/mmap.c                                     |  19 +-
>   mm/page_alloc.c                               |   3 +
>   mm/util.c                                     |  32 +++
>   mm/xpfo.c                                     | 247 ++++++++++++++++++
>   security/Kconfig                              |  29 ++
>   32 files changed, 974 insertions(+), 43 deletions(-)
>   create mode 100644 arch/arm64/mm/xpfo.c
>   create mode 100644 arch/x86/mm/xpfo.c
>   create mode 100644 drivers/misc/lkdtm/xpfo.c
>   create mode 100644 include/linux/xpfo.h
>   create mode 100644 mm/xpfo.c
> 

So this seems to blow up immediately on my arm64 box with a config
based on Fedora:

[   11.008243] Unable to handle kernel paging request at virtual address ffff8003f8602f9b
[   11.016133] Mem abort info:
[   11.018926]   ESR = 0x96000007
[   11.021967]   Exception class = DABT (current EL), IL = 32 bits
[   11.027858]   SET = 0, FnV = 0
[   11.030904]   EA = 0, S1PTW = 0
[   11.034030] Data abort info:
[   11.036896]   ISV = 0, ISS = 0x00000007
[   11.040718]   CM = 0, WnR = 0
[   11.043672] swapper pgtable: 4k pages, 48-bit VAs, pgdp = (____ptrval____)
[   11.050523] [ffff8003f8602f9b] pgd=00000043ffff7803, pud=00000043fe113803, pmd=00000043fc376803, pte=00e80043f8602f13
[   11.061094] Internal error: Oops: 96000007 [#3] SMP
[   11.065948] Modules linked in: xfs libcrc32c sdhci_of_arasan sdhci_pltfm sdhci i2c_xgene_slimpro cqhci gpio_dwapb xhci_plat_hcd gpio_xgene_sb gpio_keys
[   11.079454] CPU: 3 PID: 577 Comm: systemd-getty-g Tainted: G      D           4.20.0-xpfo+ #9
[   11.087936] Hardware name: www.apm.com American Megatrends/American Megatrends, BIOS 3.07.06 20/03/2015
[   11.097285] pstate: 00400005 (nzcv daif +PAN -UAO)
[   11.102057] pc : __memcpy+0x20/0x180
[   11.105616] lr : __access_remote_vm+0x7c/0x1f0
[   11.110036] sp : ffff000011cb3c20
[   11.113333] x29: ffff000011cb3c20 x28: ffff8003f8602000
[   11.118619] x27: 0000000000000f9b x26: 0000000000001000
[   11.123904] x25: 000083ffffffffff x24: cccccccccccccccd
[   11.129189] x23: ffff8003d7c53000 x22: 0000000000000044
[   11.134474] x21: 0000fffff0591f9b x20: 0000000000000044
[   11.139759] x19: 0000000000000044 x18: 0000000000000000
[   11.145044] x17: 0000000000000002 x16: 0000000000000000
[   11.150329] x15: 0000000000000000 x14: 0000000000000000
[   11.155614] x13: 0000000000000000 x12: 0000000000000000
[   11.160899] x11: 0000000000000000 x10: 0000000000000000
[   11.166184] x9 : 0000000000000000 x8 : 0000000000000000
[   11.171469] x7 : 0000000000000000 x6 : ffff8003d7c53000
[   11.176754] x5 : 00e00043f8602fd3 x4 : 0000000000000005
[   11.182038] x3 : 00000003f8602000 x2 : 000000000000003f
[   11.187323] x1 : ffff8003f8602f9b x0 : ffff8003d7c53000
[   11.192609] Process systemd-getty-g (pid: 577, stack limit = 0x(____ptrval____))
[   11.199967] Call trace:
[   11.202400]  __memcpy+0x20/0x180
[   11.205611]  access_remote_vm+0x4c/0x60
[   11.209428]  environ_read+0x12c/0x260
[   11.213071]  __vfs_read+0x48/0x158
[   11.216454]  vfs_read+0x94/0x150
[   11.219665]  ksys_read+0x54/0xb0
[   11.222875]  __arm64_sys_read+0x24/0x30
[   11.226691]  el0_svc_handler+0x94/0x110
[   11.230508]  el0_svc+0x8/0xc
[   11.233375] Code: f2400c84 540001c0 cb040042 36000064 (38401423)
[   11.239439] ---[ end trace 4132d3416fb70591 ]---

I'll see if I get some time tomorrow to dig into this unless
someone spots a problem sooner.

Thanks,
Laura
Julian Stecklina Jan. 16, 2019, 2:56 p.m. UTC | #13
Khalid Aziz <khalid.aziz@oracle.com> writes:

> I am continuing to build on the work Juerg, Tycho and Julian have done
> on XPFO.

Awesome!

> A rogue process can launch a ret2dir attack only from a CPU that has
> dual mapping for its pages in physmap in its TLB. We can hence defer
> TLB flush on a CPU until a process that would have caused a TLB flush
> is scheduled on that CPU.

Assuming the attacker already has the ability to execute arbitrary code
in userspace, they can just create a second process and thus avoid the
TLB flush. Am I getting this wrong?

Julian
Khalid Aziz Jan. 16, 2019, 3:16 p.m. UTC | #14
On 1/16/19 7:56 AM, Julian Stecklina wrote:
> Khalid Aziz <khalid.aziz@oracle.com> writes:
> 
>> I am continuing to build on the work Juerg, Tycho and Julian have done
>> on XPFO.
> 
> Awesome!
> 
>> A rogue process can launch a ret2dir attack only from a CPU that has
>> dual mapping for its pages in physmap in its TLB. We can hence defer
>> TLB flush on a CPU until a process that would have caused a TLB flush
>> is scheduled on that CPU.
> 
> Assuming the attacker already has the ability to execute arbitrary code
> in userspace, they can just create a second process and thus avoid the
> TLB flush. Am I getting this wrong?

No, you got it right. The patch I wrote closes the security hole when
attack is launched from the same process but still leaves a window open
when attack is launched from another process. I am working on figuring
out how to close that hole while keeping the performance the same as it
is now. Synchronous TLB flush across all cores is the most secure but
performance impact is horrendous.

--
Khalid
Laura Abbott Jan. 17, 2019, 11:38 p.m. UTC | #15
On 1/10/19 1:09 PM, Khalid Aziz wrote:
> I am continuing to build on the work Juerg, Tycho and Julian have done
> on XPFO. After the last round of updates, we were seeing very
> significant performance penalties when stale TLB entries were flushed
> actively after an XPFO TLB update. Benchmark for measuring performance
> is kernel build using parallel make. To get full protection from
> ret2dir attackes, we must flush stale TLB entries. Performance
> penalty from flushing stale TLB entries goes up as the number of
> cores goes up. On a desktop class machine with only 4 cores,
> enabling TLB flush for stale entries causes system time for "make
> -j4" to go up by a factor of 2.614x but on a larger machine with 96
> cores, system time with "make -j60" goes up by a factor of 26.366x!
> I have been working on reducing this performance penalty.
> 
> I implemented a solution to reduce performance penalty and
> that has had large impact. When XPFO code flushes stale TLB entries,
> it does so for all CPUs on the system which may include CPUs that
> may not have any matching TLB entries or may never be scheduled to
> run the userspace task causing TLB flush. Problem is made worse by
> the fact that if number of entries being flushed exceeds
> tlb_single_page_flush_ceiling, it results in a full TLB flush on
> every CPU. A rogue process can launch a ret2dir attack only from a
> CPU that has dual mapping for its pages in physmap in its TLB. We
> can hence defer TLB flush on a CPU until a process that would have
> caused a TLB flush is scheduled on that CPU. I have added a cpumask
> to task_struct which is then used to post pending TLB flush on CPUs
> other than the one a process is running on. This cpumask is checked
> when a process migrates to a new CPU and TLB is flushed at that
> time. I measured system time for parallel make with unmodified 4.20
> kernel, 4.20 with XPFO patches before this optimization and then
> again after applying this optimization. Here are the results:
> 
> Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
> make -j60 all
> 
> 4.20				915.183s
> 4.20+XPFO			24129.354s	26.366x
> 4.20+XPFO+Deferred flush	1216.987s	 1.330xx
> 
> 
> Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
> make -j4 all
> 
> 4.20				607.671s
> 4.20+XPFO			1588.646s	2.614x
> 4.20+XPFO+Deferred flush	794.473s	1.307xx
> 
> 30+% overhead is still very high and there is room for improvement.
> Dave Hansen had suggested batch updating TLB entries and Tycho had
> created an initial implementation but I have not been able to get
> that to work correctly. I am still working on it and I suspect we
> will see a noticeable improvement in performance with that. In the
> code I added, I post a pending full TLB flush to all other CPUs even
> when number of TLB entries being flushed on current CPU does not
> exceed tlb_single_page_flush_ceiling. There has to be a better way
> to do this. I just haven't found an efficient way to implemented
> delayed limited TLB flush on other CPUs.
> 
> I am not entirely sure if switch_mm_irqs_off() is indeed the right
> place to perform the pending TLB flush for a CPU. Any feedback on
> that will be very helpful. Delaying full TLB flushes on other CPUs
> seems to help tremendously, so if there is a better way to implement
> the same thing than what I have done in patch 16, I am open to
> ideas.
> 
> Performance with this patch set is good enough to use these as
> starting point for further refinement before we merge it into main
> kernel, hence RFC.
> 
> Since not flushing stale TLB entries creates a false sense of
> security, I would recommend making TLB flush mandatory and eliminate
> the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
> flushing by default for XPFO page table modifications").
> 
> What remains to be done beyond this patch series:
> 
> 1. Performance improvements
> 2. Remove xpfotlbflush parameter
> 3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
>     from Juerg. I dropped it for now since swiotlb code for ARM has
>     changed a lot in 4.20.
> 4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
>     CPUs" to other architectures besides x86.
> 
> 
> ---------------------------------------------------------
> 
> Juerg Haefliger (5):
>    mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
>    swiotlb: Map the buffer if it was unmapped by XPFO
>    arm64/mm: Add support for XPFO
>    arm64/mm, xpfo: temporarily map dcache regions
>    lkdtm: Add test for XPFO
> 
> Julian Stecklina (4):
>    mm, x86: omit TLB flushing by default for XPFO page table
>      modifications
>    xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION
>    xpfo, mm: optimize spinlock usage in xpfo_kunmap
>    EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap
> 
> Khalid Aziz (2):
>    xpfo, mm: Fix hang when booting with "xpfotlbflush"
>    xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
> 
> Tycho Andersen (5):
>    mm: add MAP_HUGETLB support to vm_mmap
>    x86: always set IF before oopsing from page fault
>    xpfo: add primitives for mapping underlying memory
>    arm64/mm: disable section/contiguous mappings if XPFO is enabled
>    mm: add a user_virt_to_phys symbol
> 
>   .../admin-guide/kernel-parameters.txt         |   2 +
>   arch/arm64/Kconfig                            |   1 +
>   arch/arm64/mm/Makefile                        |   2 +
>   arch/arm64/mm/flush.c                         |   7 +
>   arch/arm64/mm/mmu.c                           |   2 +-
>   arch/arm64/mm/xpfo.c                          |  58 ++++
>   arch/x86/Kconfig                              |   1 +
>   arch/x86/include/asm/pgtable.h                |  26 ++
>   arch/x86/include/asm/tlbflush.h               |   1 +
>   arch/x86/mm/Makefile                          |   2 +
>   arch/x86/mm/fault.c                           |  10 +
>   arch/x86/mm/pageattr.c                        |  23 +-
>   arch/x86/mm/tlb.c                             |  27 ++
>   arch/x86/mm/xpfo.c                            | 171 ++++++++++++
>   drivers/misc/lkdtm/Makefile                   |   1 +
>   drivers/misc/lkdtm/core.c                     |   3 +
>   drivers/misc/lkdtm/lkdtm.h                    |   5 +
>   drivers/misc/lkdtm/xpfo.c                     | 194 ++++++++++++++
>   include/linux/highmem.h                       |  15 +-
>   include/linux/mm.h                            |   2 +
>   include/linux/mm_types.h                      |   8 +
>   include/linux/page-flags.h                    |  13 +
>   include/linux/sched.h                         |   9 +
>   include/linux/xpfo.h                          |  90 +++++++
>   include/trace/events/mmflags.h                |  10 +-
>   kernel/dma/swiotlb.c                          |   3 +-
>   mm/Makefile                                   |   1 +
>   mm/mmap.c                                     |  19 +-
>   mm/page_alloc.c                               |   3 +
>   mm/util.c                                     |  32 +++
>   mm/xpfo.c                                     | 247 ++++++++++++++++++
>   security/Kconfig                              |  29 ++
>   32 files changed, 974 insertions(+), 43 deletions(-)
>   create mode 100644 arch/arm64/mm/xpfo.c
>   create mode 100644 arch/x86/mm/xpfo.c
>   create mode 100644 drivers/misc/lkdtm/xpfo.c
>   create mode 100644 include/linux/xpfo.h
>   create mode 100644 mm/xpfo.c
> 

Also gave this a boot on my X1 Carbon and I got some lockdep splat:

[   16.863110] ================================
[   16.863119] WARNING: inconsistent lock state
[   16.863128] 4.20.0-xpfo+ #6 Not tainted
[   16.863136] --------------------------------
[   16.863145] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[   16.863157] swapper/5/0 [HC1[1]:SC1[1]:HE0:SE0] takes:
[   16.863168] 00000000301e129a (&(&page->xpfo_lock)->rlock){?.+.}, at: xpfo_do_map+0x1b/0x90
[   16.863188] {HARDIRQ-ON-W} state was registered at:
[   16.863200]   _raw_spin_lock+0x30/0x70
[   16.863208]   xpfo_do_map+0x1b/0x90
[   16.863217]   simple_write_begin+0xc7/0x240
[   16.863227]   generic_perform_write+0xf7/0x1c0
[   16.863237]   __generic_file_write_iter+0xfa/0x1c0
[   16.863247]   generic_file_write_iter+0xab/0x150
[   16.863257]   __vfs_write+0x139/0x1a0
[   16.863264]   vfs_write+0xba/0x1c0
[   16.863272]   ksys_write+0x52/0xc0
[   16.863281]   xwrite+0x29/0x5a
[   16.863288]   do_copy+0x2b/0xc8
[   16.863296]   write_buffer+0x2a/0x3a
[   16.863304]   unpack_to_rootfs+0x107/0x2c8
[   16.863312]   populate_rootfs+0x5d/0x10a
[   16.863322]   do_one_initcall+0x5d/0x2be
[   16.863541]   kernel_init_freeable+0x21b/0x2c9
[   16.863764]   kernel_init+0xa/0x109
[   16.863988]   ret_from_fork+0x3a/0x50
[   16.864220] irq event stamp: 337503
[   16.864456] hardirqs last  enabled at (337502): [<ffffffff8ce000a7>] __do_softirq+0xa7/0x47c
[   16.864715] hardirqs last disabled at (337503): [<ffffffff8c0037e8>] trace_hardirqs_off_thunk+0x1a/0x1c
[   16.864985] softirqs last  enabled at (337500): [<ffffffff8c0c6d88>] irq_enter+0x68/0x70
[   16.865263] softirqs last disabled at (337501): [<ffffffff8c0c6ea9>] irq_exit+0x119/0x120
[   16.865546]
                other info that might help us debug this:
[   16.866128]  Possible unsafe locking scenario:

[   16.866733]        CPU0
[   16.867039]        ----
[   16.867370]   lock(&(&page->xpfo_lock)->rlock);
[   16.867693]   <Interrupt>
[   16.868019]     lock(&(&page->xpfo_lock)->rlock);
[   16.868354]
                 *** DEADLOCK ***

[   16.869373] 1 lock held by swapper/5/0:
[   16.869727]  #0: 00000000800b2c51 (&(&ctx->completion_lock)->rlock){-.-.}, at: aio_complete+0x3c/0x460
[   16.870106]
                stack backtrace:
[   16.870868] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.20.0-xpfo+ #6
[   16.871270] Hardware name: LENOVO 20KGS23S00/20KGS23S00, BIOS N23ET40W (1.15 ) 04/13/2018
[   16.871686] Call Trace:
[   16.872106]  <IRQ>
[   16.872531]  dump_stack+0x85/0xc0
[   16.872962]  print_usage_bug.cold.60+0x1a8/0x1e2
[   16.873407]  ? print_shortest_lock_dependencies+0x40/0x40
[   16.873856]  mark_lock+0x502/0x600
[   16.874308]  ? check_usage_backwards+0x120/0x120
[   16.874769]  __lock_acquire+0x6e2/0x1650
[   16.875236]  ? find_held_lock+0x34/0xa0
[   16.875710]  ? sched_clock_cpu+0xc/0xb0
[   16.876185]  lock_acquire+0x9e/0x180
[   16.876668]  ? xpfo_do_map+0x1b/0x90
[   16.877154]  _raw_spin_lock+0x30/0x70
[   16.877649]  ? xpfo_do_map+0x1b/0x90
[   16.878144]  xpfo_do_map+0x1b/0x90
[   16.878647]  aio_complete+0xb2/0x460
[   16.879154]  blkdev_bio_end_io+0x71/0x150
[   16.879665]  blk_update_request+0xd7/0x2e0
[   16.880170]  blk_mq_end_request+0x1a/0x100
[   16.880669]  blk_mq_complete_request+0x98/0x120
[   16.881175]  nvme_irq+0x192/0x210 [nvme]
[   16.881675]  __handle_irq_event_percpu+0x46/0x2a0
[   16.882174]  handle_irq_event_percpu+0x30/0x80
[   16.882670]  handle_irq_event+0x34/0x51
[   16.883252]  handle_edge_irq+0x7b/0x190
[   16.883772]  handle_irq+0xbf/0x100
[   16.883774]  do_IRQ+0x5f/0x120
[   16.883776]  common_interrupt+0xf/0xf
[   16.885469] RIP: 0010:__do_softirq+0xae/0x47c
[   16.885470] Code: 0c 00 00 01 c7 44 24 24 0a 00 00 00 44 89 7c 24 04 48 c7 c0 c0 1e 1e 00 65 66 c7 00 00 00 e8 69 3d 3e ff fb 66 0f 1f 44 00 00 <48> c7 44 24 08 80 51 60 8d b8 ff ff ff ff 0f bc 44 24 04 83 c0 01
[   16.885471] RSP: 0018:ffff8bde5e003f68 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffdd
[   16.887291] RAX: ffff8bde5b303740 RBX: ffff8bde5b303740 RCX: 0000000000000000
[   16.887291] RDX: ffff8bde5b303740 RSI: 0000000000000000 RDI: ffff8bde5b303740
[   16.887292] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[   16.887293] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   16.887294] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000202
[   16.887296]  ? common_interrupt+0xa/0xf
[   16.890885]  ? __do_softirq+0xa7/0x47c
[   16.890887]  ? hrtimer_interrupt+0x12e/0x220
[   16.890889]  irq_exit+0x119/0x120
[   16.890920]  smp_apic_timer_interrupt+0xa2/0x230
[   16.890921]  apic_timer_interrupt+0xf/0x20
[   16.890922]  </IRQ>
[   16.890955] RIP: 0010:cpuidle_enter_state+0xbe/0x350
[   16.890956] Code: 80 7c 24 0b 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 6d 02 00 00 31 ff e8 8e 61 91 ff e8 19 77 98 ff fb 66 0f 1f 44 00 00 <85> ed 0f 88 36 02 00 00 48 b8 ff ff ff ff f3 01 00 00 48 2b 1c 24
[   16.890957] RSP: 0018:ffffa91a41997ea0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
[   16.891025] RAX: ffff8bde5b303740 RBX: 00000003ed1dca4d RCX: 0000000000000000
[   16.891026] RDX: ffff8bde5b303740 RSI: 0000000000000001 RDI: ffff8bde5b303740
[   16.891027] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
[   16.891028] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8d7f8898
[   16.891028] R13: ffffc91a3f800a00 R14: 0000000000000004 R15: 0000000000000000
[   16.891032]  do_idle+0x23e/0x280
[   16.891119]  cpu_startup_entry+0x19/0x20
[   16.891122]  start_secondary+0x1b3/0x200
[   16.891124]  secondary_startup_64+0xa4/0xb0

This was 4.20 + this series. config was based on what Fedora has.

Thanks,
Laura