diff mbox series

[RFC,v9,12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

Message ID 4495dda4bfc4a06b3312cc4063915b306ecfaecb.1554248002.git.khalid.aziz@oracle.com (mailing list archive)
State New, archived
Headers show
Series Add support for eXclusive Page Frame Ownership | expand

Commit Message

Khalid Aziz April 3, 2019, 5:34 p.m. UTC
XPFO flushes kernel space TLB entries for pages that are now mapped
in userspace on not only the current CPU but also all other CPUs
synchronously. Processes on each core allocating pages causes a
flood of IPI messages to all other cores to flush TLB entries.
Many of these messages are to flush the entire TLB on the core if
the number of entries being flushed from local core exceeds
tlb_single_page_flush_ceiling. The cost of TLB flush caused by
unmapping pages from physmap goes up dramatically on machines with
high core count.

This patch flushes relevant TLB entries for current process or
entire TLB depending upon number of entries for the current CPU
and posts a pending TLB flush on all other CPUs when a page is
unmapped from kernel space and mapped in userspace. Each core
checks the pending TLB flush flag for itself on every context
switch, flushes its TLB if the flag is set and clears it.
This patch potentially aggregates multiple TLB flushes into one.
This has very significant impact especially on machines with large
core counts. To illustrate this, kernel was compiled with -j on
two classes of machines - a server with high core count and large
amount of memory, and a desktop class machine with more modest
specs. System time from "make -j" from vanilla 4.20 kernel, 4.20
with XPFO patches before applying this patch and after applying
this patch are below:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20                            950.966s
4.20+XPFO                       25073.169s      26.366x
4.20+XPFO+Deferred flush        1372.874s        1.44x

Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20                            607.671s
4.20+XPFO                       1588.646s       2.614x
4.20+XPFO+Deferred flush        803.989s        1.32x

This same code should be implemented for other architectures as
well once finalized.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c               | 52 +++++++++++++++++++++++++++++++++
 arch/x86/mm/xpfo.c              |  2 +-
 3 files changed, 54 insertions(+), 1 deletion(-)

Comments

Andy Lutomirski April 4, 2019, 4:10 a.m. UTC | #1
On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> XPFO flushes kernel space TLB entries for pages that are now mapped
> in userspace on not only the current CPU but also all other CPUs
> synchronously. Processes on each core allocating pages causes a
> flood of IPI messages to all other cores to flush TLB entries.
> Many of these messages are to flush the entire TLB on the core if
> the number of entries being flushed from local core exceeds
> tlb_single_page_flush_ceiling. The cost of TLB flush caused by
> unmapping pages from physmap goes up dramatically on machines with
> high core count.
>
> This patch flushes relevant TLB entries for current process or
> entire TLB depending upon number of entries for the current CPU
> and posts a pending TLB flush on all other CPUs when a page is
> unmapped from kernel space and mapped in userspace. Each core
> checks the pending TLB flush flag for itself on every context
> switch, flushes its TLB if the flag is set and clears it.
> This patch potentially aggregates multiple TLB flushes into one.
> This has very significant impact especially on machines with large
> core counts.

Why is this a reasonable strategy?

> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
> +{
> +       struct cpumask tmp_mask;
> +
> +       /*
> +        * Balance as user space task's flush, a bit conservative.
> +        * Do a local flush immediately and post a pending flush on all
> +        * other CPUs. Local flush can be a range flush or full flush
> +        * depending upon the number of entries to be flushed. Remote
> +        * flushes will be done by individual processors at the time of
> +        * context switch and this allows multiple flush requests from
> +        * other CPUs to be batched together.
> +        */

I don't like this function at all.  A core function like this is a
contract of sorts between the caller and the implementation.  There is
no such thing as an "xpfo" flush, and this function's behavior isn't
at all well defined.  For flush_tlb_kernel_range(), I can tell you
exactly what that function does, and the implementation is either
correct or incorrect.  With this function, I have no idea what is
actually required, and I can't possibly tell whether it's correct.

As far as I can see, xpfo_flush_tlb_kernel_range() actually means
"flush this range on this CPU right now, and flush it on remote CPUs
eventually".  It would be valid, but probably silly, to flush locally
and to never flush at all on remote CPUs.  This makes me wonder what
the point is.
Peter Zijlstra April 4, 2019, 8:18 a.m. UTC | #2
On Wed, Apr 03, 2019 at 11:34:13AM -0600, Khalid Aziz wrote:
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 999d6d8f0bef..cc806a01a0eb 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -37,6 +37,20 @@
>   */
>  #define LAST_USER_MM_IBPB	0x1UL
>  
> +/*
> + * A TLB flush may be needed to flush stale TLB entries
> + * for pages that have been mapped into userspace and unmapped
> + * from kernel space. This TLB flush needs to be propagated to
> + * all CPUs. Asynchronous flush requests to all CPUs can cause
> + * significant performance imapct. Queue a pending flush for
> + * a CPU instead. Multiple of these requests can then be handled
> + * by a CPU at a less disruptive time, like context switch, in
> + * one go and reduce performance impact significantly. Following
> + * data structure is used to keep track of CPUs with pending full
> + * TLB flush forced by xpfo.
> + */
> +static cpumask_t pending_xpfo_flush;
> +
>  /*
>   * We get here when we do something requiring a TLB invalidation
>   * but could not go invalidate all of the contexts.  We do the
> @@ -321,6 +335,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  		__flush_tlb_all();
>  	}
>  #endif
> +
> +	/*
> +	 * If there is a pending TLB flush for this CPU due to XPFO
> +	 * flush, do it now.
> +	 */
> +	if (cpumask_test_and_clear_cpu(cpu, &pending_xpfo_flush)) {
> +		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> +		__flush_tlb_all();
> +	}

That really should be:

	if (cpumask_test_cpu(cpu, &pending_xpfo_flush)) {
		cpumask_clear_cpu(cpu, &pending_xpfo_flush);
		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
		__flush_tlb_all();
	}

test_and_clear is an unconditional RmW and can cause cacheline
contention between adjecent CPUs even if none of the bits are set.

> +
>  	this_cpu_write(cpu_tlbstate.is_lazy, false);
>  
>  	/*
> @@ -803,6 +827,34 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>  	}
>  }
>  
> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
> +{
> +	struct cpumask tmp_mask;
> +
> +	/*
> +	 * Balance as user space task's flush, a bit conservative.
> +	 * Do a local flush immediately and post a pending flush on all
> +	 * other CPUs. Local flush can be a range flush or full flush
> +	 * depending upon the number of entries to be flushed. Remote
> +	 * flushes will be done by individual processors at the time of
> +	 * context switch and this allows multiple flush requests from
> +	 * other CPUs to be batched together.
> +	 */
> +	if (end == TLB_FLUSH_ALL ||
> +	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
> +		do_flush_tlb_all(NULL);
> +	} else {
> +		struct flush_tlb_info info;
> +
> +		info.start = start;
> +		info.end = end;
> +		do_kernel_range_flush(&info);
> +	}
> +	cpumask_setall(&tmp_mask);
> +	__cpumask_clear_cpu(smp_processor_id(), &tmp_mask);
> +	cpumask_or(&pending_xpfo_flush, &pending_xpfo_flush, &tmp_mask);
> +}
> +
>  void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>  {
>  	struct flush_tlb_info info = {
> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
> index b42513347865..638eee5b1f09 100644
> --- a/arch/x86/mm/xpfo.c
> +++ b/arch/x86/mm/xpfo.c
> @@ -118,7 +118,7 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int order)
>  		return;
>  	}
>  
> -	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
> +	xpfo_flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
>  }
>  EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb);

So this patch is the one that makes it 'work', but I'm with Andy on
hating it something fierce.

Up until this point x86_64 is completely buggered in this series, after
this it sorta works but *urgh* what crap.

All in all your changelog is complete and utter garbage, this is _NOT_ a
performance issue. It is a very much a correctness issue.

Also; I distinctly dislike the inconsistent TLB states this generates.
It makes it very hard to argue for its correctness..
Thomas Gleixner April 5, 2019, 7:17 a.m. UTC | #3
On Thu, 4 Apr 2019, Khalid Aziz wrote:
> When xpfo unmaps a page from physmap only (after mapping the page in
> userspace in response to an allocation request from userspace) on one
> processor, there is a small window of opportunity for ret2dir attack on
> other cpus until the TLB entry in physmap for the unmapped pages on
> other cpus is cleared. Forcing that to happen synchronously is the
> expensive part. A multiple of these requests can come in over a very
> short time across multiple processors resulting in every cpu asking
> every other cpusto flush TLB just to close this small window of
> vulnerability in the kernel. If each request is processed synchronously,
> each CPU will do multiple TLB flushes in short order. If we could
> consolidate these TLB flush requests instead and do one TLB flush on
> each cpu at the time of context switch, we can reduce the performance
> impact significantly. This bears out in real life measuring the system
> time when doing a parallel kernel build on a large server. Without this,
> system time on 96-core server when doing "make -j60 all" went up 26x.
> After this optimization, impact went down to 1.44x.
> 
> The trade-off with this strategy is, the kernel on a cpu is vulnerable
> for a short time if the current running processor is the malicious

The "short" time to next context switch on the other CPUs is how short
exactly? Anything from 1us to seconds- think NOHZ FULL - and even w/o that
10ms on a HZ=100 kernel is plenty of time to launch an attack.

> process. Is that an acceptable trade-off?

You are not seriously asking whether creating a user controllable ret2dir
attack window is a acceptable trade-off? April 1st was a few days ago.

Thanks,

	tglx
Dave Hansen April 5, 2019, 2:44 p.m. UTC | #4
On 4/5/19 12:17 AM, Thomas Gleixner wrote:
>> process. Is that an acceptable trade-off?
> You are not seriously asking whether creating a user controllable ret2dir
> attack window is a acceptable trade-off? April 1st was a few days ago.

Well, let's not forget that this set at least takes us from "always
vulnerable to ret2dir" to a choice between:

1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
2. slow and "mitigated against ret2dir"

Sounds like we need a mechanism that will do the deferred XPFO TLB
flushes whenever the kernel is entered, and not _just_ at context switch
time.  This permits an app to run in userspace with stale kernel TLB
entries as long as it wants... that's harmless.

But, if it enters the kernel, we could process the deferred flush there.
 The concern would be catching all the kernel entry paths (PTI does this
today obviously, but I don't think we want to muck with the PTI code for
this).  The other concern would be that the code between kernel entry
and the flush would be vulnerable.  But, that seems like a reasonable
trade-off to me.
Andy Lutomirski April 5, 2019, 3:24 p.m. UTC | #5
>>> On Apr 4, 2019, at 4:55 PM, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>> 
>>> On 4/3/19 10:10 PM, Andy Lutomirski wrote:
>>> On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>> 
>>> XPFO flushes kernel space TLB entries for pages that are now mapped
>>> in userspace on not only the current CPU but also all other CPUs
>>> synchronously. Processes on each core allocating pages causes a
>>> flood of IPI messages to all other cores to flush TLB entries.
>>> Many of these messages are to flush the entire TLB on the core if
>>> the number of entries being flushed from local core exceeds
>>> tlb_single_page_flush_ceiling. The cost of TLB flush caused by
>>> unmapping pages from physmap goes up dramatically on machines with
>>> high core count.
>>> 
>>> This patch flushes relevant TLB entries for current process or
>>> entire TLB depending upon number of entries for the current CPU
>>> and posts a pending TLB flush on all other CPUs when a page is
>>> unmapped from kernel space and mapped in userspace. Each core
>>> checks the pending TLB flush flag for itself on every context
>>> switch, flushes its TLB if the flag is set and clears it.
>>> This patch potentially aggregates multiple TLB flushes into one.
>>> This has very significant impact especially on machines with large
>>> core counts.
>> 
>> Why is this a reasonable strategy?
> 
> Ideally when pages are unmapped from physmap, all CPUs would be sent IPI
> synchronously to flush TLB entry for those pages immediately. This may
> be ideal from correctness and consistency point of view, but it also
> results in IPI storm and repeated TLB flushes on all processors. Any
> time a page is allocated to userspace, we are going to go through this
> and it is very expensive. On a 96-core server, performance degradation
> is 26x!!

Indeed. XPFO is expensive.

> 
> When xpfo unmaps a page from physmap only (after mapping the page in
> userspace in response to an allocation request from userspace) on one
> processor, there is a small window of opportunity for ret2dir attack on
> other cpus until the TLB entry in physmap for the unmapped pages on
> other cpus is cleared.

Why do you think this window is small? Intervals of seconds to months between context switches aren’t unheard of.

And why is a small window like this even helpful?  For a ret2dir attack, you just need to get CPU A to allocate a page and write the ret2dir payload and then get CPU B to return to it before context switching.  This should be doable quite reliably.

So I don’t really have a suggestion, but I think that a 44% regression to get a weak defense like this doesn’t seem worthwhile.  I bet that any of a number of CFI techniques (RAP-like or otherwise) will be cheaper and protect against ret2dir better.  And they’ll also protect against using other kernel memory as a stack buffer.  There are plenty of those — think pipe buffers, network buffers, any page cache not covered by XPFO, XMM/YMM saved state, etc.
Andy Lutomirski April 5, 2019, 3:24 p.m. UTC | #6
> On Apr 5, 2019, at 8:44 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 4/5/19 12:17 AM, Thomas Gleixner wrote:
>>> process. Is that an acceptable trade-off?
>> You are not seriously asking whether creating a user controllable ret2dir
>> attack window is a acceptable trade-off? April 1st was a few days ago.
> 
> Well, let's not forget that this set at least takes us from "always
> vulnerable to ret2dir" to a choice between:
> 
> 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
> 2. slow and "mitigated against ret2dir"
> 
> Sounds like we need a mechanism that will do the deferred XPFO TLB
> flushes whenever the kernel is entered, and not _just_ at context switch
> time.  This permits an app to run in userspace with stale kernel TLB
> entries as long as it wants... that's harmless.

I don’t think this is good enough. The bad guys can enter the kernel and arrange for the kernel to wait, *in kernel*, for long enough to set up the attack.  userfaultfd is the most obvious way, but there are plenty. I suppose we could do the flush at context switch *and* entry.  I bet that performance still utterly sucks, though — on many workloads, this turns every entry into a full flush, and we already know exactly how much that sucks — it’s identical to KPTI without PCID.  (And yes, if we go this route, we need to merge this logic together — we shouldn’t write CR3 twice on entry).

I feel like this whole approach is misguided. ret2dir is not such a game changer that fixing it is worth huge slowdowns. I think all this effort should be spent on some kind of sensible CFI. For example, we should be able to mostly squash ret2anything by inserting a check that the high bits of RSP match the value on the top of the stack before any code that pops RSP.  On an FPO build, there aren’t all that many hot POP RSP instructions, I think.

(Actually, checking the bits is suboptimal. Do:

unsigned long offset = *rsp - rsp;
offset >>= THREAD_SHIFT;
if (unlikely(offset))
BUG();
POP RSP;

This means that it’s also impossible to trick a function to return into a buffer that is on that function’s stack.)

In other words, I think that ret2dir is an insufficient justification for XPFO.
Khalid Aziz April 5, 2019, 3:44 p.m. UTC | #7
On 4/5/19 8:44 AM, Dave Hansen wrote:
> On 4/5/19 12:17 AM, Thomas Gleixner wrote:
>>> process. Is that an acceptable trade-off?
>> You are not seriously asking whether creating a user controllable ret2dir
>> attack window is a acceptable trade-off? April 1st was a few days ago.
> 
> Well, let's not forget that this set at least takes us from "always
> vulnerable to ret2dir" to a choice between:
> 
> 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
> 2. slow and "mitigated against ret2dir"
> 
> Sounds like we need a mechanism that will do the deferred XPFO TLB
> flushes whenever the kernel is entered, and not _just_ at context switch
> time.  This permits an app to run in userspace with stale kernel TLB
> entries as long as it wants... that's harmless.

That sounds like a good idea. This TLB flush only needs to be done at
kernel entry when there is a pending flush posted for that cpu. What
this does is move the cost of TLB flush from next context switch to
kernel entry and does not add any more flushes than what we are doing
with these xpfo patches. So the overall performance impact might not
change much. It seems worth coding up.

Thanks,
Khalid
Tycho Andersen April 5, 2019, 3:56 p.m. UTC | #8
On Fri, Apr 05, 2019 at 09:24:50AM -0600, Andy Lutomirski wrote:
> 
> 
> > On Apr 5, 2019, at 8:44 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> > 
> > On 4/5/19 12:17 AM, Thomas Gleixner wrote:
> >>> process. Is that an acceptable trade-off?
> >> You are not seriously asking whether creating a user controllable ret2dir
> >> attack window is a acceptable trade-off? April 1st was a few days ago.
> > 
> > Well, let's not forget that this set at least takes us from "always
> > vulnerable to ret2dir" to a choice between:
> > 
> > 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
> > 2. slow and "mitigated against ret2dir"
> > 
> > Sounds like we need a mechanism that will do the deferred XPFO TLB
> > flushes whenever the kernel is entered, and not _just_ at context switch
> > time.  This permits an app to run in userspace with stale kernel TLB
> > entries as long as it wants... that's harmless.
> 
> I don’t think this is good enough. The bad guys can enter the kernel and arrange for the kernel to wait, *in kernel*, for long enough to set up the attack.  userfaultfd is the most obvious way, but there are plenty. I suppose we could do the flush at context switch *and* entry.  I bet that performance still utterly sucks, though — on many workloads, this turns every entry into a full flush, and we already know exactly how much that sucks — it’s identical to KPTI without PCID.  (And yes, if we go this route, we need to merge this logic together — we shouldn’t write CR3 twice on entry).
> 
> I feel like this whole approach is misguided. ret2dir is not such a game changer that fixing it is worth huge slowdowns. I think all this effort should be spent on some kind of sensible CFI. For example, we should be able to mostly squash ret2anything by inserting a check that the high bits of RSP match the value on the top of the stack before any code that pops RSP.  On an FPO build, there aren’t all that many hot POP RSP instructions, I think.
> 
> (Actually, checking the bits is suboptimal. Do:
> 
> unsigned long offset = *rsp - rsp;
> offset >>= THREAD_SHIFT;
> if (unlikely(offset))
> BUG();
> POP RSP;

This is a neat trick, and definitely prevents going random places in
the heap. But,

> This means that it’s also impossible to trick a function to return into a buffer that is on that function’s stack.)

Why is this true? All you're checking is that you can't shift the
"location" of the stack. If you can inject stuff into a stack buffer,
can't you just inject the right frame to return to your code as well,
so you don't have to shift locations?

Thanks!

Tycho
Khalid Aziz April 5, 2019, 3:56 p.m. UTC | #9
On 4/5/19 9:24 AM, Andy Lutomirski wrote:
> 
> 
>> On Apr 5, 2019, at 8:44 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 4/5/19 12:17 AM, Thomas Gleixner wrote:
>>>> process. Is that an acceptable trade-off?
>>> You are not seriously asking whether creating a user controllable ret2dir
>>> attack window is a acceptable trade-off? April 1st was a few days ago.
>>
>> Well, let's not forget that this set at least takes us from "always
>> vulnerable to ret2dir" to a choice between:
>>
>> 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
>> 2. slow and "mitigated against ret2dir"
>>
>> Sounds like we need a mechanism that will do the deferred XPFO TLB
>> flushes whenever the kernel is entered, and not _just_ at context switch
>> time.  This permits an app to run in userspace with stale kernel TLB
>> entries as long as it wants... that's harmless.
> 
> I don’t think this is good enough. The bad guys can enter the kernel and arrange for the kernel to wait, *in kernel*, for long enough to set up the attack.  userfaultfd is the most obvious way, but there are plenty. I suppose we could do the flush at context switch *and* entry.  I bet that performance still utterly sucks, though — on many workloads, this turns every entry into a full flush, and we already know exactly how much that sucks — it’s identical to KPTI without PCID.  (And yes, if we go this route, we need to merge this logic together — we shouldn’t write CR3 twice on entry).

Performance impact might not be all that much from flush at kernel
entry. This flush will happen only if there is a pending flush posted to
the processor and will be done in lieu of flush at the next context
switch. So we are not looking at adding more TLB flushes, rather change
where they might happen. That still can result in some performance
impact and measuring it with real code will be the only way to get that
number.

> 
> I feel like this whole approach is misguided. ret2dir is not such a game changer that fixing it is worth huge slowdowns. I think all this effort should be spent on some kind of sensible CFI. For example, we should be able to mostly squash ret2anything by inserting a check that the high bits of RSP match the value on the top of the stack before any code that pops RSP.  On an FPO build, there aren’t all that many hot POP RSP instructions, I think.
> 
> (Actually, checking the bits is suboptimal. Do:
> 
> unsigned long offset = *rsp - rsp;
> offset >>= THREAD_SHIFT;
> if (unlikely(offset))
> BUG();
> POP RSP;
> 
> This means that it’s also impossible to trick a function to return into a buffer that is on that function’s stack.)
> 
> In other words, I think that ret2dir is an insufficient justification for XPFO.
> 

That is something we may want to explore further. Closing down
/proc/<pid>/pagemap has already helped reduce one way to mount ret2dir
attack. physmap spraying technique still remains viable. XPFO
implementation is expensive. Can we do something different to mitigate
physmap spraying?

Thanks,
Khalid
Dave Hansen April 5, 2019, 4:01 p.m. UTC | #10
On 4/5/19 8:24 AM, Andy Lutomirski wrote:
>> Sounds like we need a mechanism that will do the deferred XPFO TLB 
>> flushes whenever the kernel is entered, and not _just_ at context
>> switch time.  This permits an app to run in userspace with stale
>> kernel TLB entries as long as it wants... that's harmless.
...
> I suppose we could do the flush at context switch *and*
> entry.  I bet that performance still utterly sucks, though — on many
> workloads, this turns every entry into a full flush, and we already
> know exactly how much that sucks — it’s identical to KPTI without
> PCID.  (And yes, if we go this route, we need to merge this logic
> together — we shouldn’t write CR3 twice on entry).

Yeah, probably true.

Just eyeballing this, it would mean mapping the "cpu needs deferred
flush" variable into the cpu_entry_area, which doesn't seem too awful.

I think the basic overall concern is that the deferred flush leaves too
many holes and by the time we close them sufficiently, performance will
suck again.  Seems like a totally valid concern, but my crystal ball is
hazy on whether it will be worth it in the end to many folks

...
> In other words, I think that ret2dir is an insufficient justification
> for XPFO.

Yeah, other things that it is good for have kinda been lost in the
noise.  I think I first started looking at this long before Meltdown and
L1TF were public.

There are hypervisors out there that simply don't (persistently) map
user data.  They can't leak user data because they don't even have
access to it in their virtual address space.  Those hypervisors had a
much easier time with L1TF mitigation than we did.  Basically, they
could flush the L1 after user data was accessible instead of before
untrusted guest code runs.

My hope is that XPFO could provide us similar protection.  But,
somebody's got to poke at it for a while to see how far they can push it.

IMNHO, XPFO is *always* going to be painful for kernel compiles.  But,
cloud providers aren't doing a lot of kernel compiles on their KVM
hosts, and they deeply care about leaking their users' data.
Andy Lutomirski April 5, 2019, 4:27 p.m. UTC | #11
> On Apr 5, 2019, at 10:01 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 4/5/19 8:24 AM, Andy Lutomirski wrote:
>>> Sounds like we need a mechanism that will do the deferred XPFO TLB 
>>> flushes whenever the kernel is entered, and not _just_ at context
>>> switch time.  This permits an app to run in userspace with stale
>>> kernel TLB entries as long as it wants... that's harmless.
> ...
>> I suppose we could do the flush at context switch *and*
>> entry.  I bet that performance still utterly sucks, though — on many
>> workloads, this turns every entry into a full flush, and we already
>> know exactly how much that sucks — it’s identical to KPTI without
>> PCID.  (And yes, if we go this route, we need to merge this logic
>> together — we shouldn’t write CR3 twice on entry).
> 
> Yeah, probably true.
> 
> Just eyeballing this, it would mean mapping the "cpu needs deferred
> flush" variable into the cpu_entry_area, which doesn't seem too awful.
> 
> I think the basic overall concern is that the deferred flush leaves too
> many holes and by the time we close them sufficiently, performance will
> suck again.  Seems like a totally valid concern, but my crystal ball is
> hazy on whether it will be worth it in the end to many folks
> 
> ...
>> In other words, I think that ret2dir is an insufficient justification
>> for XPFO.
> 
> Yeah, other things that it is good for have kinda been lost in the
> noise.  I think I first started looking at this long before Meltdown and
> L1TF were public.
> 
> There are hypervisors out there that simply don't (persistently) map
> user data.  They can't leak user data because they don't even have
> access to it in their virtual address space.  Those hypervisors had a
> much easier time with L1TF mitigation than we did.  Basically, they
> could flush the L1 after user data was accessible instead of before
> untrusted guest code runs.
> 
> My hope is that XPFO could provide us similar protection.  But,
> somebody's got to poke at it for a while to see how far they can push it.
> 
> IMNHO, XPFO is *always* going to be painful for kernel compiles.  But,
> cloud providers aren't doing a lot of kernel compiles on their KVM
> hosts, and they deeply care about leaking their users' data.

At the risk of asking stupid questions: we already have a mechanism for this: highmem.  Can we enable highmem on x86_64, maybe with some heuristics to make it work well?
Andy Lutomirski April 5, 2019, 4:32 p.m. UTC | #12
> On Apr 5, 2019, at 9:56 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> 
>> On Fri, Apr 05, 2019 at 09:24:50AM -0600, Andy Lutomirski wrote:
>> 
>> 
>>> On Apr 5, 2019, at 8:44 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>> 
>>> On 4/5/19 12:17 AM, Thomas Gleixner wrote:
>>>>> process. Is that an acceptable trade-off?
>>>> You are not seriously asking whether creating a user controllable ret2dir
>>>> attack window is a acceptable trade-off? April 1st was a few days ago.
>>> 
>>> Well, let's not forget that this set at least takes us from "always
>>> vulnerable to ret2dir" to a choice between:
>>> 
>>> 1. fast-ish and "vulnerable to ret2dir for a user-controllable window"
>>> 2. slow and "mitigated against ret2dir"
>>> 
>>> Sounds like we need a mechanism that will do the deferred XPFO TLB
>>> flushes whenever the kernel is entered, and not _just_ at context switch
>>> time.  This permits an app to run in userspace with stale kernel TLB
>>> entries as long as it wants... that's harmless.
>> 
>> I don’t think this is good enough. The bad guys can enter the kernel and arrange for the kernel to wait, *in kernel*, for long enough to set up the attack.  userfaultfd is the most obvious way, but there are plenty. I suppose we could do the flush at context switch *and* entry.  I bet that performance still utterly sucks, though — on many workloads, this turns every entry into a full flush, and we already know exactly how much that sucks — it’s identical to KPTI without PCID.  (And yes, if we go this route, we need to merge this logic together — we shouldn’t write CR3 twice on entry).
>> 
>> I feel like this whole approach is misguided. ret2dir is not such a game changer that fixing it is worth huge slowdowns. I think all this effort should be spent on some kind of sensible CFI. For example, we should be able to mostly squash ret2anything by inserting a check that the high bits of RSP match the value on the top of the stack before any code that pops RSP.  On an FPO build, there aren’t all that many hot POP RSP instructions, I think.
>> 
>> (Actually, checking the bits is suboptimal. Do:
>> 
>> unsigned long offset = *rsp - rsp;
>> offset >>= THREAD_SHIFT;
>> if (unlikely(offset))
>> BUG();
>> POP RSP;
> 
> This is a neat trick, and definitely prevents going random places in
> the heap. But,
> 
>> This means that it’s also impossible to trick a function to return into a buffer that is on that function’s stack.)
> 
> Why is this true? All you're checking is that you can't shift the
> "location" of the stack. If you can inject stuff into a stack buffer,
> can't you just inject the right frame to return to your code as well,
> so you don't have to shift locations?
> 
> 

But the injected ROP payload will be *below* RSP, so you’ll need a gadget that can decrement RSP.  This makes the attack a good deal harder.

Something like RAP on top, or CET, will make this even harder.

> 
> Tycho
Peter Zijlstra April 5, 2019, 4:41 p.m. UTC | #13
On Fri, Apr 05, 2019 at 10:27:05AM -0600, Andy Lutomirski wrote:
> At the risk of asking stupid questions: we already have a mechanism
> for this: highmem.  Can we enable highmem on x86_64, maybe with some
> heuristics to make it work well?

That's what I said; but note that I'm still not convinced fixmap/highmem
is actually correct TLB wise.
Khalid Aziz April 5, 2019, 5:35 p.m. UTC | #14
On 4/5/19 10:27 AM, Andy Lutomirski wrote:
> At the risk of asking stupid questions: we already have a mechanism for this: highmem.  Can we enable highmem on x86_64, maybe with some heuristics to make it work well?
> 

I proposed using highmem infrastructure for xpfo in my cover letter as
well as in my earlier replies discussing redesigning
xpfo_kmap/xpfo_kunmap. So that sounds like a reasonable question to me
:) Looks like we might be getting to an agreement that highmem
infrastructure is a good thing to use here.

--
Khalid
diff mbox series

Patch

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f4204bf377fc..92d23629d01d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -561,6 +561,7 @@  extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 999d6d8f0bef..cc806a01a0eb 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -37,6 +37,20 @@ 
  */
 #define LAST_USER_MM_IBPB	0x1UL
 
+/*
+ * A TLB flush may be needed to flush stale TLB entries
+ * for pages that have been mapped into userspace and unmapped
+ * from kernel space. This TLB flush needs to be propagated to
+ * all CPUs. Asynchronous flush requests to all CPUs can cause
+ * significant performance imapct. Queue a pending flush for
+ * a CPU instead. Multiple of these requests can then be handled
+ * by a CPU at a less disruptive time, like context switch, in
+ * one go and reduce performance impact significantly. Following
+ * data structure is used to keep track of CPUs with pending full
+ * TLB flush forced by xpfo.
+ */
+static cpumask_t pending_xpfo_flush;
+
 /*
  * We get here when we do something requiring a TLB invalidation
  * but could not go invalidate all of the contexts.  We do the
@@ -321,6 +335,16 @@  void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		__flush_tlb_all();
 	}
 #endif
+
+	/*
+	 * If there is a pending TLB flush for this CPU due to XPFO
+	 * flush, do it now.
+	 */
+	if (cpumask_test_and_clear_cpu(cpu, &pending_xpfo_flush)) {
+		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+		__flush_tlb_all();
+	}
+
 	this_cpu_write(cpu_tlbstate.is_lazy, false);
 
 	/*
@@ -803,6 +827,34 @@  void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 	}
 }
 
+void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+	struct cpumask tmp_mask;
+
+	/*
+	 * Balance as user space task's flush, a bit conservative.
+	 * Do a local flush immediately and post a pending flush on all
+	 * other CPUs. Local flush can be a range flush or full flush
+	 * depending upon the number of entries to be flushed. Remote
+	 * flushes will be done by individual processors at the time of
+	 * context switch and this allows multiple flush requests from
+	 * other CPUs to be batched together.
+	 */
+	if (end == TLB_FLUSH_ALL ||
+	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
+		do_flush_tlb_all(NULL);
+	} else {
+		struct flush_tlb_info info;
+
+		info.start = start;
+		info.end = end;
+		do_kernel_range_flush(&info);
+	}
+	cpumask_setall(&tmp_mask);
+	__cpumask_clear_cpu(smp_processor_id(), &tmp_mask);
+	cpumask_or(&pending_xpfo_flush, &pending_xpfo_flush, &tmp_mask);
+}
+
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
 	struct flush_tlb_info info = {
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index b42513347865..638eee5b1f09 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -118,7 +118,7 @@  inline void xpfo_flush_kernel_tlb(struct page *page, int order)
 		return;
 	}
 
-	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+	xpfo_flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
 }
 EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb);