diff mbox series

x86: entry: flush the cache if syscall error

Message ID 20181011185458.10186-1-kristen@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series x86: entry: flush the cache if syscall error | expand

Commit Message

Kristen Carlson Accardi Oct. 11, 2018, 6:54 p.m. UTC
This patch aims to make it harder to perform cache timing attacks on data
left behind by system calls. If we have an error returned from a syscall,
flush the L1 cache.

It's important to note that this patch is not addressing any specific
exploit, nor is it intended to be a complete defense against anything.
It is intended to be a low cost way of eliminating some of side effects
of a failed system call.

A performance test using sysbench on one hyperthread and a script which
attempts to repeatedly access files it does not have permission to access
on the other hyperthread found no significant performance impact.

Suggested-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
---
 arch/x86/Kconfig        |  9 +++++++++
 arch/x86/entry/common.c | 18 ++++++++++++++++++
 2 files changed, 27 insertions(+)

Comments

Andy Lutomirski Oct. 11, 2018, 7:25 p.m. UTC | #1
On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi
<kristen@linux.intel.com> wrote:
>
> This patch aims to make it harder to perform cache timing attacks on data
> left behind by system calls. If we have an error returned from a syscall,
> flush the L1 cache.
>
> It's important to note that this patch is not addressing any specific
> exploit, nor is it intended to be a complete defense against anything.
> It is intended to be a low cost way of eliminating some of side effects
> of a failed system call.
>
> A performance test using sysbench on one hyperthread and a script which
> attempts to repeatedly access files it does not have permission to access
> on the other hyperthread found no significant performance impact.
>

> +__visible inline void l1_cache_flush(struct pt_regs *regs)
> +{
> +       if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) &&
> +           static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> +               if (regs->ax == 0 || regs->ax == -EAGAIN ||
> +                   regs->ax == -EEXIST || regs->ax == -ENOENT ||
> +                   regs->ax == -EXDEV || regs->ax == -ETIMEDOUT ||
> +                   regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS)
> +                       return;
> +
> +               wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> +       }
> +}

Ugh.

What exactly is this trying to protect against?  And how many cycles
should we expect L1D_FLUSH to take?

ISTM that, if we have a situation where the L1D can be read by user
code, we lose, via hyperthreading, successful syscalls, /dev/random,
and may other vectors.  This seems like a small mitigation at a rather
large cost.
Kristen Carlson Accardi Oct. 11, 2018, 8:15 p.m. UTC | #2
On Thu, 2018-10-11 at 12:25 -0700, Andy Lutomirski wrote:
> On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi
> <kristen@linux.intel.com> wrote:
> > 
> > This patch aims to make it harder to perform cache timing attacks
> > on data
> > left behind by system calls. If we have an error returned from a
> > syscall,
> > flush the L1 cache.
> > 
> > It's important to note that this patch is not addressing any
> > specific
> > exploit, nor is it intended to be a complete defense against
> > anything.
> > It is intended to be a low cost way of eliminating some of side
> > effects
> > of a failed system call.
> > 
> > A performance test using sysbench on one hyperthread and a script
> > which
> > attempts to repeatedly access files it does not have permission to
> > access
> > on the other hyperthread found no significant performance impact.
> > 
> > +__visible inline void l1_cache_flush(struct pt_regs *regs)
> > +{
> > +       if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) &&
> > +           static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> > +               if (regs->ax == 0 || regs->ax == -EAGAIN ||
> > +                   regs->ax == -EEXIST || regs->ax == -ENOENT ||
> > +                   regs->ax == -EXDEV || regs->ax == -ETIMEDOUT ||
> > +                   regs->ax == -ENOTCONN || regs->ax ==
> > -EINPROGRESS)
> > +                       return;
> > +
> > +               wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> > +       }
> > +}
> 
> Ugh.
> 
> What exactly is this trying to protect against?  And how many cycles
> should we expect L1D_FLUSH to take?

As I mentioned in the commit message, this is not addressing any
specific exploit. It is removing any side effects from a failed system
call in the L1 cache.

> 
> ISTM that, if we have a situation where the L1D can be read by user
> code, we lose, via hyperthreading, successful syscalls, /dev/random,
> and may other vectors.  This seems like a small mitigation at a
> rather
> large cost.

I pinned an evil task to one hyperthread that just caused L1 flushes by
issuing failed system calls. On the other hyperthread, I ran a
performance benchmark (sysbench). I did not see any difference between
the baseline and the kernel with the patch applied. Is there a more
appropriate test you'd be interested in seeing the results of? I'd be
happy to design a different test.
Alan Cox Oct. 11, 2018, 8:25 p.m. UTC | #3
> Ugh.
> 
> What exactly is this trying to protect against?  And how many cycles

Most attacks by speculation rely upon leaving footprints in the L1 cache.
They also almost inevitably resolve non speculatively to errors. If you
look through all the 'yet another potential spectre case' patches people
have found they would have been rendered close to useless by this change.

It's a way to deal with the ones we don't know about, all the ones the
tools won't find and it has pretty much zero cost

(If you are bored strace an entire days desktop session, bang it through
a script or two to extract the number of triggerig error returns and do
the maths...)

> should we expect L1D_FLUSH to take?

More to the point you pretty much never trigger it. Errors are not the
normal path in real code. The original version of this code emptied the
L1 the hard way - and even then it was in the noise for real workloads we
tried.

You can argue that the other thread could be some evil task that
deliberately triggers flushes, but it can already thrash the L1 on
processors that share L1 between threads using perfectly normal memory
instructions.

Alan
Andy Lutomirski Oct. 11, 2018, 8:47 p.m. UTC | #4
On Thu, Oct 11, 2018 at 1:25 PM Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
>
> > Ugh.
> >
> > What exactly is this trying to protect against?  And how many cycles
>
> Most attacks by speculation rely upon leaving footprints in the L1 cache.
> They also almost inevitably resolve non speculatively to errors. If you
> look through all the 'yet another potential spectre case' patches people
> have found they would have been rendered close to useless by this change.

Can you give an example?  AFAIK the interesting Meltdown-like attacks
are unaffected because Meltdown doesn't actually need the target data
to be in L1D.  And most of the Spectre-style attacks would have been
blocked by doing LFENCE on the error cases (and somehow making sure
that the CPU doesn't speculate around the LFENCE without noticing it).
But this patch is doing an L1D flush, which, as far as I've heard,
isn't actually relevant.

>
> It's a way to deal with the ones we don't know about, all the ones theion
> tools won't find and it has pretty much zero cost
>
> (If you are bored strace an entire days desktop session, bang it through
> a script or two to extract the number of triggerig error returns and do
> the maths...)
>
> > should we expect L1D_FLUSH to take?
>
> More to the point you pretty much never trigger it. Errors are not the
> normal path in real code. The original version of this code emptied the
> L1 the hard way - and even then it was in the noise for real workloads we
> tried.
>
> You can argue that the other thread could be some evil task that
> deliberately triggers flushes, but it can already thrash the L1 on
> processors that share L1 between threads using perfectly normal memory
> instructions.
>

That's not what I meant.  I meant that, if an attacker can run code on
*both* logical threads on the same CPU, then they can run their attack
code on the other logical thread before the L1D_FLUSH command takes
effect.

I care about the performance of single-threaded workloads, though.
How slow is this thing?  No one cares about syscall performance on
regular desktop sessions except for gaming.   But people do care about
syscall performance on all kinds of crazy server, database, etc
workloads.  And compilation.  And HPC stuff, although that mostly
doesn't involve syscalls.  So: benchmarks, please.  And estimated
cycle counts, please, on at least a couple of relevant CPU
generations.

On Meltdown-affected CPUs, we're doing a CR3 write anyway, which is
fully serializing, so it's slow.  But AFAIK that *already* blocks most
of these attacks except L1TF, and L1TF has (hopefully!) been fixed
anyway on Linux.
Andy Lutomirski Oct. 11, 2018, 8:48 p.m. UTC | #5
On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi
<kristen@linux.intel.com> wrote:
>
> This patch aims to make it harder to perform cache timing attacks on data
> left behind by system calls. If we have an error returned from a syscall,
> flush the L1 cache.
>
> It's important to note that this patch is not addressing any specific
> exploit, nor is it intended to be a complete defense against anything.
> It is intended to be a low cost way of eliminating some of side effects
> of a failed system call.
>
> A performance test using sysbench on one hyperthread and a script which
> attempts to repeatedly access files it does not have permission to access
> on the other hyperthread found no significant performance impact.
>
> Suggested-by: Alan Cox <alan@linux.intel.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> ---
>  arch/x86/Kconfig        |  9 +++++++++
>  arch/x86/entry/common.c | 18 ++++++++++++++++++
>  2 files changed, 27 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 1a0be022f91d..bde978eb3b4e 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -445,6 +445,15 @@ config RETPOLINE
>           code are eliminated. Since this includes the syscall entry path,
>           it is not entirely pointless.
>
> +config SYSCALL_FLUSH
> +       bool "Clear L1 Cache on syscall errors"
> +       default n
> +       help
> +         Selecting 'y' allows the L1 cache to be cleared upon return of
> +         an error code from a syscall if the CPU supports "flush_l1d".
> +         This may reduce the likelyhood of speculative execution style
> +         attacks on syscalls.
> +
>  config INTEL_RDT
>         bool "Intel Resource Director Technology support"
>         default n
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 3b2490b81918..26de8ea71293 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -268,6 +268,20 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
>         prepare_exit_to_usermode(regs);
>  }
>
> +__visible inline void l1_cache_flush(struct pt_regs *regs)
> +{
> +       if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) &&
> +           static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> +               if (regs->ax == 0 || regs->ax == -EAGAIN ||
> +                   regs->ax == -EEXIST || regs->ax == -ENOENT ||
> +                   regs->ax == -EXDEV || regs->ax == -ETIMEDOUT ||
> +                   regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS)

What about ax > 0?  (Or more generally, any ax outside the range of -1
.. -4095 or whatever the error range is.)  As it stands, it looks like
you'll flush on successful read(), write(), recv(), etc, and that
could seriously hurt performance on real workloads.

> +                       return;
> +
> +               wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> +       }
> +}
> +
>  #ifdef CONFIG_X86_64
>  __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>  {
> @@ -290,6 +304,8 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>                 regs->ax = sys_call_table[nr](regs);
>         }
>
> +       l1_cache_flush(regs);
> +
>         syscall_return_slowpath(regs);
>  }
>  #endif
> @@ -338,6 +354,8 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
>  #endif /* CONFIG_IA32_EMULATION */
>         }
>
> +       l1_cache_flush(regs);
> +
>         syscall_return_slowpath(regs);
>  }
>
> --
> 2.14.4
>
Kees Cook Oct. 11, 2018, 8:55 p.m. UTC | #6
On Thu, Oct 11, 2018 at 1:48 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi
> <kristen@linux.intel.com> wrote:
>>
>> This patch aims to make it harder to perform cache timing attacks on data
>> left behind by system calls. If we have an error returned from a syscall,
>> flush the L1 cache.
>>
>> It's important to note that this patch is not addressing any specific
>> exploit, nor is it intended to be a complete defense against anything.
>> It is intended to be a low cost way of eliminating some of side effects
>> of a failed system call.
>>
>> A performance test using sysbench on one hyperthread and a script which
>> attempts to repeatedly access files it does not have permission to access
>> on the other hyperthread found no significant performance impact.
>>
>> Suggested-by: Alan Cox <alan@linux.intel.com>
>> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
>> ---
>>  arch/x86/Kconfig        |  9 +++++++++
>>  arch/x86/entry/common.c | 18 ++++++++++++++++++
>>  2 files changed, 27 insertions(+)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 1a0be022f91d..bde978eb3b4e 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -445,6 +445,15 @@ config RETPOLINE
>>           code are eliminated. Since this includes the syscall entry path,
>>           it is not entirely pointless.
>>
>> +config SYSCALL_FLUSH
>> +       bool "Clear L1 Cache on syscall errors"
>> +       default n
>> +       help
>> +         Selecting 'y' allows the L1 cache to be cleared upon return of
>> +         an error code from a syscall if the CPU supports "flush_l1d".
>> +         This may reduce the likelyhood of speculative execution style
>> +         attacks on syscalls.
>> +
>>  config INTEL_RDT
>>         bool "Intel Resource Director Technology support"
>>         default n
>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
>> index 3b2490b81918..26de8ea71293 100644
>> --- a/arch/x86/entry/common.c
>> +++ b/arch/x86/entry/common.c
>> @@ -268,6 +268,20 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
>>         prepare_exit_to_usermode(regs);
>>  }
>>
>> +__visible inline void l1_cache_flush(struct pt_regs *regs)
>> +{
>> +       if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) &&
>> +           static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
>> +               if (regs->ax == 0 || regs->ax == -EAGAIN ||
>> +                   regs->ax == -EEXIST || regs->ax == -ENOENT ||
>> +                   regs->ax == -EXDEV || regs->ax == -ETIMEDOUT ||
>> +                   regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS)
>
> What about ax > 0?  (Or more generally, any ax outside the range of -1
> .. -4095 or whatever the error range is.)  As it stands, it looks like
> you'll flush on successful read(), write(), recv(), etc, and that
> could seriously hurt performance on real workloads.

Seems like just changing this with "ax == 0" into "ax >= 0" would solve that?

I think this looks like a good idea. It might be worth adding a
comment about the checks to explain why those errors are whitelisted.
It's a cheap and effective mitigation for "unknown future problems"
that doesn't degrade normal workloads.

>> +                       return;
>> +
>> +               wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);

What about CPUs without FLUSH_L1D? Could it be done manually with a
memcpy or something?

-Kees

>> +       }
>> +}
>> +
>>  #ifdef CONFIG_X86_64
>>  __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>>  {
>> @@ -290,6 +304,8 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>>                 regs->ax = sys_call_table[nr](regs);
>>         }
>>
>> +       l1_cache_flush(regs);
>> +
>>         syscall_return_slowpath(regs);
>>  }
>>  #endif
>> @@ -338,6 +354,8 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
>>  #endif /* CONFIG_IA32_EMULATION */
>>         }
>>
>> +       l1_cache_flush(regs);
>> +
>>         syscall_return_slowpath(regs);
>>  }
>>
>> --
>> 2.14.4
>>
Andy Lutomirski Oct. 11, 2018, 9:17 p.m. UTC | #7
On Thu, Oct 11, 2018 at 1:55 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Thu, Oct 11, 2018 at 1:48 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi
> > <kristen@linux.intel.com> wrote:
> >>
> >> This patch aims to make it harder to perform cache timing attacks on data
> >> left behind by system calls. If we have an error returned from a syscall,
> >> flush the L1 cache.
> >>
> >> It's important to note that this patch is not addressing any specific
> >> exploit, nor is it intended to be a complete defense against anything.
> >> It is intended to be a low cost way of eliminating some of side effects
> >> of a failed system call.
> >>
> >> A performance test using sysbench on one hyperthread and a script which
> >> attempts to repeatedly access files it does not have permission to access
> >> on the other hyperthread found no significant performance impact.
> >>
> >> Suggested-by: Alan Cox <alan@linux.intel.com>
> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> >> ---
> >>  arch/x86/Kconfig        |  9 +++++++++
> >>  arch/x86/entry/common.c | 18 ++++++++++++++++++
> >>  2 files changed, 27 insertions(+)
> >>
> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >> index 1a0be022f91d..bde978eb3b4e 100644
> >> --- a/arch/x86/Kconfig
> >> +++ b/arch/x86/Kconfig
> >> @@ -445,6 +445,15 @@ config RETPOLINE
> >>           code are eliminated. Since this includes the syscall entry path,
> >>           it is not entirely pointless.
> >>
> >> +config SYSCALL_FLUSH
> >> +       bool "Clear L1 Cache on syscall errors"
> >> +       default n
> >> +       help
> >> +         Selecting 'y' allows the L1 cache to be cleared upon return of
> >> +         an error code from a syscall if the CPU supports "flush_l1d".
> >> +         This may reduce the likelyhood of speculative execution style
> >> +         attacks on syscalls.
> >> +
> >>  config INTEL_RDT
> >>         bool "Intel Resource Director Technology support"
> >>         default n
> >> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> >> index 3b2490b81918..26de8ea71293 100644
> >> --- a/arch/x86/entry/common.c
> >> +++ b/arch/x86/entry/common.c
> >> @@ -268,6 +268,20 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
> >>         prepare_exit_to_usermode(regs);
> >>  }
> >>
> >> +__visible inline void l1_cache_flush(struct pt_regs *regs)
> >> +{
> >> +       if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) &&
> >> +           static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> >> +               if (regs->ax == 0 || regs->ax == -EAGAIN ||
> >> +                   regs->ax == -EEXIST || regs->ax == -ENOENT ||
> >> +                   regs->ax == -EXDEV || regs->ax == -ETIMEDOUT ||
> >> +                   regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS)
> >
> > What about ax > 0?  (Or more generally, any ax outside the range of -1
> > .. -4095 or whatever the error range is.)  As it stands, it looks like
> > you'll flush on successful read(), write(), recv(), etc, and that
> > could seriously hurt performance on real workloads.
>
> Seems like just changing this with "ax == 0" into "ax >= 0" would solve that?

I can easily imagine that there are other errors for which performance
matters.  EBUSY comes to mind.

>
> I think this looks like a good idea. It might be worth adding a
> comment about the checks to explain why those errors are whitelisted.
> It's a cheap and effective mitigation for "unknown future problems"
> that doesn't degrade normal workloads.

I still want to see two pieces of information before I ack a patch like this:

 - How long does L1D_FLUSH take, roughly?

 - An example of a type of attack that would be mitigated.

For the latter, I assume that the goal is to mitigate against attacks
where a syscall speculatively loads something sensitive and then
fails.  But, before it fails, it leaks the information it
speculatively loaded, and that leak ended up in L1D but *not* in other
cache levels.  And somehow the L1D can't be probed quickly enough in a
parallel thread to meaningfully get the information out of L1D.  Or
maybe it's trying to mitigate against the use of failing syscalls to
get some sensitive data into L1D and then using something like L1TF to
read it out.

But this really needs to be clarified.  Alan said that a bunch of the
"yet another Spectre variant" attacks would have been mitigated by
this patch.  An explanation of *how* would be in order.

And we should seriously consider putting this kind of thing under
CONFIG_SPECULATIVE_MITIGATIONS or similar.  The idea being that it
doesn't mitigate a clear known attack family.
Kristen Carlson Accardi Oct. 11, 2018, 9:23 p.m. UTC | #8
On Thu, 2018-10-11 at 13:55 -0700, Kees Cook wrote:
> On Thu, Oct 11, 2018 at 1:48 PM, Andy Lutomirski <luto@kernel.org>
> wrote:
> > On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi
> > <kristen@linux.intel.com> wrote:
> > > 
> > > This patch aims to make it harder to perform cache timing attacks
> > > on data
> > > left behind by system calls. If we have an error returned from a
> > > syscall,
> > > flush the L1 cache.
> > > 
> > > It's important to note that this patch is not addressing any
> > > specific
> > > exploit, nor is it intended to be a complete defense against
> > > anything.
> > > It is intended to be a low cost way of eliminating some of side
> > > effects
> > > of a failed system call.
> > > 
> > > A performance test using sysbench on one hyperthread and a script
> > > which
> > > attempts to repeatedly access files it does not have permission
> > > to access
> > > on the other hyperthread found no significant performance impact.
> > > 
> > > Suggested-by: Alan Cox <alan@linux.intel.com>
> > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > > ---
> > >  arch/x86/Kconfig        |  9 +++++++++
> > >  arch/x86/entry/common.c | 18 ++++++++++++++++++
> > >  2 files changed, 27 insertions(+)
> > > 
> > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > index 1a0be022f91d..bde978eb3b4e 100644
> > > --- a/arch/x86/Kconfig
> > > +++ b/arch/x86/Kconfig
> > > @@ -445,6 +445,15 @@ config RETPOLINE
> > >           code are eliminated. Since this includes the syscall
> > > entry path,
> > >           it is not entirely pointless.
> > > 
> > > +config SYSCALL_FLUSH
> > > +       bool "Clear L1 Cache on syscall errors"
> > > +       default n
> > > +       help
> > > +         Selecting 'y' allows the L1 cache to be cleared upon
> > > return of
> > > +         an error code from a syscall if the CPU supports
> > > "flush_l1d".
> > > +         This may reduce the likelyhood of speculative execution
> > > style
> > > +         attacks on syscalls.
> > > +
> > >  config INTEL_RDT
> > >         bool "Intel Resource Director Technology support"
> > >         default n
> > > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> > > index 3b2490b81918..26de8ea71293 100644
> > > --- a/arch/x86/entry/common.c
> > > +++ b/arch/x86/entry/common.c
> > > @@ -268,6 +268,20 @@ __visible inline void
> > > syscall_return_slowpath(struct pt_regs *regs)
> > >         prepare_exit_to_usermode(regs);
> > >  }
> > > 
> > > +__visible inline void l1_cache_flush(struct pt_regs *regs)
> > > +{
> > > +       if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) &&
> > > +           static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> > > +               if (regs->ax == 0 || regs->ax == -EAGAIN ||
> > > +                   regs->ax == -EEXIST || regs->ax == -ENOENT ||
> > > +                   regs->ax == -EXDEV || regs->ax == -ETIMEDOUT
> > > ||
> > > +                   regs->ax == -ENOTCONN || regs->ax ==
> > > -EINPROGRESS)
> > 
> > What about ax > 0?  (Or more generally, any ax outside the range of
> > -1
> > .. -4095 or whatever the error range is.)  As it stands, it looks
> > like
> > you'll flush on successful read(), write(), recv(), etc, and that
> > could seriously hurt performance on real workloads.
> 
> Seems like just changing this with "ax == 0" into "ax >= 0" would
> solve that?

thanks, will do.

> 
> I think this looks like a good idea. It might be worth adding a
> comment about the checks to explain why those errors are whitelisted.
> It's a cheap and effective mitigation for "unknown future problems"
> that doesn't degrade normal workloads.
> 
> > > +                       return;
> > > +
> > > +               wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> 
> What about CPUs without FLUSH_L1D? Could it be done manually with a
> memcpy or something?

It could - my original implementation (pre l1d_flush msr) did, but it
did come with some additional cost in that I allocated per-cpu memory
to keep a 32K buffer around that I could memcpy. It also sacrificed
completeness for simplicity by not taking into account cases where L1
was not 32K. As far as I know this msr is pretty widely deployed, even
on older hardware.
Jann Horn Oct. 11, 2018, 9:42 p.m. UTC | #9
On Thu, Oct 11, 2018 at 10:56 PM Kees Cook <keescook@chromium.org> wrote:
> On Thu, Oct 11, 2018 at 1:48 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi
> > <kristen@linux.intel.com> wrote:
> >>
> >> This patch aims to make it harder to perform cache timing attacks on data
> >> left behind by system calls. If we have an error returned from a syscall,
> >> flush the L1 cache.
> >>
> >> It's important to note that this patch is not addressing any specific
> >> exploit, nor is it intended to be a complete defense against anything.
> >> It is intended to be a low cost way of eliminating some of side effects
> >> of a failed system call.
> >>
> >> A performance test using sysbench on one hyperthread and a script which
> >> attempts to repeatedly access files it does not have permission to access
> >> on the other hyperthread found no significant performance impact.
> >>
> >> Suggested-by: Alan Cox <alan@linux.intel.com>
> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> >> ---
> >>  arch/x86/Kconfig        |  9 +++++++++
> >>  arch/x86/entry/common.c | 18 ++++++++++++++++++
> >>  2 files changed, 27 insertions(+)
> >>
> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >> index 1a0be022f91d..bde978eb3b4e 100644
> >> --- a/arch/x86/Kconfig
> >> +++ b/arch/x86/Kconfig
> >> @@ -445,6 +445,15 @@ config RETPOLINE
> >>           code are eliminated. Since this includes the syscall entry path,
> >>           it is not entirely pointless.
> >>
> >> +config SYSCALL_FLUSH
> >> +       bool "Clear L1 Cache on syscall errors"
> >> +       default n
> >> +       help
> >> +         Selecting 'y' allows the L1 cache to be cleared upon return of
> >> +         an error code from a syscall if the CPU supports "flush_l1d".
> >> +         This may reduce the likelyhood of speculative execution style
> >> +         attacks on syscalls.
> >> +
> >>  config INTEL_RDT
> >>         bool "Intel Resource Director Technology support"
> >>         default n
> >> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> >> index 3b2490b81918..26de8ea71293 100644
> >> --- a/arch/x86/entry/common.c
> >> +++ b/arch/x86/entry/common.c
> >> @@ -268,6 +268,20 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
> >>         prepare_exit_to_usermode(regs);
> >>  }
> >>
> >> +__visible inline void l1_cache_flush(struct pt_regs *regs)
> >> +{
> >> +       if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) &&
> >> +           static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> >> +               if (regs->ax == 0 || regs->ax == -EAGAIN ||
> >> +                   regs->ax == -EEXIST || regs->ax == -ENOENT ||
> >> +                   regs->ax == -EXDEV || regs->ax == -ETIMEDOUT ||
> >> +                   regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS)
> >
> > What about ax > 0?  (Or more generally, any ax outside the range of -1
> > .. -4095 or whatever the error range is.)  As it stands, it looks like
> > you'll flush on successful read(), write(), recv(), etc, and that
> > could seriously hurt performance on real workloads.
>
> Seems like just changing this with "ax == 0" into "ax >= 0" would solve that?

As spender points out on twitter
(https://twitter.com/grsecurity/status/1050497259937370118 - thanks,
spender!), struct pt_regs stores register values as "unsigned long",
and so you'll need to use something like IS_ERR_VALUE().
Jann Horn Oct. 11, 2018, 10:11 p.m. UTC | #10
On Thu, Oct 11, 2018 at 11:17 PM Andy Lutomirski <luto@kernel.org> wrote:
> On Thu, Oct 11, 2018 at 1:55 PM Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Oct 11, 2018 at 1:48 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > > On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi
> > > <kristen@linux.intel.com> wrote:
> > >>
> > >> This patch aims to make it harder to perform cache timing attacks on data
> > >> left behind by system calls. If we have an error returned from a syscall,
> > >> flush the L1 cache.
> > >>
> > >> It's important to note that this patch is not addressing any specific
> > >> exploit, nor is it intended to be a complete defense against anything.
> > >> It is intended to be a low cost way of eliminating some of side effects
> > >> of a failed system call.
> > >>
> > >> A performance test using sysbench on one hyperthread and a script which
> > >> attempts to repeatedly access files it does not have permission to access
> > >> on the other hyperthread found no significant performance impact.
> > >>
> > >> Suggested-by: Alan Cox <alan@linux.intel.com>
> > >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
> > >> ---
> > >>  arch/x86/Kconfig        |  9 +++++++++
> > >>  arch/x86/entry/common.c | 18 ++++++++++++++++++
> > >>  2 files changed, 27 insertions(+)
> > >>
> > >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > >> index 1a0be022f91d..bde978eb3b4e 100644
> > >> --- a/arch/x86/Kconfig
> > >> +++ b/arch/x86/Kconfig
> > >> @@ -445,6 +445,15 @@ config RETPOLINE
> > >>           code are eliminated. Since this includes the syscall entry path,
> > >>           it is not entirely pointless.
> > >>
> > >> +config SYSCALL_FLUSH
> > >> +       bool "Clear L1 Cache on syscall errors"
> > >> +       default n
> > >> +       help
> > >> +         Selecting 'y' allows the L1 cache to be cleared upon return of
> > >> +         an error code from a syscall if the CPU supports "flush_l1d".
> > >> +         This may reduce the likelyhood of speculative execution style
> > >> +         attacks on syscalls.
> > >> +
> > >>  config INTEL_RDT
> > >>         bool "Intel Resource Director Technology support"
> > >>         default n
> > >> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> > >> index 3b2490b81918..26de8ea71293 100644
> > >> --- a/arch/x86/entry/common.c
> > >> +++ b/arch/x86/entry/common.c
> > >> @@ -268,6 +268,20 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
> > >>         prepare_exit_to_usermode(regs);
> > >>  }
> > >>
> > >> +__visible inline void l1_cache_flush(struct pt_regs *regs)
> > >> +{
> > >> +       if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) &&
> > >> +           static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> > >> +               if (regs->ax == 0 || regs->ax == -EAGAIN ||
> > >> +                   regs->ax == -EEXIST || regs->ax == -ENOENT ||
> > >> +                   regs->ax == -EXDEV || regs->ax == -ETIMEDOUT ||
> > >> +                   regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS)
> > >
> > > What about ax > 0?  (Or more generally, any ax outside the range of -1
> > > .. -4095 or whatever the error range is.)  As it stands, it looks like
> > > you'll flush on successful read(), write(), recv(), etc, and that
> > > could seriously hurt performance on real workloads.
> >
> > Seems like just changing this with "ax == 0" into "ax >= 0" would solve that?
>
> I can easily imagine that there are other errors for which performance
> matters.  EBUSY comes to mind.
>
> >
> > I think this looks like a good idea. It might be worth adding a
> > comment about the checks to explain why those errors are whitelisted.
> > It's a cheap and effective mitigation for "unknown future problems"
> > that doesn't degrade normal workloads.
>
> I still want to see two pieces of information before I ack a patch like this:
>
>  - How long does L1D_FLUSH take, roughly?

(Especially if L1D is dirty and you can't just wipe it all.)

>  - An example of a type of attack that would be mitigated.
>
> For the latter, I assume that the goal is to mitigate against attacks
> where a syscall speculatively loads something sensitive and then
> fails.  But, before it fails, it leaks the information it
> speculatively loaded, and that leak ended up in L1D but *not* in other
> cache levels.  And somehow the L1D can't be probed quickly enough in a
> parallel thread to meaningfully get the information out of L1D.

And the attacker can't delay the syscall return somehow, e.g. on a
fully-preemptible kernel by preempting the syscall, or by tarpitting a
userspace memory access in the error path.

> Or
> maybe it's trying to mitigate against the use of failing syscalls to
> get some sensitive data into L1D and then using something like L1TF to
> read it out.
>
> But this really needs to be clarified.  Alan said that a bunch of the
> "yet another Spectre variant" attacks would have been mitigated by
> this patch.  An explanation of *how* would be in order.
>
> And we should seriously consider putting this kind of thing under
> CONFIG_SPECULATIVE_MITIGATIONS or similar.  The idea being that it
> doesn't mitigate a clear known attack family.
Thomas Gleixner Oct. 11, 2018, 10:33 p.m. UTC | #11
On Thu, 11 Oct 2018, Kees Cook wrote:
> On Thu, Oct 11, 2018 at 1:48 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi
> >> +__visible inline void l1_cache_flush(struct pt_regs *regs)
> >> +{
> >> +       if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) &&
> >> +           static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
> >> +               if (regs->ax == 0 || regs->ax == -EAGAIN ||
> >> +                   regs->ax == -EEXIST || regs->ax == -ENOENT ||
> >> +                   regs->ax == -EXDEV || regs->ax == -ETIMEDOUT ||
> >> +                   regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS)
> >
> > What about ax > 0?  (Or more generally, any ax outside the range of -1
> > .. -4095 or whatever the error range is.)  As it stands, it looks like
> > you'll flush on successful read(), write(), recv(), etc, and that
> > could seriously hurt performance on real workloads.
> 
> Seems like just changing this with "ax == 0" into "ax >= 0" would solve that?
> 
> I think this looks like a good idea. It might be worth adding a
> comment about the checks to explain why those errors are whitelisted.
> It's a cheap and effective mitigation for "unknown future problems"
> that doesn't degrade normal workloads.

pt_regs->ax is unsigned long, so you want to check this with IS_ERR_VALUE()
first.

               if (!IS_ERR_VALUE(regs->ax))
		        return;

and then you really want to have something smarter than a gazillion of
whitelisted error value checks, which effectively compile into a gazillion
conditonal branches.

Thanks,

	tglx
Thomas Gleixner Oct. 11, 2018, 11:15 p.m. UTC | #12
On Thu, 11 Oct 2018, Jann Horn wrote:
> On Thu, Oct 11, 2018 at 10:56 PM Kees Cook <keescook@chromium.org> wrote:
> > Seems like just changing this with "ax == 0" into "ax >= 0" would solve that?
> 
> As spender points out on twitter
> (https://twitter.com/grsecurity/status/1050497259937370118 - thanks,
> spender!), struct pt_regs stores register values as "unsigned long",
> and so you'll need to use something like IS_ERR_VALUE().

Ha, so it's not only me who noticed, but then I'm not reading twitter...

Thanks,

	tglx
Thomas Gleixner Oct. 11, 2018, 11:43 p.m. UTC | #13
On Thu, 11 Oct 2018, Kristen C Accardi wrote:
> On Thu, 2018-10-11 at 13:55 -0700, Kees Cook wrote:
> > I think this looks like a good idea. It might be worth adding a
> > comment about the checks to explain why those errors are whitelisted.
> > It's a cheap and effective mitigation for "unknown future problems"
> > that doesn't degrade normal workloads.
> > 
> > > > +                       return;
> > > > +
> > > > +               wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
> > 
> > What about CPUs without FLUSH_L1D? Could it be done manually with a
> > memcpy or something?
> 
> It could - my original implementation (pre l1d_flush msr) did, but it
> did come with some additional cost in that I allocated per-cpu memory
> to keep a 32K buffer around that I could memcpy. It also sacrificed
> completeness for simplicity by not taking into account cases where L1
> was not 32K. As far as I know this msr is pretty widely deployed, even
> on older hardware.

You don't need that per cpu thing, really. We have both the MSR flush and
the software fallback in KVM vmx_l1d_flush(). The performance impact is
pretty much the same between the MSR flush and the software fallback.

Vs. deployment of the MSR. I'm not sure how wide this is supported on older
CPUs as there have been limitations on microcode patch space, but I can't
find the support matrix right now. Thanks Intel for reshuffling the webpage
every other week!

Thanks,

	tglx
Samuel Neves Oct. 12, 2018, 9:20 a.m. UTC | #14
On Thu, Oct 11, 2018 at 8:25 PM Andy Lutomirski <luto@kernel.org> wrote:
> What exactly is this trying to protect against?  And how many cycles
> should we expect L1D_FLUSH to take?

As far as I could measure, I got 1660 cycles per wrmsr 0x10b, 0x1 on a
Skylake chip, and 1220 cycles on a Skylake-SP.
Jann Horn Oct. 12, 2018, 1:25 p.m. UTC | #15
On Fri, Oct 12, 2018 at 11:41 AM Samuel Neves <sneves@dei.uc.pt> wrote:
>
> On Thu, Oct 11, 2018 at 8:25 PM Andy Lutomirski <luto@kernel.org> wrote:
> > What exactly is this trying to protect against?  And how many cycles
> > should we expect L1D_FLUSH to take?
>
> As far as I could measure, I got 1660 cycles per wrmsr 0x10b, 0x1 on a
> Skylake chip, and 1220 cycles on a Skylake-SP.

Is that with L1D mostly empty, with L1D mostly full with clean lines,
or with L1D full of dirty lines that need to be written back?
Alan Cox Oct. 12, 2018, 2:25 p.m. UTC | #16
> But this really needs to be clarified.  Alan said that a bunch of the
> "yet another Spectre variant" attacks would have been mitigated by
> this patch.  An explanation of *how* would be in order.

Today you have the situation where something creates a speculative
disclosure gadget. So we run around and we try and guess where to fix
them all with lfence. If you miss one then it leaves a trace in the L1D
cache, which is what you measure.

In almost every case we have looked at when you leave a footprint in the
L1D you resolve to an error path so the syscall errors.

In other words every time we fail to find a

                if (foo < limit) {
                        gadget(array[foo]);
 		} else
			return -EINVAL;

we turn that from being an easy to use gadget into something really
tricky because by the time the code flow has gotten back to the caller
the breadcrumbs have been eaten by the L1D flush.

The current process of trying to find them all with smatch and the like
is a game of whack-a-mole that will go on for a long long time. In the
meantime (and until the tools get better) it's nice to have an option
that takes a totally non-hot path (the fast path change is a single test
for >= 0) and provides additional defence in depth.

They are not hot paths: when I started playing with this idea I did
indeed strace my entire desktop for a day. There are couple of other
cases I would add to the pass list (EAGAIN, EWOULDBLOCK)). Tracing
other stuff you see the same - real world workloads simply don't generate
vast spews of erroring syscalls.

Look just how many examples we are still committing like
de916736aaaadddbd6061472969f667b14204aa9
7b6924d94a60c6b8c1279ca003e8744e6cd9e8b1
46feb6b495f7628a6dbf36c4e6d80faf378372d4
55690c07b44a82cc3359ce0c233f4ba7d80ba145


This approach turns the ones we miss into

                if (type > = 

                [speculatively create breadcrumb]

                condition resolves, return -EINVAL

                L1D flush

		No breadcrumbs

At best you have a microscopic window to attack it on the SMT pair.

Alan
Samuel Neves Oct. 12, 2018, 2:28 p.m. UTC | #17
On Fri, Oct 12, 2018 at 2:26 PM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Oct 12, 2018 at 11:41 AM Samuel Neves <sneves@dei.uc.pt> wrote:
> >
> > On Thu, Oct 11, 2018 at 8:25 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > What exactly is this trying to protect against?  And how many cycles
> > > should we expect L1D_FLUSH to take?
> >
> > As far as I could measure, I got 1660 cycles per wrmsr 0x10b, 0x1 on a
> > Skylake chip, and 1220 cycles on a Skylake-SP.
>
> Is that with L1D mostly empty, with L1D mostly full with clean lines,
> or with L1D full of dirty lines that need to be written back?

Mostly empty, as this is flushing repeatedly without bothering to
refill L1d with anything.

On Skylake the (averaged) uops breakdown is something like

port 0: 255
port 1: 143
port 2: 176
port 3: 177
port 4: 524
port 5: 273
port 6: 616
port 7: 182

The number of port 4 dispatches is very close to the number of cache
lines, suggesting one write per line (with respective 176+177+182 port
{2, 3, 7} address generations).

Furthermore, I suspect it also clears L1i cache. For 2^20 wrmsr
executions, we have around 2^20 frontend_retired_l1i_miss events, but
a negligible amount of frontend_retired_l2_miss ones.
Andy Lutomirski Oct. 12, 2018, 2:43 p.m. UTC | #18
On Oct 12, 2018, at 7:25 AM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:

>> But this really needs to be clarified.  Alan said that a bunch of the
>> "yet another Spectre variant" attacks would have been mitigated by
>> this patch.  An explanation of *how* would be in order.
> 
> Today you have the situation where something creates a speculative
> disclosure gadget. So we run around and we try and guess where to fix
> them all with lfence. If you miss one then it leaves a trace in the L1D
> cache, which is what you measure.
> 
> In almost every case we have looked at when you leave a footprint in the
> L1D you resolve to an error path so the syscall errors.
> 
> In other words every time we fail to find a
> 
>                if (foo < limit) {
>                        gadget(array[foo]);
>        } else
>            return -EINVAL;
> 
> we turn that from being an easy to use gadget into something really
> tricky because by the time the code flow has gotten back to the caller
> the breadcrumbs have been eaten by the L1D flush.

My understanding is that the standard “breadcrumb” is that a cache line is fetched into L1D, and that the cacheline in question will go into L1D even if it was previously not cached at all. So flushing L1D will cause the timing from a probe to be different, but the breadcrumb is still there, and the attack will still work.

Am I wrong?

> 
> 
> At best you have a microscopic window to attack it on the SMT pair.

So only the extra clever attackers will pull it off. This isn’t all that reassuring.

If I had the time to try to break this, I would set it up so that the cache lines that get probed are cached remotely, and I’d spin, waiting until one of them gets stolen. The spin loop would be much faster this way.

Or I would get a fancy new CPU and use UMONITOR and, unless UMONITOR is much cleverer than I suspect it is, the gig is up.  The time window for the attack could be as small as you want, and UMONITOR will catch it.

(Have I mentioned that I think that Intel needs to remove UMONITOR, PCOMMIT-style?  That instruction is simply the wrong solution to whatever problem it’s trying to solve.)

> 
> Alan
Alan Cox Oct. 12, 2018, 3:02 p.m. UTC | #19
> My understanding is that the standard “breadcrumb” is that a cache line is fetched into L1D, and that the cacheline in question will go into L1D even if it was previously not cached at all. So flushing L1D will cause the timing from a probe to be different, but the breadcrumb is still there, and the attack will still work.

Flush not write back. The L1D is empty (or full of other stuff the way
the prototype I tested did it as x86 lacked a true L1 flushing primitive)
 
> > At best you have a microscopic window to attack it on the SMT pair.  
> 
> So only the extra clever attackers will pull it off. This isn’t all that reassuring.

It's going to be very hard for them to do that. You don't really have the
timing control in userspace to do that sort of thing easily.

At the end of the day it's an additional hardenening not a fix. It's
designed to provide extra protection for the cases we don't know about
until we find them and lfence them. All of this (and all sidechannel of
any kind) is merely about reducing the bandwidth an attacker can achieve.

The idea is that it's cheap enough that it's worth doing.

> Or I would get a fancy new CPU and use UMONITOR and, unless UMONITOR is much cleverer than I suspect it is, the gig is up.  The time window for the attack could be as small as you want, and UMONITOR will catch it.

That would be an interesting line of attack. You would still have to
navigate a fair bit of noise.

Alan
Jann Horn Oct. 12, 2018, 3:41 p.m. UTC | #20
On Fri, Oct 12, 2018 at 5:03 PM Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
>
> > My understanding is that the standard “breadcrumb” is that a cache line is fetched into L1D, and that the cacheline in question will go into L1D even if it was previously not cached at all. So flushing L1D will cause the timing from a probe to be different, but the breadcrumb is still there, and the attack will still work.
>
> Flush not write back. The L1D is empty (or full of other stuff the way
> the prototype I tested did it as x86 lacked a true L1 flushing primitive)

What about the TLB? Once KPTI is unnecessary, are you going to need an
extra TLB flush here to prevent PRIME+PROBE on the TLB?

Also: Are you assuming that the attacker would only be able to do
PRIME+PROBE, not FLUSH+RELOAD or other attacks that rely on shared
memory (like FLUSH+FLUSH, or the MESI-based stuff that detects when a
line stops being in modified state on another core)?

> > > At best you have a microscopic window to attack it on the SMT pair.
> >
> > So only the extra clever attackers will pull it off. This isn’t all that reassuring.
>
> It's going to be very hard for them to do that. You don't really have the
> timing control in userspace to do that sort of thing easily.
>
> At the end of the day it's an additional hardenening not a fix. It's
> designed to provide extra protection for the cases we don't know about
> until we find them and lfence them. All of this (and all sidechannel of
> any kind) is merely about reducing the bandwidth an attacker can achieve.
>
> The idea is that it's cheap enough that it's worth doing.
>
> > Or I would get a fancy new CPU and use UMONITOR and, unless UMONITOR is much cleverer than I suspect it is, the gig is up.  The time window for the attack could be as small as you want, and UMONITOR will catch it.
>
> That would be an interesting line of attack. You would still have to
> navigate a fair bit of noise.
>
> Alan
Andy Lutomirski Oct. 12, 2018, 4:07 p.m. UTC | #21
On Fri, Oct 12, 2018 at 8:02 AM Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
>
> > My understanding is that the standard “breadcrumb” is that a cache line is fetched into L1D, and that the cacheline in question will go into L1D even if it was previously not cached at all. So flushing L1D will cause the timing from a probe to be different, but the breadcrumb is still there, and the attack will still work.
>
> Flush not write back. The L1D is empty (or full of other stuff the way
> the prototype I tested did it as x86 lacked a true L1 flushing primitive)

I'm not sure I follow what you're saying.  If an attacker is learning
some information based on whether a given cacheline is in L1D, I'm
asking why the attacker can't learn exactly the same information based
on whether the cache line is in L2.  Or using any of the other
variants that Jann is talking about.

Adding ~1600 cycles plus the slowdown due to the fact that the cache
got flushed to a code path that we hope isn't hot to mitigate one
particular means of exploiting potential bugs seems a bit dubious to
me.
diff mbox series

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1a0be022f91d..bde978eb3b4e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -445,6 +445,15 @@  config RETPOLINE
 	  code are eliminated. Since this includes the syscall entry path,
 	  it is not entirely pointless.
 
+config SYSCALL_FLUSH
+	bool "Clear L1 Cache on syscall errors"
+	default n
+	help
+	  Selecting 'y' allows the L1 cache to be cleared upon return of
+	  an error code from a syscall if the CPU supports "flush_l1d".
+	  This may reduce the likelyhood of speculative execution style
+	  attacks on syscalls.
+
 config INTEL_RDT
 	bool "Intel Resource Director Technology support"
 	default n
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 3b2490b81918..26de8ea71293 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -268,6 +268,20 @@  __visible inline void syscall_return_slowpath(struct pt_regs *regs)
 	prepare_exit_to_usermode(regs);
 }
 
+__visible inline void l1_cache_flush(struct pt_regs *regs)
+{
+	if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) &&
+	    static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
+		if (regs->ax == 0 || regs->ax == -EAGAIN ||
+		    regs->ax == -EEXIST || regs->ax == -ENOENT ||
+		    regs->ax == -EXDEV || regs->ax == -ETIMEDOUT ||
+		    regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS)
+			return;
+
+		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
+	}
+}
+
 #ifdef CONFIG_X86_64
 __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
@@ -290,6 +304,8 @@  __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 		regs->ax = sys_call_table[nr](regs);
 	}
 
+	l1_cache_flush(regs);
+
 	syscall_return_slowpath(regs);
 }
 #endif
@@ -338,6 +354,8 @@  static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
 #endif /* CONFIG_IA32_EMULATION */
 	}
 
+	l1_cache_flush(regs);
+
 	syscall_return_slowpath(regs);
 }