Message ID | 20181011185458.10186-1-kristen@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: entry: flush the cache if syscall error | expand |
On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi <kristen@linux.intel.com> wrote: > > This patch aims to make it harder to perform cache timing attacks on data > left behind by system calls. If we have an error returned from a syscall, > flush the L1 cache. > > It's important to note that this patch is not addressing any specific > exploit, nor is it intended to be a complete defense against anything. > It is intended to be a low cost way of eliminating some of side effects > of a failed system call. > > A performance test using sysbench on one hyperthread and a script which > attempts to repeatedly access files it does not have permission to access > on the other hyperthread found no significant performance impact. > > +__visible inline void l1_cache_flush(struct pt_regs *regs) > +{ > + if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) && > + static_cpu_has(X86_FEATURE_FLUSH_L1D)) { > + if (regs->ax == 0 || regs->ax == -EAGAIN || > + regs->ax == -EEXIST || regs->ax == -ENOENT || > + regs->ax == -EXDEV || regs->ax == -ETIMEDOUT || > + regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS) > + return; > + > + wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH); > + } > +} Ugh. What exactly is this trying to protect against? And how many cycles should we expect L1D_FLUSH to take? ISTM that, if we have a situation where the L1D can be read by user code, we lose, via hyperthreading, successful syscalls, /dev/random, and may other vectors. This seems like a small mitigation at a rather large cost.
On Thu, 2018-10-11 at 12:25 -0700, Andy Lutomirski wrote: > On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi > <kristen@linux.intel.com> wrote: > > > > This patch aims to make it harder to perform cache timing attacks > > on data > > left behind by system calls. If we have an error returned from a > > syscall, > > flush the L1 cache. > > > > It's important to note that this patch is not addressing any > > specific > > exploit, nor is it intended to be a complete defense against > > anything. > > It is intended to be a low cost way of eliminating some of side > > effects > > of a failed system call. > > > > A performance test using sysbench on one hyperthread and a script > > which > > attempts to repeatedly access files it does not have permission to > > access > > on the other hyperthread found no significant performance impact. > > > > +__visible inline void l1_cache_flush(struct pt_regs *regs) > > +{ > > + if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) && > > + static_cpu_has(X86_FEATURE_FLUSH_L1D)) { > > + if (regs->ax == 0 || regs->ax == -EAGAIN || > > + regs->ax == -EEXIST || regs->ax == -ENOENT || > > + regs->ax == -EXDEV || regs->ax == -ETIMEDOUT || > > + regs->ax == -ENOTCONN || regs->ax == > > -EINPROGRESS) > > + return; > > + > > + wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH); > > + } > > +} > > Ugh. > > What exactly is this trying to protect against? And how many cycles > should we expect L1D_FLUSH to take? As I mentioned in the commit message, this is not addressing any specific exploit. It is removing any side effects from a failed system call in the L1 cache. > > ISTM that, if we have a situation where the L1D can be read by user > code, we lose, via hyperthreading, successful syscalls, /dev/random, > and may other vectors. This seems like a small mitigation at a > rather > large cost. I pinned an evil task to one hyperthread that just caused L1 flushes by issuing failed system calls. On the other hyperthread, I ran a performance benchmark (sysbench). I did not see any difference between the baseline and the kernel with the patch applied. Is there a more appropriate test you'd be interested in seeing the results of? I'd be happy to design a different test.
> Ugh. > > What exactly is this trying to protect against? And how many cycles Most attacks by speculation rely upon leaving footprints in the L1 cache. They also almost inevitably resolve non speculatively to errors. If you look through all the 'yet another potential spectre case' patches people have found they would have been rendered close to useless by this change. It's a way to deal with the ones we don't know about, all the ones the tools won't find and it has pretty much zero cost (If you are bored strace an entire days desktop session, bang it through a script or two to extract the number of triggerig error returns and do the maths...) > should we expect L1D_FLUSH to take? More to the point you pretty much never trigger it. Errors are not the normal path in real code. The original version of this code emptied the L1 the hard way - and even then it was in the noise for real workloads we tried. You can argue that the other thread could be some evil task that deliberately triggers flushes, but it can already thrash the L1 on processors that share L1 between threads using perfectly normal memory instructions. Alan
On Thu, Oct 11, 2018 at 1:25 PM Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote: > > > Ugh. > > > > What exactly is this trying to protect against? And how many cycles > > Most attacks by speculation rely upon leaving footprints in the L1 cache. > They also almost inevitably resolve non speculatively to errors. If you > look through all the 'yet another potential spectre case' patches people > have found they would have been rendered close to useless by this change. Can you give an example? AFAIK the interesting Meltdown-like attacks are unaffected because Meltdown doesn't actually need the target data to be in L1D. And most of the Spectre-style attacks would have been blocked by doing LFENCE on the error cases (and somehow making sure that the CPU doesn't speculate around the LFENCE without noticing it). But this patch is doing an L1D flush, which, as far as I've heard, isn't actually relevant. > > It's a way to deal with the ones we don't know about, all the ones theion > tools won't find and it has pretty much zero cost > > (If you are bored strace an entire days desktop session, bang it through > a script or two to extract the number of triggerig error returns and do > the maths...) > > > should we expect L1D_FLUSH to take? > > More to the point you pretty much never trigger it. Errors are not the > normal path in real code. The original version of this code emptied the > L1 the hard way - and even then it was in the noise for real workloads we > tried. > > You can argue that the other thread could be some evil task that > deliberately triggers flushes, but it can already thrash the L1 on > processors that share L1 between threads using perfectly normal memory > instructions. > That's not what I meant. I meant that, if an attacker can run code on *both* logical threads on the same CPU, then they can run their attack code on the other logical thread before the L1D_FLUSH command takes effect. I care about the performance of single-threaded workloads, though. How slow is this thing? No one cares about syscall performance on regular desktop sessions except for gaming. But people do care about syscall performance on all kinds of crazy server, database, etc workloads. And compilation. And HPC stuff, although that mostly doesn't involve syscalls. So: benchmarks, please. And estimated cycle counts, please, on at least a couple of relevant CPU generations. On Meltdown-affected CPUs, we're doing a CR3 write anyway, which is fully serializing, so it's slow. But AFAIK that *already* blocks most of these attacks except L1TF, and L1TF has (hopefully!) been fixed anyway on Linux.
On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi <kristen@linux.intel.com> wrote: > > This patch aims to make it harder to perform cache timing attacks on data > left behind by system calls. If we have an error returned from a syscall, > flush the L1 cache. > > It's important to note that this patch is not addressing any specific > exploit, nor is it intended to be a complete defense against anything. > It is intended to be a low cost way of eliminating some of side effects > of a failed system call. > > A performance test using sysbench on one hyperthread and a script which > attempts to repeatedly access files it does not have permission to access > on the other hyperthread found no significant performance impact. > > Suggested-by: Alan Cox <alan@linux.intel.com> > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > --- > arch/x86/Kconfig | 9 +++++++++ > arch/x86/entry/common.c | 18 ++++++++++++++++++ > 2 files changed, 27 insertions(+) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 1a0be022f91d..bde978eb3b4e 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -445,6 +445,15 @@ config RETPOLINE > code are eliminated. Since this includes the syscall entry path, > it is not entirely pointless. > > +config SYSCALL_FLUSH > + bool "Clear L1 Cache on syscall errors" > + default n > + help > + Selecting 'y' allows the L1 cache to be cleared upon return of > + an error code from a syscall if the CPU supports "flush_l1d". > + This may reduce the likelyhood of speculative execution style > + attacks on syscalls. > + > config INTEL_RDT > bool "Intel Resource Director Technology support" > default n > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c > index 3b2490b81918..26de8ea71293 100644 > --- a/arch/x86/entry/common.c > +++ b/arch/x86/entry/common.c > @@ -268,6 +268,20 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs) > prepare_exit_to_usermode(regs); > } > > +__visible inline void l1_cache_flush(struct pt_regs *regs) > +{ > + if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) && > + static_cpu_has(X86_FEATURE_FLUSH_L1D)) { > + if (regs->ax == 0 || regs->ax == -EAGAIN || > + regs->ax == -EEXIST || regs->ax == -ENOENT || > + regs->ax == -EXDEV || regs->ax == -ETIMEDOUT || > + regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS) What about ax > 0? (Or more generally, any ax outside the range of -1 .. -4095 or whatever the error range is.) As it stands, it looks like you'll flush on successful read(), write(), recv(), etc, and that could seriously hurt performance on real workloads. > + return; > + > + wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH); > + } > +} > + > #ifdef CONFIG_X86_64 > __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs) > { > @@ -290,6 +304,8 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs) > regs->ax = sys_call_table[nr](regs); > } > > + l1_cache_flush(regs); > + > syscall_return_slowpath(regs); > } > #endif > @@ -338,6 +354,8 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs) > #endif /* CONFIG_IA32_EMULATION */ > } > > + l1_cache_flush(regs); > + > syscall_return_slowpath(regs); > } > > -- > 2.14.4 >
On Thu, Oct 11, 2018 at 1:48 PM, Andy Lutomirski <luto@kernel.org> wrote: > On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi > <kristen@linux.intel.com> wrote: >> >> This patch aims to make it harder to perform cache timing attacks on data >> left behind by system calls. If we have an error returned from a syscall, >> flush the L1 cache. >> >> It's important to note that this patch is not addressing any specific >> exploit, nor is it intended to be a complete defense against anything. >> It is intended to be a low cost way of eliminating some of side effects >> of a failed system call. >> >> A performance test using sysbench on one hyperthread and a script which >> attempts to repeatedly access files it does not have permission to access >> on the other hyperthread found no significant performance impact. >> >> Suggested-by: Alan Cox <alan@linux.intel.com> >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> >> --- >> arch/x86/Kconfig | 9 +++++++++ >> arch/x86/entry/common.c | 18 ++++++++++++++++++ >> 2 files changed, 27 insertions(+) >> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >> index 1a0be022f91d..bde978eb3b4e 100644 >> --- a/arch/x86/Kconfig >> +++ b/arch/x86/Kconfig >> @@ -445,6 +445,15 @@ config RETPOLINE >> code are eliminated. Since this includes the syscall entry path, >> it is not entirely pointless. >> >> +config SYSCALL_FLUSH >> + bool "Clear L1 Cache on syscall errors" >> + default n >> + help >> + Selecting 'y' allows the L1 cache to be cleared upon return of >> + an error code from a syscall if the CPU supports "flush_l1d". >> + This may reduce the likelyhood of speculative execution style >> + attacks on syscalls. >> + >> config INTEL_RDT >> bool "Intel Resource Director Technology support" >> default n >> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c >> index 3b2490b81918..26de8ea71293 100644 >> --- a/arch/x86/entry/common.c >> +++ b/arch/x86/entry/common.c >> @@ -268,6 +268,20 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs) >> prepare_exit_to_usermode(regs); >> } >> >> +__visible inline void l1_cache_flush(struct pt_regs *regs) >> +{ >> + if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) && >> + static_cpu_has(X86_FEATURE_FLUSH_L1D)) { >> + if (regs->ax == 0 || regs->ax == -EAGAIN || >> + regs->ax == -EEXIST || regs->ax == -ENOENT || >> + regs->ax == -EXDEV || regs->ax == -ETIMEDOUT || >> + regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS) > > What about ax > 0? (Or more generally, any ax outside the range of -1 > .. -4095 or whatever the error range is.) As it stands, it looks like > you'll flush on successful read(), write(), recv(), etc, and that > could seriously hurt performance on real workloads. Seems like just changing this with "ax == 0" into "ax >= 0" would solve that? I think this looks like a good idea. It might be worth adding a comment about the checks to explain why those errors are whitelisted. It's a cheap and effective mitigation for "unknown future problems" that doesn't degrade normal workloads. >> + return; >> + >> + wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH); What about CPUs without FLUSH_L1D? Could it be done manually with a memcpy or something? -Kees >> + } >> +} >> + >> #ifdef CONFIG_X86_64 >> __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs) >> { >> @@ -290,6 +304,8 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs) >> regs->ax = sys_call_table[nr](regs); >> } >> >> + l1_cache_flush(regs); >> + >> syscall_return_slowpath(regs); >> } >> #endif >> @@ -338,6 +354,8 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs) >> #endif /* CONFIG_IA32_EMULATION */ >> } >> >> + l1_cache_flush(regs); >> + >> syscall_return_slowpath(regs); >> } >> >> -- >> 2.14.4 >>
On Thu, Oct 11, 2018 at 1:55 PM Kees Cook <keescook@chromium.org> wrote: > > On Thu, Oct 11, 2018 at 1:48 PM, Andy Lutomirski <luto@kernel.org> wrote: > > On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi > > <kristen@linux.intel.com> wrote: > >> > >> This patch aims to make it harder to perform cache timing attacks on data > >> left behind by system calls. If we have an error returned from a syscall, > >> flush the L1 cache. > >> > >> It's important to note that this patch is not addressing any specific > >> exploit, nor is it intended to be a complete defense against anything. > >> It is intended to be a low cost way of eliminating some of side effects > >> of a failed system call. > >> > >> A performance test using sysbench on one hyperthread and a script which > >> attempts to repeatedly access files it does not have permission to access > >> on the other hyperthread found no significant performance impact. > >> > >> Suggested-by: Alan Cox <alan@linux.intel.com> > >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > >> --- > >> arch/x86/Kconfig | 9 +++++++++ > >> arch/x86/entry/common.c | 18 ++++++++++++++++++ > >> 2 files changed, 27 insertions(+) > >> > >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > >> index 1a0be022f91d..bde978eb3b4e 100644 > >> --- a/arch/x86/Kconfig > >> +++ b/arch/x86/Kconfig > >> @@ -445,6 +445,15 @@ config RETPOLINE > >> code are eliminated. Since this includes the syscall entry path, > >> it is not entirely pointless. > >> > >> +config SYSCALL_FLUSH > >> + bool "Clear L1 Cache on syscall errors" > >> + default n > >> + help > >> + Selecting 'y' allows the L1 cache to be cleared upon return of > >> + an error code from a syscall if the CPU supports "flush_l1d". > >> + This may reduce the likelyhood of speculative execution style > >> + attacks on syscalls. > >> + > >> config INTEL_RDT > >> bool "Intel Resource Director Technology support" > >> default n > >> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c > >> index 3b2490b81918..26de8ea71293 100644 > >> --- a/arch/x86/entry/common.c > >> +++ b/arch/x86/entry/common.c > >> @@ -268,6 +268,20 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs) > >> prepare_exit_to_usermode(regs); > >> } > >> > >> +__visible inline void l1_cache_flush(struct pt_regs *regs) > >> +{ > >> + if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) && > >> + static_cpu_has(X86_FEATURE_FLUSH_L1D)) { > >> + if (regs->ax == 0 || regs->ax == -EAGAIN || > >> + regs->ax == -EEXIST || regs->ax == -ENOENT || > >> + regs->ax == -EXDEV || regs->ax == -ETIMEDOUT || > >> + regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS) > > > > What about ax > 0? (Or more generally, any ax outside the range of -1 > > .. -4095 or whatever the error range is.) As it stands, it looks like > > you'll flush on successful read(), write(), recv(), etc, and that > > could seriously hurt performance on real workloads. > > Seems like just changing this with "ax == 0" into "ax >= 0" would solve that? I can easily imagine that there are other errors for which performance matters. EBUSY comes to mind. > > I think this looks like a good idea. It might be worth adding a > comment about the checks to explain why those errors are whitelisted. > It's a cheap and effective mitigation for "unknown future problems" > that doesn't degrade normal workloads. I still want to see two pieces of information before I ack a patch like this: - How long does L1D_FLUSH take, roughly? - An example of a type of attack that would be mitigated. For the latter, I assume that the goal is to mitigate against attacks where a syscall speculatively loads something sensitive and then fails. But, before it fails, it leaks the information it speculatively loaded, and that leak ended up in L1D but *not* in other cache levels. And somehow the L1D can't be probed quickly enough in a parallel thread to meaningfully get the information out of L1D. Or maybe it's trying to mitigate against the use of failing syscalls to get some sensitive data into L1D and then using something like L1TF to read it out. But this really needs to be clarified. Alan said that a bunch of the "yet another Spectre variant" attacks would have been mitigated by this patch. An explanation of *how* would be in order. And we should seriously consider putting this kind of thing under CONFIG_SPECULATIVE_MITIGATIONS or similar. The idea being that it doesn't mitigate a clear known attack family.
On Thu, 2018-10-11 at 13:55 -0700, Kees Cook wrote: > On Thu, Oct 11, 2018 at 1:48 PM, Andy Lutomirski <luto@kernel.org> > wrote: > > On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi > > <kristen@linux.intel.com> wrote: > > > > > > This patch aims to make it harder to perform cache timing attacks > > > on data > > > left behind by system calls. If we have an error returned from a > > > syscall, > > > flush the L1 cache. > > > > > > It's important to note that this patch is not addressing any > > > specific > > > exploit, nor is it intended to be a complete defense against > > > anything. > > > It is intended to be a low cost way of eliminating some of side > > > effects > > > of a failed system call. > > > > > > A performance test using sysbench on one hyperthread and a script > > > which > > > attempts to repeatedly access files it does not have permission > > > to access > > > on the other hyperthread found no significant performance impact. > > > > > > Suggested-by: Alan Cox <alan@linux.intel.com> > > > Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > > > --- > > > arch/x86/Kconfig | 9 +++++++++ > > > arch/x86/entry/common.c | 18 ++++++++++++++++++ > > > 2 files changed, 27 insertions(+) > > > > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > > > index 1a0be022f91d..bde978eb3b4e 100644 > > > --- a/arch/x86/Kconfig > > > +++ b/arch/x86/Kconfig > > > @@ -445,6 +445,15 @@ config RETPOLINE > > > code are eliminated. Since this includes the syscall > > > entry path, > > > it is not entirely pointless. > > > > > > +config SYSCALL_FLUSH > > > + bool "Clear L1 Cache on syscall errors" > > > + default n > > > + help > > > + Selecting 'y' allows the L1 cache to be cleared upon > > > return of > > > + an error code from a syscall if the CPU supports > > > "flush_l1d". > > > + This may reduce the likelyhood of speculative execution > > > style > > > + attacks on syscalls. > > > + > > > config INTEL_RDT > > > bool "Intel Resource Director Technology support" > > > default n > > > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c > > > index 3b2490b81918..26de8ea71293 100644 > > > --- a/arch/x86/entry/common.c > > > +++ b/arch/x86/entry/common.c > > > @@ -268,6 +268,20 @@ __visible inline void > > > syscall_return_slowpath(struct pt_regs *regs) > > > prepare_exit_to_usermode(regs); > > > } > > > > > > +__visible inline void l1_cache_flush(struct pt_regs *regs) > > > +{ > > > + if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) && > > > + static_cpu_has(X86_FEATURE_FLUSH_L1D)) { > > > + if (regs->ax == 0 || regs->ax == -EAGAIN || > > > + regs->ax == -EEXIST || regs->ax == -ENOENT || > > > + regs->ax == -EXDEV || regs->ax == -ETIMEDOUT > > > || > > > + regs->ax == -ENOTCONN || regs->ax == > > > -EINPROGRESS) > > > > What about ax > 0? (Or more generally, any ax outside the range of > > -1 > > .. -4095 or whatever the error range is.) As it stands, it looks > > like > > you'll flush on successful read(), write(), recv(), etc, and that > > could seriously hurt performance on real workloads. > > Seems like just changing this with "ax == 0" into "ax >= 0" would > solve that? thanks, will do. > > I think this looks like a good idea. It might be worth adding a > comment about the checks to explain why those errors are whitelisted. > It's a cheap and effective mitigation for "unknown future problems" > that doesn't degrade normal workloads. > > > > + return; > > > + > > > + wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH); > > What about CPUs without FLUSH_L1D? Could it be done manually with a > memcpy or something? It could - my original implementation (pre l1d_flush msr) did, but it did come with some additional cost in that I allocated per-cpu memory to keep a 32K buffer around that I could memcpy. It also sacrificed completeness for simplicity by not taking into account cases where L1 was not 32K. As far as I know this msr is pretty widely deployed, even on older hardware.
On Thu, Oct 11, 2018 at 10:56 PM Kees Cook <keescook@chromium.org> wrote: > On Thu, Oct 11, 2018 at 1:48 PM, Andy Lutomirski <luto@kernel.org> wrote: > > On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi > > <kristen@linux.intel.com> wrote: > >> > >> This patch aims to make it harder to perform cache timing attacks on data > >> left behind by system calls. If we have an error returned from a syscall, > >> flush the L1 cache. > >> > >> It's important to note that this patch is not addressing any specific > >> exploit, nor is it intended to be a complete defense against anything. > >> It is intended to be a low cost way of eliminating some of side effects > >> of a failed system call. > >> > >> A performance test using sysbench on one hyperthread and a script which > >> attempts to repeatedly access files it does not have permission to access > >> on the other hyperthread found no significant performance impact. > >> > >> Suggested-by: Alan Cox <alan@linux.intel.com> > >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > >> --- > >> arch/x86/Kconfig | 9 +++++++++ > >> arch/x86/entry/common.c | 18 ++++++++++++++++++ > >> 2 files changed, 27 insertions(+) > >> > >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > >> index 1a0be022f91d..bde978eb3b4e 100644 > >> --- a/arch/x86/Kconfig > >> +++ b/arch/x86/Kconfig > >> @@ -445,6 +445,15 @@ config RETPOLINE > >> code are eliminated. Since this includes the syscall entry path, > >> it is not entirely pointless. > >> > >> +config SYSCALL_FLUSH > >> + bool "Clear L1 Cache on syscall errors" > >> + default n > >> + help > >> + Selecting 'y' allows the L1 cache to be cleared upon return of > >> + an error code from a syscall if the CPU supports "flush_l1d". > >> + This may reduce the likelyhood of speculative execution style > >> + attacks on syscalls. > >> + > >> config INTEL_RDT > >> bool "Intel Resource Director Technology support" > >> default n > >> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c > >> index 3b2490b81918..26de8ea71293 100644 > >> --- a/arch/x86/entry/common.c > >> +++ b/arch/x86/entry/common.c > >> @@ -268,6 +268,20 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs) > >> prepare_exit_to_usermode(regs); > >> } > >> > >> +__visible inline void l1_cache_flush(struct pt_regs *regs) > >> +{ > >> + if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) && > >> + static_cpu_has(X86_FEATURE_FLUSH_L1D)) { > >> + if (regs->ax == 0 || regs->ax == -EAGAIN || > >> + regs->ax == -EEXIST || regs->ax == -ENOENT || > >> + regs->ax == -EXDEV || regs->ax == -ETIMEDOUT || > >> + regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS) > > > > What about ax > 0? (Or more generally, any ax outside the range of -1 > > .. -4095 or whatever the error range is.) As it stands, it looks like > > you'll flush on successful read(), write(), recv(), etc, and that > > could seriously hurt performance on real workloads. > > Seems like just changing this with "ax == 0" into "ax >= 0" would solve that? As spender points out on twitter (https://twitter.com/grsecurity/status/1050497259937370118 - thanks, spender!), struct pt_regs stores register values as "unsigned long", and so you'll need to use something like IS_ERR_VALUE().
On Thu, Oct 11, 2018 at 11:17 PM Andy Lutomirski <luto@kernel.org> wrote: > On Thu, Oct 11, 2018 at 1:55 PM Kees Cook <keescook@chromium.org> wrote: > > On Thu, Oct 11, 2018 at 1:48 PM, Andy Lutomirski <luto@kernel.org> wrote: > > > On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi > > > <kristen@linux.intel.com> wrote: > > >> > > >> This patch aims to make it harder to perform cache timing attacks on data > > >> left behind by system calls. If we have an error returned from a syscall, > > >> flush the L1 cache. > > >> > > >> It's important to note that this patch is not addressing any specific > > >> exploit, nor is it intended to be a complete defense against anything. > > >> It is intended to be a low cost way of eliminating some of side effects > > >> of a failed system call. > > >> > > >> A performance test using sysbench on one hyperthread and a script which > > >> attempts to repeatedly access files it does not have permission to access > > >> on the other hyperthread found no significant performance impact. > > >> > > >> Suggested-by: Alan Cox <alan@linux.intel.com> > > >> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> > > >> --- > > >> arch/x86/Kconfig | 9 +++++++++ > > >> arch/x86/entry/common.c | 18 ++++++++++++++++++ > > >> 2 files changed, 27 insertions(+) > > >> > > >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > > >> index 1a0be022f91d..bde978eb3b4e 100644 > > >> --- a/arch/x86/Kconfig > > >> +++ b/arch/x86/Kconfig > > >> @@ -445,6 +445,15 @@ config RETPOLINE > > >> code are eliminated. Since this includes the syscall entry path, > > >> it is not entirely pointless. > > >> > > >> +config SYSCALL_FLUSH > > >> + bool "Clear L1 Cache on syscall errors" > > >> + default n > > >> + help > > >> + Selecting 'y' allows the L1 cache to be cleared upon return of > > >> + an error code from a syscall if the CPU supports "flush_l1d". > > >> + This may reduce the likelyhood of speculative execution style > > >> + attacks on syscalls. > > >> + > > >> config INTEL_RDT > > >> bool "Intel Resource Director Technology support" > > >> default n > > >> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c > > >> index 3b2490b81918..26de8ea71293 100644 > > >> --- a/arch/x86/entry/common.c > > >> +++ b/arch/x86/entry/common.c > > >> @@ -268,6 +268,20 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs) > > >> prepare_exit_to_usermode(regs); > > >> } > > >> > > >> +__visible inline void l1_cache_flush(struct pt_regs *regs) > > >> +{ > > >> + if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) && > > >> + static_cpu_has(X86_FEATURE_FLUSH_L1D)) { > > >> + if (regs->ax == 0 || regs->ax == -EAGAIN || > > >> + regs->ax == -EEXIST || regs->ax == -ENOENT || > > >> + regs->ax == -EXDEV || regs->ax == -ETIMEDOUT || > > >> + regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS) > > > > > > What about ax > 0? (Or more generally, any ax outside the range of -1 > > > .. -4095 or whatever the error range is.) As it stands, it looks like > > > you'll flush on successful read(), write(), recv(), etc, and that > > > could seriously hurt performance on real workloads. > > > > Seems like just changing this with "ax == 0" into "ax >= 0" would solve that? > > I can easily imagine that there are other errors for which performance > matters. EBUSY comes to mind. > > > > > I think this looks like a good idea. It might be worth adding a > > comment about the checks to explain why those errors are whitelisted. > > It's a cheap and effective mitigation for "unknown future problems" > > that doesn't degrade normal workloads. > > I still want to see two pieces of information before I ack a patch like this: > > - How long does L1D_FLUSH take, roughly? (Especially if L1D is dirty and you can't just wipe it all.) > - An example of a type of attack that would be mitigated. > > For the latter, I assume that the goal is to mitigate against attacks > where a syscall speculatively loads something sensitive and then > fails. But, before it fails, it leaks the information it > speculatively loaded, and that leak ended up in L1D but *not* in other > cache levels. And somehow the L1D can't be probed quickly enough in a > parallel thread to meaningfully get the information out of L1D. And the attacker can't delay the syscall return somehow, e.g. on a fully-preemptible kernel by preempting the syscall, or by tarpitting a userspace memory access in the error path. > Or > maybe it's trying to mitigate against the use of failing syscalls to > get some sensitive data into L1D and then using something like L1TF to > read it out. > > But this really needs to be clarified. Alan said that a bunch of the > "yet another Spectre variant" attacks would have been mitigated by > this patch. An explanation of *how* would be in order. > > And we should seriously consider putting this kind of thing under > CONFIG_SPECULATIVE_MITIGATIONS or similar. The idea being that it > doesn't mitigate a clear known attack family.
On Thu, 11 Oct 2018, Kees Cook wrote: > On Thu, Oct 11, 2018 at 1:48 PM, Andy Lutomirski <luto@kernel.org> wrote: > > On Thu, Oct 11, 2018 at 11:55 AM Kristen Carlson Accardi > >> +__visible inline void l1_cache_flush(struct pt_regs *regs) > >> +{ > >> + if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) && > >> + static_cpu_has(X86_FEATURE_FLUSH_L1D)) { > >> + if (regs->ax == 0 || regs->ax == -EAGAIN || > >> + regs->ax == -EEXIST || regs->ax == -ENOENT || > >> + regs->ax == -EXDEV || regs->ax == -ETIMEDOUT || > >> + regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS) > > > > What about ax > 0? (Or more generally, any ax outside the range of -1 > > .. -4095 or whatever the error range is.) As it stands, it looks like > > you'll flush on successful read(), write(), recv(), etc, and that > > could seriously hurt performance on real workloads. > > Seems like just changing this with "ax == 0" into "ax >= 0" would solve that? > > I think this looks like a good idea. It might be worth adding a > comment about the checks to explain why those errors are whitelisted. > It's a cheap and effective mitigation for "unknown future problems" > that doesn't degrade normal workloads. pt_regs->ax is unsigned long, so you want to check this with IS_ERR_VALUE() first. if (!IS_ERR_VALUE(regs->ax)) return; and then you really want to have something smarter than a gazillion of whitelisted error value checks, which effectively compile into a gazillion conditonal branches. Thanks, tglx
On Thu, 11 Oct 2018, Jann Horn wrote: > On Thu, Oct 11, 2018 at 10:56 PM Kees Cook <keescook@chromium.org> wrote: > > Seems like just changing this with "ax == 0" into "ax >= 0" would solve that? > > As spender points out on twitter > (https://twitter.com/grsecurity/status/1050497259937370118 - thanks, > spender!), struct pt_regs stores register values as "unsigned long", > and so you'll need to use something like IS_ERR_VALUE(). Ha, so it's not only me who noticed, but then I'm not reading twitter... Thanks, tglx
On Thu, 11 Oct 2018, Kristen C Accardi wrote: > On Thu, 2018-10-11 at 13:55 -0700, Kees Cook wrote: > > I think this looks like a good idea. It might be worth adding a > > comment about the checks to explain why those errors are whitelisted. > > It's a cheap and effective mitigation for "unknown future problems" > > that doesn't degrade normal workloads. > > > > > > + return; > > > > + > > > > + wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH); > > > > What about CPUs without FLUSH_L1D? Could it be done manually with a > > memcpy or something? > > It could - my original implementation (pre l1d_flush msr) did, but it > did come with some additional cost in that I allocated per-cpu memory > to keep a 32K buffer around that I could memcpy. It also sacrificed > completeness for simplicity by not taking into account cases where L1 > was not 32K. As far as I know this msr is pretty widely deployed, even > on older hardware. You don't need that per cpu thing, really. We have both the MSR flush and the software fallback in KVM vmx_l1d_flush(). The performance impact is pretty much the same between the MSR flush and the software fallback. Vs. deployment of the MSR. I'm not sure how wide this is supported on older CPUs as there have been limitations on microcode patch space, but I can't find the support matrix right now. Thanks Intel for reshuffling the webpage every other week! Thanks, tglx
On Thu, Oct 11, 2018 at 8:25 PM Andy Lutomirski <luto@kernel.org> wrote: > What exactly is this trying to protect against? And how many cycles > should we expect L1D_FLUSH to take? As far as I could measure, I got 1660 cycles per wrmsr 0x10b, 0x1 on a Skylake chip, and 1220 cycles on a Skylake-SP.
On Fri, Oct 12, 2018 at 11:41 AM Samuel Neves <sneves@dei.uc.pt> wrote: > > On Thu, Oct 11, 2018 at 8:25 PM Andy Lutomirski <luto@kernel.org> wrote: > > What exactly is this trying to protect against? And how many cycles > > should we expect L1D_FLUSH to take? > > As far as I could measure, I got 1660 cycles per wrmsr 0x10b, 0x1 on a > Skylake chip, and 1220 cycles on a Skylake-SP. Is that with L1D mostly empty, with L1D mostly full with clean lines, or with L1D full of dirty lines that need to be written back?
> But this really needs to be clarified. Alan said that a bunch of the > "yet another Spectre variant" attacks would have been mitigated by > this patch. An explanation of *how* would be in order. Today you have the situation where something creates a speculative disclosure gadget. So we run around and we try and guess where to fix them all with lfence. If you miss one then it leaves a trace in the L1D cache, which is what you measure. In almost every case we have looked at when you leave a footprint in the L1D you resolve to an error path so the syscall errors. In other words every time we fail to find a if (foo < limit) { gadget(array[foo]); } else return -EINVAL; we turn that from being an easy to use gadget into something really tricky because by the time the code flow has gotten back to the caller the breadcrumbs have been eaten by the L1D flush. The current process of trying to find them all with smatch and the like is a game of whack-a-mole that will go on for a long long time. In the meantime (and until the tools get better) it's nice to have an option that takes a totally non-hot path (the fast path change is a single test for >= 0) and provides additional defence in depth. They are not hot paths: when I started playing with this idea I did indeed strace my entire desktop for a day. There are couple of other cases I would add to the pass list (EAGAIN, EWOULDBLOCK)). Tracing other stuff you see the same - real world workloads simply don't generate vast spews of erroring syscalls. Look just how many examples we are still committing like de916736aaaadddbd6061472969f667b14204aa9 7b6924d94a60c6b8c1279ca003e8744e6cd9e8b1 46feb6b495f7628a6dbf36c4e6d80faf378372d4 55690c07b44a82cc3359ce0c233f4ba7d80ba145 This approach turns the ones we miss into if (type > = [speculatively create breadcrumb] condition resolves, return -EINVAL L1D flush No breadcrumbs At best you have a microscopic window to attack it on the SMT pair. Alan
On Fri, Oct 12, 2018 at 2:26 PM Jann Horn <jannh@google.com> wrote: > > On Fri, Oct 12, 2018 at 11:41 AM Samuel Neves <sneves@dei.uc.pt> wrote: > > > > On Thu, Oct 11, 2018 at 8:25 PM Andy Lutomirski <luto@kernel.org> wrote: > > > What exactly is this trying to protect against? And how many cycles > > > should we expect L1D_FLUSH to take? > > > > As far as I could measure, I got 1660 cycles per wrmsr 0x10b, 0x1 on a > > Skylake chip, and 1220 cycles on a Skylake-SP. > > Is that with L1D mostly empty, with L1D mostly full with clean lines, > or with L1D full of dirty lines that need to be written back? Mostly empty, as this is flushing repeatedly without bothering to refill L1d with anything. On Skylake the (averaged) uops breakdown is something like port 0: 255 port 1: 143 port 2: 176 port 3: 177 port 4: 524 port 5: 273 port 6: 616 port 7: 182 The number of port 4 dispatches is very close to the number of cache lines, suggesting one write per line (with respective 176+177+182 port {2, 3, 7} address generations). Furthermore, I suspect it also clears L1i cache. For 2^20 wrmsr executions, we have around 2^20 frontend_retired_l1i_miss events, but a negligible amount of frontend_retired_l2_miss ones.
On Oct 12, 2018, at 7:25 AM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote: >> But this really needs to be clarified. Alan said that a bunch of the >> "yet another Spectre variant" attacks would have been mitigated by >> this patch. An explanation of *how* would be in order. > > Today you have the situation where something creates a speculative > disclosure gadget. So we run around and we try and guess where to fix > them all with lfence. If you miss one then it leaves a trace in the L1D > cache, which is what you measure. > > In almost every case we have looked at when you leave a footprint in the > L1D you resolve to an error path so the syscall errors. > > In other words every time we fail to find a > > if (foo < limit) { > gadget(array[foo]); > } else > return -EINVAL; > > we turn that from being an easy to use gadget into something really > tricky because by the time the code flow has gotten back to the caller > the breadcrumbs have been eaten by the L1D flush. My understanding is that the standard “breadcrumb” is that a cache line is fetched into L1D, and that the cacheline in question will go into L1D even if it was previously not cached at all. So flushing L1D will cause the timing from a probe to be different, but the breadcrumb is still there, and the attack will still work. Am I wrong? > > > At best you have a microscopic window to attack it on the SMT pair. So only the extra clever attackers will pull it off. This isn’t all that reassuring. If I had the time to try to break this, I would set it up so that the cache lines that get probed are cached remotely, and I’d spin, waiting until one of them gets stolen. The spin loop would be much faster this way. Or I would get a fancy new CPU and use UMONITOR and, unless UMONITOR is much cleverer than I suspect it is, the gig is up. The time window for the attack could be as small as you want, and UMONITOR will catch it. (Have I mentioned that I think that Intel needs to remove UMONITOR, PCOMMIT-style? That instruction is simply the wrong solution to whatever problem it’s trying to solve.) > > Alan
> My understanding is that the standard “breadcrumb” is that a cache line is fetched into L1D, and that the cacheline in question will go into L1D even if it was previously not cached at all. So flushing L1D will cause the timing from a probe to be different, but the breadcrumb is still there, and the attack will still work. Flush not write back. The L1D is empty (or full of other stuff the way the prototype I tested did it as x86 lacked a true L1 flushing primitive) > > At best you have a microscopic window to attack it on the SMT pair. > > So only the extra clever attackers will pull it off. This isn’t all that reassuring. It's going to be very hard for them to do that. You don't really have the timing control in userspace to do that sort of thing easily. At the end of the day it's an additional hardenening not a fix. It's designed to provide extra protection for the cases we don't know about until we find them and lfence them. All of this (and all sidechannel of any kind) is merely about reducing the bandwidth an attacker can achieve. The idea is that it's cheap enough that it's worth doing. > Or I would get a fancy new CPU and use UMONITOR and, unless UMONITOR is much cleverer than I suspect it is, the gig is up. The time window for the attack could be as small as you want, and UMONITOR will catch it. That would be an interesting line of attack. You would still have to navigate a fair bit of noise. Alan
On Fri, Oct 12, 2018 at 5:03 PM Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote: > > > My understanding is that the standard “breadcrumb” is that a cache line is fetched into L1D, and that the cacheline in question will go into L1D even if it was previously not cached at all. So flushing L1D will cause the timing from a probe to be different, but the breadcrumb is still there, and the attack will still work. > > Flush not write back. The L1D is empty (or full of other stuff the way > the prototype I tested did it as x86 lacked a true L1 flushing primitive) What about the TLB? Once KPTI is unnecessary, are you going to need an extra TLB flush here to prevent PRIME+PROBE on the TLB? Also: Are you assuming that the attacker would only be able to do PRIME+PROBE, not FLUSH+RELOAD or other attacks that rely on shared memory (like FLUSH+FLUSH, or the MESI-based stuff that detects when a line stops being in modified state on another core)? > > > At best you have a microscopic window to attack it on the SMT pair. > > > > So only the extra clever attackers will pull it off. This isn’t all that reassuring. > > It's going to be very hard for them to do that. You don't really have the > timing control in userspace to do that sort of thing easily. > > At the end of the day it's an additional hardenening not a fix. It's > designed to provide extra protection for the cases we don't know about > until we find them and lfence them. All of this (and all sidechannel of > any kind) is merely about reducing the bandwidth an attacker can achieve. > > The idea is that it's cheap enough that it's worth doing. > > > Or I would get a fancy new CPU and use UMONITOR and, unless UMONITOR is much cleverer than I suspect it is, the gig is up. The time window for the attack could be as small as you want, and UMONITOR will catch it. > > That would be an interesting line of attack. You would still have to > navigate a fair bit of noise. > > Alan
On Fri, Oct 12, 2018 at 8:02 AM Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote: > > > My understanding is that the standard “breadcrumb” is that a cache line is fetched into L1D, and that the cacheline in question will go into L1D even if it was previously not cached at all. So flushing L1D will cause the timing from a probe to be different, but the breadcrumb is still there, and the attack will still work. > > Flush not write back. The L1D is empty (or full of other stuff the way > the prototype I tested did it as x86 lacked a true L1 flushing primitive) I'm not sure I follow what you're saying. If an attacker is learning some information based on whether a given cacheline is in L1D, I'm asking why the attacker can't learn exactly the same information based on whether the cache line is in L2. Or using any of the other variants that Jann is talking about. Adding ~1600 cycles plus the slowdown due to the fact that the cache got flushed to a code path that we hope isn't hot to mitigate one particular means of exploiting potential bugs seems a bit dubious to me.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1a0be022f91d..bde978eb3b4e 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -445,6 +445,15 @@ config RETPOLINE code are eliminated. Since this includes the syscall entry path, it is not entirely pointless. +config SYSCALL_FLUSH + bool "Clear L1 Cache on syscall errors" + default n + help + Selecting 'y' allows the L1 cache to be cleared upon return of + an error code from a syscall if the CPU supports "flush_l1d". + This may reduce the likelyhood of speculative execution style + attacks on syscalls. + config INTEL_RDT bool "Intel Resource Director Technology support" default n diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 3b2490b81918..26de8ea71293 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -268,6 +268,20 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs) prepare_exit_to_usermode(regs); } +__visible inline void l1_cache_flush(struct pt_regs *regs) +{ + if (IS_ENABLED(CONFIG_SYSCALL_FLUSH) && + static_cpu_has(X86_FEATURE_FLUSH_L1D)) { + if (regs->ax == 0 || regs->ax == -EAGAIN || + regs->ax == -EEXIST || regs->ax == -ENOENT || + regs->ax == -EXDEV || regs->ax == -ETIMEDOUT || + regs->ax == -ENOTCONN || regs->ax == -EINPROGRESS) + return; + + wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH); + } +} + #ifdef CONFIG_X86_64 __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs) { @@ -290,6 +304,8 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs) regs->ax = sys_call_table[nr](regs); } + l1_cache_flush(regs); + syscall_return_slowpath(regs); } #endif @@ -338,6 +354,8 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs) #endif /* CONFIG_IA32_EMULATION */ } + l1_cache_flush(regs); + syscall_return_slowpath(regs); }
This patch aims to make it harder to perform cache timing attacks on data left behind by system calls. If we have an error returned from a syscall, flush the L1 cache. It's important to note that this patch is not addressing any specific exploit, nor is it intended to be a complete defense against anything. It is intended to be a low cost way of eliminating some of side effects of a failed system call. A performance test using sysbench on one hyperthread and a script which attempts to repeatedly access files it does not have permission to access on the other hyperthread found no significant performance impact. Suggested-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> --- arch/x86/Kconfig | 9 +++++++++ arch/x86/entry/common.c | 18 ++++++++++++++++++ 2 files changed, 27 insertions(+)