diff mbox

[v4.1,07/10] x86: narrow out of bounds syscalls to sys_read under speculation

Message ID 151648239535.34747.4422108674633222531.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams Jan. 20, 2018, 9:06 p.m. UTC
The syscall table base is a user controlled function pointer in kernel
space. Like, 'get_user, use 'MASK_NOSPEC' to prevent any out of bounds
speculation. While retpoline prevents speculating into the user
controlled target it does not stop the pointer de-reference, the concern
is leaking memory relative to the syscall table base.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/entry/entry_64.S   |    2 ++
 arch/x86/include/asm/smap.h |    9 ++++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

Comments

Andy Lutomirski Jan. 21, 2018, 7:16 p.m. UTC | #1
> On Jan 20, 2018, at 1:06 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> The syscall table base is a user controlled function pointer in kernel
> space. Like, 'get_user, use 'MASK_NOSPEC' to prevent any out of bounds
> speculation. While retpoline prevents speculating into the user
> controlled target it does not stop the pointer de-reference, the concern
> is leaking memory relative to the syscall table base.
> 
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> arch/x86/entry/entry_64.S   |    2 ++
> arch/x86/include/asm/smap.h |    9 ++++++++-
> 2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 63f4320602a3..584f6d2236b3 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -35,6 +35,7 @@
> #include <asm/asm.h>
> #include <asm/smap.h>
> #include <asm/pgtable_types.h>
> +#include <asm/smap.h>
> #include <asm/export.h>
> #include <asm/frame.h>
> #include <asm/nospec-branch.h>
> @@ -260,6 +261,7 @@ entry_SYSCALL_64_fastpath:
>    cmpl    $__NR_syscall_max, %eax
> #endif
>    ja    1f                /* return -ENOSYS (already in pt_regs->ax) */
> +    MASK_NOSPEC %r11 %rax            /* sanitize syscall_nr wrt speculation */

What's the threat you're protecting against?  No one does any computation or data dependent loads based on the value being read.  Or are you worried about ASLR leaks due to merely speculatively loading from a bogus address.
Andy Lutomirski Jan. 21, 2018, 7:25 p.m. UTC | #2
> On Jan 21, 2018, at 11:16 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
> 
>> On Jan 20, 2018, at 1:06 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> 
>> The syscall table base is a user controlled function pointer in kernel
>> space. Like, 'get_user, use 'MASK_NOSPEC' to prevent any out of bounds
>> speculation. While retpoline prevents speculating into the user
>> controlled target it does not stop the pointer de-reference, the concern
>> is leaking memory relative to the syscall table base.
>> 
>> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>> arch/x86/entry/entry_64.S   |    2 ++
>> arch/x86/include/asm/smap.h |    9 ++++++++-
>> 2 files changed, 10 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index 63f4320602a3..584f6d2236b3 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -35,6 +35,7 @@
>> #include <asm/asm.h>
>> #include <asm/smap.h>
>> #include <asm/pgtable_types.h>
>> +#include <asm/smap.h>
>> #include <asm/export.h>
>> #include <asm/frame.h>
>> #include <asm/nospec-branch.h>
>> @@ -260,6 +261,7 @@ entry_SYSCALL_64_fastpath:
>>   cmpl    $__NR_syscall_max, %eax
>> #endif
>>   ja    1f                /* return -ENOSYS (already in pt_regs->ax) */
>> +    MASK_NOSPEC %r11 %rax            /* sanitize syscall_nr wrt speculation */
> 
> What's the threat you're protecting against?  No one does any computation or data dependent loads based on the value being read.  Or are you worried about ASLR leaks due to merely speculatively loading from a bogus address.

Also, the actual guard you're using is unfortunate.  At the very least, it's a maintenance disaster.  Get rid of the macro and open code it.  It's two instructions and it has crazy clobber requirements *and* a flag dependency.

But I think it's also just a bad way to do it.  You're making the dependency chain longer in a place where it matters, and the retpoline prevents the CPU from hiding that latency.  Just use an immediate mask with andq.  Realistically, you should be fine rounding the table size up to a power of two without even extending the table, since I think this, at best, protects against ASLR bypass.
Jann Horn Jan. 21, 2018, 7:31 p.m. UTC | #3
On Sun, Jan 21, 2018 at 8:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Jan 20, 2018, at 1:06 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> The syscall table base is a user controlled function pointer in kernel
>> space. Like, 'get_user, use 'MASK_NOSPEC' to prevent any out of bounds
>> speculation. While retpoline prevents speculating into the user
>> controlled target it does not stop the pointer de-reference, the concern
>> is leaking memory relative to the syscall table base.
>>
>> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>> arch/x86/entry/entry_64.S   |    2 ++
>> arch/x86/include/asm/smap.h |    9 ++++++++-
>> 2 files changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index 63f4320602a3..584f6d2236b3 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -35,6 +35,7 @@
>> #include <asm/asm.h>
>> #include <asm/smap.h>
>> #include <asm/pgtable_types.h>
>> +#include <asm/smap.h>
>> #include <asm/export.h>
>> #include <asm/frame.h>
>> #include <asm/nospec-branch.h>
>> @@ -260,6 +261,7 @@ entry_SYSCALL_64_fastpath:
>>    cmpl    $__NR_syscall_max, %eax
>> #endif
>>    ja    1f                /* return -ENOSYS (already in pt_regs->ax) */
>> +    MASK_NOSPEC %r11 %rax            /* sanitize syscall_nr wrt speculation */
>
> What's the threat you're protecting against?  No one does any computation or data dependent loads based on the value being read.  Or are you worried about ASLR leaks due to merely speculatively loading from a bogus address.

AFAIU the threat is that you could potentially use variant 1 to get
speculative RIP control using the indirect call through the syscall
table. (I haven't tested myself whether that'd work, and I don't know
how well that would work especially when the limit for the comparison
is coming from an immediate, but apparently Dan thinks it's a
potential problem?)
In other words, this would again be an attack against the indirect
call "call *sys_call_table(, %rax, 8)", but this time, the attack
would rely on the indirect branch being resolved *properly* based on
the address read from memory instead of relying on a misprediction;
but using an index into the syscall array that is out of bounds, after
mispredicting the conditional jump for verifying that the syscall
number is in bounds. This could still work even with retpolines if the
CPU can correct mispredictions within speculative execution.
Andy Lutomirski Jan. 22, 2018, 1:38 a.m. UTC | #4
On Sun, Jan 21, 2018 at 11:31 AM, Jann Horn <jannh@google.com> wrote:
> On Sun, Jan 21, 2018 at 8:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Jan 20, 2018, at 1:06 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> The syscall table base is a user controlled function pointer in kernel
>>> space. Like, 'get_user, use 'MASK_NOSPEC' to prevent any out of bounds
>>> speculation. While retpoline prevents speculating into the user
>>> controlled target it does not stop the pointer de-reference, the concern
>>> is leaking memory relative to the syscall table base.
>>>
>>> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>> Cc: x86@kernel.org
>>> Cc: Andy Lutomirski <luto@kernel.org>
>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>>> ---
>>> arch/x86/entry/entry_64.S   |    2 ++
>>> arch/x86/include/asm/smap.h |    9 ++++++++-
>>> 2 files changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>>> index 63f4320602a3..584f6d2236b3 100644
>>> --- a/arch/x86/entry/entry_64.S
>>> +++ b/arch/x86/entry/entry_64.S
>>> @@ -35,6 +35,7 @@
>>> #include <asm/asm.h>
>>> #include <asm/smap.h>
>>> #include <asm/pgtable_types.h>
>>> +#include <asm/smap.h>
>>> #include <asm/export.h>
>>> #include <asm/frame.h>
>>> #include <asm/nospec-branch.h>
>>> @@ -260,6 +261,7 @@ entry_SYSCALL_64_fastpath:
>>>    cmpl    $__NR_syscall_max, %eax
>>> #endif
>>>    ja    1f                /* return -ENOSYS (already in pt_regs->ax) */
>>> +    MASK_NOSPEC %r11 %rax            /* sanitize syscall_nr wrt speculation */
>>
>> What's the threat you're protecting against?  No one does any computation or data dependent loads based on the value being read.  Or are you worried about ASLR leaks due to merely speculatively loading from a bogus address.
>
> AFAIU the threat is that you could potentially use variant 1 to get
> speculative RIP control using the indirect call through the syscall
> table. (I haven't tested myself whether that'd work, and I don't know
> how well that would work especially when the limit for the comparison
> is coming from an immediate, but apparently Dan thinks it's a
> potential problem?)
> In other words, this would again be an attack against the indirect
> call "call *sys_call_table(, %rax, 8)", but this time, the attack
> would rely on the indirect branch being resolved *properly* based on
> the address read from memory instead of relying on a misprediction;
> but using an index into the syscall array that is out of bounds, after
> mispredicting the conditional jump for verifying that the syscall
> number is in bounds. This could still work even with retpolines if the
> CPU can correct mispredictions within speculative execution.

Fair enough.  That being said:

1. Intel needs to document this crap.  I don't want a situation where
we keep adding random mitigations in the hopes that we got them all.

2. Wny on Earth is MASK_NOSPEC in smap.h?

3. What's with sbb; and?  I can see two sane ways to do this.  One is
cmovaq [something safe], %rax, *in the #ifdef CONFIG_RETPOLINE block*.
(Or is cmov subject to incorrect speculation, too?  And, if cmov is,
why wouldn't sbb be just as bad?  At least using cmov here is two
whole ALU ops shorter.)  The other is andq $(syscall table size), %rax
(again inside the ifdef block) and to round the syscall table to one
less than a power of two.

In my mental model of how a CPU works, the cmov approach makes more
sense, since we're about to jump somewhere that depends on the syscall
nr, and any *the whole point* of the mitigation here is to avoid
jumping anywhere until the CPU has fully caught up with reality wrt
the syscall number.
Linus Torvalds Jan. 22, 2018, 2:04 a.m. UTC | #5
On Sun, Jan 21, 2018 at 5:38 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> 3. What's with sbb; and?  I can see two sane ways to do this.  One is
> cmovaq [something safe], %rax,

Heh. I think it's partly about being old-fashioned. sbb has always
been around, and is the traditional trick for 0/-1.

Also, my original suggested thing did the *access* too, and masked the
result with the same mask.

But I guess we could use cmov instead. It has very similar performance
(ie it was relatively slow on P4, but so was sbb).

However, I suspect it actually has a slightly higher register
pressure, since you'd need to have that zero register (zero being the
"safe" value), and the only good way to get a zero value is the xor
thing, which affects flags and thus needs to be before the cmp.

In contrast, the sbb trick has no early inputs needed.

So on the whole, 'cmov' may be more natural on a conceptual level, but
the sbb trick really is a very "traditional x86 thing" to do.

               Linus
Andy Lutomirski Jan. 22, 2018, 2:23 a.m. UTC | #6
On Sun, Jan 21, 2018 at 6:04 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sun, Jan 21, 2018 at 5:38 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> 3. What's with sbb; and?  I can see two sane ways to do this.  One is
>> cmovaq [something safe], %rax,
>
> Heh. I think it's partly about being old-fashioned. sbb has always
> been around, and is the traditional trick for 0/-1.
>
> Also, my original suggested thing did the *access* too, and masked the
> result with the same mask.
>
> But I guess we could use cmov instead. It has very similar performance
> (ie it was relatively slow on P4, but so was sbb).
>
> However, I suspect it actually has a slightly higher register
> pressure, since you'd need to have that zero register (zero being the
> "safe" value), and the only good way to get a zero value is the xor
> thing, which affects flags and thus needs to be before the cmp.
>
> In contrast, the sbb trick has no early inputs needed.
>
> So on the whole, 'cmov' may be more natural on a conceptual level, but
> the sbb trick really is a very "traditional x86 thing" to do.

Fair enough.

That being said, what I *actually* want to do is to nuke this thing
entirely.  I just wrote a patch to turn off the SYSCALL64 fast path
entirely when retpolines are on.  Then this issue can be dealt with in
C.  I assume someone has a brilliant way to make gcc automatically do
something intelligent about guarded array access in C. </snicker>
Seriously, though, the retpolined fast path is actually slower than
the slow path on a "minimal" retpoline kernel (which is what I'm using
because Fedora hasn't pushed out a retpoline compiler yet), and I
doubt it's more than the tiniest speedup on a full retpoline kernel.

I've read a bunch of emails flying around saying that retpolines
aren't that bad.  In very informal benchmarking, a single mispredicted
ret (which is what a retpoline is) seems to take 20-30 cycles on
Skylake.  That's pretty far from "not bad".  Is IBRS somehow doing
something that adversely affects code that doesn't use indirect
branches?  Because I'm having a bit of a hard time imagining IBRS
hurting indirect branches worse than retpolines do.
Samuel Neves Jan. 22, 2018, 11:59 a.m. UTC | #7
On Mon, Jan 22, 2018 at 2:04 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> However, I suspect it actually has a slightly higher register
> pressure, since you'd need to have that zero register (zero being the
> "safe" value), and the only good way to get a zero value is the xor
> thing, which affects flags and thus needs to be before the cmp.
>
> In contrast, the sbb trick has no early inputs needed.

On the flipside, sbb carries a false dependency [*] on the destination
register. Imagine something like

divq %rcx
...
cmpq %rdi, %rsi
sbbq %rax,%rax

sbb needs to wait not only for the comparison, but also for the
earlier unrelated slow division. On the other hand, zeroing with mov
or xor breaks dependencies on the destination register, making the
computation independent of what came before.

[*] Recent AMD chips are smart enough to understand the sbb r,r idiom
and ignore the value of r, but as far as I know none of the Intel
chips do.
diff mbox

Patch

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 63f4320602a3..584f6d2236b3 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -35,6 +35,7 @@ 
 #include <asm/asm.h>
 #include <asm/smap.h>
 #include <asm/pgtable_types.h>
+#include <asm/smap.h>
 #include <asm/export.h>
 #include <asm/frame.h>
 #include <asm/nospec-branch.h>
@@ -260,6 +261,7 @@  entry_SYSCALL_64_fastpath:
 	cmpl	$__NR_syscall_max, %eax
 #endif
 	ja	1f				/* return -ENOSYS (already in pt_regs->ax) */
+	MASK_NOSPEC %r11 %rax			/* sanitize syscall_nr wrt speculation */
 	movq	%r10, %rcx
 
 	/*
diff --git a/arch/x86/include/asm/smap.h b/arch/x86/include/asm/smap.h
index 2b4ad4c6a226..3b5b2cf58dc6 100644
--- a/arch/x86/include/asm/smap.h
+++ b/arch/x86/include/asm/smap.h
@@ -35,7 +35,14 @@ 
  * this directs the cpu to speculate with a NULL ptr rather than
  * something targeting kernel memory.
  *
- * assumes CF is set from a previous 'cmp TASK_addr_limit, %ptr'
+ * In the syscall entry path it is possible to speculate past the
+ * validation of the system call number. Use MASK_NOSPEC to sanitize the
+ * syscall array index to zero (sys_read) rather than an arbitrary
+ * target.
+ *
+ * assumes CF is set from a previous 'cmp' i.e.:
+ *     cmp TASK_addr_limit, %ptr
+ *     cmp __NR_syscall_max, %idx
  */
 .macro MASK_NOSPEC mask val
 	sbb \mask, \mask