Message ID | 151648239535.34747.4422108674633222531.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> On Jan 20, 2018, at 1:06 PM, Dan Williams <dan.j.williams@intel.com> wrote: > > The syscall table base is a user controlled function pointer in kernel > space. Like, 'get_user, use 'MASK_NOSPEC' to prevent any out of bounds > speculation. While retpoline prevents speculating into the user > controlled target it does not stop the pointer de-reference, the concern > is leaking memory relative to the syscall table base. > > Reported-by: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: "H. Peter Anvin" <hpa@zytor.com> > Cc: x86@kernel.org > Cc: Andy Lutomirski <luto@kernel.org> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > arch/x86/entry/entry_64.S | 2 ++ > arch/x86/include/asm/smap.h | 9 ++++++++- > 2 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index 63f4320602a3..584f6d2236b3 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -35,6 +35,7 @@ > #include <asm/asm.h> > #include <asm/smap.h> > #include <asm/pgtable_types.h> > +#include <asm/smap.h> > #include <asm/export.h> > #include <asm/frame.h> > #include <asm/nospec-branch.h> > @@ -260,6 +261,7 @@ entry_SYSCALL_64_fastpath: > cmpl $__NR_syscall_max, %eax > #endif > ja 1f /* return -ENOSYS (already in pt_regs->ax) */ > + MASK_NOSPEC %r11 %rax /* sanitize syscall_nr wrt speculation */ What's the threat you're protecting against? No one does any computation or data dependent loads based on the value being read. Or are you worried about ASLR leaks due to merely speculatively loading from a bogus address.
> On Jan 21, 2018, at 11:16 AM, Andy Lutomirski <luto@amacapital.net> wrote: > > > >> On Jan 20, 2018, at 1:06 PM, Dan Williams <dan.j.williams@intel.com> wrote: >> >> The syscall table base is a user controlled function pointer in kernel >> space. Like, 'get_user, use 'MASK_NOSPEC' to prevent any out of bounds >> speculation. While retpoline prevents speculating into the user >> controlled target it does not stop the pointer de-reference, the concern >> is leaking memory relative to the syscall table base. >> >> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: "H. Peter Anvin" <hpa@zytor.com> >> Cc: x86@kernel.org >> Cc: Andy Lutomirski <luto@kernel.org> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> arch/x86/entry/entry_64.S | 2 ++ >> arch/x86/include/asm/smap.h | 9 ++++++++- >> 2 files changed, 10 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S >> index 63f4320602a3..584f6d2236b3 100644 >> --- a/arch/x86/entry/entry_64.S >> +++ b/arch/x86/entry/entry_64.S >> @@ -35,6 +35,7 @@ >> #include <asm/asm.h> >> #include <asm/smap.h> >> #include <asm/pgtable_types.h> >> +#include <asm/smap.h> >> #include <asm/export.h> >> #include <asm/frame.h> >> #include <asm/nospec-branch.h> >> @@ -260,6 +261,7 @@ entry_SYSCALL_64_fastpath: >> cmpl $__NR_syscall_max, %eax >> #endif >> ja 1f /* return -ENOSYS (already in pt_regs->ax) */ >> + MASK_NOSPEC %r11 %rax /* sanitize syscall_nr wrt speculation */ > > What's the threat you're protecting against? No one does any computation or data dependent loads based on the value being read. Or are you worried about ASLR leaks due to merely speculatively loading from a bogus address. Also, the actual guard you're using is unfortunate. At the very least, it's a maintenance disaster. Get rid of the macro and open code it. It's two instructions and it has crazy clobber requirements *and* a flag dependency. But I think it's also just a bad way to do it. You're making the dependency chain longer in a place where it matters, and the retpoline prevents the CPU from hiding that latency. Just use an immediate mask with andq. Realistically, you should be fine rounding the table size up to a power of two without even extending the table, since I think this, at best, protects against ASLR bypass.
On Sun, Jan 21, 2018 at 8:16 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> On Jan 20, 2018, at 1:06 PM, Dan Williams <dan.j.williams@intel.com> wrote: >> The syscall table base is a user controlled function pointer in kernel >> space. Like, 'get_user, use 'MASK_NOSPEC' to prevent any out of bounds >> speculation. While retpoline prevents speculating into the user >> controlled target it does not stop the pointer de-reference, the concern >> is leaking memory relative to the syscall table base. >> >> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: "H. Peter Anvin" <hpa@zytor.com> >> Cc: x86@kernel.org >> Cc: Andy Lutomirski <luto@kernel.org> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> arch/x86/entry/entry_64.S | 2 ++ >> arch/x86/include/asm/smap.h | 9 ++++++++- >> 2 files changed, 10 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S >> index 63f4320602a3..584f6d2236b3 100644 >> --- a/arch/x86/entry/entry_64.S >> +++ b/arch/x86/entry/entry_64.S >> @@ -35,6 +35,7 @@ >> #include <asm/asm.h> >> #include <asm/smap.h> >> #include <asm/pgtable_types.h> >> +#include <asm/smap.h> >> #include <asm/export.h> >> #include <asm/frame.h> >> #include <asm/nospec-branch.h> >> @@ -260,6 +261,7 @@ entry_SYSCALL_64_fastpath: >> cmpl $__NR_syscall_max, %eax >> #endif >> ja 1f /* return -ENOSYS (already in pt_regs->ax) */ >> + MASK_NOSPEC %r11 %rax /* sanitize syscall_nr wrt speculation */ > > What's the threat you're protecting against? No one does any computation or data dependent loads based on the value being read. Or are you worried about ASLR leaks due to merely speculatively loading from a bogus address. AFAIU the threat is that you could potentially use variant 1 to get speculative RIP control using the indirect call through the syscall table. (I haven't tested myself whether that'd work, and I don't know how well that would work especially when the limit for the comparison is coming from an immediate, but apparently Dan thinks it's a potential problem?) In other words, this would again be an attack against the indirect call "call *sys_call_table(, %rax, 8)", but this time, the attack would rely on the indirect branch being resolved *properly* based on the address read from memory instead of relying on a misprediction; but using an index into the syscall array that is out of bounds, after mispredicting the conditional jump for verifying that the syscall number is in bounds. This could still work even with retpolines if the CPU can correct mispredictions within speculative execution.
On Sun, Jan 21, 2018 at 11:31 AM, Jann Horn <jannh@google.com> wrote: > On Sun, Jan 21, 2018 at 8:16 PM, Andy Lutomirski <luto@amacapital.net> wrote: >>> On Jan 20, 2018, at 1:06 PM, Dan Williams <dan.j.williams@intel.com> wrote: >>> The syscall table base is a user controlled function pointer in kernel >>> space. Like, 'get_user, use 'MASK_NOSPEC' to prevent any out of bounds >>> speculation. While retpoline prevents speculating into the user >>> controlled target it does not stop the pointer de-reference, the concern >>> is leaking memory relative to the syscall table base. >>> >>> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> >>> Cc: Thomas Gleixner <tglx@linutronix.de> >>> Cc: Ingo Molnar <mingo@redhat.com> >>> Cc: "H. Peter Anvin" <hpa@zytor.com> >>> Cc: x86@kernel.org >>> Cc: Andy Lutomirski <luto@kernel.org> >>> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >>> --- >>> arch/x86/entry/entry_64.S | 2 ++ >>> arch/x86/include/asm/smap.h | 9 ++++++++- >>> 2 files changed, 10 insertions(+), 1 deletion(-) >>> >>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S >>> index 63f4320602a3..584f6d2236b3 100644 >>> --- a/arch/x86/entry/entry_64.S >>> +++ b/arch/x86/entry/entry_64.S >>> @@ -35,6 +35,7 @@ >>> #include <asm/asm.h> >>> #include <asm/smap.h> >>> #include <asm/pgtable_types.h> >>> +#include <asm/smap.h> >>> #include <asm/export.h> >>> #include <asm/frame.h> >>> #include <asm/nospec-branch.h> >>> @@ -260,6 +261,7 @@ entry_SYSCALL_64_fastpath: >>> cmpl $__NR_syscall_max, %eax >>> #endif >>> ja 1f /* return -ENOSYS (already in pt_regs->ax) */ >>> + MASK_NOSPEC %r11 %rax /* sanitize syscall_nr wrt speculation */ >> >> What's the threat you're protecting against? No one does any computation or data dependent loads based on the value being read. Or are you worried about ASLR leaks due to merely speculatively loading from a bogus address. > > AFAIU the threat is that you could potentially use variant 1 to get > speculative RIP control using the indirect call through the syscall > table. (I haven't tested myself whether that'd work, and I don't know > how well that would work especially when the limit for the comparison > is coming from an immediate, but apparently Dan thinks it's a > potential problem?) > In other words, this would again be an attack against the indirect > call "call *sys_call_table(, %rax, 8)", but this time, the attack > would rely on the indirect branch being resolved *properly* based on > the address read from memory instead of relying on a misprediction; > but using an index into the syscall array that is out of bounds, after > mispredicting the conditional jump for verifying that the syscall > number is in bounds. This could still work even with retpolines if the > CPU can correct mispredictions within speculative execution. Fair enough. That being said: 1. Intel needs to document this crap. I don't want a situation where we keep adding random mitigations in the hopes that we got them all. 2. Wny on Earth is MASK_NOSPEC in smap.h? 3. What's with sbb; and? I can see two sane ways to do this. One is cmovaq [something safe], %rax, *in the #ifdef CONFIG_RETPOLINE block*. (Or is cmov subject to incorrect speculation, too? And, if cmov is, why wouldn't sbb be just as bad? At least using cmov here is two whole ALU ops shorter.) The other is andq $(syscall table size), %rax (again inside the ifdef block) and to round the syscall table to one less than a power of two. In my mental model of how a CPU works, the cmov approach makes more sense, since we're about to jump somewhere that depends on the syscall nr, and any *the whole point* of the mitigation here is to avoid jumping anywhere until the CPU has fully caught up with reality wrt the syscall number.
On Sun, Jan 21, 2018 at 5:38 PM, Andy Lutomirski <luto@kernel.org> wrote: > > 3. What's with sbb; and? I can see two sane ways to do this. One is > cmovaq [something safe], %rax, Heh. I think it's partly about being old-fashioned. sbb has always been around, and is the traditional trick for 0/-1. Also, my original suggested thing did the *access* too, and masked the result with the same mask. But I guess we could use cmov instead. It has very similar performance (ie it was relatively slow on P4, but so was sbb). However, I suspect it actually has a slightly higher register pressure, since you'd need to have that zero register (zero being the "safe" value), and the only good way to get a zero value is the xor thing, which affects flags and thus needs to be before the cmp. In contrast, the sbb trick has no early inputs needed. So on the whole, 'cmov' may be more natural on a conceptual level, but the sbb trick really is a very "traditional x86 thing" to do. Linus
On Sun, Jan 21, 2018 at 6:04 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, Jan 21, 2018 at 5:38 PM, Andy Lutomirski <luto@kernel.org> wrote: >> >> 3. What's with sbb; and? I can see two sane ways to do this. One is >> cmovaq [something safe], %rax, > > Heh. I think it's partly about being old-fashioned. sbb has always > been around, and is the traditional trick for 0/-1. > > Also, my original suggested thing did the *access* too, and masked the > result with the same mask. > > But I guess we could use cmov instead. It has very similar performance > (ie it was relatively slow on P4, but so was sbb). > > However, I suspect it actually has a slightly higher register > pressure, since you'd need to have that zero register (zero being the > "safe" value), and the only good way to get a zero value is the xor > thing, which affects flags and thus needs to be before the cmp. > > In contrast, the sbb trick has no early inputs needed. > > So on the whole, 'cmov' may be more natural on a conceptual level, but > the sbb trick really is a very "traditional x86 thing" to do. Fair enough. That being said, what I *actually* want to do is to nuke this thing entirely. I just wrote a patch to turn off the SYSCALL64 fast path entirely when retpolines are on. Then this issue can be dealt with in C. I assume someone has a brilliant way to make gcc automatically do something intelligent about guarded array access in C. </snicker> Seriously, though, the retpolined fast path is actually slower than the slow path on a "minimal" retpoline kernel (which is what I'm using because Fedora hasn't pushed out a retpoline compiler yet), and I doubt it's more than the tiniest speedup on a full retpoline kernel. I've read a bunch of emails flying around saying that retpolines aren't that bad. In very informal benchmarking, a single mispredicted ret (which is what a retpoline is) seems to take 20-30 cycles on Skylake. That's pretty far from "not bad". Is IBRS somehow doing something that adversely affects code that doesn't use indirect branches? Because I'm having a bit of a hard time imagining IBRS hurting indirect branches worse than retpolines do.
On Mon, Jan 22, 2018 at 2:04 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > However, I suspect it actually has a slightly higher register > pressure, since you'd need to have that zero register (zero being the > "safe" value), and the only good way to get a zero value is the xor > thing, which affects flags and thus needs to be before the cmp. > > In contrast, the sbb trick has no early inputs needed. On the flipside, sbb carries a false dependency [*] on the destination register. Imagine something like divq %rcx ... cmpq %rdi, %rsi sbbq %rax,%rax sbb needs to wait not only for the comparison, but also for the earlier unrelated slow division. On the other hand, zeroing with mov or xor breaks dependencies on the destination register, making the computation independent of what came before. [*] Recent AMD chips are smart enough to understand the sbb r,r idiom and ignore the value of r, but as far as I know none of the Intel chips do.
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 63f4320602a3..584f6d2236b3 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -35,6 +35,7 @@ #include <asm/asm.h> #include <asm/smap.h> #include <asm/pgtable_types.h> +#include <asm/smap.h> #include <asm/export.h> #include <asm/frame.h> #include <asm/nospec-branch.h> @@ -260,6 +261,7 @@ entry_SYSCALL_64_fastpath: cmpl $__NR_syscall_max, %eax #endif ja 1f /* return -ENOSYS (already in pt_regs->ax) */ + MASK_NOSPEC %r11 %rax /* sanitize syscall_nr wrt speculation */ movq %r10, %rcx /* diff --git a/arch/x86/include/asm/smap.h b/arch/x86/include/asm/smap.h index 2b4ad4c6a226..3b5b2cf58dc6 100644 --- a/arch/x86/include/asm/smap.h +++ b/arch/x86/include/asm/smap.h @@ -35,7 +35,14 @@ * this directs the cpu to speculate with a NULL ptr rather than * something targeting kernel memory. * - * assumes CF is set from a previous 'cmp TASK_addr_limit, %ptr' + * In the syscall entry path it is possible to speculate past the + * validation of the system call number. Use MASK_NOSPEC to sanitize the + * syscall array index to zero (sys_read) rather than an arbitrary + * target. + * + * assumes CF is set from a previous 'cmp' i.e.: + * cmp TASK_addr_limit, %ptr + * cmp __NR_syscall_max, %idx */ .macro MASK_NOSPEC mask val sbb \mask, \mask
The syscall table base is a user controlled function pointer in kernel space. Like, 'get_user, use 'MASK_NOSPEC' to prevent any out of bounds speculation. While retpoline prevents speculating into the user controlled target it does not stop the pointer de-reference, the concern is leaking memory relative to the syscall table base. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- arch/x86/entry/entry_64.S | 2 ++ arch/x86/include/asm/smap.h | 9 ++++++++- 2 files changed, 10 insertions(+), 1 deletion(-)