Message ID | 20190214101429.GD32494@hirez.programming.kicks-ass.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | sched/x86: Save [ER]FLAGS on context switch | expand |
On Thu, Feb 14, 2019 at 5:14 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Feb 13, 2019 at 02:49:47PM -0800, Andy Lutomirski wrote: > > > Do we need to backport this thing? > > Possibly, just to be safe. > > > The problem can’t be too widespread or we would have heard of it before. > > Yes, so far we've been lucky. > > --- > Subject: sched/x86: Save [ER]FLAGS on context switch > From: Peter Zijlstra <peterz@infradead.org> > Date: Thu Feb 14 10:30:52 CET 2019 > > Effectively revert commit: > > 2c7577a75837 ("sched/x86_64: Don't save flags on context switch") > > Specifically because SMAP uses FLAGS.AC which invalidates the claim > that the kernel has clean flags. > > In particular; while preemption from interrupt return is fine (the > IRET frame on the exception stack contains FLAGS) it breaks any code > that does synchonous scheduling, including preempt_enable(). > > This has become a significant issue ever since commit: > > 5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses") > > provided for means of having 'normal' C code between STAC / CLAC, > exposing the FLAGS.AC state. So far this hasn't led to trouble, > however fix it before it comes apart. > > Fixes: 5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses") > Acked-by: Andy Lutomirski <luto@amacapital.net> > Reported-by: Julien Thierry <julien.thierry@arm.com> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > arch/x86/entry/entry_32.S | 2 ++ > arch/x86/entry/entry_64.S | 2 ++ > arch/x86/include/asm/switch_to.h | 1 + > 3 files changed, 5 insertions(+) > > --- a/arch/x86/entry/entry_32.S > +++ b/arch/x86/entry/entry_32.S > @@ -650,6 +650,7 @@ ENTRY(__switch_to_asm) > pushl %ebx > pushl %edi > pushl %esi > + pushfl > > /* switch stack */ > movl %esp, TASK_threadsp(%eax) > @@ -672,6 +673,7 @@ ENTRY(__switch_to_asm) > #endif > > /* restore callee-saved registers */ > + popfl > popl %esi > popl %edi > popl %ebx > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -291,6 +291,7 @@ ENTRY(__switch_to_asm) > pushq %r13 > pushq %r14 > pushq %r15 > + pushfq > > /* switch stack */ > movq %rsp, TASK_threadsp(%rdi) > @@ -313,6 +314,7 @@ ENTRY(__switch_to_asm) > #endif > > /* restore callee-saved registers */ > + popfq > popq %r15 > popq %r14 > popq %r13 > --- a/arch/x86/include/asm/switch_to.h > +++ b/arch/x86/include/asm/switch_to.h > @@ -40,6 +40,7 @@ asmlinkage void ret_from_fork(void); > * order of the fields must match the code in __switch_to_asm(). > */ > struct inactive_task_frame { > + unsigned long flags; > #ifdef CONFIG_X86_64 > unsigned long r15; > unsigned long r14; flags should be initialized in copy_thread_tls(). I think the new stack is zeroed already, but it would be better to explicitly set it. -- Brian Gerst
On Thu, Feb 14, 2019 at 11:18:01AM -0500, Brian Gerst wrote: > > --- a/arch/x86/include/asm/switch_to.h > > +++ b/arch/x86/include/asm/switch_to.h > > @@ -40,6 +40,7 @@ asmlinkage void ret_from_fork(void); > > * order of the fields must match the code in __switch_to_asm(). > > */ > > struct inactive_task_frame { > > + unsigned long flags; > > #ifdef CONFIG_X86_64 > > unsigned long r15; > > unsigned long r14; > > flags should be initialized in copy_thread_tls(). I think the new > stack is zeroed already, but it would be better to explicitly set it. Ah indeed. I somehow misread that code and thought we got initialized with a copy of current. Something like the below, right? --- --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -127,6 +127,7 @@ int copy_thread_tls(unsigned long clone_ struct task_struct *tsk; int err; + frame->flags = 0; frame->bp = 0; frame->ret_addr = (unsigned long) ret_from_fork; p->thread.sp = (unsigned long) fork_frame; --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -392,6 +392,7 @@ int copy_thread_tls(unsigned long clone_ childregs = task_pt_regs(p); fork_frame = container_of(childregs, struct fork_frame, regs); frame = &fork_frame->frame; + frame->flags = 0; frame->bp = 0; frame->ret_addr = (unsigned long) ret_from_fork; p->thread.sp = (unsigned long) fork_frame;
On Thu, Feb 14, 2019 at 2:34 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Feb 14, 2019 at 11:18:01AM -0500, Brian Gerst wrote: > > > > --- a/arch/x86/include/asm/switch_to.h > > > +++ b/arch/x86/include/asm/switch_to.h > > > @@ -40,6 +40,7 @@ asmlinkage void ret_from_fork(void); > > > * order of the fields must match the code in __switch_to_asm(). > > > */ > > > struct inactive_task_frame { > > > + unsigned long flags; > > > #ifdef CONFIG_X86_64 > > > unsigned long r15; > > > unsigned long r14; > > > > flags should be initialized in copy_thread_tls(). I think the new > > stack is zeroed already, but it would be better to explicitly set it. > > Ah indeed. I somehow misread that code and thought we got initialized > with a copy of current. > > Something like the below, right? > > --- > > --- a/arch/x86/kernel/process_32.c > +++ b/arch/x86/kernel/process_32.c > @@ -127,6 +127,7 @@ int copy_thread_tls(unsigned long clone_ > struct task_struct *tsk; > int err; > > + frame->flags = 0; > frame->bp = 0; > frame->ret_addr = (unsigned long) ret_from_fork; > p->thread.sp = (unsigned long) fork_frame; > --- a/arch/x86/kernel/process_64.c > +++ b/arch/x86/kernel/process_64.c > @@ -392,6 +392,7 @@ int copy_thread_tls(unsigned long clone_ > childregs = task_pt_regs(p); > fork_frame = container_of(childregs, struct fork_frame, regs); > frame = &fork_frame->frame; > + frame->flags = 0; > frame->bp = 0; > frame->ret_addr = (unsigned long) ret_from_fork; > p->thread.sp = (unsigned long) fork_frame; Yes, this looks good. -- Brian Gerst
On Thu, Feb 14, 2019 at 11:34 AM Peter Zijlstra <peterz@infradead.org> wrote: > > Something like the below, right? > > + frame->flags = 0; > + frame->flags = 0; Those are not valid flag values. Can you popf them? Yes. Do they make sense? No. It has the IF flag clear, for example. Is that intentional? If it is, it should likely be documented. Because IF clear means "interrupts disabled". Are the places that load flags in irq disabled regions? Maybe they are, maybe they aren't, but shouldn't this be something that is made explicit, rather than "oh, let's initialize to zero like all the other registers we don't care about". Because a zero initializer for eflags really is odd. Linus
On Fri, Feb 15, 2019 at 09:18:00AM -0800, Linus Torvalds wrote: > On Thu, Feb 14, 2019 at 11:34 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > Something like the below, right? > > > > + frame->flags = 0; > > + frame->flags = 0; > > Those are not valid flag values. > > Can you popf them? Yes. > > Do they make sense? No. > > It has the IF flag clear, for example. Is that intentional? If it is, Uhmm. yeah, that's bonkers. We should have interrupts disabled here. I'll go read up on the eflags and figure out what they _should_ be right about there.
> On Feb 15, 2019, at 9:40 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >> On Fri, Feb 15, 2019 at 09:18:00AM -0800, Linus Torvalds wrote: >>> On Thu, Feb 14, 2019 at 11:34 AM Peter Zijlstra <peterz@infradead.org> wrote: >>> >>> Something like the below, right? >>> >>> + frame->flags = 0; >>> + frame->flags = 0; >> >> Those are not valid flag values. >> >> Can you popf them? Yes. >> >> Do they make sense? No. >> >> It has the IF flag clear, for example. Is that intentional? If it is, > > Uhmm. yeah, that's bonkers. We should have interrupts disabled here. > I'll go read up on the eflags and figure out what they _should_ be right > about there. And probably add a comment near the POPF explaining that it will keep IRQs off :)
On Fri, Feb 15, 2019 at 06:40:34PM +0100, Peter Zijlstra wrote: > On Fri, Feb 15, 2019 at 09:18:00AM -0800, Linus Torvalds wrote: > > On Thu, Feb 14, 2019 at 11:34 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > Something like the below, right? > > > > > > + frame->flags = 0; > > > + frame->flags = 0; > > > > Those are not valid flag values. > > > > Can you popf them? Yes. > > > > Do they make sense? No. > > > > It has the IF flag clear, for example. Is that intentional? If it is, > > Uhmm. yeah, that's bonkers. We should have interrupts disabled here. > I'll go read up on the eflags and figure out what they _should_ be right > about there. I misread (I'm forever confused about what way around IF goes), but you said it right; IF=0 is interrupts disabled and we very much have that in the middle of context switch. (just for giggles I set IF for the initial flags value; and it comes unstuck _real_ quick) Now, EFLAGS bit 1 is supposedly always 1, but it really doesn't seem to matter for POPF. I went through the other flags, and aside from VIP/VIF (I've no clue), they looks like 0 should be just fine.
On Fri, Feb 15, 2019 at 3:34 PM Peter Zijlstra <peterz@infradead.org> wrote: > > Now, EFLAGS bit 1 is supposedly always 1, but it really doesn't seem to > matter for POPF. Correct, it's "read as 1", you can try to write it and it doesn't matter. > I went through the other flags, and aside from VIP/VIF (I've no clue), > they looks like 0 should be just fine. So 0 is a perfectly valid initializer in the sense that it _works_, I just want it to be something that was thought about, not just a random "initialize to zero" without thinking. Even just a comment about it would be fine. But it might also be good to show that it's an explicit eflags value and just use X86_EFLAGS_FIXED as the initializer. Linus
On February 14, 2019 2:14:29 AM PST, Peter Zijlstra <peterz@infradead.org> wrote: >On Wed, Feb 13, 2019 at 02:49:47PM -0800, Andy Lutomirski wrote: > >> Do we need to backport this thing? > >Possibly, just to be safe. > >> The problem can’t be too widespread or we would have heard of it >before. > >Yes, so far we've been lucky. > >--- >Subject: sched/x86: Save [ER]FLAGS on context switch >From: Peter Zijlstra <peterz@infradead.org> >Date: Thu Feb 14 10:30:52 CET 2019 > >Effectively revert commit: > > 2c7577a75837 ("sched/x86_64: Don't save flags on context switch") > >Specifically because SMAP uses FLAGS.AC which invalidates the claim >that the kernel has clean flags. > >In particular; while preemption from interrupt return is fine (the >IRET frame on the exception stack contains FLAGS) it breaks any code >that does synchonous scheduling, including preempt_enable(). > >This has become a significant issue ever since commit: > >5b24a7a2aa20 ("Add 'unsafe' user access functions for batched >accesses") > >provided for means of having 'normal' C code between STAC / CLAC, >exposing the FLAGS.AC state. So far this hasn't led to trouble, >however fix it before it comes apart. > >Fixes: 5b24a7a2aa20 ("Add 'unsafe' user access functions for batched >accesses") >Acked-by: Andy Lutomirski <luto@amacapital.net> >Reported-by: Julien Thierry <julien.thierry@arm.com> >Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> >--- > arch/x86/entry/entry_32.S | 2 ++ > arch/x86/entry/entry_64.S | 2 ++ > arch/x86/include/asm/switch_to.h | 1 + > 3 files changed, 5 insertions(+) > >--- a/arch/x86/entry/entry_32.S >+++ b/arch/x86/entry/entry_32.S >@@ -650,6 +650,7 @@ ENTRY(__switch_to_asm) > pushl %ebx > pushl %edi > pushl %esi >+ pushfl > > /* switch stack */ > movl %esp, TASK_threadsp(%eax) >@@ -672,6 +673,7 @@ ENTRY(__switch_to_asm) > #endif > > /* restore callee-saved registers */ >+ popfl > popl %esi > popl %edi > popl %ebx >--- a/arch/x86/entry/entry_64.S >+++ b/arch/x86/entry/entry_64.S >@@ -291,6 +291,7 @@ ENTRY(__switch_to_asm) > pushq %r13 > pushq %r14 > pushq %r15 >+ pushfq > > /* switch stack */ > movq %rsp, TASK_threadsp(%rdi) >@@ -313,6 +314,7 @@ ENTRY(__switch_to_asm) > #endif > > /* restore callee-saved registers */ >+ popfq > popq %r15 > popq %r14 > popq %r13 >--- a/arch/x86/include/asm/switch_to.h >+++ b/arch/x86/include/asm/switch_to.h >@@ -40,6 +40,7 @@ asmlinkage void ret_from_fork(void); > * order of the fields must match the code in __switch_to_asm(). > */ > struct inactive_task_frame { >+ unsigned long flags; > #ifdef CONFIG_X86_64 > unsigned long r15; > unsigned long r14; This implies we invoke schedule -- a restricted operation (consider may_sleep) during execution of STAC-enabled code, but *not* as an exception or interrupt, since those preserve the flags. I have serious concerns about this. This is more or less saying that we have left an unlimited gap and have had AC escape. Does *anyone* see a need to allow this? I got a question at LPC from someone about this, and what they were trying to do once all the layers had been unwound was so far down the wrong track for a root problem that actually has a very simple solution.
On Fri, Feb 15, 2019 at 08:06:56PM -0800, hpa@zytor.com wrote: > This implies we invoke schedule -- a restricted operation (consider > may_sleep) during execution of STAC-enabled code, but *not* as an > exception or interrupt, since those preserve the flags. Meet preempt_enable(). > I have serious concerns about this. This is more or less saying that > we have left an unlimited gap and have had AC escape. Yes; by allowing normal C in between STAC and CLAC this is so. > Does *anyone* see a need to allow this? I got a question at LPC from > someone about this, and what they were trying to do once all the > layers had been unwound was so far down the wrong track for a root > problem that actually has a very simple solution. Have you read the rest of the thread? All it takes for this to explode is a call to a C function that has tracing on it in between the user_access_begin() and user_access_end() things. That is a _very_ easy thing to do. Heck, GCC is allowed to generate that broken code compiling lib/strnlen_user.c; it doesn't _have_ to inline do_strnlen_user (esp. with CONFIG_OPTIMIZE_INLINING), and making that a function call would get us fentry hooks and ftrace and *BOOM*. (Now, of course, its a static function with a single caller, and GCC isn't _THAT_ stupid, but it could, if it wanted to) Since the bar is _that_ low for mistakes, I figure we'd better fix it.
On Fri, Feb 15, 2019 at 04:21:46PM -0800, Linus Torvalds wrote: > Even just a comment about it would be fine. But it might also be good > to show that it's an explicit eflags value and just use > X86_EFLAGS_FIXED as the initializer. That is in fact what I have now; I'll repost on Monday or so.
On 2/16/19 2:30 AM, Peter Zijlstra wrote: > On Fri, Feb 15, 2019 at 08:06:56PM -0800, hpa@zytor.com wrote: >> This implies we invoke schedule -- a restricted operation (consider >> may_sleep) during execution of STAC-enabled code, but *not* as an >> exception or interrupt, since those preserve the flags. > > Meet preempt_enable(). I believe this falls under "doctor, it hurts when I do that." And it hurts for very good reasons. See below. >> I have serious concerns about this. This is more or less saying that >> we have left an unlimited gap and have had AC escape. > > Yes; by allowing normal C in between STAC and CLAC this is so. > >> Does *anyone* see a need to allow this? I got a question at LPC from >> someone about this, and what they were trying to do once all the >> layers had been unwound was so far down the wrong track for a root >> problem that actually has a very simple solution. > > Have you read the rest of the thread? > > All it takes for this to explode is a call to a C function that has > tracing on it in between the user_access_begin() and user_access_end() > things. That is a _very_ easy thing to do. > > Heck, GCC is allowed to generate that broken code compiling > lib/strnlen_user.c; it doesn't _have_ to inline do_strnlen_user (esp. > with CONFIG_OPTIMIZE_INLINING), and making that a function call would > get us fentry hooks and ftrace and *BOOM*. > > (Now, of course, its a static function with a single caller, and GCC > isn't _THAT_ stupid, but it could, if it wanted to) > > Since the bar is _that_ low for mistakes, I figure we'd better fix it. > The question is what "fix it" means. I'm really concerned about AC escapes, and everyone else should be, too. For an issue specific to tracing, it would be more appropriate to do the appropriate things in the tracing entry/exit than in schedule. Otherwise, we don't want to silently paper over mistakes which could mean that we run a large portion of the kernel without protection we thought we had. In that sense, calling schedule() with AC set is in fact a great place to have a WARN() or even BUG(), because such an event really could imply that the kernel has been security compromised. Does that make more sense? -hpa
On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote: > > The question is what "fix it" means. I'm really concerned about AC escapes, > and everyone else should be, too. I do think that it might be the right thing to do to add some kind of WARN_ON_ONCE() for AC being set in various can-reschedule situations. We'd just have to abstract it sanely. I'm sure arm64 has the exact same issue with PAN - maybe it saves properly, but the same "we wouldn't want to go through the scheduler with PAN clear". On x86, we might as well check DF at the same time as AC. Linus
> On Feb 18, 2019, at 4:24 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote: >> >> The question is what "fix it" means. I'm really concerned about AC escapes, >> and everyone else should be, too. > > I do think that it might be the right thing to do to add some kind of > WARN_ON_ONCE() for AC being set in various can-reschedule situations. > > We'd just have to abstract it sanely. I'm sure arm64 has the exact > same issue with PAN - maybe it saves properly, but the same "we > wouldn't want to go through the scheduler with PAN clear". > > On x86, we might as well check DF at the same time as AC. > > hpa is right, though — calling into tracing code with AC set is not really so good. And calling schedule() (via preempt_enable() or whatever) is also bad because it runs all the scheduler code with AC on. Admittedly, the scheduler is not *that* interesting of an attack surface.
On 2/18/19 6:20 PM, Andy Lutomirski wrote: > > >> On Feb 18, 2019, at 4:24 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >>> On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote: >>> >>> The question is what "fix it" means. I'm really concerned about AC escapes, >>> and everyone else should be, too. >> >> I do think that it might be the right thing to do to add some kind of >> WARN_ON_ONCE() for AC being set in various can-reschedule situations. >> >> We'd just have to abstract it sanely. I'm sure arm64 has the exact >> same issue with PAN - maybe it saves properly, but the same "we >> wouldn't want to go through the scheduler with PAN clear". >> >> On x86, we might as well check DF at the same time as AC. >> > > hpa is right, though — calling into tracing code with AC set is not really so good. And calling schedule() (via preempt_enable() or whatever) is also bad because it runs all the scheduler code with AC on. Admittedly, the scheduler is not *that* interesting of an attack surface. > Not just that, but the other question is just how much code we are running with AC open. It really should only be done in some very small regions. -hpa
On 19/02/2019 00:24, Linus Torvalds wrote: > On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote: >> >> The question is what "fix it" means. I'm really concerned about AC escapes, >> and everyone else should be, too. > > I do think that it might be the right thing to do to add some kind of > WARN_ON_ONCE() for AC being set in various can-reschedule situations. > > We'd just have to abstract it sanely. I'm sure arm64 has the exact > same issue with PAN - maybe it saves properly, but the same "we > wouldn't want to go through the scheduler with PAN clear". > As of right now, we have the same issue on arm64 as on x86. We don't currently save the PAN bit on task switch, but I have a patch to do that. Unless we decide to go down the route of warning against uses of schedule() inside. As for the abstraction, I had this patch[1] that added another primitive for the user_access API (although this might not be suited for x86 if you also want to check DF). However, an issue that appears is where to perform the check to cover enough ground. Checking inside the schedule() code you only cover cases where things have already gone wrong, and not the use of functions that are unsafe to call inside a user_access region. [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2019-January/625385.html Cheers,
On Mon, Feb 18, 2019 at 02:30:21PM -0800, H. Peter Anvin wrote: > On 2/16/19 2:30 AM, Peter Zijlstra wrote: > > On Fri, Feb 15, 2019 at 08:06:56PM -0800, hpa@zytor.com wrote: > >> This implies we invoke schedule -- a restricted operation (consider > >> may_sleep) during execution of STAC-enabled code, but *not* as an > >> exception or interrupt, since those preserve the flags. > > > > Meet preempt_enable(). > > I believe this falls under "doctor, it hurts when I do that." And it hurts for > very good reasons. See below. I disagree; the basic rule is that if you're preemptible you must also be able to schedule and vice-versa. These AC regions violate this. And, like illustrated, it is _very_ easy to get all sorts inside that AC crap. > >> I have serious concerns about this. This is more or less saying that > >> we have left an unlimited gap and have had AC escape. > > > > Yes; by allowing normal C in between STAC and CLAC this is so. > > > >> Does *anyone* see a need to allow this? I got a question at LPC from > >> someone about this, and what they were trying to do once all the > >> layers had been unwound was so far down the wrong track for a root > >> problem that actually has a very simple solution. > > > > Have you read the rest of the thread? > > > > All it takes for this to explode is a call to a C function that has > > tracing on it in between the user_access_begin() and user_access_end() > > things. That is a _very_ easy thing to do. > > > > Heck, GCC is allowed to generate that broken code compiling > > lib/strnlen_user.c; it doesn't _have_ to inline do_strnlen_user (esp. > > with CONFIG_OPTIMIZE_INLINING), and making that a function call would > > get us fentry hooks and ftrace and *BOOM*. > > > > (Now, of course, its a static function with a single caller, and GCC > > isn't _THAT_ stupid, but it could, if it wanted to) > > > > Since the bar is _that_ low for mistakes, I figure we'd better fix it. > > > > The question is what "fix it" means. It means that when we do schedule, the next task doesn't have AC set, and when we schedule back, we'll restore our AC when we had it. Unlike the current situation, where the next task will run with AC set. IOW, it confines AC to the task context where it was set. > I'm really concerned about AC escapes, > and everyone else should be, too. Then _that_ should be asserted. > For an issue specific to tracing, it would be more appropriate to do the > appropriate things in the tracing entry/exit than in schedule. Otherwise, we > don't want to silently paper over mistakes which could mean that we run a > large portion of the kernel without protection we thought we had. > > In that sense, calling schedule() with AC set is in fact a great place to have > a WARN() or even BUG(), because such an event really could imply that the > kernel has been security compromised. It is not specific to tracing, tracing is just one of the most trivial and non-obvious ways to make it go splat. There's lot of fairly innocent stuff that does preempt_disable() / preempt_enable(). And just a warning in schedule() isn't sufficient, you'd have to actually trigger a reschedule before you know your code is bad. And like I said; the invariant is: if you're preemptible you can schedule and v.v. Now, clearly you don't want to mark these whole regions as !preemptible, because then you can also not take faults, but somehow you're not worried about the whole fault handler, but you are worried about the scheduler ?!? How does that work? The fault handler can touch a _ton_ more code. Heck, the fault handler can schedule. So either pre-fault, and run the whole AC crap with preemption disabled and retry, or accept that we have to schedule. > Does that make more sense? It appears to me you're going about it backwards.
On 19/02/2019 02:46, H. Peter Anvin wrote: > On 2/18/19 6:20 PM, Andy Lutomirski wrote: >> >> >>> On Feb 18, 2019, at 4:24 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: >>> >>>> On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote: >>>> >>>> The question is what "fix it" means. I'm really concerned about AC escapes, >>>> and everyone else should be, too. >>> >>> I do think that it might be the right thing to do to add some kind of >>> WARN_ON_ONCE() for AC being set in various can-reschedule situations. >>> >>> We'd just have to abstract it sanely. I'm sure arm64 has the exact >>> same issue with PAN - maybe it saves properly, but the same "we >>> wouldn't want to go through the scheduler with PAN clear". >>> >>> On x86, we might as well check DF at the same time as AC. >>> >> >> hpa is right, though — calling into tracing code with AC set is not really so good. And calling schedule() (via preempt_enable() or whatever) is also bad because it runs all the scheduler code with AC on. Admittedly, the scheduler is not *that* interesting of an attack surface. >> > > Not just that, but the other question is just how much code we are running > with AC open. It really should only be done in some very small regions. Yes, but we don't really have a way to enforce that, as far as I'm aware. The user_access_begin/end() is generic API, meaning any arch is free to implement it. If they don't have the same hardware behaviour as x86/arm64, it might be that their interrupt/exception entry code will run with user_access open until they reach the entry code that closes it (and entry code could potentially be a more interesting attack surface than the scheduler). This could be the case of software emulated PAN on arm/arm64 (although currently arm, non-64bit, doesn't have user_access_begin/end() at the time). So the whole "very small region" restriction sounds a bit loose/arbitrary to me... Thanks,
On Mon, Feb 18, 2019 at 04:24:30PM -0800, Linus Torvalds wrote: > On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote: > > > > The question is what "fix it" means. I'm really concerned about AC escapes, > > and everyone else should be, too. > > I do think that it might be the right thing to do to add some kind of > WARN_ON_ONCE() for AC being set in various can-reschedule situations. So I disagree. Either we set AC with preempt disabled, and then we don't need an extra warning, because we already have a warning about scheduling with preemption disabled, or we accept that the fault handler can run.
On Tue, Feb 19, 2019 at 10:15:25AM +0100, Peter Zijlstra wrote: > On Mon, Feb 18, 2019 at 04:24:30PM -0800, Linus Torvalds wrote: > > On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote: > > > > > > The question is what "fix it" means. I'm really concerned about AC escapes, > > > and everyone else should be, too. > > > > I do think that it might be the right thing to do to add some kind of > > WARN_ON_ONCE() for AC being set in various can-reschedule situations. > > So I disagree. > > Either we set AC with preempt disabled, and then we don't need an extra > warning, because we already have a warning about scheduling with > preemption disabled, or we accept that the fault handler can run. n/m about the faults, forgot the obvious :/ I still really dislike wrecking the preemption model over this.
On February 19, 2019 1:04:09 AM PST, Peter Zijlstra <peterz@infradead.org> wrote: >On Mon, Feb 18, 2019 at 02:30:21PM -0800, H. Peter Anvin wrote: >> On 2/16/19 2:30 AM, Peter Zijlstra wrote: >> > On Fri, Feb 15, 2019 at 08:06:56PM -0800, hpa@zytor.com wrote: >> >> This implies we invoke schedule -- a restricted operation >(consider >> >> may_sleep) during execution of STAC-enabled code, but *not* as an >> >> exception or interrupt, since those preserve the flags. >> > >> > Meet preempt_enable(). >> >> I believe this falls under "doctor, it hurts when I do that." And it >hurts for >> very good reasons. See below. > >I disagree; the basic rule is that if you're preemptible you must also >be able to schedule and vice-versa. These AC regions violate this. > >And, like illustrated, it is _very_ easy to get all sorts inside that >AC >crap. > >> >> I have serious concerns about this. This is more or less saying >that >> >> we have left an unlimited gap and have had AC escape. >> > >> > Yes; by allowing normal C in between STAC and CLAC this is so. >> > >> >> Does *anyone* see a need to allow this? I got a question at LPC >from >> >> someone about this, and what they were trying to do once all the >> >> layers had been unwound was so far down the wrong track for a root >> >> problem that actually has a very simple solution. >> > >> > Have you read the rest of the thread? >> > >> > All it takes for this to explode is a call to a C function that has >> > tracing on it in between the user_access_begin() and >user_access_end() >> > things. That is a _very_ easy thing to do. >> > >> > Heck, GCC is allowed to generate that broken code compiling >> > lib/strnlen_user.c; it doesn't _have_ to inline do_strnlen_user >(esp. >> > with CONFIG_OPTIMIZE_INLINING), and making that a function call >would >> > get us fentry hooks and ftrace and *BOOM*. >> > >> > (Now, of course, its a static function with a single caller, and >GCC >> > isn't _THAT_ stupid, but it could, if it wanted to) >> > >> > Since the bar is _that_ low for mistakes, I figure we'd better fix >it. >> > >> >> The question is what "fix it" means. > >It means that when we do schedule, the next task doesn't have AC set, >and when we schedule back, we'll restore our AC when we had it. Unlike >the current situation, where the next task will run with AC set. > >IOW, it confines AC to the task context where it was set. > >> I'm really concerned about AC escapes, >> and everyone else should be, too. > >Then _that_ should be asserted. > >> For an issue specific to tracing, it would be more appropriate to do >the >> appropriate things in the tracing entry/exit than in schedule. >Otherwise, we >> don't want to silently paper over mistakes which could mean that we >run a >> large portion of the kernel without protection we thought we had. >> >> In that sense, calling schedule() with AC set is in fact a great >place to have >> a WARN() or even BUG(), because such an event really could imply that >the >> kernel has been security compromised. > >It is not specific to tracing, tracing is just one of the most trivial >and non-obvious ways to make it go splat. > >There's lot of fairly innocent stuff that does preempt_disable() / >preempt_enable(). And just a warning in schedule() isn't sufficient, >you'd have to actually trigger a reschedule before you know your code >is >bad. > >And like I said; the invariant is: if you're preemptible you can >schedule and v.v. > >Now, clearly you don't want to mark these whole regions as >!preemptible, >because then you can also not take faults, but somehow you're not >worried about the whole fault handler, but you are worried about the >scheduler ?!? How does that work? The fault handler can touch a _ton_ >more code. Heck, the fault handler can schedule. > >So either pre-fault, and run the whole AC crap with preemption disabled >and retry, or accept that we have to schedule. > >> Does that make more sense? > >It appears to me you're going about it backwards. I'm not worried about the fault handler, because AC is always presented/disabled on exception entry; otherwise I would of course be very concerned. To be clear: I'm not worried about the scheduler itself. I'm worried about how much code we have gone through to get there. Perhaps the scheduler itself is not the right point to probe for it, but we do need to catch things that have gone wrong, rather than just leaving the door wide open. I would personally be far more comfortable saying you have to disable preemption in user access regions. It may be an unnecessary constraint for x86 and ARM64, but it is safe. And Julien, yes, it is "somewhat arbitrary", but so are many engineering tradeoffs. Not everything has a very sharply definable line.
On Tue, Feb 19, 2019 at 10:04:09AM +0100, Peter Zijlstra wrote: > > Does that make more sense? > > It appears to me you're going about it backwards. So how about you do a GCC plugin that verifies limits on code-gen between user_access_begin/user_access_end() ? - No CALL/RET - implies user_access_end() happens - implies no fentry hooks - No __preempt_count frobbing - No tracepoints - ... That way you put the burden on the special code, not on the rest of the kernel.
On Tue, 19 Feb 2019, Peter Zijlstra wrote: > On Tue, Feb 19, 2019 at 10:04:09AM +0100, Peter Zijlstra wrote: > > > Does that make more sense? > > > > It appears to me you're going about it backwards. > > So how about you do a GCC plugin that verifies limits on code-gen > between user_access_begin/user_access_end() ? > > - No CALL/RET > - implies user_access_end() happens > - implies no fentry hooks > - No __preempt_count frobbing > - No tracepoints > - ... > > That way you put the burden on the special code, not on the rest of the > kernel. And then you have kprobes ....
On Tue, Feb 19, 2019 at 12:38:42PM +0100, Thomas Gleixner wrote: > On Tue, 19 Feb 2019, Peter Zijlstra wrote: > > > On Tue, Feb 19, 2019 at 10:04:09AM +0100, Peter Zijlstra wrote: > > > > Does that make more sense? > > > > > > It appears to me you're going about it backwards. > > > > So how about you do a GCC plugin that verifies limits on code-gen > > between user_access_begin/user_access_end() ? > > > > - No CALL/RET > > - implies user_access_end() happens > > - implies no fentry hooks > > - No __preempt_count frobbing > > - No tracepoints > > - ... > > > > That way you put the burden on the special code, not on the rest of the > > kernel. > > And then you have kprobes .... They prod the INT3 byte and then take an exception, and exceptions are 'fine'.
On Tue, Feb 19, 2019 at 10:04:09AM +0100, Peter Zijlstra wrote: > On Mon, Feb 18, 2019 at 02:30:21PM -0800, H. Peter Anvin wrote: > > On 2/16/19 2:30 AM, Peter Zijlstra wrote: > > > On Fri, Feb 15, 2019 at 08:06:56PM -0800, hpa@zytor.com wrote: > > >> This implies we invoke schedule -- a restricted operation (consider > > >> may_sleep) during execution of STAC-enabled code, but *not* as an > > >> exception or interrupt, since those preserve the flags. > > > > > > Meet preempt_enable(). > > > > I believe this falls under "doctor, it hurts when I do that." And it hurts for > > very good reasons. See below. > > I disagree; the basic rule is that if you're preemptible you must also > be able to schedule and vice-versa. These AC regions violate this. > > And, like illustrated, it is _very_ easy to get all sorts inside that AC > crap. > > > >> I have serious concerns about this. This is more or less saying that > > >> we have left an unlimited gap and have had AC escape. > > > > > > Yes; by allowing normal C in between STAC and CLAC this is so. > > > > > >> Does *anyone* see a need to allow this? I got a question at LPC from > > >> someone about this, and what they were trying to do once all the > > >> layers had been unwound was so far down the wrong track for a root > > >> problem that actually has a very simple solution. > > > > > > Have you read the rest of the thread? > > > > > > All it takes for this to explode is a call to a C function that has > > > tracing on it in between the user_access_begin() and user_access_end() > > > things. That is a _very_ easy thing to do. > > > > > > Heck, GCC is allowed to generate that broken code compiling > > > lib/strnlen_user.c; it doesn't _have_ to inline do_strnlen_user (esp. > > > with CONFIG_OPTIMIZE_INLINING), and making that a function call would > > > get us fentry hooks and ftrace and *BOOM*. > > > > > > (Now, of course, its a static function with a single caller, and GCC > > > isn't _THAT_ stupid, but it could, if it wanted to) > > > > > > Since the bar is _that_ low for mistakes, I figure we'd better fix it. > > > > > > > The question is what "fix it" means. > > It means that when we do schedule, the next task doesn't have AC set, > and when we schedule back, we'll restore our AC when we had it. Unlike > the current situation, where the next task will run with AC set. > > IOW, it confines AC to the task context where it was set. > > > I'm really concerned about AC escapes, > > and everyone else should be, too. > > Then _that_ should be asserted. > > > For an issue specific to tracing, it would be more appropriate to do the > > appropriate things in the tracing entry/exit than in schedule. Otherwise, we > > don't want to silently paper over mistakes which could mean that we run a > > large portion of the kernel without protection we thought we had. > > > > In that sense, calling schedule() with AC set is in fact a great place to have > > a WARN() or even BUG(), because such an event really could imply that the > > kernel has been security compromised. > > It is not specific to tracing, tracing is just one of the most trivial > and non-obvious ways to make it go splat. > > There's lot of fairly innocent stuff that does preempt_disable() / > preempt_enable(). And just a warning in schedule() isn't sufficient, > you'd have to actually trigger a reschedule before you know your code is > bad. > > And like I said; the invariant is: if you're preemptible you can > schedule and v.v. > > Now, clearly you don't want to mark these whole regions as !preemptible, > because then you can also not take faults, but somehow you're not > worried about the whole fault handler, but you are worried about the > scheduler ?!? How does that work? The fault handler can touch a _ton_ > more code. Heck, the fault handler can schedule. > > So either pre-fault, and run the whole AC crap with preemption disabled > and retry, or accept that we have to schedule. I think you'll still hate this, but could we not disable preemption during the uaccess-enabled region, re-enabling it on the fault path after we've toggled uaccess off and disable it again when we return back to the uaccess-enabled region? Doesn't help with tracing, but it should at least handle the common case. Will
On 2/19/19 4:48 AM, Will Deacon wrote: > > I think you'll still hate this, but could we not disable preemption during > the uaccess-enabled region, re-enabling it on the fault path after we've > toggled uaccess off and disable it again when we return back to the > uaccess-enabled region? Doesn't help with tracing, but it should at least > handle the common case. > There is a worse problem with this, I still realize: this would mean blocking preemption across what could possibly be a *very* large copy_from_user(), for example. Exceptions *have* to handle this; there is no way around it. Perhaps the scheduler isn't the right place to put these kinds of asserts, either. Now, __fentry__ is kind of a special beast; in some ways it is an "exception implemented as a function call"; on x86 one could even consider using an INT instruction in order to reduce the NOP footprint in the unarmed case. Nor is __fentry__ a C function; it has far more of an exception-like ABI. *Regardless* of what else we do, I believe __fentry__ ought to save/disable/restore AC, just like an exception does. The idea of using s/a gcc plugin/objtool/ for this isn't really a bad idea. Obviously the general problem is undecidable :) but the enforcement of some simple, fairly draconian rules ("as tight as possible, but no tighter") shouldn't be a huge problem. An actual gcc plugin -- which would probably be quite complex -- could make gcc itself aware of user space accesses and be able to rearrange them to minimize STAC/CLAC and avoid kernel-space accesses inside those brackets. Finally, of course, there is the option of simply outlawing this practice as a matter of policy and require that all structures be accessed through a limited set of APIs. As I recall, the number of places where there were performance-critical regions which could not use the normal accessors are fairly small (signal frames being the main one.) Doing bulk copy to/from kernel memory and then accessing them from there would have some performance cost, but would eliminate the need for this complexity entirely. -hpa
On 20/02/2019 22:55, H. Peter Anvin wrote: > On 2/19/19 4:48 AM, Will Deacon wrote: >> >> I think you'll still hate this, but could we not disable preemption during >> the uaccess-enabled region, re-enabling it on the fault path after we've >> toggled uaccess off and disable it again when we return back to the >> uaccess-enabled region? Doesn't help with tracing, but it should at least >> handle the common case. >> > > There is a worse problem with this, I still realize: this would mean blocking > preemption across what could possibly be a *very* large copy_from_user(), for > example. > Yes, that's a good point. And I guess a userspace program could forcibly trigger the kernel into a large copy_from/to_user(), knowingly disabling preemption. I don't know how bad this could get. > Exceptions *have* to handle this; there is no way around it. Perhaps the > scheduler isn't the right place to put these kinds of asserts, either. > Maybe I'm misunderstanding what you mean with "Exceptions *have* to handle this". Isn't the fact that the AC/PAN flags gets saved on exception entry and set back to "user accesses disabled" already handling it? Or are you referring to something else? So far my understanding is that the exception/interrupt case is fine. The worrying case is what gets called in the user access regions (schedule(), preempt_enable(), tracing...). Did I get lost somewhere? > Now, __fentry__ is kind of a special beast; in some ways it is an "exception > implemented as a function call"; on x86 one could even consider using an INT > instruction in order to reduce the NOP footprint in the unarmed case. Nor is > __fentry__ a C function; it has far more of an exception-like ABI. > *Regardless* of what else we do, I believe __fentry__ ought to > save/disable/restore AC, just like an exception does. > That does make sense to me. However it doesn't solve the issue of calling (or preventing to call) some function that rescheds. > The idea of using s/a gcc plugin/objtool/ for this isn't really a bad idea. > Obviously the general problem is undecidable :) but the enforcement of some > simple, fairly draconian rules ("as tight as possible, but no tighter") > shouldn't be a huge problem. > > An actual gcc plugin -- which would probably be quite complex -- could make > gcc itself aware of user space accesses and be able to rearrange them to > minimize STAC/CLAC and avoid kernel-space accesses inside those brackets. > Not sure I get this. The data that is retrieved from/stored user space is generally obtained from/saved into kernel-space address. And when you start the user_access_begin() it means you plan on doing a bunch of user access, so I wouldn't expect to be able to batch everything into registers. Cheers,
On Wed, Feb 20, 2019 at 02:55:59PM -0800, H. Peter Anvin wrote: > On 2/19/19 4:48 AM, Will Deacon wrote: > > > > I think you'll still hate this, but could we not disable preemption during > > the uaccess-enabled region, re-enabling it on the fault path after we've > > toggled uaccess off and disable it again when we return back to the > > uaccess-enabled region? Doesn't help with tracing, but it should at least > > handle the common case. > > > > There is a worse problem with this, I still realize: this would mean blocking > preemption across what could possibly be a *very* large copy_from_user(), for > example. I don't think it's legitimate to call copy_{to,from}_user() inside a user_access_{begin,end} region. You'd need to add some unsafe variants, which could periodically disable uaccess and call cond_resched() inside the loop to avoid the problem you're eluding to. For existing callers of copy_{to,from}_user(), there's no issue as they don't call into the scheduler during the copy operation. Exceptions are handled fine by the code in mainline today. GCC plugins are a cool idea, but I'm just a bit nervous about relying on them for things like this. Will
On Thu, 21 Feb 2019, Julien Thierry wrote: > On 20/02/2019 22:55, H. Peter Anvin wrote: > > Now, __fentry__ is kind of a special beast; in some ways it is an "exception > > implemented as a function call"; on x86 one could even consider using an INT > > instruction in order to reduce the NOP footprint in the unarmed case. Nor is > > __fentry__ a C function; it has far more of an exception-like ABI. > > *Regardless* of what else we do, I believe __fentry__ ought to > > save/disable/restore AC, just like an exception does. > > > > That does make sense to me. However it doesn't solve the issue of > calling (or preventing to call) some function that rescheds. IMNSHO any call inside a AC region is a bug lurking round the corner. The only thing which is tolerable is an exception of some sort. Enforce that with objtool. End of story. Thanks, tglx
> On Feb 21, 2019, at 4:46 AM, Will Deacon <will.deacon@arm.com> wrote: > >> On Wed, Feb 20, 2019 at 02:55:59PM -0800, H. Peter Anvin wrote: >>> On 2/19/19 4:48 AM, Will Deacon wrote: >>> >>> I think you'll still hate this, but could we not disable preemption during >>> the uaccess-enabled region, re-enabling it on the fault path after we've >>> toggled uaccess off and disable it again when we return back to the >>> uaccess-enabled region? Doesn't help with tracing, but it should at least >>> handle the common case. >>> >> >> There is a worse problem with this, I still realize: this would mean blocking >> preemption across what could possibly be a *very* large copy_from_user(), for >> example. > > I don't think it's legitimate to call copy_{to,from}_user() inside a > user_access_{begin,end} region. You'd need to add some unsafe variants, > which could periodically disable uaccess and call cond_resched() inside > the loop to avoid the problem you're eluding to. > Definitely not safe. On x86 they do CLAC and everything breaks.
On Thu, Feb 21, 2019 at 1:35 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > IMNSHO any call inside a AC region is a bug lurking round the corner. The > only thing which is tolerable is an exception of some sort. > > Enforce that with objtool. End of story. Not quite that simple. There are some use cases where you really do want a call - albeit to special functions - inside the AC region. The prime example of this is likely "filldir()" in fs/readdir.c, which is actually somewhat important for some loads (eg samba). And the whole AC thing made it horribly horribly slow because readdir is one of those things that doesn't just copy one value (or one structure) to user space, but writes several different things. Kind of like signal frame setup does. You may not realize that right now signal frame setup *does* actually do a call with AC set. See ia32_setup_rt_frame() that does put_user_try { ... compat_save_altstack_ex(&frame->uc.uc_stack, regs->sp); ... } put_user_catch(err); is a macro, but inside that macro is a call to sas_ss_reset(). Now, that is an inline function, so it hopefully doesn't actually cause a call. But it's "only" an inline function, not "always_inline". We did already have (and I rejected, for now) patches that made those inlines not very strong... [ Side note: currently signal frame setup uses the absolutely *disgusting* put_user_ex() thing, but that's actually what unsafe_put_user() was designed for. I just never got around to getting rid of the nasty hot mess of put_user_ex() ] Anyway, while signal frame setup just writes individual values, "filldir()" does *both* several individual values _and_ does a copy_to_user(). And it's that "copy_to_user()" that I want to eventually change to a "unsafe_copy_to_user()", along with making the individual values be written with unsafe_put_user(). Right now "filldir()" messes with AC a total of *six* times. It sets four field values, and then does a "copy_to_user()", all of which set/clear AC right now. It's wasting hundreds of cycles on this, because AC is so slow to set/clear. If AC was cheap, this wouldn't be much of an issue. But AC is really really expensive. I think it's microcode and serializes the pipeline or something. Anyway. We already have a possible call site, and there are good reasons for future ones too. But they are hopefully all very controlled. Linus
On Thu, Feb 21, 2019 at 02:08:11PM -0800, Linus Torvalds wrote: > On Thu, Feb 21, 2019 at 1:35 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > IMNSHO any call inside a AC region is a bug lurking round the corner. The > > only thing which is tolerable is an exception of some sort. > > > > Enforce that with objtool. End of story. > > Not quite that simple. > > There are some use cases where you really do want a call - albeit to > special functions - inside the AC region. > > The prime example of this is likely "filldir()" in fs/readdir.c, which > is actually somewhat important for some loads (eg samba). > > And the whole AC thing made it horribly horribly slow because readdir > is one of those things that doesn't just copy one value (or one > structure) to user space, but writes several different things. > > Kind of like signal frame setup does. > > You may not realize that right now signal frame setup *does* actually > do a call with AC set. See ia32_setup_rt_frame() that does > > put_user_try { > ... > compat_save_altstack_ex(&frame->uc.uc_stack, regs->sp); > ... > } put_user_catch(err); > > is a macro, but inside that macro is a call to sas_ss_reset(). > > Now, that is an inline function, so it hopefully doesn't actually > cause a call. But it's "only" an inline function, not "always_inline". > We did already have (and I rejected, for now) patches that made those > inlines not very strong... That one does indeed not generate a call, but: arch/x86/ia32/ia32_signal.o: warning: objtool: ia32_restore_sigcontext()+0x38: call to native_load_gs_index() with AC set I've yet to look at detectoring __preempt_count mucking. --- diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index 43c029cdc3fe..cd31e4433f4c 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -5,6 +5,7 @@ #ifdef __KERNEL__ +#include <linux/frame.h> #include <asm/nops.h> /* @@ -135,6 +136,7 @@ static inline void native_wbinvd(void) } extern asmlinkage void native_load_gs_index(unsigned); +AC_SAFE(native_load_gs_index); static inline unsigned long __read_cr4(void) { diff --git a/include/linux/frame.h b/include/linux/frame.h index 02d3ca2d9598..644662d8b8e8 100644 --- a/include/linux/frame.h +++ b/include/linux/frame.h @@ -15,9 +15,14 @@ static void __used __section(.discard.func_stack_frame_non_standard) \ *__func_stack_frame_non_standard_##func = func +#define AC_SAFE(func) \ + static void __used __section(.discard.ac_safe) \ + *__func_ac_safe_##func = func + #else /* !CONFIG_STACK_VALIDATION */ #define STACK_FRAME_NON_STANDARD(func) +#define AC_SAFE(func) #endif /* CONFIG_STACK_VALIDATION */ diff --git a/tools/objtool/Makefile b/tools/objtool/Makefile index c9d038f91af6..5a64e5fb9d7a 100644 --- a/tools/objtool/Makefile +++ b/tools/objtool/Makefile @@ -31,7 +31,7 @@ INCLUDES := -I$(srctree)/tools/include \ -I$(srctree)/tools/arch/$(HOSTARCH)/include/uapi \ -I$(srctree)/tools/objtool/arch/$(ARCH)/include WARNINGS := $(EXTRA_WARNINGS) -Wno-switch-default -Wno-switch-enum -Wno-packed -CFLAGS += -Werror $(WARNINGS) $(KBUILD_HOSTCFLAGS) -g $(INCLUDES) +CFLAGS += -Werror $(WARNINGS) $(KBUILD_HOSTCFLAGS) -g $(INCLUDES) -ggdb3 LDFLAGS += -lelf $(LIBSUBCMD) $(KBUILD_HOSTLDFLAGS) # Allow old libelf to be used: diff --git a/tools/objtool/arch.h b/tools/objtool/arch.h index b0d7dc3d71b5..48327099466d 100644 --- a/tools/objtool/arch.h +++ b/tools/objtool/arch.h @@ -33,7 +33,9 @@ #define INSN_STACK 8 #define INSN_BUG 9 #define INSN_NOP 10 -#define INSN_OTHER 11 +#define INSN_STAC 11 +#define INSN_CLAC 12 +#define INSN_OTHER 13 #define INSN_LAST INSN_OTHER enum op_dest_type { diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c index 540a209b78ab..d1e99d1460a5 100644 --- a/tools/objtool/arch/x86/decode.c +++ b/tools/objtool/arch/x86/decode.c @@ -369,7 +369,19 @@ int arch_decode_instruction(struct elf *elf, struct section *sec, case 0x0f: - if (op2 >= 0x80 && op2 <= 0x8f) { + if (op2 == 0x01) { + + if (modrm == 0xca) { + + *type = INSN_CLAC; + + } else if (modrm == 0xcb) { + + *type = INSN_STAC; + + } + + } else if (op2 >= 0x80 && op2 <= 0x8f) { *type = INSN_JUMP_CONDITIONAL; diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 0414a0d52262..6fef157e244b 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -127,6 +127,24 @@ static bool ignore_func(struct objtool_file *file, struct symbol *func) return false; } +static bool ac_safe_func(struct objtool_file *file, struct symbol *func) +{ + struct rela *rela; + + /* check for AC_SAFE */ + if (file->ac_safe && file->ac_safe->rela) + list_for_each_entry(rela, &file->ac_safe->rela->rela_list, list) { + if (rela->sym->type == STT_SECTION && + rela->sym->sec == func->sec && + rela->addend == func->offset) + return true; + if (/* rela->sym->type == STT_FUNC && */ rela->sym == func) + return true; + } + + return false; +} + /* * This checks to see if the given function is a "noreturn" function. * @@ -439,6 +457,8 @@ static void add_ignores(struct objtool_file *file) for_each_sec(file, sec) { list_for_each_entry(func, &sec->symbol_list, list) { + func->ac_safe = ac_safe_func(file, func); + if (func->type != STT_FUNC) continue; @@ -1897,11 +1917,16 @@ static int validate_branch(struct objtool_file *file, struct instruction *first, if (ret) return 1; } - } + } else printf("ponies\n"); switch (insn->type) { case INSN_RETURN: + if (state.ac) { + WARN_FUNC("return with AC set", sec, insn->offset); + return 1; + } + if (func && has_modified_stack_frame(&state)) { WARN_FUNC("return with modified stack frame", sec, insn->offset); @@ -1917,6 +1942,12 @@ static int validate_branch(struct objtool_file *file, struct instruction *first, return 0; case INSN_CALL: + if (state.ac && !insn->call_dest->ac_safe) { + WARN_FUNC("call to %s() with AC set", sec, insn->offset, + insn->call_dest->name); + return 1; + } + if (is_fentry_call(insn)) break; @@ -1928,6 +1959,11 @@ static int validate_branch(struct objtool_file *file, struct instruction *first, /* fallthrough */ case INSN_CALL_DYNAMIC: + if (state.ac && !insn->call_dest->ac_safe) { + WARN_FUNC("call to %s() with AC set", sec, insn->offset, + insn->call_dest->name); + return 1; + } if (!no_fp && func && !has_valid_stack_frame(&state)) { WARN_FUNC("call without frame pointer save/setup", sec, insn->offset); @@ -1980,6 +2016,14 @@ static int validate_branch(struct objtool_file *file, struct instruction *first, break; + case INSN_STAC: + state.ac = true; + break; + + case INSN_CLAC: + state.ac = false; + break; + default: break; } @@ -2198,6 +2242,7 @@ int check(const char *_objname, bool orc) INIT_LIST_HEAD(&file.insn_list); hash_init(file.insn_hash); file.whitelist = find_section_by_name(file.elf, ".discard.func_stack_frame_non_standard"); + file.ac_safe = find_section_by_name(file.elf, ".discard.ac_safe"); file.c_file = find_section_by_name(file.elf, ".comment"); file.ignore_unreachables = no_unreachable; file.hints = false; diff --git a/tools/objtool/check.h b/tools/objtool/check.h index e6e8a655b556..d4ae59454fff 100644 --- a/tools/objtool/check.h +++ b/tools/objtool/check.h @@ -31,7 +31,7 @@ struct insn_state { int stack_size; unsigned char type; bool bp_scratch; - bool drap, end; + bool drap, end, ac; int drap_reg, drap_offset; struct cfi_reg vals[CFI_NUM_REGS]; }; @@ -61,6 +61,7 @@ struct objtool_file { struct list_head insn_list; DECLARE_HASHTABLE(insn_hash, 16); struct section *whitelist; + struct section *ac_safe; bool ignore_unreachables, c_file, hints, rodata; }; diff --git a/tools/objtool/elf.h b/tools/objtool/elf.h index bc97ed86b9cd..064c3df31e40 100644 --- a/tools/objtool/elf.h +++ b/tools/objtool/elf.h @@ -62,6 +62,7 @@ struct symbol { unsigned long offset; unsigned int len; struct symbol *pfunc, *cfunc; + bool ac_safe; }; struct rela { diff --git a/tools/objtool/special.c b/tools/objtool/special.c index 50af4e1274b3..72084cae8f35 100644 --- a/tools/objtool/special.c +++ b/tools/objtool/special.c @@ -166,8 +166,8 @@ int special_get_alts(struct elf *elf, struct list_head *alts) continue; if (sec->len % entry->size != 0) { - WARN("%s size not a multiple of %d", - sec->name, entry->size); + WARN("%s size (%d) not a multiple of %d", + sec->name, sec->len, entry->size); return -1; }
On Thu, 21 Feb 2019, Linus Torvalds wrote: > On Thu, Feb 21, 2019 at 1:35 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > IMNSHO any call inside a AC region is a bug lurking round the corner. The > > only thing which is tolerable is an exception of some sort. > > > > Enforce that with objtool. End of story. > > Not quite that simple. But correct :) > Right now "filldir()" messes with AC a total of *six* times. It sets > four field values, and then does a "copy_to_user()", all of which > set/clear AC right now. It's wasting hundreds of cycles on this, > because AC is so slow to set/clear. > > If AC was cheap, this wouldn't be much of an issue. But AC is really > really expensive. I think it's microcode and serializes the pipeline > or something. > > Anyway. We already have a possible call site, and there are good > reasons for future ones too. But they are hopefully all very > controlled. I agree, that a function which is doing the actual copy should be callable, but random other functions? NO! The problem is that once you open that can of worms the people who think their problem is special will come around and do begin() copy_to_user_unsafe(uptr, collect_data()) end() just because they can. That's the stuff, I'm worried about, not the well defined copy_to/from_user() invocation. We can deal with that and make sure that it's safe even with tracing and annotate with some special magic. It's simpler to find and check a new '__safe_inside_ac' annotation than chasing randomly added code within a gazillion of begin/end sections. Thanks, tglx
--- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -650,6 +650,7 @@ ENTRY(__switch_to_asm) pushl %ebx pushl %edi pushl %esi + pushfl /* switch stack */ movl %esp, TASK_threadsp(%eax) @@ -672,6 +673,7 @@ ENTRY(__switch_to_asm) #endif /* restore callee-saved registers */ + popfl popl %esi popl %edi popl %ebx --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -291,6 +291,7 @@ ENTRY(__switch_to_asm) pushq %r13 pushq %r14 pushq %r15 + pushfq /* switch stack */ movq %rsp, TASK_threadsp(%rdi) @@ -313,6 +314,7 @@ ENTRY(__switch_to_asm) #endif /* restore callee-saved registers */ + popfq popq %r15 popq %r14 popq %r13 --- a/arch/x86/include/asm/switch_to.h +++ b/arch/x86/include/asm/switch_to.h @@ -40,6 +40,7 @@ asmlinkage void ret_from_fork(void); * order of the fields must match the code in __switch_to_asm(). */ struct inactive_task_frame { + unsigned long flags; #ifdef CONFIG_X86_64 unsigned long r15; unsigned long r14;