sched/x86: Save [ER]FLAGS on context switch
diff mbox series

Message ID 20190214101429.GD32494@hirez.programming.kicks-ass.net
State New
Headers show
Series
  • sched/x86: Save [ER]FLAGS on context switch
Related show

Commit Message

Peter Zijlstra Feb. 14, 2019, 10:14 a.m. UTC
On Wed, Feb 13, 2019 at 02:49:47PM -0800, Andy Lutomirski wrote:

> Do we need to backport this thing?

Possibly, just to be safe.

> The problem can’t be too widespread or we would have heard of it before.

Yes, so far we've been lucky.

---
Subject: sched/x86: Save [ER]FLAGS on context switch
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu Feb 14 10:30:52 CET 2019

Effectively revert commit:

  2c7577a75837 ("sched/x86_64: Don't save flags on context switch")

Specifically because SMAP uses FLAGS.AC which invalidates the claim
that the kernel has clean flags.

In particular; while preemption from interrupt return is fine (the
IRET frame on the exception stack contains FLAGS) it breaks any code
that does synchonous scheduling, including preempt_enable().

This has become a significant issue ever since commit:

  5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses")

provided for means of having 'normal' C code between STAC / CLAC,
exposing the FLAGS.AC state. So far this hasn't led to trouble,
however fix it before it comes apart.

Fixes: 5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses")
Acked-by: Andy Lutomirski <luto@amacapital.net>
Reported-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/entry/entry_32.S        |    2 ++
 arch/x86/entry/entry_64.S        |    2 ++
 arch/x86/include/asm/switch_to.h |    1 +
 3 files changed, 5 insertions(+)

Comments

Brian Gerst Feb. 14, 2019, 4:18 p.m. UTC | #1
On Thu, Feb 14, 2019 at 5:14 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Feb 13, 2019 at 02:49:47PM -0800, Andy Lutomirski wrote:
>
> > Do we need to backport this thing?
>
> Possibly, just to be safe.
>
> > The problem can’t be too widespread or we would have heard of it before.
>
> Yes, so far we've been lucky.
>
> ---
> Subject: sched/x86: Save [ER]FLAGS on context switch
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Thu Feb 14 10:30:52 CET 2019
>
> Effectively revert commit:
>
>   2c7577a75837 ("sched/x86_64: Don't save flags on context switch")
>
> Specifically because SMAP uses FLAGS.AC which invalidates the claim
> that the kernel has clean flags.
>
> In particular; while preemption from interrupt return is fine (the
> IRET frame on the exception stack contains FLAGS) it breaks any code
> that does synchonous scheduling, including preempt_enable().
>
> This has become a significant issue ever since commit:
>
>   5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses")
>
> provided for means of having 'normal' C code between STAC / CLAC,
> exposing the FLAGS.AC state. So far this hasn't led to trouble,
> however fix it before it comes apart.
>
> Fixes: 5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses")
> Acked-by: Andy Lutomirski <luto@amacapital.net>
> Reported-by: Julien Thierry <julien.thierry@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/entry/entry_32.S        |    2 ++
>  arch/x86/entry/entry_64.S        |    2 ++
>  arch/x86/include/asm/switch_to.h |    1 +
>  3 files changed, 5 insertions(+)
>
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -650,6 +650,7 @@ ENTRY(__switch_to_asm)
>         pushl   %ebx
>         pushl   %edi
>         pushl   %esi
> +       pushfl
>
>         /* switch stack */
>         movl    %esp, TASK_threadsp(%eax)
> @@ -672,6 +673,7 @@ ENTRY(__switch_to_asm)
>  #endif
>
>         /* restore callee-saved registers */
> +       popfl
>         popl    %esi
>         popl    %edi
>         popl    %ebx
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -291,6 +291,7 @@ ENTRY(__switch_to_asm)
>         pushq   %r13
>         pushq   %r14
>         pushq   %r15
> +       pushfq
>
>         /* switch stack */
>         movq    %rsp, TASK_threadsp(%rdi)
> @@ -313,6 +314,7 @@ ENTRY(__switch_to_asm)
>  #endif
>
>         /* restore callee-saved registers */
> +       popfq
>         popq    %r15
>         popq    %r14
>         popq    %r13
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -40,6 +40,7 @@ asmlinkage void ret_from_fork(void);
>   * order of the fields must match the code in __switch_to_asm().
>   */
>  struct inactive_task_frame {
> +       unsigned long flags;
>  #ifdef CONFIG_X86_64
>         unsigned long r15;
>         unsigned long r14;

flags should be initialized in copy_thread_tls().  I think the new
stack is zeroed already, but it would be better to explicitly set it.

--
Brian Gerst
Peter Zijlstra Feb. 14, 2019, 7:34 p.m. UTC | #2
On Thu, Feb 14, 2019 at 11:18:01AM -0500, Brian Gerst wrote:

> > --- a/arch/x86/include/asm/switch_to.h
> > +++ b/arch/x86/include/asm/switch_to.h
> > @@ -40,6 +40,7 @@ asmlinkage void ret_from_fork(void);
> >   * order of the fields must match the code in __switch_to_asm().
> >   */
> >  struct inactive_task_frame {
> > +       unsigned long flags;
> >  #ifdef CONFIG_X86_64
> >         unsigned long r15;
> >         unsigned long r14;
> 
> flags should be initialized in copy_thread_tls().  I think the new
> stack is zeroed already, but it would be better to explicitly set it.

Ah indeed. I somehow misread that code and thought we got initialized
with a copy of current.

Something like the below, right?

---

--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -127,6 +127,7 @@ int copy_thread_tls(unsigned long clone_
 	struct task_struct *tsk;
 	int err;
 
+	frame->flags = 0;
 	frame->bp = 0;
 	frame->ret_addr = (unsigned long) ret_from_fork;
 	p->thread.sp = (unsigned long) fork_frame;
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -392,6 +392,7 @@ int copy_thread_tls(unsigned long clone_
 	childregs = task_pt_regs(p);
 	fork_frame = container_of(childregs, struct fork_frame, regs);
 	frame = &fork_frame->frame;
+	frame->flags = 0;
 	frame->bp = 0;
 	frame->ret_addr = (unsigned long) ret_from_fork;
 	p->thread.sp = (unsigned long) fork_frame;
Brian Gerst Feb. 15, 2019, 2:34 p.m. UTC | #3
On Thu, Feb 14, 2019 at 2:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Feb 14, 2019 at 11:18:01AM -0500, Brian Gerst wrote:
>
> > > --- a/arch/x86/include/asm/switch_to.h
> > > +++ b/arch/x86/include/asm/switch_to.h
> > > @@ -40,6 +40,7 @@ asmlinkage void ret_from_fork(void);
> > >   * order of the fields must match the code in __switch_to_asm().
> > >   */
> > >  struct inactive_task_frame {
> > > +       unsigned long flags;
> > >  #ifdef CONFIG_X86_64
> > >         unsigned long r15;
> > >         unsigned long r14;
> >
> > flags should be initialized in copy_thread_tls().  I think the new
> > stack is zeroed already, but it would be better to explicitly set it.
>
> Ah indeed. I somehow misread that code and thought we got initialized
> with a copy of current.
>
> Something like the below, right?
>
> ---
>
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -127,6 +127,7 @@ int copy_thread_tls(unsigned long clone_
>         struct task_struct *tsk;
>         int err;
>
> +       frame->flags = 0;
>         frame->bp = 0;
>         frame->ret_addr = (unsigned long) ret_from_fork;
>         p->thread.sp = (unsigned long) fork_frame;
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -392,6 +392,7 @@ int copy_thread_tls(unsigned long clone_
>         childregs = task_pt_regs(p);
>         fork_frame = container_of(childregs, struct fork_frame, regs);
>         frame = &fork_frame->frame;
> +       frame->flags = 0;
>         frame->bp = 0;
>         frame->ret_addr = (unsigned long) ret_from_fork;
>         p->thread.sp = (unsigned long) fork_frame;

Yes, this looks good.

--
Brian Gerst
Linus Torvalds Feb. 15, 2019, 5:18 p.m. UTC | #4
On Thu, Feb 14, 2019 at 11:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Something like the below, right?
>
> +       frame->flags = 0;
> +       frame->flags = 0;

Those are not valid flag values.

Can you popf them? Yes.

Do they make sense? No.

It has the IF flag clear, for example. Is that intentional? If it is,
it should likely be documented. Because IF clear means "interrupts
disabled". Are the places that load flags in irq disabled regions?
Maybe they are, maybe they aren't, but shouldn't this be something
that is made explicit, rather than "oh, let's initialize to zero like
all the other registers we don't care about".

Because a zero initializer for eflags really is odd.

             Linus
Peter Zijlstra Feb. 15, 2019, 5:40 p.m. UTC | #5
On Fri, Feb 15, 2019 at 09:18:00AM -0800, Linus Torvalds wrote:
> On Thu, Feb 14, 2019 at 11:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Something like the below, right?
> >
> > +       frame->flags = 0;
> > +       frame->flags = 0;
> 
> Those are not valid flag values.
> 
> Can you popf them? Yes.
> 
> Do they make sense? No.
> 
> It has the IF flag clear, for example. Is that intentional? If it is,

Uhmm. yeah, that's bonkers. We should have interrupts disabled here.
I'll go read up on the eflags and figure out what they _should_ be right
about there.
Andy Lutomirski Feb. 15, 2019, 6:28 p.m. UTC | #6
> On Feb 15, 2019, at 9:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Fri, Feb 15, 2019 at 09:18:00AM -0800, Linus Torvalds wrote:
>>> On Thu, Feb 14, 2019 at 11:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> Something like the below, right?
>>> 
>>> +       frame->flags = 0;
>>> +       frame->flags = 0;
>> 
>> Those are not valid flag values.
>> 
>> Can you popf them? Yes.
>> 
>> Do they make sense? No.
>> 
>> It has the IF flag clear, for example. Is that intentional? If it is,
> 
> Uhmm. yeah, that's bonkers. We should have interrupts disabled here.
> I'll go read up on the eflags and figure out what they _should_ be right
> about there.

And probably add a comment near the POPF explaining that it will keep IRQs off :)
Peter Zijlstra Feb. 15, 2019, 11:34 p.m. UTC | #7
On Fri, Feb 15, 2019 at 06:40:34PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 15, 2019 at 09:18:00AM -0800, Linus Torvalds wrote:
> > On Thu, Feb 14, 2019 at 11:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > Something like the below, right?
> > >
> > > +       frame->flags = 0;
> > > +       frame->flags = 0;
> > 
> > Those are not valid flag values.
> > 
> > Can you popf them? Yes.
> > 
> > Do they make sense? No.
> > 
> > It has the IF flag clear, for example. Is that intentional? If it is,
> 
> Uhmm. yeah, that's bonkers. We should have interrupts disabled here.
> I'll go read up on the eflags and figure out what they _should_ be right
> about there.

I misread (I'm forever confused about what way around IF goes), but you
said it right; IF=0 is interrupts disabled and we very much have that in
the middle of context switch.

(just for giggles I set IF for the initial flags value; and it comes
unstuck _real_ quick)

Now, EFLAGS bit 1 is supposedly always 1, but it really doesn't seem to
matter for POPF.

I went through the other flags, and aside from VIP/VIF (I've no clue),
they looks like 0 should be just fine.
Linus Torvalds Feb. 16, 2019, 12:21 a.m. UTC | #8
On Fri, Feb 15, 2019 at 3:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Now, EFLAGS bit 1 is supposedly always 1, but it really doesn't seem to
> matter for POPF.

Correct, it's "read as 1", you can try to write it and it doesn't matter.

> I went through the other flags, and aside from VIP/VIF (I've no clue),
> they looks like 0 should be just fine.

So 0 is a perfectly valid initializer in the sense that it _works_, I
just want it to be something that was thought about, not just a random
"initialize to zero" without thinking.

Even just a comment about it would be fine. But it might also be good
to show that it's an explicit eflags value and just use
X86_EFLAGS_FIXED as the initializer.

            Linus
H. Peter Anvin Feb. 16, 2019, 4:06 a.m. UTC | #9
On February 14, 2019 2:14:29 AM PST, Peter Zijlstra <peterz@infradead.org> wrote:
>On Wed, Feb 13, 2019 at 02:49:47PM -0800, Andy Lutomirski wrote:
>
>> Do we need to backport this thing?
>
>Possibly, just to be safe.
>
>> The problem can’t be too widespread or we would have heard of it
>before.
>
>Yes, so far we've been lucky.
>
>---
>Subject: sched/x86: Save [ER]FLAGS on context switch
>From: Peter Zijlstra <peterz@infradead.org>
>Date: Thu Feb 14 10:30:52 CET 2019
>
>Effectively revert commit:
>
>  2c7577a75837 ("sched/x86_64: Don't save flags on context switch")
>
>Specifically because SMAP uses FLAGS.AC which invalidates the claim
>that the kernel has clean flags.
>
>In particular; while preemption from interrupt return is fine (the
>IRET frame on the exception stack contains FLAGS) it breaks any code
>that does synchonous scheduling, including preempt_enable().
>
>This has become a significant issue ever since commit:
>
>5b24a7a2aa20 ("Add 'unsafe' user access functions for batched
>accesses")
>
>provided for means of having 'normal' C code between STAC / CLAC,
>exposing the FLAGS.AC state. So far this hasn't led to trouble,
>however fix it before it comes apart.
>
>Fixes: 5b24a7a2aa20 ("Add 'unsafe' user access functions for batched
>accesses")
>Acked-by: Andy Lutomirski <luto@amacapital.net>
>Reported-by: Julien Thierry <julien.thierry@arm.com>
>Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>---
> arch/x86/entry/entry_32.S        |    2 ++
> arch/x86/entry/entry_64.S        |    2 ++
> arch/x86/include/asm/switch_to.h |    1 +
> 3 files changed, 5 insertions(+)
>
>--- a/arch/x86/entry/entry_32.S
>+++ b/arch/x86/entry/entry_32.S
>@@ -650,6 +650,7 @@ ENTRY(__switch_to_asm)
> 	pushl	%ebx
> 	pushl	%edi
> 	pushl	%esi
>+	pushfl
> 
> 	/* switch stack */
> 	movl	%esp, TASK_threadsp(%eax)
>@@ -672,6 +673,7 @@ ENTRY(__switch_to_asm)
> #endif
> 
> 	/* restore callee-saved registers */
>+	popfl
> 	popl	%esi
> 	popl	%edi
> 	popl	%ebx
>--- a/arch/x86/entry/entry_64.S
>+++ b/arch/x86/entry/entry_64.S
>@@ -291,6 +291,7 @@ ENTRY(__switch_to_asm)
> 	pushq	%r13
> 	pushq	%r14
> 	pushq	%r15
>+	pushfq
> 
> 	/* switch stack */
> 	movq	%rsp, TASK_threadsp(%rdi)
>@@ -313,6 +314,7 @@ ENTRY(__switch_to_asm)
> #endif
> 
> 	/* restore callee-saved registers */
>+	popfq
> 	popq	%r15
> 	popq	%r14
> 	popq	%r13
>--- a/arch/x86/include/asm/switch_to.h
>+++ b/arch/x86/include/asm/switch_to.h
>@@ -40,6 +40,7 @@ asmlinkage void ret_from_fork(void);
>  * order of the fields must match the code in __switch_to_asm().
>  */
> struct inactive_task_frame {
>+	unsigned long flags;
> #ifdef CONFIG_X86_64
> 	unsigned long r15;
> 	unsigned long r14;

This implies we invoke schedule -- a restricted operation (consider may_sleep) during execution of STAC-enabled code, but *not* as an exception or interrupt, since those preserve the flags.

I have serious concerns about this. This is more or less saying that we have left an unlimited gap and have had AC escape.

Does *anyone* see a need to allow this? I got a question at LPC from someone about this, and what they were trying to do once all the layers had been unwound was so far down the wrong track for a root problem that actually has a very simple solution.
Peter Zijlstra Feb. 16, 2019, 10:30 a.m. UTC | #10
On Fri, Feb 15, 2019 at 08:06:56PM -0800, hpa@zytor.com wrote:
> This implies we invoke schedule -- a restricted operation (consider
> may_sleep) during execution of STAC-enabled code, but *not* as an
> exception or interrupt, since those preserve the flags.

Meet preempt_enable().

> I have serious concerns about this. This is more or less saying that
> we have left an unlimited gap and have had AC escape.

Yes; by allowing normal C in between STAC and CLAC this is so.

> Does *anyone* see a need to allow this? I got a question at LPC from
> someone about this, and what they were trying to do once all the
> layers had been unwound was so far down the wrong track for a root
> problem that actually has a very simple solution.

Have you read the rest of the thread?

All it takes for this to explode is a call to a C function that has
tracing on it in between the user_access_begin() and user_access_end()
things. That is a _very_ easy thing to do.

Heck, GCC is allowed to generate that broken code compiling
lib/strnlen_user.c; it doesn't _have_ to inline do_strnlen_user (esp.
with CONFIG_OPTIMIZE_INLINING), and making that a function call would
get us fentry hooks and ftrace and *BOOM*.

(Now, of course, its a static function with a single caller, and GCC
isn't _THAT_ stupid, but it could, if it wanted to)

Since the bar is _that_ low for mistakes, I figure we'd better fix it.
Peter Zijlstra Feb. 16, 2019, 10:32 a.m. UTC | #11
On Fri, Feb 15, 2019 at 04:21:46PM -0800, Linus Torvalds wrote:
> Even just a comment about it would be fine. But it might also be good
> to show that it's an explicit eflags value and just use
> X86_EFLAGS_FIXED as the initializer.

That is in fact what I have now; I'll repost on Monday or so.
H. Peter Anvin Feb. 18, 2019, 10:30 p.m. UTC | #12
On 2/16/19 2:30 AM, Peter Zijlstra wrote:
> On Fri, Feb 15, 2019 at 08:06:56PM -0800, hpa@zytor.com wrote:
>> This implies we invoke schedule -- a restricted operation (consider
>> may_sleep) during execution of STAC-enabled code, but *not* as an
>> exception or interrupt, since those preserve the flags.
> 
> Meet preempt_enable().

I believe this falls under "doctor, it hurts when I do that." And it hurts for
very good reasons. See below.

>> I have serious concerns about this. This is more or less saying that
>> we have left an unlimited gap and have had AC escape.
> 
> Yes; by allowing normal C in between STAC and CLAC this is so.
> 
>> Does *anyone* see a need to allow this? I got a question at LPC from
>> someone about this, and what they were trying to do once all the
>> layers had been unwound was so far down the wrong track for a root
>> problem that actually has a very simple solution.
> 
> Have you read the rest of the thread?
> 
> All it takes for this to explode is a call to a C function that has
> tracing on it in between the user_access_begin() and user_access_end()
> things. That is a _very_ easy thing to do.
> 
> Heck, GCC is allowed to generate that broken code compiling
> lib/strnlen_user.c; it doesn't _have_ to inline do_strnlen_user (esp.
> with CONFIG_OPTIMIZE_INLINING), and making that a function call would
> get us fentry hooks and ftrace and *BOOM*.
> 
> (Now, of course, its a static function with a single caller, and GCC
> isn't _THAT_ stupid, but it could, if it wanted to)
> 
> Since the bar is _that_ low for mistakes, I figure we'd better fix it.
> 

The question is what "fix it" means. I'm really concerned about AC escapes,
and everyone else should be, too.

For an issue specific to tracing, it would be more appropriate to do the
appropriate things in the tracing entry/exit than in schedule. Otherwise, we
don't want to silently paper over mistakes which could mean that we run a
large portion of the kernel without protection we thought we had.

In that sense, calling schedule() with AC set is in fact a great place to have
a WARN() or even BUG(), because such an event really could imply that the
kernel has been security compromised.

Does that make more sense?

	-hpa
Linus Torvalds Feb. 19, 2019, 12:24 a.m. UTC | #13
On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> The question is what "fix it" means. I'm really concerned about AC escapes,
> and everyone else should be, too.

I do think that it might be the right thing to do to add some kind of
WARN_ON_ONCE() for AC being set in various can-reschedule situations.

We'd just have to abstract it sanely. I'm sure arm64 has the exact
same issue with PAN - maybe it saves properly, but the same "we
wouldn't want to go through the scheduler with PAN clear".

On x86, we might as well check DF at the same time as AC.

              Linus
Andy Lutomirski Feb. 19, 2019, 2:20 a.m. UTC | #14
> On Feb 18, 2019, at 4:24 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote:
>> 
>> The question is what "fix it" means. I'm really concerned about AC escapes,
>> and everyone else should be, too.
> 
> I do think that it might be the right thing to do to add some kind of
> WARN_ON_ONCE() for AC being set in various can-reschedule situations.
> 
> We'd just have to abstract it sanely. I'm sure arm64 has the exact
> same issue with PAN - maybe it saves properly, but the same "we
> wouldn't want to go through the scheduler with PAN clear".
> 
> On x86, we might as well check DF at the same time as AC.
> 
> 

hpa is right, though — calling into tracing code with AC set is not really so good.  And calling schedule() (via preempt_enable() or whatever) is also bad because it runs all the scheduler code with AC on.  Admittedly, the scheduler is not *that* interesting of an attack surface.
H. Peter Anvin Feb. 19, 2019, 2:46 a.m. UTC | #15
On 2/18/19 6:20 PM, Andy Lutomirski wrote:
> 
> 
>> On Feb 18, 2019, at 4:24 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>>> On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>>
>>> The question is what "fix it" means. I'm really concerned about AC escapes,
>>> and everyone else should be, too.
>>
>> I do think that it might be the right thing to do to add some kind of
>> WARN_ON_ONCE() for AC being set in various can-reschedule situations.
>>
>> We'd just have to abstract it sanely. I'm sure arm64 has the exact
>> same issue with PAN - maybe it saves properly, but the same "we
>> wouldn't want to go through the scheduler with PAN clear".
>>
>> On x86, we might as well check DF at the same time as AC.
>>
> 
> hpa is right, though — calling into tracing code with AC set is not really so good.  And calling schedule() (via preempt_enable() or whatever) is also bad because it runs all the scheduler code with AC on.  Admittedly, the scheduler is not *that* interesting of an attack surface.
> 

Not just that, but the other question is just how much code we are running
with AC open. It really should only be done in some very small regions.

	-hpa
Julien Thierry Feb. 19, 2019, 8:53 a.m. UTC | #16
On 19/02/2019 00:24, Linus Torvalds wrote:
> On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> The question is what "fix it" means. I'm really concerned about AC escapes,
>> and everyone else should be, too.
> 
> I do think that it might be the right thing to do to add some kind of
> WARN_ON_ONCE() for AC being set in various can-reschedule situations.
> 
> We'd just have to abstract it sanely. I'm sure arm64 has the exact
> same issue with PAN - maybe it saves properly, but the same "we
> wouldn't want to go through the scheduler with PAN clear".
> 

As of right now, we have the same issue on arm64 as on x86. We don't
currently save the PAN bit on task switch, but I have a patch to do that.

Unless we decide to go down the route of warning against uses of
schedule() inside.

As for the abstraction, I had this patch[1] that added another primitive
for the user_access API (although this might not be suited for x86 if
you also want to check DF). However, an issue that appears is where to
perform the check to cover enough ground.

Checking inside the schedule() code you only cover cases where things
have already gone wrong, and not the use of functions that are unsafe to
call inside a user_access region.

[1]
http://lists.infradead.org/pipermail/linux-arm-kernel/2019-January/625385.html

Cheers,
Peter Zijlstra Feb. 19, 2019, 9:04 a.m. UTC | #17
On Mon, Feb 18, 2019 at 02:30:21PM -0800, H. Peter Anvin wrote:
> On 2/16/19 2:30 AM, Peter Zijlstra wrote:
> > On Fri, Feb 15, 2019 at 08:06:56PM -0800, hpa@zytor.com wrote:
> >> This implies we invoke schedule -- a restricted operation (consider
> >> may_sleep) during execution of STAC-enabled code, but *not* as an
> >> exception or interrupt, since those preserve the flags.
> > 
> > Meet preempt_enable().
> 
> I believe this falls under "doctor, it hurts when I do that." And it hurts for
> very good reasons. See below.

I disagree; the basic rule is that if you're preemptible you must also
be able to schedule and vice-versa. These AC regions violate this.

And, like illustrated, it is _very_ easy to get all sorts inside that AC
crap.

> >> I have serious concerns about this. This is more or less saying that
> >> we have left an unlimited gap and have had AC escape.
> > 
> > Yes; by allowing normal C in between STAC and CLAC this is so.
> > 
> >> Does *anyone* see a need to allow this? I got a question at LPC from
> >> someone about this, and what they were trying to do once all the
> >> layers had been unwound was so far down the wrong track for a root
> >> problem that actually has a very simple solution.
> > 
> > Have you read the rest of the thread?
> > 
> > All it takes for this to explode is a call to a C function that has
> > tracing on it in between the user_access_begin() and user_access_end()
> > things. That is a _very_ easy thing to do.
> > 
> > Heck, GCC is allowed to generate that broken code compiling
> > lib/strnlen_user.c; it doesn't _have_ to inline do_strnlen_user (esp.
> > with CONFIG_OPTIMIZE_INLINING), and making that a function call would
> > get us fentry hooks and ftrace and *BOOM*.
> > 
> > (Now, of course, its a static function with a single caller, and GCC
> > isn't _THAT_ stupid, but it could, if it wanted to)
> > 
> > Since the bar is _that_ low for mistakes, I figure we'd better fix it.
> > 
> 
> The question is what "fix it" means. 

It means that when we do schedule, the next task doesn't have AC set,
and when we schedule back, we'll restore our AC when we had it. Unlike
the current situation, where the next task will run with AC set.

IOW, it confines AC to the task context where it was set.

> I'm really concerned about AC escapes,
> and everyone else should be, too.

Then _that_ should be asserted.

> For an issue specific to tracing, it would be more appropriate to do the
> appropriate things in the tracing entry/exit than in schedule. Otherwise, we
> don't want to silently paper over mistakes which could mean that we run a
> large portion of the kernel without protection we thought we had.
> 
> In that sense, calling schedule() with AC set is in fact a great place to have
> a WARN() or even BUG(), because such an event really could imply that the
> kernel has been security compromised.

It is not specific to tracing, tracing is just one of the most trivial
and non-obvious ways to make it go splat.

There's lot of fairly innocent stuff that does preempt_disable() /
preempt_enable(). And just a warning in schedule() isn't sufficient,
you'd have to actually trigger a reschedule before you know your code is
bad.

And like I said; the invariant is: if you're preemptible you can
schedule and v.v.

Now, clearly you don't want to mark these whole regions as !preemptible,
because then you can also not take faults, but somehow you're not
worried about the whole fault handler, but you are worried about the
scheduler ?!? How does that work? The fault handler can touch a _ton_
more code. Heck, the fault handler can schedule.

So either pre-fault, and run the whole AC crap with preemption disabled
and retry, or accept that we have to schedule.

> Does that make more sense?

It appears to me you're going about it backwards.
Julien Thierry Feb. 19, 2019, 9:07 a.m. UTC | #18
On 19/02/2019 02:46, H. Peter Anvin wrote:
> On 2/18/19 6:20 PM, Andy Lutomirski wrote:
>>
>>
>>> On Feb 18, 2019, at 4:24 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>>
>>>> On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>>>
>>>> The question is what "fix it" means. I'm really concerned about AC escapes,
>>>> and everyone else should be, too.
>>>
>>> I do think that it might be the right thing to do to add some kind of
>>> WARN_ON_ONCE() for AC being set in various can-reschedule situations.
>>>
>>> We'd just have to abstract it sanely. I'm sure arm64 has the exact
>>> same issue with PAN - maybe it saves properly, but the same "we
>>> wouldn't want to go through the scheduler with PAN clear".
>>>
>>> On x86, we might as well check DF at the same time as AC.
>>>
>>
>> hpa is right, though — calling into tracing code with AC set is not really so good.  And calling schedule() (via preempt_enable() or whatever) is also bad because it runs all the scheduler code with AC on.  Admittedly, the scheduler is not *that* interesting of an attack surface.
>>
> 
> Not just that, but the other question is just how much code we are running
> with AC open. It really should only be done in some very small regions.

Yes, but we don't really have a way to enforce that, as far as I'm aware.

The user_access_begin/end() is generic API, meaning any arch is free to
implement it. If they don't have the same hardware behaviour as
x86/arm64, it might be that their interrupt/exception entry code will
run with user_access open until they reach the entry code that closes it
(and entry code could potentially be a more interesting attack surface
than the scheduler). This could be the case of software emulated PAN on
arm/arm64 (although currently arm, non-64bit, doesn't have
user_access_begin/end() at the time).

So the whole "very small region" restriction sounds a bit
loose/arbitrary to me...

Thanks,
Peter Zijlstra Feb. 19, 2019, 9:15 a.m. UTC | #19
On Mon, Feb 18, 2019 at 04:24:30PM -0800, Linus Torvalds wrote:
> On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote:
> >
> > The question is what "fix it" means. I'm really concerned about AC escapes,
> > and everyone else should be, too.
> 
> I do think that it might be the right thing to do to add some kind of
> WARN_ON_ONCE() for AC being set in various can-reschedule situations.

So I disagree.

Either we set AC with preempt disabled, and then we don't need an extra
warning, because we already have a warning about scheduling with
preemption disabled, or we accept that the fault handler can run.
Peter Zijlstra Feb. 19, 2019, 9:19 a.m. UTC | #20
On Tue, Feb 19, 2019 at 10:15:25AM +0100, Peter Zijlstra wrote:
> On Mon, Feb 18, 2019 at 04:24:30PM -0800, Linus Torvalds wrote:
> > On Mon, Feb 18, 2019 at 2:31 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > >
> > > The question is what "fix it" means. I'm really concerned about AC escapes,
> > > and everyone else should be, too.
> > 
> > I do think that it might be the right thing to do to add some kind of
> > WARN_ON_ONCE() for AC being set in various can-reschedule situations.
> 
> So I disagree.
> 
> Either we set AC with preempt disabled, and then we don't need an extra
> warning, because we already have a warning about scheduling with
> preemption disabled, or we accept that the fault handler can run.

n/m about the faults, forgot the obvious :/

I still really dislike wrecking the preemption model over this.
H. Peter Anvin Feb. 19, 2019, 9:21 a.m. UTC | #21
On February 19, 2019 1:04:09 AM PST, Peter Zijlstra <peterz@infradead.org> wrote:
>On Mon, Feb 18, 2019 at 02:30:21PM -0800, H. Peter Anvin wrote:
>> On 2/16/19 2:30 AM, Peter Zijlstra wrote:
>> > On Fri, Feb 15, 2019 at 08:06:56PM -0800, hpa@zytor.com wrote:
>> >> This implies we invoke schedule -- a restricted operation
>(consider
>> >> may_sleep) during execution of STAC-enabled code, but *not* as an
>> >> exception or interrupt, since those preserve the flags.
>> > 
>> > Meet preempt_enable().
>> 
>> I believe this falls under "doctor, it hurts when I do that." And it
>hurts for
>> very good reasons. See below.
>
>I disagree; the basic rule is that if you're preemptible you must also
>be able to schedule and vice-versa. These AC regions violate this.
>
>And, like illustrated, it is _very_ easy to get all sorts inside that
>AC
>crap.
>
>> >> I have serious concerns about this. This is more or less saying
>that
>> >> we have left an unlimited gap and have had AC escape.
>> > 
>> > Yes; by allowing normal C in between STAC and CLAC this is so.
>> > 
>> >> Does *anyone* see a need to allow this? I got a question at LPC
>from
>> >> someone about this, and what they were trying to do once all the
>> >> layers had been unwound was so far down the wrong track for a root
>> >> problem that actually has a very simple solution.
>> > 
>> > Have you read the rest of the thread?
>> > 
>> > All it takes for this to explode is a call to a C function that has
>> > tracing on it in between the user_access_begin() and
>user_access_end()
>> > things. That is a _very_ easy thing to do.
>> > 
>> > Heck, GCC is allowed to generate that broken code compiling
>> > lib/strnlen_user.c; it doesn't _have_ to inline do_strnlen_user
>(esp.
>> > with CONFIG_OPTIMIZE_INLINING), and making that a function call
>would
>> > get us fentry hooks and ftrace and *BOOM*.
>> > 
>> > (Now, of course, its a static function with a single caller, and
>GCC
>> > isn't _THAT_ stupid, but it could, if it wanted to)
>> > 
>> > Since the bar is _that_ low for mistakes, I figure we'd better fix
>it.
>> > 
>> 
>> The question is what "fix it" means. 
>
>It means that when we do schedule, the next task doesn't have AC set,
>and when we schedule back, we'll restore our AC when we had it. Unlike
>the current situation, where the next task will run with AC set.
>
>IOW, it confines AC to the task context where it was set.
>
>> I'm really concerned about AC escapes,
>> and everyone else should be, too.
>
>Then _that_ should be asserted.
>
>> For an issue specific to tracing, it would be more appropriate to do
>the
>> appropriate things in the tracing entry/exit than in schedule.
>Otherwise, we
>> don't want to silently paper over mistakes which could mean that we
>run a
>> large portion of the kernel without protection we thought we had.
>> 
>> In that sense, calling schedule() with AC set is in fact a great
>place to have
>> a WARN() or even BUG(), because such an event really could imply that
>the
>> kernel has been security compromised.
>
>It is not specific to tracing, tracing is just one of the most trivial
>and non-obvious ways to make it go splat.
>
>There's lot of fairly innocent stuff that does preempt_disable() /
>preempt_enable(). And just a warning in schedule() isn't sufficient,
>you'd have to actually trigger a reschedule before you know your code
>is
>bad.
>
>And like I said; the invariant is: if you're preemptible you can
>schedule and v.v.
>
>Now, clearly you don't want to mark these whole regions as
>!preemptible,
>because then you can also not take faults, but somehow you're not
>worried about the whole fault handler, but you are worried about the
>scheduler ?!? How does that work? The fault handler can touch a _ton_
>more code. Heck, the fault handler can schedule.
>
>So either pre-fault, and run the whole AC crap with preemption disabled
>and retry, or accept that we have to schedule.
>
>> Does that make more sense?
>
>It appears to me you're going about it backwards.

I'm not worried about the fault handler, because AC is always presented/disabled on exception entry; otherwise I would of course be very concerned.

To be clear: I'm not worried about the scheduler itself. I'm worried about how much code we have gone through to get there. Perhaps the scheduler itself is not the right point to probe for it, but we do need to catch things that have gone wrong, rather than just leaving the door wide open.

I would personally be far more comfortable saying you have to disable preemption in user access regions. It may be an unnecessary constraint for x86 and ARM64, but it is safe.

And Julien, yes, it is "somewhat arbitrary", but so are many engineering tradeoffs. Not everything has a very sharply definable line.
Peter Zijlstra Feb. 19, 2019, 9:44 a.m. UTC | #22
On Tue, Feb 19, 2019 at 10:04:09AM +0100, Peter Zijlstra wrote:
> > Does that make more sense?
> 
> It appears to me you're going about it backwards.

So how about you do a GCC plugin that verifies limits on code-gen
between user_access_begin/user_access_end() ?

 - No CALL/RET
   - implies user_access_end() happens
   - implies no fentry hooks
 - No __preempt_count frobbing
 - No tracepoints
 - ...

That way you put the burden on the special code, not on the rest of the
kernel.
Thomas Gleixner Feb. 19, 2019, 11:38 a.m. UTC | #23
On Tue, 19 Feb 2019, Peter Zijlstra wrote:

> On Tue, Feb 19, 2019 at 10:04:09AM +0100, Peter Zijlstra wrote:
> > > Does that make more sense?
> > 
> > It appears to me you're going about it backwards.
> 
> So how about you do a GCC plugin that verifies limits on code-gen
> between user_access_begin/user_access_end() ?
> 
>  - No CALL/RET
>    - implies user_access_end() happens
>    - implies no fentry hooks
>  - No __preempt_count frobbing
>  - No tracepoints
>  - ...
> 
> That way you put the burden on the special code, not on the rest of the
> kernel.

And then you have kprobes ....
Peter Zijlstra Feb. 19, 2019, 11:58 a.m. UTC | #24
On Tue, Feb 19, 2019 at 12:38:42PM +0100, Thomas Gleixner wrote:
> On Tue, 19 Feb 2019, Peter Zijlstra wrote:
> 
> > On Tue, Feb 19, 2019 at 10:04:09AM +0100, Peter Zijlstra wrote:
> > > > Does that make more sense?
> > > 
> > > It appears to me you're going about it backwards.
> > 
> > So how about you do a GCC plugin that verifies limits on code-gen
> > between user_access_begin/user_access_end() ?
> > 
> >  - No CALL/RET
> >    - implies user_access_end() happens
> >    - implies no fentry hooks
> >  - No __preempt_count frobbing
> >  - No tracepoints
> >  - ...
> > 
> > That way you put the burden on the special code, not on the rest of the
> > kernel.
> 
> And then you have kprobes ....

They prod the INT3 byte and then take an exception, and exceptions are
'fine'.
Will Deacon Feb. 19, 2019, 12:48 p.m. UTC | #25
On Tue, Feb 19, 2019 at 10:04:09AM +0100, Peter Zijlstra wrote:
> On Mon, Feb 18, 2019 at 02:30:21PM -0800, H. Peter Anvin wrote:
> > On 2/16/19 2:30 AM, Peter Zijlstra wrote:
> > > On Fri, Feb 15, 2019 at 08:06:56PM -0800, hpa@zytor.com wrote:
> > >> This implies we invoke schedule -- a restricted operation (consider
> > >> may_sleep) during execution of STAC-enabled code, but *not* as an
> > >> exception or interrupt, since those preserve the flags.
> > > 
> > > Meet preempt_enable().
> > 
> > I believe this falls under "doctor, it hurts when I do that." And it hurts for
> > very good reasons. See below.
> 
> I disagree; the basic rule is that if you're preemptible you must also
> be able to schedule and vice-versa. These AC regions violate this.
> 
> And, like illustrated, it is _very_ easy to get all sorts inside that AC
> crap.
> 
> > >> I have serious concerns about this. This is more or less saying that
> > >> we have left an unlimited gap and have had AC escape.
> > > 
> > > Yes; by allowing normal C in between STAC and CLAC this is so.
> > > 
> > >> Does *anyone* see a need to allow this? I got a question at LPC from
> > >> someone about this, and what they were trying to do once all the
> > >> layers had been unwound was so far down the wrong track for a root
> > >> problem that actually has a very simple solution.
> > > 
> > > Have you read the rest of the thread?
> > > 
> > > All it takes for this to explode is a call to a C function that has
> > > tracing on it in between the user_access_begin() and user_access_end()
> > > things. That is a _very_ easy thing to do.
> > > 
> > > Heck, GCC is allowed to generate that broken code compiling
> > > lib/strnlen_user.c; it doesn't _have_ to inline do_strnlen_user (esp.
> > > with CONFIG_OPTIMIZE_INLINING), and making that a function call would
> > > get us fentry hooks and ftrace and *BOOM*.
> > > 
> > > (Now, of course, its a static function with a single caller, and GCC
> > > isn't _THAT_ stupid, but it could, if it wanted to)
> > > 
> > > Since the bar is _that_ low for mistakes, I figure we'd better fix it.
> > > 
> > 
> > The question is what "fix it" means. 
> 
> It means that when we do schedule, the next task doesn't have AC set,
> and when we schedule back, we'll restore our AC when we had it. Unlike
> the current situation, where the next task will run with AC set.
> 
> IOW, it confines AC to the task context where it was set.
> 
> > I'm really concerned about AC escapes,
> > and everyone else should be, too.
> 
> Then _that_ should be asserted.
> 
> > For an issue specific to tracing, it would be more appropriate to do the
> > appropriate things in the tracing entry/exit than in schedule. Otherwise, we
> > don't want to silently paper over mistakes which could mean that we run a
> > large portion of the kernel without protection we thought we had.
> > 
> > In that sense, calling schedule() with AC set is in fact a great place to have
> > a WARN() or even BUG(), because such an event really could imply that the
> > kernel has been security compromised.
> 
> It is not specific to tracing, tracing is just one of the most trivial
> and non-obvious ways to make it go splat.
> 
> There's lot of fairly innocent stuff that does preempt_disable() /
> preempt_enable(). And just a warning in schedule() isn't sufficient,
> you'd have to actually trigger a reschedule before you know your code is
> bad.
> 
> And like I said; the invariant is: if you're preemptible you can
> schedule and v.v.
> 
> Now, clearly you don't want to mark these whole regions as !preemptible,
> because then you can also not take faults, but somehow you're not
> worried about the whole fault handler, but you are worried about the
> scheduler ?!? How does that work? The fault handler can touch a _ton_
> more code. Heck, the fault handler can schedule.
> 
> So either pre-fault, and run the whole AC crap with preemption disabled
> and retry, or accept that we have to schedule.

I think you'll still hate this, but could we not disable preemption during
the uaccess-enabled region, re-enabling it on the fault path after we've
toggled uaccess off and disable it again when we return back to the
uaccess-enabled region? Doesn't help with tracing, but it should at least
handle the common case.

Will
H. Peter Anvin Feb. 20, 2019, 10:55 p.m. UTC | #26
On 2/19/19 4:48 AM, Will Deacon wrote:
> 
> I think you'll still hate this, but could we not disable preemption during
> the uaccess-enabled region, re-enabling it on the fault path after we've
> toggled uaccess off and disable it again when we return back to the
> uaccess-enabled region? Doesn't help with tracing, but it should at least
> handle the common case.
> 

There is a worse problem with this, I still realize: this would mean blocking
preemption across what could possibly be a *very* large copy_from_user(), for
example.

Exceptions *have* to handle this; there is no way around it. Perhaps the
scheduler isn't the right place to put these kinds of asserts, either.

Now, __fentry__ is kind of a special beast; in some ways it is an "exception
implemented as a function call"; on x86 one could even consider using an INT
instruction in order to reduce the NOP footprint in the unarmed case.  Nor is
__fentry__ a C function; it has far more of an exception-like ABI.
*Regardless* of what else we do, I believe __fentry__ ought to
save/disable/restore AC, just like an exception does.

The idea of using s/a gcc plugin/objtool/ for this isn't really a bad idea.
Obviously the general problem is undecidable :) but the enforcement of some
simple, fairly draconian rules ("as tight as possible, but no tighter")
shouldn't be a huge problem.

An actual gcc plugin -- which would probably be quite complex -- could make
gcc itself aware of user space accesses and be able to rearrange them to
minimize STAC/CLAC and avoid kernel-space accesses inside those brackets.

Finally, of course, there is the option of simply outlawing this practice as a
matter of policy and require that all structures be accessed through a limited
set of APIs. As I recall, the number of places where there were
performance-critical regions which could not use the normal accessors are
fairly small (signal frames being the main one.) Doing bulk copy to/from
kernel memory and then accessing them from there would have some performance
cost, but would eliminate the need for this complexity entirely.

	-hpa
Julien Thierry Feb. 21, 2019, 12:06 p.m. UTC | #27
On 20/02/2019 22:55, H. Peter Anvin wrote:
> On 2/19/19 4:48 AM, Will Deacon wrote:
>>
>> I think you'll still hate this, but could we not disable preemption during
>> the uaccess-enabled region, re-enabling it on the fault path after we've
>> toggled uaccess off and disable it again when we return back to the
>> uaccess-enabled region? Doesn't help with tracing, but it should at least
>> handle the common case.
>>
> 
> There is a worse problem with this, I still realize: this would mean blocking
> preemption across what could possibly be a *very* large copy_from_user(), for
> example.
> 

Yes, that's a good point. And I guess a userspace program could forcibly
trigger the kernel into a large copy_from/to_user(), knowingly disabling
preemption.

I don't know how bad this could get.

> Exceptions *have* to handle this; there is no way around it. Perhaps the
> scheduler isn't the right place to put these kinds of asserts, either.
> 

Maybe I'm misunderstanding what you mean with "Exceptions *have* to
handle this". Isn't the fact that the AC/PAN flags gets saved on
exception entry and set back to "user accesses disabled" already
handling it? Or are you referring to something else?

So far my understanding is that the exception/interrupt case is fine.
The worrying case is what gets called in the user access regions
(schedule(), preempt_enable(), tracing...).

Did I get lost somewhere?

> Now, __fentry__ is kind of a special beast; in some ways it is an "exception
> implemented as a function call"; on x86 one could even consider using an INT
> instruction in order to reduce the NOP footprint in the unarmed case.  Nor is
> __fentry__ a C function; it has far more of an exception-like ABI.
> *Regardless* of what else we do, I believe __fentry__ ought to
> save/disable/restore AC, just like an exception does.
> 

That does make sense to me. However it doesn't solve the issue of
calling (or preventing to call) some function that rescheds.


> The idea of using s/a gcc plugin/objtool/ for this isn't really a bad idea.
> Obviously the general problem is undecidable :) but the enforcement of some
> simple, fairly draconian rules ("as tight as possible, but no tighter")
> shouldn't be a huge problem.
> 
> An actual gcc plugin -- which would probably be quite complex -- could make
> gcc itself aware of user space accesses and be able to rearrange them to
> minimize STAC/CLAC and avoid kernel-space accesses inside those brackets.
> 

Not sure I get this. The data that is retrieved from/stored user space
is generally obtained from/saved into kernel-space address. And when you
start the user_access_begin() it means you plan on doing a bunch of user
access, so I wouldn't expect to be able to batch everything into registers.

Cheers,
Will Deacon Feb. 21, 2019, 12:46 p.m. UTC | #28
On Wed, Feb 20, 2019 at 02:55:59PM -0800, H. Peter Anvin wrote:
> On 2/19/19 4:48 AM, Will Deacon wrote:
> > 
> > I think you'll still hate this, but could we not disable preemption during
> > the uaccess-enabled region, re-enabling it on the fault path after we've
> > toggled uaccess off and disable it again when we return back to the
> > uaccess-enabled region? Doesn't help with tracing, but it should at least
> > handle the common case.
> > 
> 
> There is a worse problem with this, I still realize: this would mean blocking
> preemption across what could possibly be a *very* large copy_from_user(), for
> example.

I don't think it's legitimate to call copy_{to,from}_user() inside a
user_access_{begin,end} region. You'd need to add some unsafe variants,
which could periodically disable uaccess and call cond_resched() inside
the loop to avoid the problem you're eluding to.

For existing callers of copy_{to,from}_user(), there's no issue as they
don't call into the scheduler during the copy operation. Exceptions are
handled fine by the code in mainline today.

GCC plugins are a cool idea, but I'm just a bit nervous about relying on
them for things like this.

Will
Thomas Gleixner Feb. 21, 2019, 9:35 p.m. UTC | #29
On Thu, 21 Feb 2019, Julien Thierry wrote:
> On 20/02/2019 22:55, H. Peter Anvin wrote:
> > Now, __fentry__ is kind of a special beast; in some ways it is an "exception
> > implemented as a function call"; on x86 one could even consider using an INT
> > instruction in order to reduce the NOP footprint in the unarmed case.  Nor is
> > __fentry__ a C function; it has far more of an exception-like ABI.
> > *Regardless* of what else we do, I believe __fentry__ ought to
> > save/disable/restore AC, just like an exception does.
> > 
> 
> That does make sense to me. However it doesn't solve the issue of
> calling (or preventing to call) some function that rescheds.

IMNSHO any call inside a AC region is a bug lurking round the corner. The
only thing which is tolerable is an exception of some sort.

Enforce that with objtool. End of story.

Thanks,

	tglx
Andy Lutomirski Feb. 21, 2019, 10:06 p.m. UTC | #30
> On Feb 21, 2019, at 4:46 AM, Will Deacon <will.deacon@arm.com> wrote:
> 
>> On Wed, Feb 20, 2019 at 02:55:59PM -0800, H. Peter Anvin wrote:
>>> On 2/19/19 4:48 AM, Will Deacon wrote:
>>> 
>>> I think you'll still hate this, but could we not disable preemption during
>>> the uaccess-enabled region, re-enabling it on the fault path after we've
>>> toggled uaccess off and disable it again when we return back to the
>>> uaccess-enabled region? Doesn't help with tracing, but it should at least
>>> handle the common case.
>>> 
>> 
>> There is a worse problem with this, I still realize: this would mean blocking
>> preemption across what could possibly be a *very* large copy_from_user(), for
>> example.
> 
> I don't think it's legitimate to call copy_{to,from}_user() inside a
> user_access_{begin,end} region. You'd need to add some unsafe variants,
> which could periodically disable uaccess and call cond_resched() inside
> the loop to avoid the problem you're eluding to.
> 

Definitely not safe. On x86 they do CLAC and everything breaks.
Linus Torvalds Feb. 21, 2019, 10:08 p.m. UTC | #31
On Thu, Feb 21, 2019 at 1:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> IMNSHO any call inside a AC region is a bug lurking round the corner. The
> only thing which is tolerable is an exception of some sort.
>
> Enforce that with objtool. End of story.

Not quite that simple.

There are some use cases where you really do want a call - albeit to
special functions - inside the AC region.

The prime example of this is likely "filldir()" in fs/readdir.c, which
is actually somewhat important for some loads (eg samba).

And the whole AC thing made it horribly horribly slow because readdir
is one of those things that doesn't just copy one value (or one
structure) to user space, but writes several different things.

Kind of like signal frame setup does.

You may not realize that right now signal frame setup *does* actually
do a call with AC set. See ia32_setup_rt_frame() that does

        put_user_try {
...
                compat_save_altstack_ex(&frame->uc.uc_stack, regs->sp);
...
        } put_user_catch(err);

is a macro, but inside that macro is a call to sas_ss_reset().

Now, that is an inline function, so it hopefully doesn't actually
cause a call. But it's "only" an inline function, not "always_inline".
We did already have (and I rejected, for now) patches that made those
inlines not very strong...

[ Side note: currently signal frame setup uses the absolutely
*disgusting* put_user_ex() thing, but that's actually what
unsafe_put_user() was designed for. I just never got around to getting
rid of the nasty hot mess of put_user_ex() ]

Anyway, while signal frame setup just writes individual values,
"filldir()" does *both* several individual values _and_ does a
copy_to_user().

And it's that "copy_to_user()" that I want to eventually change to a
"unsafe_copy_to_user()", along with making the individual values be
written with unsafe_put_user().

Right now "filldir()" messes with AC a total of *six* times. It sets
four field values, and then does a "copy_to_user()", all of which
set/clear AC right now. It's wasting hundreds of cycles on this,
because AC is so slow to set/clear.

If AC was cheap, this wouldn't be much of an issue. But AC is really
really expensive. I think it's microcode and serializes the pipeline
or something.

Anyway. We already have a possible call site, and there are good
reasons for future ones too. But they are hopefully all very
controlled.

                 Linus
Peter Zijlstra Feb. 22, 2019, 12:58 p.m. UTC | #32
On Thu, Feb 21, 2019 at 02:08:11PM -0800, Linus Torvalds wrote:
> On Thu, Feb 21, 2019 at 1:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > IMNSHO any call inside a AC region is a bug lurking round the corner. The
> > only thing which is tolerable is an exception of some sort.
> >
> > Enforce that with objtool. End of story.
> 
> Not quite that simple.
> 
> There are some use cases where you really do want a call - albeit to
> special functions - inside the AC region.
> 
> The prime example of this is likely "filldir()" in fs/readdir.c, which
> is actually somewhat important for some loads (eg samba).
> 
> And the whole AC thing made it horribly horribly slow because readdir
> is one of those things that doesn't just copy one value (or one
> structure) to user space, but writes several different things.
> 
> Kind of like signal frame setup does.
> 
> You may not realize that right now signal frame setup *does* actually
> do a call with AC set. See ia32_setup_rt_frame() that does
> 
>         put_user_try {
> ...
>                 compat_save_altstack_ex(&frame->uc.uc_stack, regs->sp);
> ...
>         } put_user_catch(err);
> 
> is a macro, but inside that macro is a call to sas_ss_reset().
> 
> Now, that is an inline function, so it hopefully doesn't actually
> cause a call. But it's "only" an inline function, not "always_inline".
> We did already have (and I rejected, for now) patches that made those
> inlines not very strong...

That one does indeed not generate a call, but:

arch/x86/ia32/ia32_signal.o: warning: objtool: ia32_restore_sigcontext()+0x38: call to native_load_gs_index() with AC set

I've yet to look at detectoring __preempt_count mucking.

---
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 43c029cdc3fe..cd31e4433f4c 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -5,6 +5,7 @@
 
 #ifdef __KERNEL__
 
+#include <linux/frame.h>
 #include <asm/nops.h>
 
 /*
@@ -135,6 +136,7 @@ static inline void native_wbinvd(void)
 }
 
 extern asmlinkage void native_load_gs_index(unsigned);
+AC_SAFE(native_load_gs_index);
 
 static inline unsigned long __read_cr4(void)
 {
diff --git a/include/linux/frame.h b/include/linux/frame.h
index 02d3ca2d9598..644662d8b8e8 100644
--- a/include/linux/frame.h
+++ b/include/linux/frame.h
@@ -15,9 +15,14 @@
 	static void __used __section(.discard.func_stack_frame_non_standard) \
 		*__func_stack_frame_non_standard_##func = func
 
+#define AC_SAFE(func) \
+	static void __used __section(.discard.ac_safe) \
+		*__func_ac_safe_##func = func
+
 #else /* !CONFIG_STACK_VALIDATION */
 
 #define STACK_FRAME_NON_STANDARD(func)
+#define AC_SAFE(func)
 
 #endif /* CONFIG_STACK_VALIDATION */
 
diff --git a/tools/objtool/Makefile b/tools/objtool/Makefile
index c9d038f91af6..5a64e5fb9d7a 100644
--- a/tools/objtool/Makefile
+++ b/tools/objtool/Makefile
@@ -31,7 +31,7 @@ INCLUDES := -I$(srctree)/tools/include \
 	    -I$(srctree)/tools/arch/$(HOSTARCH)/include/uapi \
 	    -I$(srctree)/tools/objtool/arch/$(ARCH)/include
 WARNINGS := $(EXTRA_WARNINGS) -Wno-switch-default -Wno-switch-enum -Wno-packed
-CFLAGS   += -Werror $(WARNINGS) $(KBUILD_HOSTCFLAGS) -g $(INCLUDES)
+CFLAGS   += -Werror $(WARNINGS) $(KBUILD_HOSTCFLAGS) -g $(INCLUDES) -ggdb3
 LDFLAGS  += -lelf $(LIBSUBCMD) $(KBUILD_HOSTLDFLAGS)
 
 # Allow old libelf to be used:
diff --git a/tools/objtool/arch.h b/tools/objtool/arch.h
index b0d7dc3d71b5..48327099466d 100644
--- a/tools/objtool/arch.h
+++ b/tools/objtool/arch.h
@@ -33,7 +33,9 @@
 #define INSN_STACK		8
 #define INSN_BUG		9
 #define INSN_NOP		10
-#define INSN_OTHER		11
+#define INSN_STAC		11
+#define INSN_CLAC		12
+#define INSN_OTHER		13
 #define INSN_LAST		INSN_OTHER
 
 enum op_dest_type {
diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
index 540a209b78ab..d1e99d1460a5 100644
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -369,7 +369,19 @@ int arch_decode_instruction(struct elf *elf, struct section *sec,
 
 	case 0x0f:
 
-		if (op2 >= 0x80 && op2 <= 0x8f) {
+		if (op2 == 0x01) {
+
+			if (modrm == 0xca) {
+
+				*type = INSN_CLAC;
+
+			} else if (modrm == 0xcb) {
+
+				*type = INSN_STAC;
+
+			}
+
+		} else if (op2 >= 0x80 && op2 <= 0x8f) {
 
 			*type = INSN_JUMP_CONDITIONAL;
 
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 0414a0d52262..6fef157e244b 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -127,6 +127,24 @@ static bool ignore_func(struct objtool_file *file, struct symbol *func)
 	return false;
 }
 
+static bool ac_safe_func(struct objtool_file *file, struct symbol *func)
+{
+	struct rela *rela;
+
+	/* check for AC_SAFE */
+	if (file->ac_safe && file->ac_safe->rela)
+		list_for_each_entry(rela, &file->ac_safe->rela->rela_list, list) {
+			if (rela->sym->type == STT_SECTION &&
+			    rela->sym->sec == func->sec &&
+			    rela->addend == func->offset)
+				return true;
+			if (/* rela->sym->type == STT_FUNC && */ rela->sym == func)
+				return true;
+		}
+
+	return false;
+}
+
 /*
  * This checks to see if the given function is a "noreturn" function.
  *
@@ -439,6 +457,8 @@ static void add_ignores(struct objtool_file *file)
 
 	for_each_sec(file, sec) {
 		list_for_each_entry(func, &sec->symbol_list, list) {
+			func->ac_safe = ac_safe_func(file, func);
+
 			if (func->type != STT_FUNC)
 				continue;
 
@@ -1897,11 +1917,16 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 				if (ret)
 					return 1;
 			}
-		}
+		} else printf("ponies\n");
 
 		switch (insn->type) {
 
 		case INSN_RETURN:
+			if (state.ac) {
+				WARN_FUNC("return with AC set", sec, insn->offset);
+				return 1;
+			}
+
 			if (func && has_modified_stack_frame(&state)) {
 				WARN_FUNC("return with modified stack frame",
 					  sec, insn->offset);
@@ -1917,6 +1942,12 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 			return 0;
 
 		case INSN_CALL:
+			if (state.ac && !insn->call_dest->ac_safe) {
+				WARN_FUNC("call to %s() with AC set", sec, insn->offset,
+						insn->call_dest->name);
+				return 1;
+			}
+
 			if (is_fentry_call(insn))
 				break;
 
@@ -1928,6 +1959,11 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 
 			/* fallthrough */
 		case INSN_CALL_DYNAMIC:
+			if (state.ac && !insn->call_dest->ac_safe) {
+				WARN_FUNC("call to %s() with AC set", sec, insn->offset,
+						insn->call_dest->name);
+				return 1;
+			}
 			if (!no_fp && func && !has_valid_stack_frame(&state)) {
 				WARN_FUNC("call without frame pointer save/setup",
 					  sec, insn->offset);
@@ -1980,6 +2016,14 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
 
 			break;
 
+		case INSN_STAC:
+			state.ac = true;
+			break;
+
+		case INSN_CLAC:
+			state.ac = false;
+			break;
+
 		default:
 			break;
 		}
@@ -2198,6 +2242,7 @@ int check(const char *_objname, bool orc)
 	INIT_LIST_HEAD(&file.insn_list);
 	hash_init(file.insn_hash);
 	file.whitelist = find_section_by_name(file.elf, ".discard.func_stack_frame_non_standard");
+	file.ac_safe = find_section_by_name(file.elf, ".discard.ac_safe");
 	file.c_file = find_section_by_name(file.elf, ".comment");
 	file.ignore_unreachables = no_unreachable;
 	file.hints = false;
diff --git a/tools/objtool/check.h b/tools/objtool/check.h
index e6e8a655b556..d4ae59454fff 100644
--- a/tools/objtool/check.h
+++ b/tools/objtool/check.h
@@ -31,7 +31,7 @@ struct insn_state {
 	int stack_size;
 	unsigned char type;
 	bool bp_scratch;
-	bool drap, end;
+	bool drap, end, ac;
 	int drap_reg, drap_offset;
 	struct cfi_reg vals[CFI_NUM_REGS];
 };
@@ -61,6 +61,7 @@ struct objtool_file {
 	struct list_head insn_list;
 	DECLARE_HASHTABLE(insn_hash, 16);
 	struct section *whitelist;
+	struct section *ac_safe;
 	bool ignore_unreachables, c_file, hints, rodata;
 };
 
diff --git a/tools/objtool/elf.h b/tools/objtool/elf.h
index bc97ed86b9cd..064c3df31e40 100644
--- a/tools/objtool/elf.h
+++ b/tools/objtool/elf.h
@@ -62,6 +62,7 @@ struct symbol {
 	unsigned long offset;
 	unsigned int len;
 	struct symbol *pfunc, *cfunc;
+	bool ac_safe;
 };
 
 struct rela {
diff --git a/tools/objtool/special.c b/tools/objtool/special.c
index 50af4e1274b3..72084cae8f35 100644
--- a/tools/objtool/special.c
+++ b/tools/objtool/special.c
@@ -166,8 +166,8 @@ int special_get_alts(struct elf *elf, struct list_head *alts)
 			continue;
 
 		if (sec->len % entry->size != 0) {
-			WARN("%s size not a multiple of %d",
-			     sec->name, entry->size);
+			WARN("%s size (%d) not a multiple of %d",
+			     sec->name, sec->len, entry->size);
 			return -1;
 		}
Thomas Gleixner Feb. 22, 2019, 6:10 p.m. UTC | #33
On Thu, 21 Feb 2019, Linus Torvalds wrote:
> On Thu, Feb 21, 2019 at 1:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > IMNSHO any call inside a AC region is a bug lurking round the corner. The
> > only thing which is tolerable is an exception of some sort.
> >
> > Enforce that with objtool. End of story.
> 
> Not quite that simple.

But correct :)

> Right now "filldir()" messes with AC a total of *six* times. It sets
> four field values, and then does a "copy_to_user()", all of which
> set/clear AC right now. It's wasting hundreds of cycles on this,
> because AC is so slow to set/clear.
> 
> If AC was cheap, this wouldn't be much of an issue. But AC is really
> really expensive. I think it's microcode and serializes the pipeline
> or something.
> 
> Anyway. We already have a possible call site, and there are good
> reasons for future ones too. But they are hopefully all very
> controlled.

I agree, that a function which is doing the actual copy should be callable,
but random other functions? NO!

The problem is that once you open that can of worms the people who think
their problem is special will come around and do

      begin()
      copy_to_user_unsafe(uptr, collect_data())
      end()

just because they can. That's the stuff, I'm worried about, not the well
defined copy_to/from_user() invocation. We can deal with that and make sure
that it's safe even with tracing and annotate with some special magic. It's
simpler to find and check a new '__safe_inside_ac' annotation than chasing
randomly added code within a gazillion of begin/end sections.

Thanks,

	tglx

Patch
diff mbox series

--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -650,6 +650,7 @@  ENTRY(__switch_to_asm)
 	pushl	%ebx
 	pushl	%edi
 	pushl	%esi
+	pushfl
 
 	/* switch stack */
 	movl	%esp, TASK_threadsp(%eax)
@@ -672,6 +673,7 @@  ENTRY(__switch_to_asm)
 #endif
 
 	/* restore callee-saved registers */
+	popfl
 	popl	%esi
 	popl	%edi
 	popl	%ebx
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -291,6 +291,7 @@  ENTRY(__switch_to_asm)
 	pushq	%r13
 	pushq	%r14
 	pushq	%r15
+	pushfq
 
 	/* switch stack */
 	movq	%rsp, TASK_threadsp(%rdi)
@@ -313,6 +314,7 @@  ENTRY(__switch_to_asm)
 #endif
 
 	/* restore callee-saved registers */
+	popfq
 	popq	%r15
 	popq	%r14
 	popq	%r13
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -40,6 +40,7 @@  asmlinkage void ret_from_fork(void);
  * order of the fields must match the code in __switch_to_asm().
  */
 struct inactive_task_frame {
+	unsigned long flags;
 #ifdef CONFIG_X86_64
 	unsigned long r15;
 	unsigned long r14;