diff mbox series

[V3,01/13] entry: Provide generic syscall entry functionality

Message ID 20200716185424.011950288@linutronix.de (mailing list archive)
State New, archived
Headers show
Series entry, x86, kvm: Generic entry/exit functionality for host and guest | expand

Commit Message

Thomas Gleixner July 16, 2020, 6:22 p.m. UTC
From: Thomas Gleixner <tglx@linutronix.de>

On syscall entry certain work needs to be done:

   - Establish state (lockdep, context tracking, tracing)
   - Conditional work (ptrace, seccomp, audit...)

This code is needlessly duplicated and  different in all
architectures.

Provide a generic version based on the x86 implementation which has all the
RCU and instrumentation bits right.

As interrupt/exception entry from user space needs parts of the same
functionality, provide a function for this as well.

syscall_enter_from_user_mode() and irqentry_enter_from_user_mode() must be
called right after the low level ASM entry. The calling code must be
non-instrumentable. After the functions returns state is correct and the
subsequent functions can be instrumented.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V3: Adapt to noinstr changes. Add interrupt/exception entry function.

V2: Fix function documentation (Mike)
    Add comment about return value (Andy)
---
 arch/Kconfig                 |    3 
 include/linux/entry-common.h |  156 +++++++++++++++++++++++++++++++++++++++++++
 kernel/Makefile              |    1 
 kernel/entry/Makefile        |    3 
 kernel/entry/common.c        |   78 +++++++++++++++++++++
 5 files changed, 241 insertions(+)

Comments

Kees Cook July 16, 2020, 8:52 p.m. UTC | #1
On Thu, Jul 16, 2020 at 08:22:09PM +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> On syscall entry certain work needs to be done:
> 
>    - Establish state (lockdep, context tracking, tracing)
>    - Conditional work (ptrace, seccomp, audit...)
> 
> This code is needlessly duplicated and  different in all
> architectures.
> 
> Provide a generic version based on the x86 implementation which has all the
> RCU and instrumentation bits right.

Ahh! You're reading my mind! I was just thinking about this while
reviewing the proposed syscall redirection series[1], and pondering the
lack of x86 TIF flags, and that nearly everything in the series (and for
seccomp and other things) didn't need to be arch-specific. And now that
series absolutely needs to be rebased and it'll magically work for every
arch that switches to the generic entry code. :)

Notes below...

[1] https://lore.kernel.org/lkml/20200716193141.4068476-2-krisman@collabora.com/

> +/*
> + * Define dummy _TIF work flags if not defined by the architecture or for
> + * disabled functionality.
> + */

When I was thinking about this last week I was pondering having a split
between the arch-agnositc TIF flags and the arch-specific TIF flags, and
that each arch could have a single "there is agnostic work to be done"
TIF in their thread_info, and the agnostic flags could live in
task_struct or something. Anyway, I'll keep reading...

> +/**
> + * syscall_enter_from_user_mode - Check and handle work before invoking
> + *				 a syscall
> + * @regs:	Pointer to currents pt_regs
> + * @syscall:	The syscall number
> + *
> + * Invoked from architecture specific syscall entry code with interrupts
> + * disabled. The calling code has to be non-instrumentable. When the
> + * function returns all state is correct and the subsequent functions can be
> + * instrumented.
> + *
> + * Returns: The original or a modified syscall number
> + *
> + * If the returned syscall number is -1 then the syscall should be
> + * skipped. In this case the caller may invoke syscall_set_error() or
> + * syscall_set_return_value() first.  If neither of those are called and -1
> + * is returned, then the syscall will fail with ENOSYS.

There's been some recent confusion over "has the syscall changed,
or did seccomp request it be skipped?" that was explored in arm64[2]
(though I see Will and Keno in CC already). There might need to be a
clearer way to distinguish between "wild userspace issued a -1 syscall"
and "seccomp or ptrace asked for the syscall to be skipped". The
difference is mostly about when ENOSYS gets set, with respect to calls
to syscall_set_return_value(), but if the syscall gets changed, the arch
may need to recheck the value and consider ENOSYS, etc. IIUC, what Will
ended up with[3] was having syscall_trace_enter() return the syscall return
value instead of the new syscall.

[2] https://lore.kernel.org/lkml/20200704125027.GB21185@willie-the-truck/
[3] https://lore.kernel.org/lkml/20200703083914.GA18516@willie-the-truck/

> +static long syscall_trace_enter(struct pt_regs *regs, long syscall,
> +				unsigned long ti_work)
> +{
> +	long ret = 0;
> +
> +	/* Handle ptrace */
> +	if (ti_work & (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU)) {
> +		ret = arch_syscall_enter_tracehook(regs);
> +		if (ret || (ti_work & _TIF_SYSCALL_EMU))
> +			return -1L;
> +	}
> +
> +	/* Do seccomp after ptrace, to catch any tracer changes. */
> +	if (ti_work & _TIF_SECCOMP) {
> +		ret = arch_syscall_enter_seccomp(regs);
> +		if (ret == -1L)
> +			return ret;
> +	}
> +
> +	if (unlikely(ti_work & _TIF_SYSCALL_TRACEPOINT))
> +		trace_sys_enter(regs, syscall);
> +
> +	arch_syscall_enter_audit(regs);
> +
> +	return ret ? : syscall;
> +}

Modulo the notes about -1 vs syscall number above, this looks correct to
me for ptrace and seccomp.
Thomas Gleixner July 16, 2020, 9:55 p.m. UTC | #2
Kees Cook <keescook@chromium.org> writes:
> On Thu, Jul 16, 2020 at 08:22:09PM +0200, Thomas Gleixner wrote:
>> This code is needlessly duplicated and  different in all
>> architectures.
>> 
>> Provide a generic version based on the x86 implementation which has all the
>> RCU and instrumentation bits right.
>
> Ahh! You're reading my mind!

I told you about that plan at the last conference over a beer :)

> I was just thinking about this while reviewing the proposed syscall
> redirection series[1], and pondering the lack of x86 TIF flags, and
> that nearly everything in the series (and for seccomp and other
> things) didn't need to be arch-specific. And now that series
> absolutely needs to be rebased and it'll magically work for every arch
> that switches to the generic entry code. :)

That's the plan. 

> Notes below...
>
> [1] https://lore.kernel.org/lkml/20200716193141.4068476-2-krisman@collabora.com/

Saw that fly by. *shudder*

>> +/*
>> + * Define dummy _TIF work flags if not defined by the architecture or for
>> + * disabled functionality.
>> + */
>
> When I was thinking about this last week I was pondering having a split
> between the arch-agnositc TIF flags and the arch-specific TIF flags, and
> that each arch could have a single "there is agnostic work to be done"
> TIF in their thread_info, and the agnostic flags could live in
> task_struct or something. Anyway, I'll keep reading...

That's going to be nasty. We rather go and expand the TIF storage to
64bit. And then do the following in a generic header:

#ifndef TIF_ARCH_SPECIFIC
# define TIF_ARCH_SPECIFIC
#endif

enum tif_bits {
	TIF_NEED_RESCHED = 0,
        TIF_...,
        TIF_LAST_GENERIC,
        TIF_ARCH_SPECIFIC,
};
        
and in the arch specific one:

#define TIF_ARCH_SPECIFIC	\
	TIF_ARCH_1,             \
        TIF_ARCH_2,

or something like that.

>> +/**
>> + * syscall_enter_from_user_mode - Check and handle work before invoking
>> + *				 a syscall
>> + * @regs:	Pointer to currents pt_regs
>> + * @syscall:	The syscall number
>> + *
>> + * Invoked from architecture specific syscall entry code with interrupts
>> + * disabled. The calling code has to be non-instrumentable. When the
>> + * function returns all state is correct and the subsequent functions can be
>> + * instrumented.
>> + *
>> + * Returns: The original or a modified syscall number
>> + *
>> + * If the returned syscall number is -1 then the syscall should be
>> + * skipped. In this case the caller may invoke syscall_set_error() or
>> + * syscall_set_return_value() first.  If neither of those are called and -1
>> + * is returned, then the syscall will fail with ENOSYS.
>
> There's been some recent confusion over "has the syscall changed,
> or did seccomp request it be skipped?" that was explored in arm64[2]
> (though I see Will and Keno in CC already). There might need to be a
> clearer way to distinguish between "wild userspace issued a -1 syscall"
> and "seccomp or ptrace asked for the syscall to be skipped". The
> difference is mostly about when ENOSYS gets set, with respect to calls
> to syscall_set_return_value(), but if the syscall gets changed, the arch
> may need to recheck the value and consider ENOSYS, etc. IIUC, what Will
> ended up with[3] was having syscall_trace_enter() return the syscall return
> value instead of the new syscall.

I was chatting with Will about that yesterday. IIRC he plans to fix the
immediate issue on arm64 first and then move arm64 over to the generic
variant. That's the reason why I reshuffled the patch series so the
generic parts are first which allows me to provide will a branch with
just those. If there are any changes needed we can just feed them back
into that branch and fixup the affected architecture trees.

IOW, that should not block progress on this stuff.

Thanks,

        tglx
Kees Cook July 17, 2020, 5:49 p.m. UTC | #3
On Thu, Jul 16, 2020 at 11:55:59PM +0200, Thomas Gleixner wrote:
> Kees Cook <keescook@chromium.org> writes:
> > On Thu, Jul 16, 2020 at 08:22:09PM +0200, Thomas Gleixner wrote:
> >> This code is needlessly duplicated and  different in all
> >> architectures.
> >> 
> >> Provide a generic version based on the x86 implementation which has all the
> >> RCU and instrumentation bits right.
> >
> > Ahh! You're reading my mind!
> 
> I told you about that plan at the last conference over a beer :)

Thank you for incepting it in my head, then! ;)

> > [1] https://lore.kernel.org/lkml/20200716193141.4068476-2-krisman@collabora.com/
> 
> Saw that fly by. *shudder*

Aw, it's nice. Better emulation! :)

> 
> >> +/*
> >> + * Define dummy _TIF work flags if not defined by the architecture or for
> >> + * disabled functionality.
> >> + */
> >
> > When I was thinking about this last week I was pondering having a split
> > between the arch-agnositc TIF flags and the arch-specific TIF flags, and
> > that each arch could have a single "there is agnostic work to be done"
> > TIF in their thread_info, and the agnostic flags could live in
> > task_struct or something. Anyway, I'll keep reading...
> 
> That's going to be nasty. We rather go and expand the TIF storage to
> 64bit. And then do the following in a generic header:

I though the point was to make the TIF_WORK check as fast as possible,
even on the 32-bit word systems. I mean it's not a huge performance hit,
but *shrug*

> 
> #ifndef TIF_ARCH_SPECIFIC
> # define TIF_ARCH_SPECIFIC
> #endif
> 
> enum tif_bits {
> 	TIF_NEED_RESCHED = 0,
>         TIF_...,
>         TIF_LAST_GENERIC,
>         TIF_ARCH_SPECIFIC,
> };
>         
> and in the arch specific one:
> 
> #define TIF_ARCH_SPECIFIC	\
> 	TIF_ARCH_1,             \
>         TIF_ARCH_2,
> 
> or something like that.

Okay, yeah, that can work.

> > There's been some recent confusion over "has the syscall changed,
> > or did seccomp request it be skipped?" that was explored in arm64[2]
> > (though I see Will and Keno in CC already). There might need to be a
> > clearer way to distinguish between "wild userspace issued a -1 syscall"
> > and "seccomp or ptrace asked for the syscall to be skipped". The
> > difference is mostly about when ENOSYS gets set, with respect to calls
> > to syscall_set_return_value(), but if the syscall gets changed, the arch
> > may need to recheck the value and consider ENOSYS, etc. IIUC, what Will
> > ended up with[3] was having syscall_trace_enter() return the syscall return
> > value instead of the new syscall.
> 
> I was chatting with Will about that yesterday. IIRC he plans to fix the
> immediate issue on arm64 first and then move arm64 over to the generic
> variant. That's the reason why I reshuffled the patch series so the
> generic parts are first which allows me to provide will a branch with
> just those. If there are any changes needed we can just feed them back
> into that branch and fixup the affected architecture trees.
> 
> IOW, that should not block progress on this stuff.

Ok, great! I just wanted to make sure that didn't surprise anyone. :)
Thomas Gleixner July 17, 2020, 7:29 p.m. UTC | #4
Kees Cook <keescook@chromium.org> writes:
> On Thu, Jul 16, 2020 at 11:55:59PM +0200, Thomas Gleixner wrote:
>> Kees Cook <keescook@chromium.org> writes:
>> >> +/*
>> >> + * Define dummy _TIF work flags if not defined by the architecture or for
>> >> + * disabled functionality.
>> >> + */
>> >
>> > When I was thinking about this last week I was pondering having a split
>> > between the arch-agnositc TIF flags and the arch-specific TIF flags, and
>> > that each arch could have a single "there is agnostic work to be done"
>> > TIF in their thread_info, and the agnostic flags could live in
>> > task_struct or something. Anyway, I'll keep reading...
>> 
>> That's going to be nasty. We rather go and expand the TIF storage to
>> 64bit. And then do the following in a generic header:
>
> I though the point was to make the TIF_WORK check as fast as possible,
> even on the 32-bit word systems. I mean it's not a huge performance hit,
> but *shrug*

For 64bit it's definitely faster to have 64bit flags.

For 32bit it's debatable whether having to fiddle with two words and
taking care about ordering is faster or not. It's a pain on 32bit in any
case.

But what we can do is distangle the flags into those which really need
to be grouped into a single 32bit TIF word and architecture specific
ones which can live in a different place.

Looking at x86 which has the most pressure on TIF bits:

Real TIF bits
TIF_SYSCALL_TRACE       Entry/exit
TIF_SYSCALL_TRACEPOINT	Entry/exit
TIF_NOTIFY_RESUME       Entry/exit
TIF_SIGPENDING          Entry/exit
TIF_NEED_RESCHED        Entry/exit
TIF_SINGLESTEP          Entry/exit
TIF_SYSCALL_EMU         Entry/exit
TIF_SYSCALL_AUDIT       Entry/exit
TIF_SECCOMP             Entry/exit
TIF_USER_RETURN_NOTIFY	Entry/exit      (x86 specific)
TIF_UPROBE		Entry/exit
TIF_PATCH_PENDING       Entry/exit
TIF_POLLING_NRFLAG      Scheduler (related to TIF_NEED_RESCHED)
TIF_MEMDIE              Historical, but not required to be a real one
TIF_FSCHECK		Historical, but not required to be a real one

Context switch group
TIF_NOCPUID             X86 Context switch
TIF_NOTSC               X86 Context switch
TIF_SLD                 X86 Context switch
TIF_BLOCKSTEP           X86 Context switch
TIF_IO_BITMAP           X86 Context switch + Entry/exit (see below)
TIF_NEED_FPU_LOAD       X86 Context switch + Entry/exit (see below)
TIF_SSBD                X86 Context switch
TIF_SPEC_IB             X86 Context switch
TIF_SPEC_FORCE_UPDATE   X86 Context switch

No group requirements
TIF_IA32                X86 random
TIF_FORCED_TF           X86 random
TIF_LAZY_MMU_UPDATES    X86 random
TIF_ADDR32              X86 random
TIF_X32                 X86 random

So the only interesting ones are TIF_IO_BITMAP and TIF_NEED_FPU_LOAD,
but those are really not required to be in the real TIF group because
they are independently evaluated _after_ the real TIF flags on exit to
user space and that requires a reread of the flags anyway. So if we put
the context switch and the random bits into a seperate word right after
thread_info->flags then the second word is in the same cacheline and it
wont matter. That way we gain plenty of free bits on 32 bit and have no
dependency between the two words at all.

The alternative is to play nasty games with TIF_IA32, TIF_ADDR32 and
TIF_X32 to free up bits for 32bit and make the flags field 64 bit on 64
bit kernels, but I prefer to do the above seperation.

Thanks,

        tglx
Andy Lutomirski July 17, 2020, 9:56 p.m. UTC | #5
On Fri, Jul 17, 2020 at 12:29 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Kees Cook <keescook@chromium.org> writes:
> > On Thu, Jul 16, 2020 at 11:55:59PM +0200, Thomas Gleixner wrote:
> >> Kees Cook <keescook@chromium.org> writes:
> >> >> +/*
> >> >> + * Define dummy _TIF work flags if not defined by the architecture or for
> >> >> + * disabled functionality.
> >> >> + */
> >> >
> >> > When I was thinking about this last week I was pondering having a split
> >> > between the arch-agnositc TIF flags and the arch-specific TIF flags, and
> >> > that each arch could have a single "there is agnostic work to be done"
> >> > TIF in their thread_info, and the agnostic flags could live in
> >> > task_struct or something. Anyway, I'll keep reading...
> >>
> >> That's going to be nasty. We rather go and expand the TIF storage to
> >> 64bit. And then do the following in a generic header:
> >
> > I though the point was to make the TIF_WORK check as fast as possible,
> > even on the 32-bit word systems. I mean it's not a huge performance hit,
> > but *shrug*
>
> For 64bit it's definitely faster to have 64bit flags.
>
> For 32bit it's debatable whether having to fiddle with two words and
> taking care about ordering is faster or not. It's a pain on 32bit in any
> case.
>
> But what we can do is distangle the flags into those which really need
> to be grouped into a single 32bit TIF word and architecture specific
> ones which can live in a different place.
>
> Looking at x86 which has the most pressure on TIF bits:
>
> Real TIF bits
> TIF_SYSCALL_TRACE       Entry/exit
> TIF_SYSCALL_TRACEPOINT  Entry/exit
> TIF_NOTIFY_RESUME       Entry/exit
> TIF_SIGPENDING          Entry/exit
> TIF_NEED_RESCHED        Entry/exit
> TIF_SINGLESTEP          Entry/exit
> TIF_SYSCALL_EMU         Entry/exit
> TIF_SYSCALL_AUDIT       Entry/exit
> TIF_SECCOMP             Entry/exit
> TIF_USER_RETURN_NOTIFY  Entry/exit      (x86 specific)
> TIF_UPROBE              Entry/exit
> TIF_PATCH_PENDING       Entry/exit
> TIF_POLLING_NRFLAG      Scheduler (related to TIF_NEED_RESCHED)
> TIF_MEMDIE              Historical, but not required to be a real one
> TIF_FSCHECK             Historical, but not required to be a real one
>
> Context switch group
> TIF_NOCPUID             X86 Context switch
> TIF_NOTSC               X86 Context switch
> TIF_SLD                 X86 Context switch
> TIF_BLOCKSTEP           X86 Context switch
> TIF_IO_BITMAP           X86 Context switch + Entry/exit (see below)
> TIF_NEED_FPU_LOAD       X86 Context switch + Entry/exit (see below)
> TIF_SSBD                X86 Context switch
> TIF_SPEC_IB             X86 Context switch
> TIF_SPEC_FORCE_UPDATE   X86 Context switch
>
> No group requirements
> TIF_IA32                X86 random
> TIF_FORCED_TF           X86 random
> TIF_LAZY_MMU_UPDATES    X86 random
> TIF_ADDR32              X86 random
> TIF_X32                 X86 random
>
> So the only interesting ones are TIF_IO_BITMAP and TIF_NEED_FPU_LOAD,
> but those are really not required to be in the real TIF group because
> they are independently evaluated _after_ the real TIF flags on exit to
> user space and that requires a reread of the flags anyway. So if we put
> the context switch and the random bits into a seperate word right after
> thread_info->flags then the second word is in the same cacheline and it
> wont matter. That way we gain plenty of free bits on 32 bit and have no
> dependency between the two words at all.
>
> The alternative is to play nasty games with TIF_IA32, TIF_ADDR32 and
> TIF_X32 to free up bits for 32bit and make the flags field 64 bit on 64
> bit kernels, but I prefer to do the above seperation.

I'm all for cleaning it up, but I don't think any nasty games would be
needed regardless.  IMO at least the following flags are nonsense and
don't belong in TIF_anything at all:

TIF_IA32, TIF_X32: can probably be deleted.  Someone would just need
to finish the work.
TIF_ADDR32: also probably removable, but I'm less confident.
TIF_FORCED_TF: This is purely a ptrace artifact and could easily go
somewhere else entirely.

So getting those five bits back would be straightforward.

FWIW, TIF_USER_RETURN_NOTIFY is a bit of an odd duck: it's an
entry/exit word *and* a context switch word.  The latter is because
it's logically a per-cpu flag, not a per-task flag, and the context
switch code moves it around so it's always set on the running task.
TIF_NEED_RESCHED is sort of in this category, too.  We could introduce
a percpu entry_exit_work field to simplify this at some small
performance cost.  TIF_POLLING_NRFLAG would go along with it.  (The
latter does not, strictly speaking, belong as a TIF_ flag at all, but
it does need to be in the same atomic word as TIF_NEED_RESCHED.)
Making this change would arguably be a decent cleanup.

--Andy
Thomas Gleixner July 18, 2020, 2:16 p.m. UTC | #6
Andy Lutomirski <luto@kernel.org> writes:
> On Fri, Jul 17, 2020 at 12:29 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> The alternative is to play nasty games with TIF_IA32, TIF_ADDR32 and
>> TIF_X32 to free up bits for 32bit and make the flags field 64 bit on 64
>> bit kernels, but I prefer to do the above seperation.
>
> I'm all for cleaning it up, but I don't think any nasty games would be
> needed regardless.  IMO at least the following flags are nonsense and
> don't belong in TIF_anything at all:
>
> TIF_IA32, TIF_X32: can probably be deleted.  Someone would just need
> to finish the work.
> TIF_ADDR32: also probably removable, but I'm less confident.
> TIF_FORCED_TF: This is purely a ptrace artifact and could easily go
> somewhere else entirely.
>
> So getting those five bits back would be straightforward.
>
> FWIW, TIF_USER_RETURN_NOTIFY is a bit of an odd duck: it's an
> entry/exit word *and* a context switch word.  The latter is because
> it's logically a per-cpu flag, not a per-task flag, and the context
> switch code moves it around so it's always set on the running task.

Gah, I missed the context switch thing of that. That stuff is hideous.

Thanks,

       tglx
Andy Lutomirski July 18, 2020, 2:41 p.m. UTC | #7
On Sat, Jul 18, 2020 at 7:16 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Andy Lutomirski <luto@kernel.org> writes:
> > On Fri, Jul 17, 2020 at 12:29 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> The alternative is to play nasty games with TIF_IA32, TIF_ADDR32 and
> >> TIF_X32 to free up bits for 32bit and make the flags field 64 bit on 64
> >> bit kernels, but I prefer to do the above seperation.
> >
> > I'm all for cleaning it up, but I don't think any nasty games would be
> > needed regardless.  IMO at least the following flags are nonsense and
> > don't belong in TIF_anything at all:
> >
> > TIF_IA32, TIF_X32: can probably be deleted.  Someone would just need
> > to finish the work.
> > TIF_ADDR32: also probably removable, but I'm less confident.
> > TIF_FORCED_TF: This is purely a ptrace artifact and could easily go
> > somewhere else entirely.
> >
> > So getting those five bits back would be straightforward.
> >
> > FWIW, TIF_USER_RETURN_NOTIFY is a bit of an odd duck: it's an
> > entry/exit word *and* a context switch word.  The latter is because
> > it's logically a per-cpu flag, not a per-task flag, and the context
> > switch code moves it around so it's always set on the running task.
>
> Gah, I missed the context switch thing of that. That stuff is hideous.

It's also delightful because anything that screws up that dance (such
as failure to do the exit-to-usermode path exactly right) likely
results in an insta-root-hole.  If we fail to run user return
notifiers, we can run user code with incorrect syscall MSRs, etc.
Thomas Gleixner July 19, 2020, 10:17 a.m. UTC | #8
Andy Lutomirski <luto@kernel.org> writes:
> On Sat, Jul 18, 2020 at 7:16 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> Andy Lutomirski <luto@kernel.org> writes:
>> > FWIW, TIF_USER_RETURN_NOTIFY is a bit of an odd duck: it's an
>> > entry/exit word *and* a context switch word.  The latter is because
>> > it's logically a per-cpu flag, not a per-task flag, and the context
>> > switch code moves it around so it's always set on the running task.
>>
>> Gah, I missed the context switch thing of that. That stuff is hideous.
>
> It's also delightful because anything that screws up that dance (such
> as failure to do the exit-to-usermode path exactly right) likely
> results in an insta-root-hole.  If we fail to run user return
> notifiers, we can run user code with incorrect syscall MSRs, etc.

Looking at it deeper, having that thing in the loop is a pointless
exercise. This really wants to be done _after_ the loop.

Thanks,

        tglx
Andy Lutomirski July 19, 2020, 3:25 p.m. UTC | #9
> On Jul 19, 2020, at 3:17 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> Andy Lutomirski <luto@kernel.org> writes:
>>> On Sat, Jul 18, 2020 at 7:16 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>> Andy Lutomirski <luto@kernel.org> writes:
>>>> FWIW, TIF_USER_RETURN_NOTIFY is a bit of an odd duck: it's an
>>>> entry/exit word *and* a context switch word.  The latter is because
>>>> it's logically a per-cpu flag, not a per-task flag, and the context
>>>> switch code moves it around so it's always set on the running task.
>>> 
>>> Gah, I missed the context switch thing of that. That stuff is hideous.
>> 
>> It's also delightful because anything that screws up that dance (such
>> as failure to do the exit-to-usermode path exactly right) likely
>> results in an insta-root-hole.  If we fail to run user return
>> notifiers, we can run user code with incorrect syscall MSRs, etc.
> 
> Looking at it deeper, having that thing in the loop is a pointless
> exercise. This really wants to be done _after_ the loop.
> 

As long as we’re confident that nothing after the loop can set the flag again.

> Thanks,
> 
>        tglx
Thomas Gleixner July 20, 2020, 6:50 a.m. UTC | #10
Andy Lutomirski <luto@amacapital.net> writes:
>> On Jul 19, 2020, at 3:17 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> 
>> Andy Lutomirski <luto@kernel.org> writes:
>>>> On Sat, Jul 18, 2020 at 7:16 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>>> Andy Lutomirski <luto@kernel.org> writes:
>>>>> FWIW, TIF_USER_RETURN_NOTIFY is a bit of an odd duck: it's an
>>>>> entry/exit word *and* a context switch word.  The latter is because
>>>>> it's logically a per-cpu flag, not a per-task flag, and the context
>>>>> switch code moves it around so it's always set on the running task.
>>>> 
>>>> Gah, I missed the context switch thing of that. That stuff is hideous.
>>> 
>>> It's also delightful because anything that screws up that dance (such
>>> as failure to do the exit-to-usermode path exactly right) likely
>>> results in an insta-root-hole.  If we fail to run user return
>>> notifiers, we can run user code with incorrect syscall MSRs, etc.
>> 
>> Looking at it deeper, having that thing in the loop is a pointless
>> exercise. This really wants to be done _after_ the loop.
>> 
> As long as we’re confident that nothing after the loop can set the flag again.

Yes, because that's the direct way off to user space.

Thanks,

        tglx
Andy Lutomirski July 27, 2020, 10:28 p.m. UTC | #11
On Thu, Jul 16, 2020 at 12:50 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> From: Thomas Gleixner <tglx@linutronix.de>
>
> On syscall entry certain work needs to be done:
>
>    - Establish state (lockdep, context tracking, tracing)
>    - Conditional work (ptrace, seccomp, audit...)
>
> This code is needlessly duplicated and  different in all
> architectures.
>
> +
> +/**
> + * arch_syscall_enter_seccomp - Architecture specific seccomp invocation
> + * @regs:      Pointer to currents pt_regs
> + *
> + * Returns: The original or a modified syscall number
> + *
> + * Invoked from syscall_enter_from_user_mode(). Can be replaced by
> + * architecture specific code.
> + */
> +static inline long arch_syscall_enter_seccomp(struct pt_regs *regs);

Ick.  I'd rather see arch_populate_seccomp_data() and kill this hook.
But we can clean this up later.

> +/**
> + * arch_syscall_enter_audit - Architecture specific audit invocation
> + * @regs:      Pointer to currents pt_regs
> + *
> + * Invoked from syscall_enter_from_user_mode(). Must be replaced by
> + * architecture specific code if the architecture supports audit.
> + */
> +static inline void arch_syscall_enter_audit(struct pt_regs *regs);
> +

Let's pass u32 arch here.

> +/**
> + * syscall_enter_from_user_mode - Check and handle work before invoking
> + *                              a syscall
> + * @regs:      Pointer to currents pt_regs
> + * @syscall:   The syscall number
> + *
> + * Invoked from architecture specific syscall entry code with interrupts
> + * disabled. The calling code has to be non-instrumentable. When the
> + * function returns all state is correct and the subsequent functions can be
> + * instrumented.
> + *
> + * Returns: The original or a modified syscall number
> + *
> + * If the returned syscall number is -1 then the syscall should be
> + * skipped. In this case the caller may invoke syscall_set_error() or
> + * syscall_set_return_value() first.  If neither of those are called and -1
> + * is returned, then the syscall will fail with ENOSYS.
> + *
> + * The following functionality is handled here:
> + *
> + *  1) Establish state (lockdep, RCU (context tracking), tracing)
> + *  2) TIF flag dependent invocations of arch_syscall_enter_tracehook(),
> + *     arch_syscall_enter_seccomp(), trace_sys_enter()
> + *  3) Invocation of arch_syscall_enter_audit()
> + */
> +long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall);

This should IMO also take u32 arch.

I'm also uneasy about having this do the idtentry/irqentry stuff as
well as the syscall stuff.  Is there a particular reason you did it
this way instead of having callers do:

idtentry_enter();
instrumentation_begin();
syscall_enter_from_user_mode();

FWIW, I think we could make this even better -- couldn't this get
folded together with syscall *exit* and become:

idtentry_enter();
instrumentation_begin();
generic_syscall();
instrumentation_end();
idtentry_exit();

and generic_syscall() would call arch_dispatch_syscall(regs, arch, syscall_nr);



> +
> +/**
> + * irqentry_enter_from_user_mode - Establish state before invoking the irq handler
> + * @regs:      Pointer to currents pt_regs
> + *
> + * Invoked from architecture specific entry code with interrupts disabled.
> + * Can only be called when the interrupt entry came from user mode. The
> + * calling code must be non-instrumentable.  When the function returns all
> + * state is correct and the subsequent functions can be instrumented.
> + *
> + * The function establishes state (lockdep, RCU (context tracking), tracing)
> + */
> +void irqentry_enter_from_user_mode(struct pt_regs *regs);

Unless the rest of the series works differently from what I expect, I
don't love this name.  How about normal_entry_from_user_mode() or
ordinary_entry_from_user_mode()?  After all, this seems to cover IRQ,
all the non-horrible exceptions, and (internally) syscalls.

--Andy
diff mbox series

Patch

--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -27,6 +27,9 @@  config HAVE_IMA_KEXEC
 config HOTPLUG_SMT
 	bool
 
+config GENERIC_ENTRY
+       bool
+
 config OPROFILE
 	tristate "OProfile system profiling"
 	depends on PROFILING
--- /dev/null
+++ b/include/linux/entry-common.h
@@ -0,0 +1,156 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_ENTRYCOMMON_H
+#define __LINUX_ENTRYCOMMON_H
+
+#include <linux/tracehook.h>
+#include <linux/syscalls.h>
+#include <linux/seccomp.h>
+#include <linux/sched.h>
+
+#include <asm/entry-common.h>
+
+/*
+ * Define dummy _TIF work flags if not defined by the architecture or for
+ * disabled functionality.
+ */
+#ifndef _TIF_SYSCALL_TRACE
+# define _TIF_SYSCALL_TRACE		(0)
+#endif
+
+#ifndef _TIF_SYSCALL_EMU
+# define _TIF_SYSCALL_EMU		(0)
+#endif
+
+#ifndef _TIF_SYSCALL_TRACEPOINT
+# define _TIF_SYSCALL_TRACEPOINT	(0)
+#endif
+
+#ifndef _TIF_SECCOMP
+# define _TIF_SECCOMP			(0)
+#endif
+
+#ifndef _TIF_AUDIT
+# define _TIF_AUDIT			(0)
+#endif
+
+/*
+ * TIF flags handled in syscall_enter_from_usermode()
+ */
+#ifndef ARCH_SYSCALL_ENTER_WORK
+# define ARCH_SYSCALL_ENTER_WORK	(0)
+#endif
+
+#define SYSCALL_ENTER_WORK						\
+	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | TIF_SECCOMP |	\
+	 _TIF_SYSCALL_TRACEPOINT | _TIF_SYSCALL_EMU |			\
+	 ARCH_SYSCALL_ENTER_WORK)
+
+/**
+ * arch_check_user_regs - Architecture specific sanity check for user mode regs
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Defaults to an empty implementation. Can be replaced by architecture
+ * specific code.
+ *
+ * Invoked from syscall_enter_from_user_mode() in the non-instrumentable
+ * section. Use __always_inline so the compiler cannot push it out of line
+ * and make it instrumentable.
+ */
+static __always_inline void arch_check_user_regs(struct pt_regs *regs);
+
+#ifndef arch_check_user_regs
+static __always_inline void arch_check_user_regs(struct pt_regs *regs) {}
+#endif
+
+/**
+ * arch_syscall_enter_tracehook - Wrapper around tracehook_report_syscall_entry()
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Returns: 0 on success or an error code to skip the syscall.
+ *
+ * Defaults to tracehook_report_syscall_entry(). Can be replaced by
+ * architecture specific code.
+ *
+ * Invoked from syscall_enter_from_user_mode()
+ */
+static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs);
+
+#ifndef arch_syscall_enter_tracehook
+static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs)
+{
+	return tracehook_report_syscall_entry(regs);
+}
+#endif
+
+/**
+ * arch_syscall_enter_seccomp - Architecture specific seccomp invocation
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Returns: The original or a modified syscall number
+ *
+ * Invoked from syscall_enter_from_user_mode(). Can be replaced by
+ * architecture specific code.
+ */
+static inline long arch_syscall_enter_seccomp(struct pt_regs *regs);
+
+#ifndef arch_syscall_enter_seccomp
+static inline long arch_syscall_enter_seccomp(struct pt_regs *regs)
+{
+	return secure_computing(NULL);
+}
+#endif
+
+/**
+ * arch_syscall_enter_audit - Architecture specific audit invocation
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Invoked from syscall_enter_from_user_mode(). Must be replaced by
+ * architecture specific code if the architecture supports audit.
+ */
+static inline void arch_syscall_enter_audit(struct pt_regs *regs);
+
+#ifndef arch_syscall_enter_audit
+static inline void arch_syscall_enter_audit(struct pt_regs *regs) { }
+#endif
+
+/**
+ * syscall_enter_from_user_mode - Check and handle work before invoking
+ *				 a syscall
+ * @regs:	Pointer to currents pt_regs
+ * @syscall:	The syscall number
+ *
+ * Invoked from architecture specific syscall entry code with interrupts
+ * disabled. The calling code has to be non-instrumentable. When the
+ * function returns all state is correct and the subsequent functions can be
+ * instrumented.
+ *
+ * Returns: The original or a modified syscall number
+ *
+ * If the returned syscall number is -1 then the syscall should be
+ * skipped. In this case the caller may invoke syscall_set_error() or
+ * syscall_set_return_value() first.  If neither of those are called and -1
+ * is returned, then the syscall will fail with ENOSYS.
+ *
+ * The following functionality is handled here:
+ *
+ *  1) Establish state (lockdep, RCU (context tracking), tracing)
+ *  2) TIF flag dependent invocations of arch_syscall_enter_tracehook(),
+ *     arch_syscall_enter_seccomp(), trace_sys_enter()
+ *  3) Invocation of arch_syscall_enter_audit()
+ */
+long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall);
+
+/**
+ * irqentry_enter_from_user_mode - Establish state before invoking the irq handler
+ * @regs:	Pointer to currents pt_regs
+ *
+ * Invoked from architecture specific entry code with interrupts disabled.
+ * Can only be called when the interrupt entry came from user mode. The
+ * calling code must be non-instrumentable.  When the function returns all
+ * state is correct and the subsequent functions can be instrumented.
+ *
+ * The function establishes state (lockdep, RCU (context tracking), tracing)
+ */
+void irqentry_enter_from_user_mode(struct pt_regs *regs);
+
+#endif
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -48,6 +48,7 @@  obj-y += irq/
 obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
+obj-y += entry/
 
 obj-$(CONFIG_CHECKPOINT_RESTORE) += kcmp.o
 obj-$(CONFIG_FREEZER) += freezer.o
--- /dev/null
+++ b/kernel/entry/Makefile
@@ -0,0 +1,3 @@ 
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_GENERIC_ENTRY) += common.o
--- /dev/null
+++ b/kernel/entry/common.c
@@ -0,0 +1,78 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/context_tracking.h>
+#include <linux/entry-common.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/syscalls.h>
+
+/**
+ * enter_from_user_mode - Establish state when coming from user mode
+ *
+ * Syscall/interrupt entry disables interrupts, but user mode is traced as
+ * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
+ *
+ * 1) Tell lockdep that interrupts are disabled
+ * 2) Invoke context tracking if enabled to reactivate RCU
+ * 3) Trace interrupts off state
+ */
+static __always_inline void enter_from_user_mode(struct pt_regs *regs)
+{
+	arch_check_user_regs(regs);
+	lockdep_hardirqs_off(CALLER_ADDR0);
+
+	CT_WARN_ON(ct_state() != CONTEXT_USER);
+	user_exit_irqoff();
+
+	instrumentation_begin();
+	trace_hardirqs_off_finish();
+	instrumentation_end();
+}
+
+static long syscall_trace_enter(struct pt_regs *regs, long syscall,
+				unsigned long ti_work)
+{
+	long ret = 0;
+
+	/* Handle ptrace */
+	if (ti_work & (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU)) {
+		ret = arch_syscall_enter_tracehook(regs);
+		if (ret || (ti_work & _TIF_SYSCALL_EMU))
+			return -1L;
+	}
+
+	/* Do seccomp after ptrace, to catch any tracer changes. */
+	if (ti_work & _TIF_SECCOMP) {
+		ret = arch_syscall_enter_seccomp(regs);
+		if (ret == -1L)
+			return ret;
+	}
+
+	if (unlikely(ti_work & _TIF_SYSCALL_TRACEPOINT))
+		trace_sys_enter(regs, syscall);
+
+	arch_syscall_enter_audit(regs);
+
+	return ret ? : syscall;
+}
+
+noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall)
+{
+	unsigned long ti_work;
+
+	enter_from_user_mode(regs);
+	instrumentation_begin();
+
+	local_irq_enable();
+	ti_work = READ_ONCE(current_thread_info()->flags);
+	if (ti_work & SYSCALL_ENTER_WORK)
+		syscall = syscall_trace_enter(regs, syscall, ti_work);
+	instrumentation_end();
+
+	return syscall;
+}
+
+noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
+{
+	enter_from_user_mode(regs);
+}