diff mbox

Re: [PATCH] x86/retpoline/entry: Disable the entire SYSCALL64 fast path with retpolines on

Message ID CA+55aFyONQeyHit6LcQz_JM5+C9pr61fcWFSpdn=7K_2mZN+oA@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Linus Torvalds Jan. 25, 2018, 6:48 p.m. UTC
On Mon, Jan 22, 2018 at 10:55 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Honestly, I'd rather get rid of the fast-path entirely. Compared to
> all the PTI mess, it's not even noticeable.

So I looked at how that would be.

Patch attached. Not really "tested", but I'm running the kernel with
this patch now, and 'strace' etc works, and honestly, it seems very
obvious.

Also, code generation for 'do_syscall_64()' does not look horrible. In
fact, it doesn't look all that much worse than the fast-path ever did.

So the biggest impact of this is the extra register saves
(SAVE_EXTRA_REGS) from setting up the full ptregs. And honestly, I
hate how that stupid macro still uses "movq reg,off(%rsp)" instead of
"pushq %reg".

Considering the diffstat:

 2 files changed, 2 insertions(+), 121 deletions(-)

and how those 100+ lines are nasty assembly code, I do think we should
just do it.

Comments?

              Linus
arch/x86/entry/entry_64.S   | 116 --------------------------------------------
 arch/x86/entry/syscall_64.c |   7 +--
 2 files changed, 2 insertions(+), 121 deletions(-)

Comments

Linus Torvalds Jan. 25, 2018, 7:16 p.m. UTC | #1
On Thu, Jan 25, 2018 at 10:48 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So the biggest impact of this is the extra register saves

Actually, the other noticeable part is the reloading of the argument
registers from ptregs. Together with just the extra level of
'call/ret' and the stack setup, I'm guessing we're talking maybe 20
cycles or so.

So there's the extra register saves, and simply the fact that the
fastpath had a flatter calling structure.

It still feels worth it. And if we do decide that we want to do the
register clearing on kernel entry for some paranoid mode, we'd pretty
much have to do this anyway.

                 Linus
Brian Gerst Jan. 25, 2018, 8:04 p.m. UTC | #2
In Thu, Jan 25, 2018 at 2:16 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 25, 2018 at 10:48 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> So the biggest impact of this is the extra register saves
>
> Actually, the other noticeable part is the reloading of the argument
> registers from ptregs. Together with just the extra level of
> 'call/ret' and the stack setup, I'm guessing we're talking maybe 20
> cycles or so.
>
> So there's the extra register saves, and simply the fact that the
> fastpath had a flatter calling structure.
>
> It still feels worth it. And if we do decide that we want to do the
> register clearing on kernel entry for some paranoid mode, we'd pretty
> much have to do this anyway.
>
>                  Linus

Another extra step the slow path does is checking to see if ptregs is
safe for SYSRET.  I think that can be mitigated by moving the check to
the places that do modify ptregs (ptrace, sigreturn, and exec) which
would set a flag to force return with IRET if the modified regs do not
satisfy the criteria for SYSRET.

--
Brian Gerst
Linus Torvalds Jan. 25, 2018, 8:54 p.m. UTC | #3
On Thu, Jan 25, 2018 at 12:04 PM, Brian Gerst <brgerst@gmail.com> wrote:
>
> Another extra step the slow path does is checking to see if ptregs is
> safe for SYSRET.  I think that can be mitigated by moving the check to
> the places that do modify ptregs (ptrace, sigreturn, and exec) which
> would set a flag to force return with IRET if the modified regs do not
> satisfy the criteria for SYSRET.

I tried to do some profiling, and none of that shows up for me.

That said, what _also_ doesn't show up is the actual page table switch
on entry. And that seems to be because the per-pcu trampoline code
isn't captures by perf (or at least not shown). Oh well.

What _does_ show up a bit is this in prepare_exit_to_usermode():

#ifdef CONFIG_COMPAT
        /*
         * Compat syscalls set TS_COMPAT.  Make sure we clear it before
         * returning to user mode.  We need to clear it *after* signal
         * handling, because syscall restart has a fixup for compat
         * syscalls.  The fixup is exercised by the ptrace_syscall_32
         * selftest.
         *
         * We also need to clear TS_REGS_POKED_I386: the 32-bit tracer
         * special case only applies after poking regs and before the
         * very next return to user mode.
         */
        current->thread.status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
#endif

and I think the problem there is that it is unnecessarily dirtying
that cacheline. Afaik, those bits are already clear 99.999% of the
time.

So things would be better if that 'status' would be in the thread-info
(to keep cachelines close to the other stuff we already touch) and the
code should have something like

        if (unlikely(ti->status & (TS_COMPAT|TS_I386_REGS_POKED)))

or whatever.

There might be other similar small tuning issues going on.

So there is room for improvement there in the slow path.

                Linus
Andy Lutomirski Jan. 25, 2018, 9:02 p.m. UTC | #4
On Thu, Jan 25, 2018 at 10:48 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jan 22, 2018 at 10:55 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Honestly, I'd rather get rid of the fast-path entirely. Compared to
>> all the PTI mess, it's not even noticeable.
>
> So I looked at how that would be.
>
> Patch attached. Not really "tested", but I'm running the kernel with
> this patch now, and 'strace' etc works, and honestly, it seems very
> obvious.
>
> Also, code generation for 'do_syscall_64()' does not look horrible. In
> fact, it doesn't look all that much worse than the fast-path ever did.
>
> So the biggest impact of this is the extra register saves
> (SAVE_EXTRA_REGS) from setting up the full ptregs. And honestly, I
> hate how that stupid macro still uses "movq reg,off(%rsp)" instead of
> "pushq %reg".
>
> Considering the diffstat:
>
>  2 files changed, 2 insertions(+), 121 deletions(-)
>
> and how those 100+ lines are nasty assembly code, I do think we should
> just do it.

Feel free to Acked-by: Andy Lutomirski <luto@kernel.org> that patch.

Or I can grab it and send it to -tip.

Re: the trampoline not showing up: if I find some time, I'll try to
wire it up correctly in kallsyms.

--Andy
Thomas Gleixner Jan. 25, 2018, 9:05 p.m. UTC | #5
On Thu, 25 Jan 2018, Andy Lutomirski wrote:
> On Thu, Jan 25, 2018 at 10:48 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Mon, Jan 22, 2018 at 10:55 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> Honestly, I'd rather get rid of the fast-path entirely. Compared to
> >> all the PTI mess, it's not even noticeable.
> >
> > So I looked at how that would be.
> >
> > Patch attached. Not really "tested", but I'm running the kernel with
> > this patch now, and 'strace' etc works, and honestly, it seems very
> > obvious.
> >
> > Also, code generation for 'do_syscall_64()' does not look horrible. In
> > fact, it doesn't look all that much worse than the fast-path ever did.
> >
> > So the biggest impact of this is the extra register saves
> > (SAVE_EXTRA_REGS) from setting up the full ptregs. And honestly, I
> > hate how that stupid macro still uses "movq reg,off(%rsp)" instead of
> > "pushq %reg".
> >
> > Considering the diffstat:
> >
> >  2 files changed, 2 insertions(+), 121 deletions(-)
> >
> > and how those 100+ lines are nasty assembly code, I do think we should
> > just do it.
> 
> Feel free to Acked-by: Andy Lutomirski <luto@kernel.org> that patch.
> 
> Or I can grab it and send it to -tip.

That would be nice, so we can route it through x86/pti which provides it
for backporting cleanly.

Thanks,

	tglx
Linus Torvalds Jan. 25, 2018, 9:06 p.m. UTC | #6
On Thu, Jan 25, 2018 at 1:02 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> Feel free to Acked-by: Andy Lutomirski <luto@kernel.org> that patch.
>
> Or I can grab it and send it to -tip.

I'm not going to apply it for 4.15, I just wanted to see how it
looked, and do some minimal profiling.

From the profiles, as mentioned, moving 'status' from thread_struct to
thread_info is probably worth doing. But I didn't look at the impact
of that at all.

So it should go through all the normal channels in -tip for 4.16.

I'll happily sign off on the patch, but it was really pretty mindless,
so I'm not sure I need the authorship either.

> Re: the trampoline not showing up: if I find some time, I'll try to
> wire it up correctly in kallsyms.

That would be lovely. Right now the system call exit shows up pretty
clearly in profiles, and most of it is (obviously) the cr3 write. So
the missing entry trampoline is not insignificant.

             Linus
Andy Lutomirski Jan. 25, 2018, 9:08 p.m. UTC | #7
On Thu, Jan 25, 2018 at 1:06 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 25, 2018 at 1:02 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> Feel free to Acked-by: Andy Lutomirski <luto@kernel.org> that patch.
>>
>> Or I can grab it and send it to -tip.
>
> I'm not going to apply it for 4.15, I just wanted to see how it
> looked, and do some minimal profiling.
>
> From the profiles, as mentioned, moving 'status' from thread_struct to
> thread_info is probably worth doing. But I didn't look at the impact
> of that at all.
>
> So it should go through all the normal channels in -tip for 4.16.
>
> I'll happily sign off on the patch, but it was really pretty mindless,
> so I'm not sure I need the authorship either.
>
>> Re: the trampoline not showing up: if I find some time, I'll try to
>> wire it up correctly in kallsyms.
>
> That would be lovely. Right now the system call exit shows up pretty
> clearly in profiles, and most of it is (obviously) the cr3 write. So
> the missing entry trampoline is not insignificant.
>

With retpoline, the retpoline in the trampoline sucks.  I don't need
perf for that -- I've benchmarked it both ways.  It sucks.  I'll fix
it, but it'll be kind of complicated.

--Andy
Linus Torvalds Jan. 25, 2018, 9:20 p.m. UTC | #8
On Thu, Jan 25, 2018 at 1:08 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> With retpoline, the retpoline in the trampoline sucks.  I don't need
> perf for that -- I've benchmarked it both ways.  It sucks.  I'll fix
> it, but it'll be kind of complicated.

Ahh, I'd forgotten about that (and obviously didn't see it in the profiles).

But yeah, that is fixable even if it does require a page per CPU. Or
did you have some clever scheme in mind?

But that's independent of the slow/fast path.

          Linus
Andy Lutomirski Jan. 25, 2018, 9:31 p.m. UTC | #9
On Thu, Jan 25, 2018 at 1:20 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 25, 2018 at 1:08 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> With retpoline, the retpoline in the trampoline sucks.  I don't need
>> perf for that -- I've benchmarked it both ways.  It sucks.  I'll fix
>> it, but it'll be kind of complicated.
>
> Ahh, I'd forgotten about that (and obviously didn't see it in the profiles).
>
> But yeah, that is fixable even if it does require a page per CPU. Or
> did you have some clever scheme in mind?

Nothing clever.  I was going to see if I could get actual
binutils-generated relocations to work in the trampoline.  We already
have code to parse ELF relocations and turn them into a simple table,
and it shouldn't be *that* hard to run a separate pass on the entry
trampoline.

Another potentially useful if rather minor optimization would be to
rejigger the SYSCALL_DEFINE macros a bit.  Currently we treat all
syscalls like this:

long func(long arg0, long arg1, long arg2, long arg3, long arg4, long arg5);

I wonder if we'd be better off doing:

long func(const struct pt_regs *regs);

and autogenerating:

static long SyS_read(const struct pt_regs *regs)
{
   return sys_reg(regs->di, ...);
}
Dan Williams Jan. 25, 2018, 9:39 p.m. UTC | #10
On Thu, Jan 25, 2018 at 1:31 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Thu, Jan 25, 2018 at 1:20 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Jan 25, 2018 at 1:08 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> With retpoline, the retpoline in the trampoline sucks.  I don't need
>>> perf for that -- I've benchmarked it both ways.  It sucks.  I'll fix
>>> it, but it'll be kind of complicated.
>>
>> Ahh, I'd forgotten about that (and obviously didn't see it in the profiles).
>>
>> But yeah, that is fixable even if it does require a page per CPU. Or
>> did you have some clever scheme in mind?
>
> Nothing clever.  I was going to see if I could get actual
> binutils-generated relocations to work in the trampoline.  We already
> have code to parse ELF relocations and turn them into a simple table,
> and it shouldn't be *that* hard to run a separate pass on the entry
> trampoline.
>
> Another potentially useful if rather minor optimization would be to
> rejigger the SYSCALL_DEFINE macros a bit.  Currently we treat all
> syscalls like this:
>
> long func(long arg0, long arg1, long arg2, long arg3, long arg4, long arg5);
>
> I wonder if we'd be better off doing:
>
> long func(const struct pt_regs *regs);
>
> and autogenerating:
>
> static long SyS_read(const struct pt_regs *regs)
> {
>    return sys_reg(regs->di, ...);
> }

If you're rejiggering, can we also put in a mechanism for detecting
which registers to clear so that userspace can't inject useful values
into speculation paths?

https://patchwork.kernel.org/patch/10153753/
Andy Lutomirski Jan. 25, 2018, 9:53 p.m. UTC | #11
On Thu, Jan 25, 2018 at 1:39 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Jan 25, 2018 at 1:31 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Thu, Jan 25, 2018 at 1:20 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Thu, Jan 25, 2018 at 1:08 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>>
>>>> With retpoline, the retpoline in the trampoline sucks.  I don't need
>>>> perf for that -- I've benchmarked it both ways.  It sucks.  I'll fix
>>>> it, but it'll be kind of complicated.
>>>
>>> Ahh, I'd forgotten about that (and obviously didn't see it in the profiles).
>>>
>>> But yeah, that is fixable even if it does require a page per CPU. Or
>>> did you have some clever scheme in mind?
>>
>> Nothing clever.  I was going to see if I could get actual
>> binutils-generated relocations to work in the trampoline.  We already
>> have code to parse ELF relocations and turn them into a simple table,
>> and it shouldn't be *that* hard to run a separate pass on the entry
>> trampoline.
>>
>> Another potentially useful if rather minor optimization would be to
>> rejigger the SYSCALL_DEFINE macros a bit.  Currently we treat all
>> syscalls like this:
>>
>> long func(long arg0, long arg1, long arg2, long arg3, long arg4, long arg5);
>>
>> I wonder if we'd be better off doing:
>>
>> long func(const struct pt_regs *regs);
>>
>> and autogenerating:
>>
>> static long SyS_read(const struct pt_regs *regs)
>> {
>>    return sys_reg(regs->di, ...);
>> }
>
> If you're rejiggering, can we also put in a mechanism for detecting
> which registers to clear so that userspace can't inject useful values
> into speculation paths?
>
> https://patchwork.kernel.org/patch/10153753/

My SYSCALL_DEFINE rejigger suggestion up-thread does this for free as
a side effect.

That being said, I think this would be more accurately characterized
as "so that userspace has a somewhat harder time injecting useful
values into speculation paths".
Linus Torvalds Jan. 25, 2018, 9:53 p.m. UTC | #12
On Thu, Jan 25, 2018 at 1:39 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>
> If you're rejiggering, can we also put in a mechanism for detecting
> which registers to clear so that userspace can't inject useful values
> into speculation paths?

That actually becomes trivial with just the "no fastpath" patch I sent
out. You can just clear all of them.

Sure, then do_syscall_64() will reload the six first ones, but since
those are the argument registers anyway, and since they are
caller-clobbered, they have very short lifetimes. So it would
effectively not really be an issue.

But yes, SYSCALL_DEFINEx() rejiggery would close even that tiny hole.

                   Linus
David Laight Jan. 26, 2018, 11:17 a.m. UTC | #13
From: Andy Lutomirski

> Sent: 25 January 2018 21:31

...
> Another potentially useful if rather minor optimization would be to

> rejigger the SYSCALL_DEFINE macros a bit.  Currently we treat all

> syscalls like this:

> 

> long func(long arg0, long arg1, long arg2, long arg3, long arg4, long arg5);

> 

> I wonder if we'd be better off doing:

> 

> long func(const struct pt_regs *regs);

> 

> and autogenerating:

> 

> static long SyS_read(const struct pt_regs *regs)

> {

>    return sys_reg(regs->di, ...);

> }


Hmmm....
NetBSD (and the other BSD?) defines a structure for the arguments to each syscall.
On systems where all function arguments are put on stack the user stack that
contains the arguments is copied into a kernel buffer.
For amd64 I changed the register save area layout so that the arguments were in
the right order [1]. Then added an extra area for the extra arguments that had to be
read from the user stack.
Just passing a pointer into the save area has to be better than reading
all the values again.

[1] There was some horrid fallout from that :-(

	David
Alan Cox Jan. 26, 2018, 2:24 p.m. UTC | #14
> NetBSD (and the other BSD?) defines a structure for the arguments to
> each syscall.

Goes back to v7 or so but they put the syscall arguments into the uarea
so that no pointers were needed (uarea being a per process mapping at a
fixed address) in order to also reduce pointer dereferencing costs (not
that those matter much on modern processors)

Alan.
Andy Lutomirski Jan. 26, 2018, 3:57 p.m. UTC | #15
On Fri, Jan 26, 2018 at 6:24 AM, Alan Cox <alan@linux.intel.com> wrote:
>> NetBSD (and the other BSD?) defines a structure for the arguments to
>> each syscall.
>
> Goes back to v7 or so but they put the syscall arguments into the uarea
> so that no pointers were needed (uarea being a per process mapping at a
> fixed address) in order to also reduce pointer dereferencing costs (not
> that those matter much on modern processors)
>

I gave the rearrangement like this a try yesterday and it's a bit of a
mess.  Part of the problem is that there are a bunch of pieces of code
that expect sys_xyz() to be actual callable functions.  The best way
to deal with that is probably to switch to calling normal functions.
Linus Torvalds Jan. 26, 2018, 5:40 p.m. UTC | #16
On Fri, Jan 26, 2018 at 7:57 AM, Andy Lutomirski <luto@kernel.org> wrote:
>
> I gave the rearrangement like this a try yesterday and it's a bit of a
> mess.  Part of the problem is that there are a bunch of pieces of code
> that expect sys_xyz() to be actual callable functions.

That's not supposed to be a mess.

That's part of why we do that whole indirection through SYSC##xyz to
sys##_xyz: the asm-callable ones will do the full casting of
troublesome arguments (some architectures have C calling sequence
rules that have security issues, so we need to make sure that the
arguments actually follow the right rules and 'int' arguments are
properly sign-extended etc).

So that whole indirection could be made to *also* create another
version of the syscall that instead took the arguments from ptregs.

We already do exactly that for the tracing events: look how
FTRACE_SYSCALLS ends up creating that extra metadata.

The ptreg version should be done the same way: don't make 'sys_xyz()'
take a struct ptregs, instead make those SYSCALL_DEFINE*() macros
create a _new_ function called 'ptregs_xyz()' and then that function
does the argument unpacking.

Then the x86 system call table can just be switched over to call those
ptreg versions instead.

                    Linus
Al Viro Jan. 26, 2018, 6:07 p.m. UTC | #17
On Fri, Jan 26, 2018 at 09:40:23AM -0800, Linus Torvalds wrote:
> On Fri, Jan 26, 2018 at 7:57 AM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > I gave the rearrangement like this a try yesterday and it's a bit of a
> > mess.  Part of the problem is that there are a bunch of pieces of code
> > that expect sys_xyz() to be actual callable functions.
> 
> That's not supposed to be a mess.
> 
> That's part of why we do that whole indirection through SYSC##xyz to
> sys##_xyz: the asm-callable ones will do the full casting of
> troublesome arguments (some architectures have C calling sequence
> rules that have security issues, so we need to make sure that the
> arguments actually follow the right rules and 'int' arguments are
> properly sign-extended etc).
> 
> So that whole indirection could be made to *also* create another
> version of the syscall that instead took the arguments from ptregs.
> 
> We already do exactly that for the tracing events: look how
> FTRACE_SYSCALLS ends up creating that extra metadata.
> 
> The ptreg version should be done the same way: don't make 'sys_xyz()'
> take a struct ptregs, instead make those SYSCALL_DEFINE*() macros
> create a _new_ function called 'ptregs_xyz()' and then that function
> does the argument unpacking.
> 
> Then the x86 system call table can just be switched over to call those
> ptreg versions instead.

Umm...  What about other architectures?  Or do you want SYSCALL_DEFINE...
to be per-arch?  I wonder how much would that "go through pt_regs" hurt
on something like sparc...
Linus Torvalds Jan. 26, 2018, 6:13 p.m. UTC | #18
On Fri, Jan 26, 2018 at 10:07 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  What about other architectures?  Or do you want SYSCALL_DEFINE...
> to be per-arch?  I wonder how much would that "go through pt_regs" hurt
> on something like sparc...

No, but I just talked to Will Deacon about register clearing on entry,
and so I suspect that arm64 might want something similar too.

So I think some opt-in for letting architectures add their own
function would be good. Because it wouldn't be all architectures, but
it probably _would_ be more than just x86.

You need to add architecture-specific "load argX from ptregs" macros anyway.

             Linus
Andy Lutomirski Jan. 26, 2018, 6:23 p.m. UTC | #19
On Fri, Jan 26, 2018 at 10:13 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Jan 26, 2018 at 10:07 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> Umm...  What about other architectures?  Or do you want SYSCALL_DEFINE...
>> to be per-arch?  I wonder how much would that "go through pt_regs" hurt
>> on something like sparc...
>
> No, but I just talked to Will Deacon about register clearing on entry,
> and so I suspect that arm64 might want something similar too.
>
> So I think some opt-in for letting architectures add their own
> function would be good. Because it wouldn't be all architectures, but
> it probably _would_ be more than just x86.
>
> You need to add architecture-specific "load argX from ptregs" macros anyway.

I mocked that up, and it's straightforward.  I ended up with something like:

#define __ARCH_SYSCALL_ARGS(n, ...) (regs->di, ...)

(obviously modified so it actually compiles.)

The issue is that doing it this way gives us, effectively:

long sys_foo(int a, int b)
{
  body here;
}

long SyS_foo(const struct pt_regs *regs)
{
  return sys_foo(regs->di, regs->si);
}

whereas what we want is *static* long sys_foo(...).  So I could split
the macros into:

DEFINE_SYSCALL2(foo, ....)

and

DEFINE_EXTERN_SYSCALL2(foo, ...)

or I could just fix up all the code that expects calling sys_foo()
across files to work.

My mockup patch doesn't actually work because of compat crap, but
that's manageable.

--Andy
Linus Torvalds Jan. 26, 2018, 6:54 p.m. UTC | #20
On Fri, Jan 26, 2018 at 10:23 AM, Andy Lutomirski <luto@kernel.org> wrote:
>
> The issue is that doing it this way gives us, effectively:
>
> long sys_foo(int a, int b)
> {
>   body here;
> }
>
> long SyS_foo(const struct pt_regs *regs)
> {
>   return sys_foo(regs->di, regs->si);
> }
>
> whereas what we want is *static* long sys_foo(...).

How about just marking 'sys_foo()' as being always_inline (but still
not static)? Because the case that _matters_ is that SyS_foo(), thing
when this is enabled.

Sure, you'll get two copies of the code (one in SyS_foo(), the other
being the callable-from C 'sys_foo()' that is exported and almost
never used). But that seems a fairly small price to pay. We could
leave it for later to try to get rid of the unused copies entirely.

             Linus
Andy Lutomirski Jan. 26, 2018, 7:02 p.m. UTC | #21
On Fri, Jan 26, 2018 at 10:54 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Jan 26, 2018 at 10:23 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> The issue is that doing it this way gives us, effectively:
>>
>> long sys_foo(int a, int b)
>> {
>>   body here;
>> }
>>
>> long SyS_foo(const struct pt_regs *regs)
>> {
>>   return sys_foo(regs->di, regs->si);
>> }
>>
>> whereas what we want is *static* long sys_foo(...).
>
> How about just marking 'sys_foo()' as being always_inline (but still
> not static)? Because the case that _matters_ is that SyS_foo(), thing
> when this is enabled.
>
> Sure, you'll get two copies of the code (one in SyS_foo(), the other
> being the callable-from C 'sys_foo()' that is exported and almost
> never used). But that seems a fairly small price to pay. We could
> leave it for later to try to get rid of the unused copies entirely.
>

I could do that, but Josh Triplett will yell at me.

Anyway, I'll fiddle with it.  This isn't exactly high priority.
Will Deacon Jan. 29, 2018, 1:19 p.m. UTC | #22
Hi Andy,

On Fri, Jan 26, 2018 at 10:23:23AM -0800, Andy Lutomirski wrote:
> On Fri, Jan 26, 2018 at 10:13 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Fri, Jan 26, 2018 at 10:07 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >>
> >> Umm...  What about other architectures?  Or do you want SYSCALL_DEFINE...
> >> to be per-arch?  I wonder how much would that "go through pt_regs" hurt
> >> on something like sparc...
> >
> > No, but I just talked to Will Deacon about register clearing on entry,
> > and so I suspect that arm64 might want something similar too.
> >
> > So I think some opt-in for letting architectures add their own
> > function would be good. Because it wouldn't be all architectures, but
> > it probably _would_ be more than just x86.
> >
> > You need to add architecture-specific "load argX from ptregs" macros anyway.
> 
> I mocked that up, and it's straightforward.  I ended up with something like:
> 
> #define __ARCH_SYSCALL_ARGS(n, ...) (regs->di, ...)
> 
> (obviously modified so it actually compiles.)
> 
> The issue is that doing it this way gives us, effectively:
> 
> long sys_foo(int a, int b)
> {
>   body here;
> }
> 
> long SyS_foo(const struct pt_regs *regs)
> {
>   return sys_foo(regs->di, regs->si);
> }
> 
> whereas what we want is *static* long sys_foo(...).  So I could split
> the macros into:
> 
> DEFINE_SYSCALL2(foo, ....)
> 
> and
> 
> DEFINE_EXTERN_SYSCALL2(foo, ...)
> 
> or I could just fix up all the code that expects calling sys_foo()
> across files to work.

Another issue with this style of macro definition exists on architectures
where the calling convention needs you to carry state around depending on
how you packed the previous parameters. For example, on 32-bit ARM, 64-bit
values are passed in adjacent pairs of registers but the low numbered
register needs to be even. This is what stopped me from trying to use
existing helpers such as syscall_get_arguments to unpack the pt_regs
and it generally means that anything that says "get me argument n" is going
to require constructing arguments 0..n-1 first.

To do this properly I think we'll either need to pass back the size and
current register offset to the arch code, or just allow the thing to be
overridden per syscall (the case above isn't especially frequent).

Will
David Laight Jan. 29, 2018, 3:23 p.m. UTC | #23
From: Will Deacon
> Sent: 29 January 2018 13:20
...
> Another issue with this style of macro definition exists on architectures
> where the calling convention needs you to carry state around depending on
> how you packed the previous parameters. For example, on 32-bit ARM, 64-bit
> values are passed in adjacent pairs of registers but the low numbered
> register needs to be even. This is what stopped me from trying to use
> existing helpers such as syscall_get_arguments to unpack the pt_regs
> and it generally means that anything that says "get me argument n" is going
> to require constructing arguments 0..n-1 first.
> 
> To do this properly I think we'll either need to pass back the size and
> current register offset to the arch code, or just allow the thing to be
> overridden per syscall (the case above isn't especially frequent).

If you generate a structure from the argument list that might work
'by magic'.
Certainly you can add explicit pads to any structure.

	David
diff mbox

Patch

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ff6f8022612c..65f4d29be69b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -241,80 +241,6 @@  GLOBAL(entry_SYSCALL_64_after_hwframe)
 
 	TRACE_IRQS_OFF
 
-	/*
-	 * If we need to do entry work or if we guess we'll need to do
-	 * exit work, go straight to the slow path.
-	 */
-	movq	PER_CPU_VAR(current_task), %r11
-	testl	$_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
-	jnz	entry_SYSCALL64_slow_path
-
-entry_SYSCALL_64_fastpath:
-	/*
-	 * Easy case: enable interrupts and issue the syscall.  If the syscall
-	 * needs pt_regs, we'll call a stub that disables interrupts again
-	 * and jumps to the slow path.
-	 */
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-#if __SYSCALL_MASK == ~0
-	cmpq	$__NR_syscall_max, %rax
-#else
-	andl	$__SYSCALL_MASK, %eax
-	cmpl	$__NR_syscall_max, %eax
-#endif
-	ja	1f				/* return -ENOSYS (already in pt_regs->ax) */
-	movq	%r10, %rcx
-
-	/*
-	 * This call instruction is handled specially in stub_ptregs_64.
-	 * It might end up jumping to the slow path.  If it jumps, RAX
-	 * and all argument registers are clobbered.
-	 */
-#ifdef CONFIG_RETPOLINE
-	movq	sys_call_table(, %rax, 8), %rax
-	call	__x86_indirect_thunk_rax
-#else
-	call	*sys_call_table(, %rax, 8)
-#endif
-.Lentry_SYSCALL_64_after_fastpath_call:
-
-	movq	%rax, RAX(%rsp)
-1:
-
-	/*
-	 * If we get here, then we know that pt_regs is clean for SYSRET64.
-	 * If we see that no exit work is required (which we are required
-	 * to check with IRQs off), then we can go straight to SYSRET64.
-	 */
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	TRACE_IRQS_OFF
-	movq	PER_CPU_VAR(current_task), %r11
-	testl	$_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
-	jnz	1f
-
-	LOCKDEP_SYS_EXIT
-	TRACE_IRQS_ON		/* user mode is traced as IRQs on */
-	movq	RIP(%rsp), %rcx
-	movq	EFLAGS(%rsp), %r11
-	addq	$6*8, %rsp	/* skip extra regs -- they were preserved */
-	UNWIND_HINT_EMPTY
-	jmp	.Lpop_c_regs_except_rcx_r11_and_sysret
-
-1:
-	/*
-	 * The fast path looked good when we started, but something changed
-	 * along the way and we need to switch to the slow path.  Calling
-	 * raise(3) will trigger this, for example.  IRQs are off.
-	 */
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_ANY)
-	SAVE_EXTRA_REGS
-	movq	%rsp, %rdi
-	call	syscall_return_slowpath	/* returns with IRQs disabled */
-	jmp	return_from_SYSCALL_64
-
-entry_SYSCALL64_slow_path:
 	/* IRQs are off. */
 	SAVE_EXTRA_REGS
 	movq	%rsp, %rdi
@@ -393,7 +319,6 @@  syscall_return_via_sysret:
 	/* rcx and r11 are already restored (see code above) */
 	UNWIND_HINT_EMPTY
 	POP_EXTRA_REGS
-.Lpop_c_regs_except_rcx_r11_and_sysret:
 	popq	%rsi	/* skip r11 */
 	popq	%r10
 	popq	%r9
@@ -424,47 +349,6 @@  syscall_return_via_sysret:
 	USERGS_SYSRET64
 END(entry_SYSCALL_64)
 
-ENTRY(stub_ptregs_64)
-	/*
-	 * Syscalls marked as needing ptregs land here.
-	 * If we are on the fast path, we need to save the extra regs,
-	 * which we achieve by trying again on the slow path.  If we are on
-	 * the slow path, the extra regs are already saved.
-	 *
-	 * RAX stores a pointer to the C function implementing the syscall.
-	 * IRQs are on.
-	 */
-	cmpq	$.Lentry_SYSCALL_64_after_fastpath_call, (%rsp)
-	jne	1f
-
-	/*
-	 * Called from fast path -- disable IRQs again, pop return address
-	 * and jump to slow path
-	 */
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	TRACE_IRQS_OFF
-	popq	%rax
-	UNWIND_HINT_REGS extra=0
-	jmp	entry_SYSCALL64_slow_path
-
-1:
-	JMP_NOSPEC %rax				/* Called from C */
-END(stub_ptregs_64)
-
-.macro ptregs_stub func
-ENTRY(ptregs_\func)
-	UNWIND_HINT_FUNC
-	leaq	\func(%rip), %rax
-	jmp	stub_ptregs_64
-END(ptregs_\func)
-.endm
-
-/* Instantiate ptregs_stub for each ptregs-using syscall */
-#define __SYSCALL_64_QUAL_(sym)
-#define __SYSCALL_64_QUAL_ptregs(sym) ptregs_stub sym
-#define __SYSCALL_64(nr, sym, qual) __SYSCALL_64_QUAL_##qual(sym)
-#include <asm/syscalls_64.h>
-
 /*
  * %rdi: prev task
  * %rsi: next task
diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
index 9c09775e589d..c176d2fab1da 100644
--- a/arch/x86/entry/syscall_64.c
+++ b/arch/x86/entry/syscall_64.c
@@ -7,14 +7,11 @@ 
 #include <asm/asm-offsets.h>
 #include <asm/syscall.h>
 
-#define __SYSCALL_64_QUAL_(sym) sym
-#define __SYSCALL_64_QUAL_ptregs(sym) ptregs_##sym
-
-#define __SYSCALL_64(nr, sym, qual) extern asmlinkage long __SYSCALL_64_QUAL_##qual(sym)(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);
+#define __SYSCALL_64(nr, sym, qual) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);
 #include <asm/syscalls_64.h>
 #undef __SYSCALL_64
 
-#define __SYSCALL_64(nr, sym, qual) [nr] = __SYSCALL_64_QUAL_##qual(sym),
+#define __SYSCALL_64(nr, sym, qual) [nr] = sym,
 
 extern long sys_ni_syscall(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);