[3/3] syscalls: Add a bit of documentation to __SYSCALL_DEFINE
diff mbox

Message ID 4dd5a4d0f6b694f17eaf64c80c36a2560993b9ea.1517164461.git.luto@kernel.org
State New
Headers show

Commit Message

Andy Lutomirski Jan. 28, 2018, 6:38 p.m. UTC
__SYSCALL_DEFINE is rather magical.  Add a bit of documentation.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 include/linux/syscalls.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Linus Torvalds Jan. 28, 2018, 7:15 p.m. UTC | #1
On Sun, Jan 28, 2018 at 10:38 AM, Andy Lutomirski <luto@kernel.org> wrote:
> __SYSCALL_DEFINE is rather magical.  Add a bit of documentation.

Ack.

Is that "long long" part of the example on purpose? Because that's
likely the only really nasty part about any ptregs wrapper: some
arguments aren't _one_ register, they are two. And "long long" is the
simplest example, even though in practice the type is most often
"loff_t".

You won't see this on 64-bit architectures, but it's visible on 32-bit ones.

We may have to do wrappers for those, and error out for 'long long'.
We already do that for some cases, for compat system calls. See for
example

COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
                const struct compat_iovec __user *,vec,
                compat_ulong_t, vlen, u32, pos_low, u32, pos_high)
{
        loff_t pos = ((loff_t)pos_high << 32) | pos_low;

        return do_compat_preadv64(fd, vec, vlen, pos, 0);
}

where we have the issue of a 64-bit value being split over two
registers even on 64-bit, due to it being a compat interface for 32
bit.

But if we pick up the values by hand from ptregs in a wrapper, we'll
have this issue even for native calls on 32-bit.

             Linus
Ingo Molnar Jan. 28, 2018, 7:21 p.m. UTC | #2
* Andy Lutomirski <luto@kernel.org> wrote:

> __SYSCALL_DEFINE is rather magical.  Add a bit of documentation.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  include/linux/syscalls.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a78186d826d7..d3f244a447c5 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -207,6 +207,16 @@ static inline int is_syscall_trace_event(struct trace_event_call *tp_event)
>  	__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
>  
>  #define __PROTECT(...) asmlinkage_protect(__VA_ARGS__)
> +
> +/*
> + * For a syscall like long foo(void *a, long long b), this defines:
> + *
> + * static inline long SYSC_foo(void *a, long long b): the actual code
> + *
> + * asmlinkage long SyS_foo(long a, long long b): wrapper that calls SYSC_foo
> + *
> + * asmlinkage long sys_foo(void *a, long long b): alias of SyS_foo
> + */

Nit, would it be more readable to write this as pseudo-code, i.e. something like:

/*
 * For a syscall like long foo(void *a, long long b), this defines:
 *
 *  static inline long SYSC_foo(void *a, long long b) { /* the actual code following */ }
 *
 *  asmlinkage long SyS_foo(long a, long long b) { /* wrapper that calls SYSC_foo() */ }
 *  asmlinkage long sys_foo(void *a, long long b); /* as GCC alias of SyS_foo() */
 */

Also, I wanted to suggest to also document that in practice the three methods map 
to the very same function in the end, with the SYSC_ variant being eliminated due 
to inlining - but when double checking an x86-64 defconfig that does not appear to 
be so for all system calls:

 triton:~/tip> grep accept4 System.map 
 ffffffff8170d710 t SYSC_accept4
 ffffffff8170f940 T SyS_accept4
 ffffffff8170f940 T sys_accept4

While for others there's just 2:

 triton:~/tip> grep sched_getattr System.map 
 ffffffff8107b9a0 T SyS_sched_getattr
 ffffffff8107b9a0 T sys_sched_getattr

The only difference appears to be that accept4() is called internally within the 
kernel, by socketcall:

 SYSCALL_DEFINE3(accept, int, fd, struct sockaddr __user *, upeer_sockaddr,
                 int __user *, upeer_addrlen)
 {
         return sys_accept4(fd, upeer_sockaddr, upeer_addrlen, 0);
 }

But why does that result in SYSC_accept4() being a different symbol?

The difference between the two appears to be rather dumb as well:

 ffffffff8170f940 <SyS_accept4>:
 ffffffff8170f940:       e9 cb dd ff ff          jmpq   ffffffff8170d710 <SYSC_accept4>

Using GCC 7.2.0.

What am I missing?

Thanks,

	Ingo
Al Viro Jan. 28, 2018, 8:21 p.m. UTC | #3
On Sun, Jan 28, 2018 at 11:15:15AM -0800, Linus Torvalds wrote:
> Is that "long long" part of the example on purpose? Because that's
> likely the only really nasty part about any ptregs wrapper: some
> arguments aren't _one_ register, they are two. And "long long" is the
> simplest example, even though in practice the type is most often
> "loff_t".
> 
> You won't see this on 64-bit architectures, but it's visible on 32-bit ones.
> 
> We may have to do wrappers for those, and error out for 'long long'.
> We already do that for some cases, for compat system calls. See for
> example
> 
> COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
>                 const struct compat_iovec __user *,vec,
>                 compat_ulong_t, vlen, u32, pos_low, u32, pos_high)
> {
>         loff_t pos = ((loff_t)pos_high << 32) | pos_low;
> 
>         return do_compat_preadv64(fd, vec, vlen, pos, 0);
> }
> 
> where we have the issue of a 64-bit value being split over two
> registers even on 64-bit, due to it being a compat interface for 32
> bit.
> 
> But if we pick up the values by hand from ptregs in a wrapper, we'll
> have this issue even for native calls on 32-bit.

... and have more of that logics arch-dependent than one might expect;
it's *not* just "split each 64bit argument into a pair of 32bit ones,
combine in the body".  I tried to do something to reduce the amount of
remaining wrappers several years ago.  Trying to automate that gets
very ugly very fast - there are architectures like mips where you get
alignment requirements ($6/$7 can hold that, $5/$6 can't), creating
padding arguments, etc.

FWIW, that affects
	lookup_dcookie() (64,32,32)
	truncate64(), ftruncate64() (32,64)
	fadvise64_64(), sync_file_range() (32,64,64,32)
	readahead() (32,64,32)
	fadvise64() (32,64,32,32)
	fallocate(), sync_file_range2() (32,32,64,64)
	fanotify_mark() (32,32,64,32,32)
	pread64(), pwrite64() (32,32,32,64)

Giving each of those a compat wrapper of its own is already annoying
(looking at that again, we should at least unify pread64() and pwrite64()
compat wrappers - grep for sys_pread64 and you'll see), but giving
each an explicit wrapper for native ones?  Ouch.
Linus Torvalds Jan. 28, 2018, 8:42 p.m. UTC | #4
On Sun, Jan 28, 2018 at 12:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> ... and have more of that logics arch-dependent than one might expect;
> it's *not* just "split each 64bit argument into a pair of 32bit ones,
> combine in the body".

Right. The paired registers tend to have special arch-specific rules,
making this particularly painful.

I suspect that the best solution (and here "best" doesn't mean
"pretty", but doable) is to rely on some of the limits we have:

In particular, we don't want to convert all architectures to the
ptregs approach _anyway_, so we need to have a model where the
architecture sets up some config option.

End result: we don't have to handle the generic case, we just have to
have a way for an architecture to set up for this. And for a _first_
implementation, we can simply just limit things to x86-64 (and then
maybe arm64) which don't even have this issue.

Longer-term, the other thing that helps us is that we have already
limited our calling convention to a maximum of six registers (NOTE!
Not six _arguments_).

So even when we want to handle the 32-bit case, the combinatorial
space is at least fairly limited, and we can make the compiler notice
them (ie the ptregs accessor case can simply be made to error out if
you try to access a 64-bit type from one single register).

End result: we will need the per-architecture macros that access
arguments from the 'struct ptregs', so we'll have some macro for
"argument1(ptregs)".

The 64-bit argument for 32-bit case would end up having to have a few
more of those architecture-specific oddities. So not just
"argument1(ptregs)", but "argument2_32_64(ptregs)" or something that
says "get me argument 2, when the first argument is 32-bit and I want
a 64-bit one".

So _if_ a 32-bit architecture wants this, such an architecture would
probably have to implement 10-20 of those odd special cases. They'll
be ugly and nasty and have little sanity to them, but they'd be simple
one-liners.

So hacky. But not necessarily even needed (I really don't think the
kind of people who care about paranoid security are going to run
32-bit kernels), and if people want to, not _complicated_, just
annoying.

                    Linus
Al Viro Jan. 28, 2018, 10:50 p.m. UTC | #5
On Sun, Jan 28, 2018 at 12:42:24PM -0800, Linus Torvalds wrote:

> The 64-bit argument for 32-bit case would end up having to have a few
> more of those architecture-specific oddities. So not just
> "argument1(ptregs)", but "argument2_32_64(ptregs)" or something that
> says "get me argument 2, when the first argument is 32-bit and I want
> a 64-bit one".

Yeah, but...  You either get to give SYSCALL_DEFINE more than just
the number of arguments (SYSCALL_DEFINE_WDDW) or you need to go
for rather grotty macros to pick that information.  That's pretty
much what I'd tried; it hadn't been fun at all...
Al Viro Jan. 29, 2018, 6:22 a.m. UTC | #6
On Sun, Jan 28, 2018 at 10:50:31PM +0000, Al Viro wrote:
> On Sun, Jan 28, 2018 at 12:42:24PM -0800, Linus Torvalds wrote:
> 
> > The 64-bit argument for 32-bit case would end up having to have a few
> > more of those architecture-specific oddities. So not just
> > "argument1(ptregs)", but "argument2_32_64(ptregs)" or something that
> > says "get me argument 2, when the first argument is 32-bit and I want
> > a 64-bit one".
> 
> Yeah, but...  You either get to give SYSCALL_DEFINE more than just
> the number of arguments (SYSCALL_DEFINE_WDDW) or you need to go
> for rather grotty macros to pick that information.  That's pretty
> much what I'd tried; it hadn't been fun at all...

FWIW, going through the notes from back then (with some personal
comments censored - parts of that were definitely in the actionable
territory):

------

* s390 aside, the headache comes from combination of calling conventions
for 64bit arguments on 32bit and [speculation regarding libc maintainers
qualities]

* All architectures in question[1] treat syscall arguments as either
32bit (arith types <= 32bit, pointers) or 64bit.  Prototypical case
is f(int a1, int a2, ....); let L1, L2, ... be the sequence of locations
used to pass them (might be GPR, might be stack locations).  Anything
that doesn't involve long long uses the same sequence; in theory, we could
have different sets of registers for e.g. pointers and integers, but nothing
we care about seems to be doing that.

* wrt 64bit ints there are two variants: one is to simply treat them as pair
of 32bit ones (i.e. take the next two elements of the sequence), another is
to skip an element and then use the next two.  Some architectures always
go for the first variant; x86, s390 and, surprisingly, sparc are that way.
arm, mips, parisc, ppc and tile go for the second variant when odd number
of 32bit values had been passed so far.

* argument passing for syscalls is something almost, but not entirely different.
First of all, we don't want to pass them on stack (no surprise); mips o32
ABI is trouble in that respect, everything else manages to use registers
(so do other mips ABI variants).  Next, we are generally limited to 6 words.
No worries, right?  We don't have syscalls with more than 6 arguments, and
ones with 64bit args still fit into 32*6 bits.  Too fucking bad - akpm has
introduced int sys_fadvise64(int fd, loff_t offset, size_t len, int advice)
and then topped it with
long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
Note that this sucker already has 32*6 bits total *AND* 64bit argument in odd
position.  arm, mips and ppc folks were not amused (IIRC, rmk got downright
sarcastic at the time; not quite Patrician-level, what with the lack of
resources, but...)
That had been compounded by sync_file_range(2), with identical braind^WAPI.
The latter got a saner variant (sync_file_range2(2)) and newer architectures
should take that.  fadvise64_64(2) has not.  BTW, that had been a headache
for other 32bit architectures as well - at least c6x and metag are in the
same boat.  Different solutions with that one - some split those 64bit into
32bit on C level and repackage them into 64bit in stubs, some reorder the
arguments so that 64bit ones are at good offsets.

* for syscalls like pread64/pwrite64, the situation is less dire.  Some still
pass the misaligned 64bit arg as a pair of C-level 32bit ones, some accept
the padding.

* to make things even more interesting, libc (all of them) pass a bunch of
64bit args as explicit pairs.  Which creates no end of amusing situations -
will this argument of this syscall end up with LSB first?  Last?  According
to endianness?  Opposite?  Different rules for different architectures?
Onna stick.  Inna bun.  And that's cuttin' me own throat... [speculations
regarding various habits of libc maintainers]

* it might be possible to teach COMPAT_SYSCALL_DEFINE to insert padding and
combine the halves.  Cost: collecting the information about the number of
words passed so far; messy, but maybe I'm missing some clever solution.
However, that doesn't do a damn thing for libc-inflicted idiocies, so it
might or might not be worth it.  In some cases the mapping from what
libc is feeding to the kernel to actual long long passed from the glue
into C side of things is just too fucking irregular[2].

------

[1] i.e. 32bit side of biarch ones; I was interested in COMPAT_SYSCALL_DEFINE
back then.
[2] looking at that now, parisc has fanotify_mark() passing halves of 64bit
arg in the order opposite to that for truncate64().  On the same parisc.
From the same libc.  And then there's lookup_dcookie(), which might or
might not be correct there.

Patch
diff mbox

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a78186d826d7..d3f244a447c5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -207,6 +207,16 @@  static inline int is_syscall_trace_event(struct trace_event_call *tp_event)
 	__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
 
 #define __PROTECT(...) asmlinkage_protect(__VA_ARGS__)
+
+/*
+ * For a syscall like long foo(void *a, long long b), this defines:
+ *
+ * static inline long SYSC_foo(void *a, long long b): the actual code
+ *
+ * asmlinkage long SyS_foo(long a, long long b): wrapper that calls SYSC_foo
+ *
+ * asmlinkage long sys_foo(void *a, long long b): alias of SyS_foo
+ */
 #define __SYSCALL_DEFINEx(x, name, ...)					\
 	asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))	\
 		__attribute__((alias(__stringify(SyS##name))));		\