Message ID | 4dd5a4d0f6b694f17eaf64c80c36a2560993b9ea.1517164461.git.luto@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, Jan 28, 2018 at 10:38 AM, Andy Lutomirski <luto@kernel.org> wrote:
> __SYSCALL_DEFINE is rather magical. Add a bit of documentation.
Ack.
Is that "long long" part of the example on purpose? Because that's
likely the only really nasty part about any ptregs wrapper: some
arguments aren't _one_ register, they are two. And "long long" is the
simplest example, even though in practice the type is most often
"loff_t".
You won't see this on 64-bit architectures, but it's visible on 32-bit ones.
We may have to do wrappers for those, and error out for 'long long'.
We already do that for some cases, for compat system calls. See for
example
COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
const struct compat_iovec __user *,vec,
compat_ulong_t, vlen, u32, pos_low, u32, pos_high)
{
loff_t pos = ((loff_t)pos_high << 32) | pos_low;
return do_compat_preadv64(fd, vec, vlen, pos, 0);
}
where we have the issue of a 64-bit value being split over two
registers even on 64-bit, due to it being a compat interface for 32
bit.
But if we pick up the values by hand from ptregs in a wrapper, we'll
have this issue even for native calls on 32-bit.
Linus
* Andy Lutomirski <luto@kernel.org> wrote: > __SYSCALL_DEFINE is rather magical. Add a bit of documentation. > > Signed-off-by: Andy Lutomirski <luto@kernel.org> > --- > include/linux/syscalls.h | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index a78186d826d7..d3f244a447c5 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -207,6 +207,16 @@ static inline int is_syscall_trace_event(struct trace_event_call *tp_event) > __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) > > #define __PROTECT(...) asmlinkage_protect(__VA_ARGS__) > + > +/* > + * For a syscall like long foo(void *a, long long b), this defines: > + * > + * static inline long SYSC_foo(void *a, long long b): the actual code > + * > + * asmlinkage long SyS_foo(long a, long long b): wrapper that calls SYSC_foo > + * > + * asmlinkage long sys_foo(void *a, long long b): alias of SyS_foo > + */ Nit, would it be more readable to write this as pseudo-code, i.e. something like: /* * For a syscall like long foo(void *a, long long b), this defines: * * static inline long SYSC_foo(void *a, long long b) { /* the actual code following */ } * * asmlinkage long SyS_foo(long a, long long b) { /* wrapper that calls SYSC_foo() */ } * asmlinkage long sys_foo(void *a, long long b); /* as GCC alias of SyS_foo() */ */ Also, I wanted to suggest to also document that in practice the three methods map to the very same function in the end, with the SYSC_ variant being eliminated due to inlining - but when double checking an x86-64 defconfig that does not appear to be so for all system calls: triton:~/tip> grep accept4 System.map ffffffff8170d710 t SYSC_accept4 ffffffff8170f940 T SyS_accept4 ffffffff8170f940 T sys_accept4 While for others there's just 2: triton:~/tip> grep sched_getattr System.map ffffffff8107b9a0 T SyS_sched_getattr ffffffff8107b9a0 T sys_sched_getattr The only difference appears to be that accept4() is called internally within the kernel, by socketcall: SYSCALL_DEFINE3(accept, int, fd, struct sockaddr __user *, upeer_sockaddr, int __user *, upeer_addrlen) { return sys_accept4(fd, upeer_sockaddr, upeer_addrlen, 0); } But why does that result in SYSC_accept4() being a different symbol? The difference between the two appears to be rather dumb as well: ffffffff8170f940 <SyS_accept4>: ffffffff8170f940: e9 cb dd ff ff jmpq ffffffff8170d710 <SYSC_accept4> Using GCC 7.2.0. What am I missing? Thanks, Ingo
On Sun, Jan 28, 2018 at 11:15:15AM -0800, Linus Torvalds wrote: > Is that "long long" part of the example on purpose? Because that's > likely the only really nasty part about any ptregs wrapper: some > arguments aren't _one_ register, they are two. And "long long" is the > simplest example, even though in practice the type is most often > "loff_t". > > You won't see this on 64-bit architectures, but it's visible on 32-bit ones. > > We may have to do wrappers for those, and error out for 'long long'. > We already do that for some cases, for compat system calls. See for > example > > COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd, > const struct compat_iovec __user *,vec, > compat_ulong_t, vlen, u32, pos_low, u32, pos_high) > { > loff_t pos = ((loff_t)pos_high << 32) | pos_low; > > return do_compat_preadv64(fd, vec, vlen, pos, 0); > } > > where we have the issue of a 64-bit value being split over two > registers even on 64-bit, due to it being a compat interface for 32 > bit. > > But if we pick up the values by hand from ptregs in a wrapper, we'll > have this issue even for native calls on 32-bit. ... and have more of that logics arch-dependent than one might expect; it's *not* just "split each 64bit argument into a pair of 32bit ones, combine in the body". I tried to do something to reduce the amount of remaining wrappers several years ago. Trying to automate that gets very ugly very fast - there are architectures like mips where you get alignment requirements ($6/$7 can hold that, $5/$6 can't), creating padding arguments, etc. FWIW, that affects lookup_dcookie() (64,32,32) truncate64(), ftruncate64() (32,64) fadvise64_64(), sync_file_range() (32,64,64,32) readahead() (32,64,32) fadvise64() (32,64,32,32) fallocate(), sync_file_range2() (32,32,64,64) fanotify_mark() (32,32,64,32,32) pread64(), pwrite64() (32,32,32,64) Giving each of those a compat wrapper of its own is already annoying (looking at that again, we should at least unify pread64() and pwrite64() compat wrappers - grep for sys_pread64 and you'll see), but giving each an explicit wrapper for native ones? Ouch.
On Sun, Jan 28, 2018 at 12:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > ... and have more of that logics arch-dependent than one might expect; > it's *not* just "split each 64bit argument into a pair of 32bit ones, > combine in the body". Right. The paired registers tend to have special arch-specific rules, making this particularly painful. I suspect that the best solution (and here "best" doesn't mean "pretty", but doable) is to rely on some of the limits we have: In particular, we don't want to convert all architectures to the ptregs approach _anyway_, so we need to have a model where the architecture sets up some config option. End result: we don't have to handle the generic case, we just have to have a way for an architecture to set up for this. And for a _first_ implementation, we can simply just limit things to x86-64 (and then maybe arm64) which don't even have this issue. Longer-term, the other thing that helps us is that we have already limited our calling convention to a maximum of six registers (NOTE! Not six _arguments_). So even when we want to handle the 32-bit case, the combinatorial space is at least fairly limited, and we can make the compiler notice them (ie the ptregs accessor case can simply be made to error out if you try to access a 64-bit type from one single register). End result: we will need the per-architecture macros that access arguments from the 'struct ptregs', so we'll have some macro for "argument1(ptregs)". The 64-bit argument for 32-bit case would end up having to have a few more of those architecture-specific oddities. So not just "argument1(ptregs)", but "argument2_32_64(ptregs)" or something that says "get me argument 2, when the first argument is 32-bit and I want a 64-bit one". So _if_ a 32-bit architecture wants this, such an architecture would probably have to implement 10-20 of those odd special cases. They'll be ugly and nasty and have little sanity to them, but they'd be simple one-liners. So hacky. But not necessarily even needed (I really don't think the kind of people who care about paranoid security are going to run 32-bit kernels), and if people want to, not _complicated_, just annoying. Linus
On Sun, Jan 28, 2018 at 12:42:24PM -0800, Linus Torvalds wrote: > The 64-bit argument for 32-bit case would end up having to have a few > more of those architecture-specific oddities. So not just > "argument1(ptregs)", but "argument2_32_64(ptregs)" or something that > says "get me argument 2, when the first argument is 32-bit and I want > a 64-bit one". Yeah, but... You either get to give SYSCALL_DEFINE more than just the number of arguments (SYSCALL_DEFINE_WDDW) or you need to go for rather grotty macros to pick that information. That's pretty much what I'd tried; it hadn't been fun at all...
On Sun, Jan 28, 2018 at 10:50:31PM +0000, Al Viro wrote: > On Sun, Jan 28, 2018 at 12:42:24PM -0800, Linus Torvalds wrote: > > > The 64-bit argument for 32-bit case would end up having to have a few > > more of those architecture-specific oddities. So not just > > "argument1(ptregs)", but "argument2_32_64(ptregs)" or something that > > says "get me argument 2, when the first argument is 32-bit and I want > > a 64-bit one". > > Yeah, but... You either get to give SYSCALL_DEFINE more than just > the number of arguments (SYSCALL_DEFINE_WDDW) or you need to go > for rather grotty macros to pick that information. That's pretty > much what I'd tried; it hadn't been fun at all... FWIW, going through the notes from back then (with some personal comments censored - parts of that were definitely in the actionable territory): ------ * s390 aside, the headache comes from combination of calling conventions for 64bit arguments on 32bit and [speculation regarding libc maintainers qualities] * All architectures in question[1] treat syscall arguments as either 32bit (arith types <= 32bit, pointers) or 64bit. Prototypical case is f(int a1, int a2, ....); let L1, L2, ... be the sequence of locations used to pass them (might be GPR, might be stack locations). Anything that doesn't involve long long uses the same sequence; in theory, we could have different sets of registers for e.g. pointers and integers, but nothing we care about seems to be doing that. * wrt 64bit ints there are two variants: one is to simply treat them as pair of 32bit ones (i.e. take the next two elements of the sequence), another is to skip an element and then use the next two. Some architectures always go for the first variant; x86, s390 and, surprisingly, sparc are that way. arm, mips, parisc, ppc and tile go for the second variant when odd number of 32bit values had been passed so far. * argument passing for syscalls is something almost, but not entirely different. First of all, we don't want to pass them on stack (no surprise); mips o32 ABI is trouble in that respect, everything else manages to use registers (so do other mips ABI variants). Next, we are generally limited to 6 words. No worries, right? We don't have syscalls with more than 6 arguments, and ones with 64bit args still fit into 32*6 bits. Too fucking bad - akpm has introduced int sys_fadvise64(int fd, loff_t offset, size_t len, int advice) and then topped it with long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice) Note that this sucker already has 32*6 bits total *AND* 64bit argument in odd position. arm, mips and ppc folks were not amused (IIRC, rmk got downright sarcastic at the time; not quite Patrician-level, what with the lack of resources, but...) That had been compounded by sync_file_range(2), with identical braind^WAPI. The latter got a saner variant (sync_file_range2(2)) and newer architectures should take that. fadvise64_64(2) has not. BTW, that had been a headache for other 32bit architectures as well - at least c6x and metag are in the same boat. Different solutions with that one - some split those 64bit into 32bit on C level and repackage them into 64bit in stubs, some reorder the arguments so that 64bit ones are at good offsets. * for syscalls like pread64/pwrite64, the situation is less dire. Some still pass the misaligned 64bit arg as a pair of C-level 32bit ones, some accept the padding. * to make things even more interesting, libc (all of them) pass a bunch of 64bit args as explicit pairs. Which creates no end of amusing situations - will this argument of this syscall end up with LSB first? Last? According to endianness? Opposite? Different rules for different architectures? Onna stick. Inna bun. And that's cuttin' me own throat... [speculations regarding various habits of libc maintainers] * it might be possible to teach COMPAT_SYSCALL_DEFINE to insert padding and combine the halves. Cost: collecting the information about the number of words passed so far; messy, but maybe I'm missing some clever solution. However, that doesn't do a damn thing for libc-inflicted idiocies, so it might or might not be worth it. In some cases the mapping from what libc is feeding to the kernel to actual long long passed from the glue into C side of things is just too fucking irregular[2]. ------ [1] i.e. 32bit side of biarch ones; I was interested in COMPAT_SYSCALL_DEFINE back then. [2] looking at that now, parisc has fanotify_mark() passing halves of 64bit arg in the order opposite to that for truncate64(). On the same parisc. From the same libc. And then there's lookup_dcookie(), which might or might not be correct there.
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index a78186d826d7..d3f244a447c5 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -207,6 +207,16 @@ static inline int is_syscall_trace_event(struct trace_event_call *tp_event) __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) #define __PROTECT(...) asmlinkage_protect(__VA_ARGS__) + +/* + * For a syscall like long foo(void *a, long long b), this defines: + * + * static inline long SYSC_foo(void *a, long long b): the actual code + * + * asmlinkage long SyS_foo(long a, long long b): wrapper that calls SYSC_foo + * + * asmlinkage long sys_foo(void *a, long long b): alias of SyS_foo + */ #define __SYSCALL_DEFINEx(x, name, ...) \ asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \ __attribute__((alias(__stringify(SyS##name)))); \
__SYSCALL_DEFINE is rather magical. Add a bit of documentation. Signed-off-by: Andy Lutomirski <luto@kernel.org> --- include/linux/syscalls.h | 10 ++++++++++ 1 file changed, 10 insertions(+)