Message ID | 1464048292-30136-2-git-send-email-ynorov@caviumnetworks.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
From: Yury Norov <ynorov@caviumnetworks.com> Date: Tue, 24 May 2016 03:04:30 +0300 > +To clear that top halves, automatic wrappers are introduced. They clear all > +required registers before passing control to regular syscall handler. Why have one of these for every single compat system call, rather than simply clearing the top half of all of these registers unconditionally in the 32-bit system call trap before the system call is invoked? That's what we do on sparc64. And with that, you only need wrappers for the case where there needs to be proper sign extention of a 32-bit signed argument.
On Wed, May 25, 2016 at 12:30:17PM -0700, David Miller wrote: > From: Yury Norov <ynorov@caviumnetworks.com> > Date: Tue, 24 May 2016 03:04:30 +0300 > > > +To clear that top halves, automatic wrappers are introduced. They clear all > > +required registers before passing control to regular syscall handler. > > Why have one of these for every single compat system call, rather than > simply clearing the top half of all of these registers unconditionally > in the 32-bit system call trap before the system call is invoked? > > That's what we do on sparc64. > > And with that, you only need wrappers for the case where there needs > to be proper sign extention of a 32-bit signed argument. It was discussed as one of possible solutions. The downside of it is that we cannot pass 64-bit types (like off_t) in single register. The other downside is that we clear top halves for every single syscall, and it looks excessive. So, from spark64 and s390 approaches we choosed second.
From: Yury Norov <ynorov@caviumnetworks.com> Date: Wed, 25 May 2016 23:03:27 +0300 > On Wed, May 25, 2016 at 12:30:17PM -0700, David Miller wrote: >> From: Yury Norov <ynorov@caviumnetworks.com> >> Date: Tue, 24 May 2016 03:04:30 +0300 >> >> > +To clear that top halves, automatic wrappers are introduced. They clear all >> > +required registers before passing control to regular syscall handler. >> >> Why have one of these for every single compat system call, rather than >> simply clearing the top half of all of these registers unconditionally >> in the 32-bit system call trap before the system call is invoked? >> >> That's what we do on sparc64. >> >> And with that, you only need wrappers for the case where there needs >> to be proper sign extention of a 32-bit signed argument. > > It was discussed as one of possible solutions. The downside of it is > that we cannot pass 64-bit types (like off_t) in single register. Wrappers can be added for the cases where you'd like to do that. > The other downside is that we clear top halves for every single > syscall, and it looks excessive. So, from spark64 and s390 approaches > we choosed second. It's like 4 cpu cycles even on crappy sparc64 cpus which only dual issue. :) And that's a pretty low cost for the benefits if you ask me.
On Wednesday, May 25, 2016 1:21:45 PM CEST David Miller wrote: > From: Yury Norov <ynorov@caviumnetworks.com> > Date: Wed, 25 May 2016 23:03:27 +0300 > > > On Wed, May 25, 2016 at 12:30:17PM -0700, David Miller wrote: > >> From: Yury Norov <ynorov@caviumnetworks.com> > >> Date: Tue, 24 May 2016 03:04:30 +0300 > >> > >> > +To clear that top halves, automatic wrappers are introduced. They clear all > >> > +required registers before passing control to regular syscall handler. > >> > >> Why have one of these for every single compat system call, rather than > >> simply clearing the top half of all of these registers unconditionally > >> in the 32-bit system call trap before the system call is invoked? > >> > >> That's what we do on sparc64. > >> > >> And with that, you only need wrappers for the case where there needs > >> to be proper sign extention of a 32-bit signed argument. > > > > It was discussed as one of possible solutions. The downside of it is > > that we cannot pass 64-bit types (like off_t) in single register. > > Wrappers can be added for the cases where you'd like to do that. If we clear the upper halves on the initial entry, we can't use a wrapper to restore them, so would have to instead pass them as register pairs as we do on the other 32-bit architectures. > > The other downside is that we clear top halves for every single > > syscall, and it looks excessive. So, from spark64 and s390 approaches > > we choosed second. > > It's like 4 cpu cycles even on crappy sparc64 cpus which only dual > issue. :) > > And that's a pretty low cost for the benefits if you ask me. To clarify what we are talking about: These syscalls that normally pass 64-bit arguments as register pairs are intentionally overridden to make them faster on ilp32 mode compare to other compat modes: +#define compat_sys_fadvise64_64 sys_fadvise64_64 +#define compat_sys_fallocate sys_fallocate +#define compat_sys_ftruncate64 sys_ftruncate +#define compat_sys_lookup_dcookie sys_lookup_dcookie +#define compat_sys_readahead sys_readahead +#define compat_sys_sync_file_range sys_sync_file_range +#define compat_sys_truncate64 sys_truncate +#define sys_llseek sys_lseek +static unsigned long compat_sys_pread64(unsigned int fd, + compat_uptr_t __user *ubuf, compat_size_t count, off_t offset) +{ + return sys_pread64(fd, (char *) ubuf, count, offset); +} + +static unsigned long compat_sys_pwrite64(unsigned int fd, + compat_uptr_t __user *ubuf, compat_size_t count, off_t offset) +{ + return sys_pwrite64(fd, (char *) ubuf, count, offset); +} If we use the normal calling conventions, we could remove these overrides along with the respective special-case handling in glibc. None of them look particularly performance-sensitive, but I could be wrong there. Arnd
From: Arnd Bergmann <arnd@arndb.de> Date: Wed, 25 May 2016 22:47:33 +0200 > If we use the normal calling conventions, we could remove these overrides > along with the respective special-case handling in glibc. None of them > look particularly performance-sensitive, but I could be wrong there. You could set the lowest bit in the system call entry pointer to indicate the upper-half clears should be elided.
On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote: > From: Arnd Bergmann <arnd@arndb.de> > Date: Wed, 25 May 2016 22:47:33 +0200 > > > If we use the normal calling conventions, we could remove these overrides > > along with the respective special-case handling in glibc. None of them > > look particularly performance-sensitive, but I could be wrong there. > > You could set the lowest bit in the system call entry pointer to indicate > the upper-half clears should be elided. Right, but that would introduce an extra conditional branch in the syscall hotpath, and likely eliminate the gains from passing the loff_t arguments in a single register instead of a pair. Arnd
From: Arnd Bergmann <arnd@arndb.de> Date: Wed, 25 May 2016 23:01:06 +0200 > On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote: >> From: Arnd Bergmann <arnd@arndb.de> >> Date: Wed, 25 May 2016 22:47:33 +0200 >> >> > If we use the normal calling conventions, we could remove these overrides >> > along with the respective special-case handling in glibc. None of them >> > look particularly performance-sensitive, but I could be wrong there. >> >> You could set the lowest bit in the system call entry pointer to indicate >> the upper-half clears should be elided. > > Right, but that would introduce an extra conditional branch in the syscall > hotpath, and likely eliminate the gains from passing the loff_t arguments > in a single register instead of a pair. Ok, then, how much are you really gaining from avoiding a 'shift' and an 'or' to build the full 64-bit value? 3 cycles? Maybe 4? And the executing the wrappers, those have a non-trivial cost too. Cost wise, this seems like it all cancels out in the end, but what do I know?
On Wed, May 25, 2016 at 02:28:21PM -0700, David Miller wrote: > From: Arnd Bergmann <arnd@arndb.de> > Date: Wed, 25 May 2016 23:01:06 +0200 > > > On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote: > >> From: Arnd Bergmann <arnd@arndb.de> > >> Date: Wed, 25 May 2016 22:47:33 +0200 > >> > >> > If we use the normal calling conventions, we could remove these overrides > >> > along with the respective special-case handling in glibc. None of them > >> > look particularly performance-sensitive, but I could be wrong there. > >> > >> You could set the lowest bit in the system call entry pointer to indicate > >> the upper-half clears should be elided. > > > > Right, but that would introduce an extra conditional branch in the syscall > > hotpath, and likely eliminate the gains from passing the loff_t arguments > > in a single register instead of a pair. > > Ok, then, how much are you really gaining from avoiding a 'shift' and > an 'or' to build the full 64-bit value? 3 cycles? Maybe 4? It's possible a few more cycles overall. Whether this is noticeable, I can't really tell without some benchmarks (e.g. a getpid wrapper zeroing top 32-bit of all possible 6 arguments, called in a loop). On arm64 with ILP32 we have three types of syscalls w.r.t. parameter width (I guess that's true for all other compat implementations): 1. User syscall definition with 32-bit arguments, kernel handling 32-bit arguments 2. User 32-bit arguments, kernel 64-bit arguments 3. User 64-bit arguments, kernel 64-bit arguments For (1), the AArch64 ABI (AAPCS) allows us to ignore the garbage in the top 32-bit of a 64-bit register as long as the callee has 32-bit arguments (IOW, the generated code will use 32-git Wn instead of 64-bit Xn registers). In this case, zeroing the top 32-bit of all 6 arguments is unnecessary. In the 2nd case, we need sign or zero extension of 32-bit arguments. For sign extension we would still need a wrapper as the generic one can only zero-extend without knowing the underlying type. How many cases do we have where sign extension is required (off_t is a signed type but does it actually make sense as a negative value)? The __SC_WRAP and COMPAT_SYSCALL_WRAP macros introduced by patches 3-5 in this series handle such conversion for both sign and unsigned arguments. We don't have such problem with AArch32 tasks since the architecture guarantees zeroing or preserving the top half of all registers. For (3), with the current ILP32 approach we wouldn't need any wrapper. If we are to pass the argument as two 32-bit values, we would need both the user (glibc) to split the argument and the kernel to re-construct it. This would be in addition to any default top 32-bit zeroing on kernel entry. The overhead may be lost in the noise (we need some data) but IIRC our decision was mostly based on a cleaner user implementation for point (3) above. Since an AArch64/ILP32 process can freely use 64-bit registers, we found it nicer to be able to pass such value directly to the kernel. Reusing the s390 macros should reduce the amount of new code added to the kernel. While writing the above, I realised the current ILP32 patches still miss on converting pointers passed from user space (unless I got myself confused in macros). The new __SC_WRAP() and COMPAT_SYSCALL_WRAPx() macros take care of zero or sign extension via __SC_COMPAT_CAST(). However, we have two more existing cases which I don't see covered: a) Native syscalls taking a pointer argument and invoked directly from ILP32. For example, sys_read() takes a pointer but I don't see any __SC_WRAP added by patch 5 b) Current compat syscalls taking a pointer argument. For example, compat_sys_vmsplice() gets the iov32 pointer and the compiler assumes it is a 64-bit variable. I don't see where the upper half is zeroed We can solve (a) by adding more __SC_WRAP annotations in the generic unistd.h. For (b), we would need an __SC_DELOUSE with a bit of penalty on AArch32/compat support where it isn't needed. So maybe davem has a point on the overall impact of always zeroing the upper half of the arguments ;) (both from a performance and maintainability perspective). I guess this part of the ABI is still up for discussion.
On 26/05/16 15:20, Catalin Marinas wrote: > While writing the above, I realised the current ILP32 patches still miss > on converting pointers passed from user space (unless I got myself > confused in macros). The new __SC_WRAP() and COMPAT_SYSCALL_WRAPx() > macros take care of zero or sign extension via __SC_COMPAT_CAST(). > However, we have two more existing cases which I don't see covered: > > a) Native syscalls taking a pointer argument and invoked directly from > ILP32. For example, sys_read() takes a pointer but I don't see any > __SC_WRAP added by patch 5 > > b) Current compat syscalls taking a pointer argument. For example, > compat_sys_vmsplice() gets the iov32 pointer and the compiler assumes > it is a 64-bit variable. I don't see where the upper half is zeroed > on x32 sign/zero extension is currently left to userspace, which is difficult to deal with, (long long)arg does the wrong thing for pointer args. > We can solve (a) by adding more __SC_WRAP annotations in the generic > unistd.h. For (b), we would need an __SC_DELOUSE with a bit of penalty > on AArch32/compat support where it isn't needed. So maybe davem has a > point on the overall impact of always zeroing the upper half of the > arguments ;) (both from a performance and maintainability perspective). > I guess this part of the ABI is still up for discussion. >
On Thu, May 26, 2016 at 03:50:01PM +0100, Szabolcs Nagy wrote: > On 26/05/16 15:20, Catalin Marinas wrote: > > While writing the above, I realised the current ILP32 patches still miss > > on converting pointers passed from user space (unless I got myself > > confused in macros). The new __SC_WRAP() and COMPAT_SYSCALL_WRAPx() > > macros take care of zero or sign extension via __SC_COMPAT_CAST(). > > However, we have two more existing cases which I don't see covered: > > > > a) Native syscalls taking a pointer argument and invoked directly from > > ILP32. For example, sys_read() takes a pointer but I don't see any > > __SC_WRAP added by patch 5 > > > > b) Current compat syscalls taking a pointer argument. For example, > > compat_sys_vmsplice() gets the iov32 pointer and the compiler assumes > > it is a 64-bit variable. I don't see where the upper half is zeroed > > on x32 sign/zero extension is currently left to userspace, > which is difficult to deal with, (long long)arg does the > wrong thing for pointer args. I agree, I don't think we should leave sign/zero extension to user. We should do it in the kernel either in a way similar to s390 (specific __SC_COMPAT_CAST, __SC_DELOUSE) or by always zeroing the arguments upper half on kernel entry with a few additional wrappers (where we have 64-bit arguments or they require sign extension). The latter has the disadvantage of having to split 64-bit arguments in user space while the former adds more maintenance burden to the kernel. I can't comment on performance aspects without some real numbers.
From: Catalin Marinas <catalin.marinas@arm.com> Date: Thu, 26 May 2016 15:20:58 +0100 > We can solve (a) by adding more __SC_WRAP annotations in the generic > unistd.h. ... I really think it's much more robust to clear the tops of the registers by default. Then you won't be auditing constantly and adding more and more wrappers. You can't even quantify the performance gains for me in any precise way. Whatever you gain by avoiding the 64-bit decompostion/reconstitution for those few system calls with 64-bit registers, you are losing by calling the wrappers for more common system calls, more often. "it's more natural to pass 64-bit values in a register" is not a clear justification for this change. This looks way over engineered to me.
On Wed, May 25, 2016 at 02:28:21PM -0700, David Miller wrote: > From: Arnd Bergmann <arnd@arndb.de> > Date: Wed, 25 May 2016 23:01:06 +0200 > > > On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote: > >> From: Arnd Bergmann <arnd@arndb.de> > >> Date: Wed, 25 May 2016 22:47:33 +0200 > >> > >> > If we use the normal calling conventions, we could remove these overrides > >> > along with the respective special-case handling in glibc. None of them > >> > look particularly performance-sensitive, but I could be wrong there. > >> > >> You could set the lowest bit in the system call entry pointer to indicate > >> the upper-half clears should be elided. > > > > Right, but that would introduce an extra conditional branch in the syscall > > hotpath, and likely eliminate the gains from passing the loff_t arguments > > in a single register instead of a pair. > > Ok, then, how much are you really gaining from avoiding a 'shift' and > an 'or' to build the full 64-bit value? 3 cycles? Maybe 4? 4 cycles in kernel and ~same cost in glibc to create a pair. And 8 'mov's that exist for every syscall, even yield(). > And the executing the wrappers, those have a non-trivial cost too. The cost is pretty trivial though. See kernel/compat_wrapper.o: COMPAT_SYSCALL_WRAP2(creat, const char __user *, pathname, umode_t, mode); 0: a9bf7bfd stp x29, x30, [sp,#-16]! 4: 910003fd mov x29, sp 8: 2a0003e0 mov w0, w0 c: 94000000 bl 0 <sys_creat> 10: a8c17bfd ldp x29, x30, [sp],#16 14: d65f03c0 ret > Cost wise, this seems like it all cancels out in the end, but what > do I know? I think you know something, and I also think Heiko and other s390 guys know something as well. So I'd like to listen their arguments here. For me spark64 way is looking reasonable only because it's really simple and takes less coding. I'll try it on some branch and share here what happened.
On Thu, May 26, 2016 at 11:48:19PM +0300, Yury Norov wrote: > On Wed, May 25, 2016 at 02:28:21PM -0700, David Miller wrote: > > From: Arnd Bergmann <arnd@arndb.de> > > Date: Wed, 25 May 2016 23:01:06 +0200 > > > > > On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote: > > >> From: Arnd Bergmann <arnd@arndb.de> > > >> Date: Wed, 25 May 2016 22:47:33 +0200 > > >> > > >> > If we use the normal calling conventions, we could remove these overrides > > >> > along with the respective special-case handling in glibc. None of them > > >> > look particularly performance-sensitive, but I could be wrong there. > > >> > > >> You could set the lowest bit in the system call entry pointer to indicate > > >> the upper-half clears should be elided. > > > > > > Right, but that would introduce an extra conditional branch in the syscall > > > hotpath, and likely eliminate the gains from passing the loff_t arguments > > > in a single register instead of a pair. > > > > Ok, then, how much are you really gaining from avoiding a 'shift' and > > an 'or' to build the full 64-bit value? 3 cycles? Maybe 4? > > 4 cycles in kernel and ~same cost in glibc to create a pair. It would take a single instruction per argument in the kernel to do shift+or and maybe 1-2 more instructions to move the remaining arguments in place (we do this for a few wrappers in arch/arm64/kernel/entry32.S). And the glibc counterpart. > And 8 'mov's that exist for every syscall, even yield(). > > > And the executing the wrappers, those have a non-trivial cost too. > > The cost is pretty trivial though. See kernel/compat_wrapper.o: > COMPAT_SYSCALL_WRAP2(creat, const char __user *, pathname, umode_t, mode); > 0: a9bf7bfd stp x29, x30, [sp,#-16]! > 4: 910003fd mov x29, sp > 8: 2a0003e0 mov w0, w0 > c: 94000000 bl 0 <sys_creat> > 10: a8c17bfd ldp x29, x30, [sp],#16 > 14: d65f03c0 ret I would say the above could be more expensive than 8 movs (16 bytes to write, read, a branch and a ret). You can also add the I-cache locality, having wrappers for each syscalls instead of a single place for zeroing the upper half (where no other wrapper is necessary). Can we trick the compiler into doing a tail call optimisation. This could have simply been: COMPAT_SYSCALL_WRAP2(creat, ...): mov w0, w0 b <sys_creat> > > Cost wise, this seems like it all cancels out in the end, but what > > do I know? > > I think you know something, and I also think Heiko and other s390 guys > know something as well. So I'd like to listen their arguments here. > > For me spark64 way is looking reasonable only because it's really simple > and takes less coding. I'll try it on some branch and share here what happened. The kernel code will definitely look simpler ;). It would be good to see if there actually is any performance impact. Even with 16 more cycles on syscall entry, would they be lost in the noise? You don't need a full implementation, just some dummy mov x0, x0 on the entry path.
On Thu, May 26, 2016 at 11:29:45PM +0100, Catalin Marinas wrote: > On Thu, May 26, 2016 at 11:48:19PM +0300, Yury Norov wrote: > > On Wed, May 25, 2016 at 02:28:21PM -0700, David Miller wrote: > > > From: Arnd Bergmann <arnd@arndb.de> > > > Date: Wed, 25 May 2016 23:01:06 +0200 > > > > > > > On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote: > > > >> From: Arnd Bergmann <arnd@arndb.de> > > > >> Date: Wed, 25 May 2016 22:47:33 +0200 > > > >> > > > >> > If we use the normal calling conventions, we could remove these overrides > > > >> > along with the respective special-case handling in glibc. None of them > > > >> > look particularly performance-sensitive, but I could be wrong there. > > > >> > > > >> You could set the lowest bit in the system call entry pointer to indicate > > > >> the upper-half clears should be elided. > > > > > > > > Right, but that would introduce an extra conditional branch in the syscall > > > > hotpath, and likely eliminate the gains from passing the loff_t arguments > > > > in a single register instead of a pair. > > > > > > Ok, then, how much are you really gaining from avoiding a 'shift' and > > > an 'or' to build the full 64-bit value? 3 cycles? Maybe 4? > > > > 4 cycles in kernel and ~same cost in glibc to create a pair. > > It would take a single instruction per argument in the kernel to do > shift+or and maybe 1-2 more instructions to move the remaining arguments > in place (we do this for a few wrappers in arch/arm64/kernel/entry32.S). > And the glibc counterpart. > > > And 8 'mov's that exist for every syscall, even yield(). > > > > > And the executing the wrappers, those have a non-trivial cost too. > > > > The cost is pretty trivial though. See kernel/compat_wrapper.o: > > COMPAT_SYSCALL_WRAP2(creat, const char __user *, pathname, umode_t, mode); > > 0: a9bf7bfd stp x29, x30, [sp,#-16]! > > 4: 910003fd mov x29, sp > > 8: 2a0003e0 mov w0, w0 > > c: 94000000 bl 0 <sys_creat> > > 10: a8c17bfd ldp x29, x30, [sp],#16 > > 14: d65f03c0 ret > > I would say the above could be more expensive than 8 movs (16 bytes to > write, read, a branch and a ret). You can also add the I-cache locality, > having wrappers for each syscalls instead of a single place for zeroing > the upper half (where no other wrapper is necessary). > > Can we trick the compiler into doing a tail call optimisation. This > could have simply been: > > COMPAT_SYSCALL_WRAP2(creat, ...): > mov w0, w0 > b <sys_creat> What you talk about was in my initial version. But Heiko insisted on having all wrappers together. http://www.spinics.net/lists/linux-s390/msg11593.html Grep your email for discussion. > > > > Cost wise, this seems like it all cancels out in the end, but what > > > do I know? > > > > I think you know something, and I also think Heiko and other s390 guys > > know something as well. So I'd like to listen their arguments here. > > > > For me spark64 way is looking reasonable only because it's really simple > > and takes less coding. I'll try it on some branch and share here what happened. > > The kernel code will definitely look simpler ;). It would be good to see > if there actually is any performance impact. Even with 16 more cycles on > syscall entry, would they be lost in the noise? You don't need a full > implementation, just some dummy mov x0, x0 on the entry path. > > -- > Catalin
On Wed, May 25, 2016 at 12:30:17PM -0700, David Miller wrote: > From: Yury Norov <ynorov@caviumnetworks.com> > Date: Tue, 24 May 2016 03:04:30 +0300 > > > +To clear that top halves, automatic wrappers are introduced. They clear all > > +required registers before passing control to regular syscall handler. > > Why have one of these for every single compat system call, rather than > simply clearing the top half of all of these registers unconditionally > in the 32-bit system call trap before the system call is invoked? > > That's what we do on sparc64. > > And with that, you only need wrappers for the case where there needs > to be proper sign extention of a 32-bit signed argument. That would probably also work for arm. On s390 we still have these odd 31 bit pointers in compat mode which require us to clear 33 bits instead of 32 bits. That makes up for appr. one third of all system calls. The additional wrappers are only for zero/sign extension, where I count a total of 27 on s390. The reason for doing this in C was the constant copy-paste error rate, when adding new system calls plus I got a rid of a lot of unnecessary asm code.
> > > The cost is pretty trivial though. See kernel/compat_wrapper.o: > > > COMPAT_SYSCALL_WRAP2(creat, const char __user *, pathname, umode_t, mode); > > > 0: a9bf7bfd stp x29, x30, [sp,#-16]! > > > 4: 910003fd mov x29, sp > > > 8: 2a0003e0 mov w0, w0 > > > c: 94000000 bl 0 <sys_creat> > > > 10: a8c17bfd ldp x29, x30, [sp],#16 > > > 14: d65f03c0 ret > > > > I would say the above could be more expensive than 8 movs (16 bytes to > > write, read, a branch and a ret). You can also add the I-cache locality, > > having wrappers for each syscalls instead of a single place for zeroing > > the upper half (where no other wrapper is necessary). > > > > Can we trick the compiler into doing a tail call optimisation. This > > could have simply been: > > > > COMPAT_SYSCALL_WRAP2(creat, ...): > > mov w0, w0 > > b <sys_creat> > > What you talk about was in my initial version. But Heiko insisted on having all > wrappers together. > http://www.spinics.net/lists/linux-s390/msg11593.html > > Grep your email for discussion. I think Catalin's question was more about why there is even a stack frame generated. It looks like it is not necessary. I did ask this too a couple of months ago, when we discussed this. > > > > Cost wise, this seems like it all cancels out in the end, but what > > > > do I know? > > > > > > I think you know something, and I also think Heiko and other s390 guys > > > know something as well. So I'd like to listen their arguments here. If it comes to 64 bit arguments for compat system calls: s390 also has an x32-like ABI extension which allows user space to use full 64 bit registers. As far as I know hardly anybody ever made use of that. However even if that would be widely used, to me it wouldn't make sense to add new compat system calls which allow 64 bit arguments, simply because something like c = (u32)a | (u64)b << 32; can be done with a single 1-cycle instruction. It's just not worth the extra effort to maintain additional system call variants.
On Friday, May 27, 2016 8:03:57 AM CEST Heiko Carstens wrote: > > > > > Cost wise, this seems like it all cancels out in the end, but what > > > > > do I know? > > > > > > > > I think you know something, and I also think Heiko and other s390 guys > > > > know something as well. So I'd like to listen their arguments here. > > If it comes to 64 bit arguments for compat system calls: s390 also has an > x32-like ABI extension which allows user space to use full 64 bit > registers. As far as I know hardly anybody ever made use of that. > > However even if that would be widely used, to me it wouldn't make sense to > add new compat system calls which allow 64 bit arguments, simply because > something like > > c = (u32)a | (u64)b << 32; > > can be done with a single 1-cycle instruction. It's just not worth the > extra effort to maintain additional system call variants. For reference, both tile and mips also have separate 32-bit ABIs that are only used on 64-bit kernels (aside from the normal 32-bit ABI). Tile does it like s390 and passes 64-bit arguments as pairs, while MIPS and x86 and pass them as single registers. Tile is very similar to arm64 because it also uses the generic system call table, which I think is a good argument to keep them in sync. Arnd
On Fri, May 27, 2016 at 08:03:57AM +0200, Heiko Carstens wrote: > > > > The cost is pretty trivial though. See kernel/compat_wrapper.o: > > > > COMPAT_SYSCALL_WRAP2(creat, const char __user *, pathname, umode_t, mode); > > > > 0: a9bf7bfd stp x29, x30, [sp,#-16]! > > > > 4: 910003fd mov x29, sp > > > > 8: 2a0003e0 mov w0, w0 > > > > c: 94000000 bl 0 <sys_creat> > > > > 10: a8c17bfd ldp x29, x30, [sp],#16 > > > > 14: d65f03c0 ret > > > > > > I would say the above could be more expensive than 8 movs (16 bytes to > > > write, read, a branch and a ret). You can also add the I-cache locality, > > > having wrappers for each syscalls instead of a single place for zeroing > > > the upper half (where no other wrapper is necessary). > > > > > > Can we trick the compiler into doing a tail call optimisation. This > > > could have simply been: > > > > > > COMPAT_SYSCALL_WRAP2(creat, ...): > > > mov w0, w0 > > > b <sys_creat> > > > > What you talk about was in my initial version. But Heiko insisted on having all > > wrappers together. > > http://www.spinics.net/lists/linux-s390/msg11593.html > > > > Grep your email for discussion. > > I think Catalin's question was more about why there is even a stack frame > generated. It looks like it is not necessary. I did ask this too a couple > of months ago, when we discussed this. Indeed, I was questioning the need for prologue/epilogue, not the use of COMPAT_SYSCALL_WRAPx(). Maybe something like __naked would do. > > > > > Cost wise, this seems like it all cancels out in the end, but what > > > > > do I know? > > > > > > > > I think you know something, and I also think Heiko and other s390 guys > > > > know something as well. So I'd like to listen their arguments here. > > If it comes to 64 bit arguments for compat system calls: s390 also has an > x32-like ABI extension which allows user space to use full 64 bit > registers. As far as I know hardly anybody ever made use of that. > > However even if that would be widely used, to me it wouldn't make sense to > add new compat system calls which allow 64 bit arguments, simply because > something like > > c = (u32)a | (u64)b << 32; > > can be done with a single 1-cycle instruction. It's just not worth the > extra effort to maintain additional system call variants. If we split 64-bit arguments in two, we can go a step further and avoid most of the COMPAT_SYSCALL_WRAPx annotations in favour of a common upper-half zeroing of the argument registers on ILP32 syscall entry. There would be a few exceptions where we need to re-build 64-bit arguments on sign-extend.
On Fri, May 27, 2016 at 10:42:59AM +0200, Arnd Bergmann wrote: > On Friday, May 27, 2016 8:03:57 AM CEST Heiko Carstens wrote: > > > > > > Cost wise, this seems like it all cancels out in the end, but what > > > > > > do I know? > > > > > > > > > > I think you know something, and I also think Heiko and other s390 guys > > > > > know something as well. So I'd like to listen their arguments here. > > > > If it comes to 64 bit arguments for compat system calls: s390 also has an > > x32-like ABI extension which allows user space to use full 64 bit > > registers. As far as I know hardly anybody ever made use of that. > > > > However even if that would be widely used, to me it wouldn't make sense to > > add new compat system calls which allow 64 bit arguments, simply because > > something like > > > > c = (u32)a | (u64)b << 32; > > > > can be done with a single 1-cycle instruction. It's just not worth the > > extra effort to maintain additional system call variants. > > For reference, both tile and mips also have separate 32-bit ABIs that are > only used on 64-bit kernels (aside from the normal 32-bit ABI). Tile > does it like s390 and passes 64-bit arguments as pairs, while MIPS > and x86 and pass them as single registers. AFAIK, x32 also requires that the upper half of a 64-bit reg is zeroed by the user when a 32-bit value is passed. We could require the same on AArch64/ILP32 but I'm a bit uneasy on trusting a multitude of C libraries on this.
On Thu, May 26, 2016 at 12:43:44PM -0700, David Miller wrote: > From: Catalin Marinas <catalin.marinas@arm.com> > Date: Thu, 26 May 2016 15:20:58 +0100 > > > We can solve (a) by adding more __SC_WRAP annotations in the generic > > unistd.h. > ... > > I really think it's much more robust to clear the tops of the registers > by default. Then you won't be auditing constantly and adding more and > more wrappers. I think we could avoid adding a new __SC_WRAP by redefining __SYSCALL for ILP32 to always invoke a wrapper. But given the wrapper overhead, cache locality, I don't think we would notice any performance difference in either case. > You can't even quantify the performance gains for me in any precise > way. Whatever you gain by avoiding the 64-bit > decompostion/reconstitution for those few system calls with 64-bit > registers, you are losing by calling the wrappers for more common > system calls, more often. I hope Yury can provide some numbers. All being equal, I would go for the lowest code maintenance cost (which is probably less annotations and wrappers). > "it's more natural to pass 64-bit values in a register" is not a clear > justification for this change. It's more related to how we went about the ILP32 ABI. We initially asked for a 64-bit native ABI similar to x32 until the libc-alpha community raised some POSIX compliance issues on time structures. So we decided to go for a 32-bit-like ABI while keeping the syscall interface close to the AArch64/ILP32 procedure calling standard (64-bit values passed in a single register). And now we have this discussion, revisiting this decision again (which is perfectly fine, we better get it right before any merging plans; thanks for your input).
On Friday, May 27, 2016 10:30:52 AM CEST Catalin Marinas wrote: > On Fri, May 27, 2016 at 10:42:59AM +0200, Arnd Bergmann wrote: > > On Friday, May 27, 2016 8:03:57 AM CEST Heiko Carstens wrote: > > > > > > > Cost wise, this seems like it all cancels out in the end, but what > > > > > > > do I know? > > > > > > > > > > > > I think you know something, and I also think Heiko and other s390 guys > > > > > > know something as well. So I'd like to listen their arguments here. > > > > > > If it comes to 64 bit arguments for compat system calls: s390 also has an > > > x32-like ABI extension which allows user space to use full 64 bit > > > registers. As far as I know hardly anybody ever made use of that. > > > > > > However even if that would be widely used, to me it wouldn't make sense to > > > add new compat system calls which allow 64 bit arguments, simply because > > > something like > > > > > > c = (u32)a | (u64)b << 32; > > > > > > can be done with a single 1-cycle instruction. It's just not worth the > > > extra effort to maintain additional system call variants. > > > > For reference, both tile and mips also have separate 32-bit ABIs that are > > only used on 64-bit kernels (aside from the normal 32-bit ABI). Tile > > does it like s390 and passes 64-bit arguments as pairs, while MIPS > > and x86 and pass them as single registers. > > AFAIK, x32 also requires that the upper half of a 64-bit reg is zeroed > by the user when a 32-bit value is passed. We could require the same on > AArch64/ILP32 but I'm a bit uneasy on trusting a multitude of C > libraries on this. It's not about trusting a C library, it's about ensuring malicious code cannot pass argumentst that the kernel code assumes will never happen. Arnd
On Fri, May 27, 2016 at 12:49:11PM +0200, Arnd Bergmann wrote: > On Friday, May 27, 2016 10:30:52 AM CEST Catalin Marinas wrote: > > On Fri, May 27, 2016 at 10:42:59AM +0200, Arnd Bergmann wrote: > > > On Friday, May 27, 2016 8:03:57 AM CEST Heiko Carstens wrote: > > > > > > > > Cost wise, this seems like it all cancels out in the end, but what > > > > > > > > do I know? > > > > > > > > > > > > > > I think you know something, and I also think Heiko and other s390 guys > > > > > > > know something as well. So I'd like to listen their arguments here. > > > > > > > > If it comes to 64 bit arguments for compat system calls: s390 also has an > > > > x32-like ABI extension which allows user space to use full 64 bit > > > > registers. As far as I know hardly anybody ever made use of that. > > > > > > > > However even if that would be widely used, to me it wouldn't make sense to > > > > add new compat system calls which allow 64 bit arguments, simply because > > > > something like > > > > > > > > c = (u32)a | (u64)b << 32; > > > > > > > > can be done with a single 1-cycle instruction. It's just not worth the > > > > extra effort to maintain additional system call variants. > > > > > > For reference, both tile and mips also have separate 32-bit ABIs that are > > > only used on 64-bit kernels (aside from the normal 32-bit ABI). Tile > > > does it like s390 and passes 64-bit arguments as pairs, while MIPS > > > and x86 and pass them as single registers. > > > > AFAIK, x32 also requires that the upper half of a 64-bit reg is zeroed > > by the user when a 32-bit value is passed. We could require the same on > > AArch64/ILP32 but I'm a bit uneasy on trusting a multitude of C > > libraries on this. > > It's not about trusting a C library, it's about ensuring malicious code > cannot pass argumentst that the kernel code assumes will never happen. At least for pointers and sizes, we have additional checks in place already, like __access_ok(). Most of the syscalls should be safe since they either go through some compat functions taking 32-bit arguments or are routed to native functions which already need to cope with a full random 64-bit value. On arm64, I think the only risk comes from syscall handlers expecting 32-bit arguments but using 64-bit types. Apart from pointer types, I don't expect this to happen but we could enforce it via a BUILD_BUG_ON(sizeof(t) > 4 && !__TYPE_IS_PTR(t)) in __SC_DELOUSE as per the s390 implementation. With ILP32 if we go for 64-bit off_t, those syscalls would be routed directly to the native layer.
On Fri, May 27, 2016 at 02:04:47PM +0100, Catalin Marinas wrote: > On Fri, May 27, 2016 at 12:49:11PM +0200, Arnd Bergmann wrote: > > On Friday, May 27, 2016 10:30:52 AM CEST Catalin Marinas wrote: > > > On Fri, May 27, 2016 at 10:42:59AM +0200, Arnd Bergmann wrote: > > > > On Friday, May 27, 2016 8:03:57 AM CEST Heiko Carstens wrote: > > > > > > > > > Cost wise, this seems like it all cancels out in the end, but what > > > > > > > > > do I know? > > > > > > > > > > > > > > > > I think you know something, and I also think Heiko and other s390 guys > > > > > > > > know something as well. So I'd like to listen their arguments here. > > > > > > > > > > If it comes to 64 bit arguments for compat system calls: s390 also has an > > > > > x32-like ABI extension which allows user space to use full 64 bit > > > > > registers. As far as I know hardly anybody ever made use of that. > > > > > > > > > > However even if that would be widely used, to me it wouldn't make sense to > > > > > add new compat system calls which allow 64 bit arguments, simply because > > > > > something like > > > > > > > > > > c = (u32)a | (u64)b << 32; > > > > > > > > > > can be done with a single 1-cycle instruction. It's just not worth the > > > > > extra effort to maintain additional system call variants. > > > > > > > > For reference, both tile and mips also have separate 32-bit ABIs that are > > > > only used on 64-bit kernels (aside from the normal 32-bit ABI). Tile > > > > does it like s390 and passes 64-bit arguments as pairs, while MIPS > > > > and x86 and pass them as single registers. > > > > > > AFAIK, x32 also requires that the upper half of a 64-bit reg is zeroed > > > by the user when a 32-bit value is passed. We could require the same on > > > AArch64/ILP32 but I'm a bit uneasy on trusting a multitude of C > > > libraries on this. > > > > It's not about trusting a C library, it's about ensuring malicious code > > cannot pass argumentst that the kernel code assumes will never happen. > > At least for pointers and sizes, we have additional checks in place > already, like __access_ok(). Most of the syscalls should be safe since > they either go through some compat functions taking 32-bit arguments or > are routed to native functions which already need to cope with a full > random 64-bit value. It's not a good idea to rely on current implementation. Implementation may be changed and it's impossible to check each and every patch against register top-halves correctness. > > On arm64, I think the only risk comes from syscall handlers expecting > 32-bit arguments but using 64-bit types. Apart from pointer types, I > don't expect this to happen but we could enforce it via a > BUILD_BUG_ON(sizeof(t) > 4 && !__TYPE_IS_PTR(t)) in __SC_DELOUSE as per > the s390 implementation. With ILP32 if we go for 64-bit off_t, those > syscalls would be routed directly to the native layer. > 64-bit off_t doesn't imply we'd rout it directly. At first glance it's looking reasonable but there are other considerations like simplicity and unification with aarch32 that may become more important. That's what David pointed out. So, we have 3 options for now: 1. Clear top halves in entry.S which means we pass off_t as a pair. The cost is performance (didn't measure it yet and doubt about it makes serious impact). The advantage is simplicity and unification with aarch32, as I mentioned above. And David likes it. And it mininizes the amount of changes on glibc side. 2. Clear top halves in in separated file hosted wrappers. 3. Clear top halves in I-cache and tail optimization friendly in-site wrappers. 2 and 3 are the same from ABI point of view. 2 is the worst for me as it is the most complex in implementation and I-cache and tail optimization non-friendly. But Heiko likes it. 3 is what Catalin is talking about, and it was my initial approach. Though I didn't made compiler to do tail optimization, I think we can do it. But 2 is what we have now. And I'd choose it. We'll never get ilp32 done if will roll back previously agreed decisions again and again. Yury. > -- > Catalin
On Fri, May 27, 2016 at 07:58:06PM +0300, Yury Norov wrote: > So, we have 3 options for now: > 1. Clear top halves in entry.S which means we pass off_t as a pair. > The cost is performance (didn't measure it yet and doubt about it > makes serious impact). The advantage is simplicity and unification with > aarch32, as I mentioned above. And David likes it. And it mininizes > the amount of changes on glibc side. > 2. Clear top halves in in separated file hosted wrappers. > 3. Clear top halves in I-cache and tail optimization friendly in-site wrappers. > > 2 and 3 are the same from ABI point of view. > > 2 is the worst for me as it is the most complex in implementation and > I-cache and tail optimization non-friendly. But Heiko likes it. > > 3 is what Catalin is talking about, and it was my initial approach. > Though I didn't made compiler to do tail optimization, I think we can > do it. I don't fully understand the difference between 2 and 3. My comment was more around annotating the wrappers in (2) with __naked to no longer generate function prologue/epilogue. They would still be in a separate kernel/compat_wrapper.c file. I can't figure out how you would have in-place wrappers for all syscalls. You can indeed handle the current COMPAT_SYSCALL_DEFINE via __SC_DELOUSE (and penalising the AArch32/compat support slightly) but there is no solution for native SYSCALL_DEFINE functions to do it in-place. > But 2 is what we have now. And I'd choose it. We'll never get ilp32 done > if will roll back previously agreed decisions again and again. I would rather roll back a decision than going ahead with a wrong one. Note that this is *ABI*, not a driver that you can fix upstream later. Since yesterday, I realised that (2) requires further annotations and wrapping for the native and compat syscalls used by ILP32 just to cope with pointers. Also given davem's comments, (1) starts to look a bit more appealing (I don't like reverting such decisions either, I'd have to review the code again and again).
diff --git a/Documentation/adding-syscalls.txt b/Documentation/adding-syscalls.txt index cc2d4ac..d02a6bd 100644 --- a/Documentation/adding-syscalls.txt +++ b/Documentation/adding-syscalls.txt @@ -341,6 +341,38 @@ To summarize, you need: - instance of __SC_COMP not __SYSCALL in include/uapi/asm-generic/unistd.h +Compatibility System Calls Wrappers +-------------------------------- + +Some architectures prevent 32-bit userspace from access to top halves of 64-bit +registers, but some not. It's not a problem if specific argument is the same +size in kernel and userspace. It also is not a problem if system call is already +handled by compatible routine. Otherwise we'd take care of it. Usually, glibc +and compiler handles register's top halve, but from kernel side, we cannot rely +on it, as malicious code may cause incorrect behaviour and/or security +vulnerabilities. + +For now, only s390 and arm64/ilp32 are affected. + +To clear that top halves, automatic wrappers are introduced. They clear all +required registers before passing control to regular syscall handler. + +If your architecture allows userspace code to access top halves of register, +you need to: + - enable COMPAT_WRAPPER in configuration file; + - declare: "#define __SC_WRAP(nr, sym) [nr] = compat_##sym,", just before + compatible syscall table declaration, if you use generic unistd; or + - declare compat wrappers manually, if you use non-generic syscall table. + The list of unsafe syscalls is in kernel/compat_wrapper. + +If you write new syscall, make sure, its arguments are the same size in both +64- and 32-bits modes. If no, and if there's no explicit compat version for +syscall handler, you need to: + - declare compat version prototype in 'include/linux/compat.h'; + - in 'include/uapi/asm-generic/unistd.h' declare syscall with macro '__SC_WRAP' + instead of '__SYSCALL'; + - add corresponding line to 'kernel/compat_wrapper.c' to let it generate wrapper. + Compatibility System Calls (x86) --------------------------------
Signed-off-by: Yury Norov <ynorov@caviumnetworks.com> --- Documentation/adding-syscalls.txt | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)