[01/23] all: syscall wrappers: add documentation

Message ID	1464048292-30136-2-git-send-email-ynorov@caviumnetworks.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Yury Norov <ynorov@caviumnetworks.com> To: <arnd@arndb.de>, <catalin.marinas@arm.com>, <linux-arm-kernel@lists.infradead.org>, <linux-kernel@vger.kernel.org>, <linux-doc@vger.kernel.org>, <linux-arch@vger.kernel.org>, <linux-s390@vger.kernel.org>, <libc-alpha@sourceware.org> Subject: [PATCH 01/23] all: syscall wrappers: add documentation Date: Tue, 24 May 2016 03:04:30 +0300 Message-ID: <1464048292-30136-2-git-send-email-ynorov@caviumnetworks.com> In-Reply-To: <1464048292-30136-1-git-send-email-ynorov@caviumnetworks.com> References: <1464048292-30136-1-git-send-email-ynorov@caviumnetworks.com> MIME-Version: 1.0 SpamDiagnosticOutput: 1:23 SpamDiagnosticMetadata: NSPM Precedence: list Cc: kilobyte@angband.pl, pinskia@gmail.com, szabolcs.nagy@arm.com, Nathan_Lynch@mentor.com, heiko.carstens@de.ibm.com, agraf@suse.de, geert@linux-m68k.org, Prasun.Kapoor@caviumnetworks.com, klimov.linux@gmail.com, broonie@kernel.org, ynorov@caviumnetworks.com, schwidefsky@de.ibm.com, bamvor.zhangjian@huawei.com, philipp.tomsich@theobroma-systems.com, joseph@codesourcery.com, christoph.muellner@theobroma-systems.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Yury Norov May 24, 2016, 12:04 a.m. UTC

Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
---
 Documentation/adding-syscalls.txt | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

David Miller May 25, 2016, 7:30 p.m. UTC | #1

From: Yury Norov <ynorov@caviumnetworks.com>
Date: Tue, 24 May 2016 03:04:30 +0300

> +To clear that top halves, automatic wrappers are introduced. They clear all
> +required registers before passing control to regular syscall handler.

Why have one of these for every single compat system call, rather than
simply clearing the top half of all of these registers unconditionally
in the 32-bit system call trap before the system call is invoked?

That's what we do on sparc64.

And with that, you only need wrappers for the case where there needs
to be proper sign extention of a 32-bit signed argument.

Yury Norov May 25, 2016, 8:03 p.m. UTC | #2

On Wed, May 25, 2016 at 12:30:17PM -0700, David Miller wrote:
> From: Yury Norov <ynorov@caviumnetworks.com>
> Date: Tue, 24 May 2016 03:04:30 +0300
> 
> > +To clear that top halves, automatic wrappers are introduced. They clear all
> > +required registers before passing control to regular syscall handler.
> 
> Why have one of these for every single compat system call, rather than
> simply clearing the top half of all of these registers unconditionally
> in the 32-bit system call trap before the system call is invoked?
> 
> That's what we do on sparc64.
> 
> And with that, you only need wrappers for the case where there needs
> to be proper sign extention of a 32-bit signed argument.

It was discussed as one of possible solutions. The downside of it is
that we cannot pass 64-bit types (like off_t) in single register.
The other downside is that we clear top halves for every single
syscall, and it looks excessive. So, from spark64 and s390 approaches
we choosed second.

David Miller May 25, 2016, 8:21 p.m. UTC | #3

From: Yury Norov <ynorov@caviumnetworks.com>
Date: Wed, 25 May 2016 23:03:27 +0300

> On Wed, May 25, 2016 at 12:30:17PM -0700, David Miller wrote:
>> From: Yury Norov <ynorov@caviumnetworks.com>
>> Date: Tue, 24 May 2016 03:04:30 +0300
>> 
>> > +To clear that top halves, automatic wrappers are introduced. They clear all
>> > +required registers before passing control to regular syscall handler.
>> 
>> Why have one of these for every single compat system call, rather than
>> simply clearing the top half of all of these registers unconditionally
>> in the 32-bit system call trap before the system call is invoked?
>> 
>> That's what we do on sparc64.
>> 
>> And with that, you only need wrappers for the case where there needs
>> to be proper sign extention of a 32-bit signed argument.
> 
> It was discussed as one of possible solutions. The downside of it is
> that we cannot pass 64-bit types (like off_t) in single register.

Wrappers can be added for the cases where you'd like to do that.

> The other downside is that we clear top halves for every single
> syscall, and it looks excessive. So, from spark64 and s390 approaches
> we choosed second.

It's like 4 cpu cycles even on crappy sparc64 cpus which only dual
issue. :)

And that's a pretty low cost for the benefits if you ask me.

Arnd Bergmann May 25, 2016, 8:47 p.m. UTC | #4

On Wednesday, May 25, 2016 1:21:45 PM CEST David Miller wrote:
> From: Yury Norov <ynorov@caviumnetworks.com>
> Date: Wed, 25 May 2016 23:03:27 +0300
> 
> > On Wed, May 25, 2016 at 12:30:17PM -0700, David Miller wrote:
> >> From: Yury Norov <ynorov@caviumnetworks.com>
> >> Date: Tue, 24 May 2016 03:04:30 +0300
> >> 
> >> > +To clear that top halves, automatic wrappers are introduced. They clear all
> >> > +required registers before passing control to regular syscall handler.
> >> 
> >> Why have one of these for every single compat system call, rather than
> >> simply clearing the top half of all of these registers unconditionally
> >> in the 32-bit system call trap before the system call is invoked?
> >> 
> >> That's what we do on sparc64.
> >> 
> >> And with that, you only need wrappers for the case where there needs
> >> to be proper sign extention of a 32-bit signed argument.
> > 
> > It was discussed as one of possible solutions. The downside of it is
> > that we cannot pass 64-bit types (like off_t) in single register.
> 
> Wrappers can be added for the cases where you'd like to do that.

If we clear the upper halves on the initial entry, we can't use a wrapper
to restore them, so would have to instead pass them as register
pairs as we do on the other 32-bit architectures.

> > The other downside is that we clear top halves for every single
> > syscall, and it looks excessive. So, from spark64 and s390 approaches
> > we choosed second.
> 
> It's like 4 cpu cycles even on crappy sparc64 cpus which only dual
> issue. :)
> 
> And that's a pretty low cost for the benefits if you ask me.

To clarify what we are talking about: These syscalls that normally
pass 64-bit arguments as register pairs are intentionally overridden
to make them faster on ilp32 mode compare to other compat modes:

+#define compat_sys_fadvise64_64        sys_fadvise64_64
+#define compat_sys_fallocate           sys_fallocate
+#define compat_sys_ftruncate64         sys_ftruncate
+#define compat_sys_lookup_dcookie      sys_lookup_dcookie
+#define compat_sys_readahead           sys_readahead
+#define compat_sys_sync_file_range     sys_sync_file_range
+#define compat_sys_truncate64          sys_truncate
+#define sys_llseek                     sys_lseek
+static unsigned long compat_sys_pread64(unsigned int fd,
+               compat_uptr_t __user *ubuf, compat_size_t count, off_t offset)
+{
+       return sys_pread64(fd, (char *) ubuf, count, offset);
+}
+
+static unsigned long compat_sys_pwrite64(unsigned int fd,
+               compat_uptr_t __user *ubuf, compat_size_t count, off_t offset)
+{
+       return sys_pwrite64(fd, (char *) ubuf, count, offset);
+}

If we use the normal calling conventions, we could remove these overrides
along with the respective special-case handling in glibc. None of them
look particularly performance-sensitive, but I could be wrong there.

	Arnd

David Miller May 25, 2016, 8:50 p.m. UTC | #5

From: Arnd Bergmann <arnd@arndb.de>
Date: Wed, 25 May 2016 22:47:33 +0200

> If we use the normal calling conventions, we could remove these overrides
> along with the respective special-case handling in glibc. None of them
> look particularly performance-sensitive, but I could be wrong there.

You could set the lowest bit in the system call entry pointer to indicate
the upper-half clears should be elided.

Arnd Bergmann May 25, 2016, 9:01 p.m. UTC | #6

On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> Date: Wed, 25 May 2016 22:47:33 +0200
> 
> > If we use the normal calling conventions, we could remove these overrides
> > along with the respective special-case handling in glibc. None of them
> > look particularly performance-sensitive, but I could be wrong there.
> 
> You could set the lowest bit in the system call entry pointer to indicate
> the upper-half clears should be elided.

Right, but that would introduce an extra conditional branch in the syscall
hotpath, and likely eliminate the gains from passing the loff_t arguments
in a single register instead of a pair.

	Arnd

David Miller May 25, 2016, 9:28 p.m. UTC | #7

From: Arnd Bergmann <arnd@arndb.de>
Date: Wed, 25 May 2016 23:01:06 +0200

> On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote:
>> From: Arnd Bergmann <arnd@arndb.de>
>> Date: Wed, 25 May 2016 22:47:33 +0200
>> 
>> > If we use the normal calling conventions, we could remove these overrides
>> > along with the respective special-case handling in glibc. None of them
>> > look particularly performance-sensitive, but I could be wrong there.
>> 
>> You could set the lowest bit in the system call entry pointer to indicate
>> the upper-half clears should be elided.
> 
> Right, but that would introduce an extra conditional branch in the syscall
> hotpath, and likely eliminate the gains from passing the loff_t arguments
> in a single register instead of a pair.

Ok, then, how much are you really gaining from avoiding a 'shift' and
an 'or' to build the full 64-bit value?  3 cycles?  Maybe 4?

And the executing the wrappers, those have a non-trivial cost too.

Cost wise, this seems like it all cancels out in the end, but what
do I know?

Catalin Marinas May 26, 2016, 2:20 p.m. UTC | #8

On Wed, May 25, 2016 at 02:28:21PM -0700, David Miller wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> Date: Wed, 25 May 2016 23:01:06 +0200
> 
> > On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote:
> >> From: Arnd Bergmann <arnd@arndb.de>
> >> Date: Wed, 25 May 2016 22:47:33 +0200
> >> 
> >> > If we use the normal calling conventions, we could remove these overrides
> >> > along with the respective special-case handling in glibc. None of them
> >> > look particularly performance-sensitive, but I could be wrong there.
> >> 
> >> You could set the lowest bit in the system call entry pointer to indicate
> >> the upper-half clears should be elided.
> > 
> > Right, but that would introduce an extra conditional branch in the syscall
> > hotpath, and likely eliminate the gains from passing the loff_t arguments
> > in a single register instead of a pair.
> 
> Ok, then, how much are you really gaining from avoiding a 'shift' and
> an 'or' to build the full 64-bit value?  3 cycles?  Maybe 4?

It's possible a few more cycles overall. Whether this is noticeable, I
can't really tell without some benchmarks (e.g. a getpid wrapper zeroing
top 32-bit of all possible 6 arguments, called in a loop).

On arm64 with ILP32 we have three types of syscalls w.r.t. parameter
width (I guess that's true for all other compat implementations):

1. User syscall definition with 32-bit arguments, kernel handling 32-bit
   arguments

2. User 32-bit arguments, kernel 64-bit arguments

3. User 64-bit arguments, kernel 64-bit arguments

For (1), the AArch64 ABI (AAPCS) allows us to ignore the garbage in the
top 32-bit of a 64-bit register as long as the callee has 32-bit
arguments (IOW, the generated code will use 32-git Wn instead of 64-bit
Xn registers). In this case, zeroing the top 32-bit of all 6 arguments
is unnecessary.

In the 2nd case, we need sign or zero extension of 32-bit arguments. For
sign extension we would still need a wrapper as the generic one can only
zero-extend without knowing the underlying type. How many cases do we
have where sign extension is required (off_t is a signed type but does
it actually make sense as a negative value)? The __SC_WRAP and
COMPAT_SYSCALL_WRAP macros introduced by patches 3-5 in this series
handle such conversion for both sign and unsigned arguments.

We don't have such problem with AArch32 tasks since the architecture
guarantees zeroing or preserving the top half of all registers.

For (3), with the current ILP32 approach we wouldn't need any wrapper.
If we are to pass the argument as two 32-bit values, we would need both
the user (glibc) to split the argument and the kernel to re-construct
it. This would be in addition to any default top 32-bit zeroing on
kernel entry.

The overhead may be lost in the noise (we need some data) but IIRC our
decision was mostly based on a cleaner user implementation for point (3)
above. Since an AArch64/ILP32 process can freely use 64-bit registers,
we found it nicer to be able to pass such value directly to the kernel.
Reusing the s390 macros should reduce the amount of new code added to
the kernel.

While writing the above, I realised the current ILP32 patches still miss
on converting pointers passed from user space (unless I got myself
confused in macros). The new __SC_WRAP() and COMPAT_SYSCALL_WRAPx()
macros take care of zero or sign extension via __SC_COMPAT_CAST().
However, we have two more existing cases which I don't see covered:

a) Native syscalls taking a pointer argument and invoked directly from
   ILP32. For example, sys_read() takes a pointer but I don't see any
   __SC_WRAP added by patch 5

b) Current compat syscalls taking a pointer argument. For example,
   compat_sys_vmsplice() gets the iov32 pointer and the compiler assumes
   it is a 64-bit variable. I don't see where the upper half is zeroed

We can solve (a) by adding more __SC_WRAP annotations in the generic
unistd.h. For (b), we would need an __SC_DELOUSE with a bit of penalty
on AArch32/compat support where it isn't needed. So maybe davem has a
point on the overall impact of always zeroing the upper half of the
arguments ;) (both from a performance and maintainability perspective).
I guess this part of the ABI is still up for discussion.

Szabolcs Nagy May 26, 2016, 2:50 p.m. UTC | #9

On 26/05/16 15:20, Catalin Marinas wrote:
> While writing the above, I realised the current ILP32 patches still miss
> on converting pointers passed from user space (unless I got myself
> confused in macros). The new __SC_WRAP() and COMPAT_SYSCALL_WRAPx()
> macros take care of zero or sign extension via __SC_COMPAT_CAST().
> However, we have two more existing cases which I don't see covered:
> 
> a) Native syscalls taking a pointer argument and invoked directly from
>    ILP32. For example, sys_read() takes a pointer but I don't see any
>    __SC_WRAP added by patch 5
> 
> b) Current compat syscalls taking a pointer argument. For example,
>    compat_sys_vmsplice() gets the iov32 pointer and the compiler assumes
>    it is a 64-bit variable. I don't see where the upper half is zeroed
> 

on x32 sign/zero extension is currently left to userspace,
which is difficult to deal with, (long long)arg does the
wrong thing for pointer args.

> We can solve (a) by adding more __SC_WRAP annotations in the generic
> unistd.h. For (b), we would need an __SC_DELOUSE with a bit of penalty
> on AArch32/compat support where it isn't needed. So maybe davem has a
> point on the overall impact of always zeroing the upper half of the
> arguments ;) (both from a performance and maintainability perspective).
> I guess this part of the ABI is still up for discussion.
>

Catalin Marinas May 26, 2016, 3:19 p.m. UTC | #10

On Thu, May 26, 2016 at 03:50:01PM +0100, Szabolcs Nagy wrote:
> On 26/05/16 15:20, Catalin Marinas wrote:
> > While writing the above, I realised the current ILP32 patches still miss
> > on converting pointers passed from user space (unless I got myself
> > confused in macros). The new __SC_WRAP() and COMPAT_SYSCALL_WRAPx()
> > macros take care of zero or sign extension via __SC_COMPAT_CAST().
> > However, we have two more existing cases which I don't see covered:
> > 
> > a) Native syscalls taking a pointer argument and invoked directly from
> >    ILP32. For example, sys_read() takes a pointer but I don't see any
> >    __SC_WRAP added by patch 5
> > 
> > b) Current compat syscalls taking a pointer argument. For example,
> >    compat_sys_vmsplice() gets the iov32 pointer and the compiler assumes
> >    it is a 64-bit variable. I don't see where the upper half is zeroed
> 
> on x32 sign/zero extension is currently left to userspace,
> which is difficult to deal with, (long long)arg does the
> wrong thing for pointer args.

I agree, I don't think we should leave sign/zero extension to user. We
should do it in the kernel either in a way similar to s390 (specific
__SC_COMPAT_CAST, __SC_DELOUSE) or by always zeroing the arguments upper
half on kernel entry with a few additional wrappers (where we have
64-bit arguments or they require sign extension). The latter has the
disadvantage of having to split 64-bit arguments in user space while the
former adds more maintenance burden to the kernel.

I can't comment on performance aspects without some real numbers.

David Miller May 26, 2016, 7:43 p.m. UTC | #11

From: Catalin Marinas <catalin.marinas@arm.com>
Date: Thu, 26 May 2016 15:20:58 +0100

> We can solve (a) by adding more __SC_WRAP annotations in the generic
> unistd.h.
 ...

I really think it's much more robust to clear the tops of the registers
by default.  Then you won't be auditing constantly and adding more and
more wrappers.

You can't even quantify the performance gains for me in any precise
way.  Whatever you gain by avoiding the 64-bit
decompostion/reconstitution for those few system calls with 64-bit
registers, you are losing by calling the wrappers for more common
system calls, more often.

"it's more natural to pass 64-bit values in a register" is not a clear
justification for this change.

This looks way over engineered to me.

Yury Norov May 26, 2016, 8:48 p.m. UTC | #12

On Wed, May 25, 2016 at 02:28:21PM -0700, David Miller wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> Date: Wed, 25 May 2016 23:01:06 +0200
> 
> > On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote:
> >> From: Arnd Bergmann <arnd@arndb.de>
> >> Date: Wed, 25 May 2016 22:47:33 +0200
> >> 
> >> > If we use the normal calling conventions, we could remove these overrides
> >> > along with the respective special-case handling in glibc. None of them
> >> > look particularly performance-sensitive, but I could be wrong there.
> >> 
> >> You could set the lowest bit in the system call entry pointer to indicate
> >> the upper-half clears should be elided.
> > 
> > Right, but that would introduce an extra conditional branch in the syscall
> > hotpath, and likely eliminate the gains from passing the loff_t arguments
> > in a single register instead of a pair.
> 
> Ok, then, how much are you really gaining from avoiding a 'shift' and
> an 'or' to build the full 64-bit value?  3 cycles?  Maybe 4?

4 cycles in kernel and ~same cost in glibc to create a pair. And 8
'mov's that exist for every syscall, even yield().

> And the executing the wrappers, those have a non-trivial cost too.

The cost is pretty trivial though. See kernel/compat_wrapper.o:
COMPAT_SYSCALL_WRAP2(creat, const char __user *, pathname, umode_t, mode);
0:   a9bf7bfd        stp     x29, x30, [sp,#-16]!
4:   910003fd        mov     x29, sp
8:   2a0003e0        mov     w0, w0
c:   94000000        bl      0 <sys_creat>
10:  a8c17bfd        ldp     x29, x30, [sp],#16
14:  d65f03c0        ret

> Cost wise, this seems like it all cancels out in the end, but what
> do I know?

I think you know something, and I also think Heiko and other s390 guys
know something as well. So I'd like to listen their arguments here.

For me spark64 way is looking reasonable only because it's really simple
and takes less coding. I'll try it on some branch and share here what happened.

Catalin Marinas May 26, 2016, 10:29 p.m. UTC | #13

On Thu, May 26, 2016 at 11:48:19PM +0300, Yury Norov wrote:
> On Wed, May 25, 2016 at 02:28:21PM -0700, David Miller wrote:
> > From: Arnd Bergmann <arnd@arndb.de>
> > Date: Wed, 25 May 2016 23:01:06 +0200
> > 
> > > On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote:
> > >> From: Arnd Bergmann <arnd@arndb.de>
> > >> Date: Wed, 25 May 2016 22:47:33 +0200
> > >> 
> > >> > If we use the normal calling conventions, we could remove these overrides
> > >> > along with the respective special-case handling in glibc. None of them
> > >> > look particularly performance-sensitive, but I could be wrong there.
> > >> 
> > >> You could set the lowest bit in the system call entry pointer to indicate
> > >> the upper-half clears should be elided.
> > > 
> > > Right, but that would introduce an extra conditional branch in the syscall
> > > hotpath, and likely eliminate the gains from passing the loff_t arguments
> > > in a single register instead of a pair.
> > 
> > Ok, then, how much are you really gaining from avoiding a 'shift' and
> > an 'or' to build the full 64-bit value?  3 cycles?  Maybe 4?
> 
> 4 cycles in kernel and ~same cost in glibc to create a pair.

It would take a single instruction per argument in the kernel to do
shift+or and maybe 1-2 more instructions to move the remaining arguments
in place (we do this for a few wrappers in arch/arm64/kernel/entry32.S).
And the glibc counterpart.

> And 8 'mov's that exist for every syscall, even yield().
> 
> > And the executing the wrappers, those have a non-trivial cost too.
> 
> The cost is pretty trivial though. See kernel/compat_wrapper.o:
> COMPAT_SYSCALL_WRAP2(creat, const char __user *, pathname, umode_t, mode);
> 0:   a9bf7bfd        stp     x29, x30, [sp,#-16]!
> 4:   910003fd        mov     x29, sp
> 8:   2a0003e0        mov     w0, w0
> c:   94000000        bl      0 <sys_creat>
> 10:  a8c17bfd        ldp     x29, x30, [sp],#16
> 14:  d65f03c0        ret

I would say the above could be more expensive than 8 movs (16 bytes to
write, read, a branch and a ret). You can also add the I-cache locality,
having wrappers for each syscalls instead of a single place for zeroing
the upper half (where no other wrapper is necessary).

Can we trick the compiler into doing a tail call optimisation. This
could have simply been:

COMPAT_SYSCALL_WRAP2(creat, ...):
	mov	w0, w0
	b	<sys_creat>

> > Cost wise, this seems like it all cancels out in the end, but what
> > do I know?
> 
> I think you know something, and I also think Heiko and other s390 guys
> know something as well. So I'd like to listen their arguments here.
> 
> For me spark64 way is looking reasonable only because it's really simple
> and takes less coding. I'll try it on some branch and share here what happened.

The kernel code will definitely look simpler ;). It would be good to see
if there actually is any performance impact. Even with 16 more cycles on
syscall entry, would they be lost in the noise? You don't need a full
implementation, just some dummy mov x0, x0 on the entry path.

Yury Norov May 27, 2016, 12:37 a.m. UTC | #14

On Thu, May 26, 2016 at 11:29:45PM +0100, Catalin Marinas wrote:
> On Thu, May 26, 2016 at 11:48:19PM +0300, Yury Norov wrote:
> > On Wed, May 25, 2016 at 02:28:21PM -0700, David Miller wrote:
> > > From: Arnd Bergmann <arnd@arndb.de>
> > > Date: Wed, 25 May 2016 23:01:06 +0200
> > > 
> > > > On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote:
> > > >> From: Arnd Bergmann <arnd@arndb.de>
> > > >> Date: Wed, 25 May 2016 22:47:33 +0200
> > > >> 
> > > >> > If we use the normal calling conventions, we could remove these overrides
> > > >> > along with the respective special-case handling in glibc. None of them
> > > >> > look particularly performance-sensitive, but I could be wrong there.
> > > >> 
> > > >> You could set the lowest bit in the system call entry pointer to indicate
> > > >> the upper-half clears should be elided.
> > > > 
> > > > Right, but that would introduce an extra conditional branch in the syscall
> > > > hotpath, and likely eliminate the gains from passing the loff_t arguments
> > > > in a single register instead of a pair.
> > > 
> > > Ok, then, how much are you really gaining from avoiding a 'shift' and
> > > an 'or' to build the full 64-bit value?  3 cycles?  Maybe 4?
> > 
> > 4 cycles in kernel and ~same cost in glibc to create a pair.
> 
> It would take a single instruction per argument in the kernel to do
> shift+or and maybe 1-2 more instructions to move the remaining arguments
> in place (we do this for a few wrappers in arch/arm64/kernel/entry32.S).
> And the glibc counterpart.
> 
> > And 8 'mov's that exist for every syscall, even yield().
> > 
> > > And the executing the wrappers, those have a non-trivial cost too.
> > 
> > The cost is pretty trivial though. See kernel/compat_wrapper.o:
> > COMPAT_SYSCALL_WRAP2(creat, const char __user *, pathname, umode_t, mode);
> > 0:   a9bf7bfd        stp     x29, x30, [sp,#-16]!
> > 4:   910003fd        mov     x29, sp
> > 8:   2a0003e0        mov     w0, w0
> > c:   94000000        bl      0 <sys_creat>
> > 10:  a8c17bfd        ldp     x29, x30, [sp],#16
> > 14:  d65f03c0        ret
> 
> I would say the above could be more expensive than 8 movs (16 bytes to
> write, read, a branch and a ret). You can also add the I-cache locality,
> having wrappers for each syscalls instead of a single place for zeroing
> the upper half (where no other wrapper is necessary).
> 
> Can we trick the compiler into doing a tail call optimisation. This
> could have simply been:
> 
> COMPAT_SYSCALL_WRAP2(creat, ...):
> 	mov	w0, w0
> 	b	<sys_creat>

What you talk about was in my initial version. But Heiko insisted on having all
wrappers together.
http://www.spinics.net/lists/linux-s390/msg11593.html

Grep your email for discussion.

> 
> > > Cost wise, this seems like it all cancels out in the end, but what
> > > do I know?
> > 
> > I think you know something, and I also think Heiko and other s390 guys
> > know something as well. So I'd like to listen their arguments here.
> > 
> > For me spark64 way is looking reasonable only because it's really simple
> > and takes less coding. I'll try it on some branch and share here what happened.
> 
> The kernel code will definitely look simpler ;). It would be good to see
> if there actually is any performance impact. Even with 16 more cycles on
> syscall entry, would they be lost in the noise? You don't need a full
> implementation, just some dummy mov x0, x0 on the entry path.
> 
> -- 
> Catalin

Heiko Carstens May 27, 2016, 5:52 a.m. UTC | #15

On Wed, May 25, 2016 at 12:30:17PM -0700, David Miller wrote:
> From: Yury Norov <ynorov@caviumnetworks.com>
> Date: Tue, 24 May 2016 03:04:30 +0300
> 
> > +To clear that top halves, automatic wrappers are introduced. They clear all
> > +required registers before passing control to regular syscall handler.
> 
> Why have one of these for every single compat system call, rather than
> simply clearing the top half of all of these registers unconditionally
> in the 32-bit system call trap before the system call is invoked?
> 
> That's what we do on sparc64.
> 
> And with that, you only need wrappers for the case where there needs
> to be proper sign extention of a 32-bit signed argument.

That would probably also work for arm. On s390 we still have these odd 31
bit pointers in compat mode which require us to clear 33 bits instead of 32
bits. That makes up for appr. one third of all system calls.

The additional wrappers are only for zero/sign extension, where I count a
total of 27 on s390.

The reason for doing this in C was the constant copy-paste error rate, when
adding new system calls plus I got a rid of a lot of unnecessary asm code.

Heiko Carstens May 27, 2016, 6:03 a.m. UTC | #16

> > > The cost is pretty trivial though. See kernel/compat_wrapper.o:
> > > COMPAT_SYSCALL_WRAP2(creat, const char __user *, pathname, umode_t, mode);
> > > 0:   a9bf7bfd        stp     x29, x30, [sp,#-16]!
> > > 4:   910003fd        mov     x29, sp
> > > 8:   2a0003e0        mov     w0, w0
> > > c:   94000000        bl      0 <sys_creat>
> > > 10:  a8c17bfd        ldp     x29, x30, [sp],#16
> > > 14:  d65f03c0        ret
> > 
> > I would say the above could be more expensive than 8 movs (16 bytes to
> > write, read, a branch and a ret). You can also add the I-cache locality,
> > having wrappers for each syscalls instead of a single place for zeroing
> > the upper half (where no other wrapper is necessary).
> > 
> > Can we trick the compiler into doing a tail call optimisation. This
> > could have simply been:
> > 
> > COMPAT_SYSCALL_WRAP2(creat, ...):
> > 	mov	w0, w0
> > 	b	<sys_creat>
> 
> What you talk about was in my initial version. But Heiko insisted on having all
> wrappers together.
> http://www.spinics.net/lists/linux-s390/msg11593.html
> 
> Grep your email for discussion.

I think Catalin's question was more about why there is even a stack frame
generated. It looks like it is not necessary. I did ask this too a couple
of months ago, when we discussed this.

> > > > Cost wise, this seems like it all cancels out in the end, but what
> > > > do I know?
> > > 
> > > I think you know something, and I also think Heiko and other s390 guys
> > > know something as well. So I'd like to listen their arguments here.

If it comes to 64 bit arguments for compat system calls: s390 also has an
x32-like ABI extension which allows user space to use full 64 bit
registers. As far as I know hardly anybody ever made use of that.

However even if that would be widely used, to me it wouldn't make sense to
add new compat system calls which allow 64 bit arguments, simply because
something like

c = (u32)a | (u64)b << 32;

can be done with a single 1-cycle instruction. It's just not worth the
extra effort to maintain additional system call variants.

Arnd Bergmann May 27, 2016, 8:42 a.m. UTC | #17

On Friday, May 27, 2016 8:03:57 AM CEST Heiko Carstens wrote:
> > > > > Cost wise, this seems like it all cancels out in the end, but what
> > > > > do I know?
> > > > 
> > > > I think you know something, and I also think Heiko and other s390 guys
> > > > know something as well. So I'd like to listen their arguments here.
> 
> If it comes to 64 bit arguments for compat system calls: s390 also has an
> x32-like ABI extension which allows user space to use full 64 bit
> registers. As far as I know hardly anybody ever made use of that.
> 
> However even if that would be widely used, to me it wouldn't make sense to
> add new compat system calls which allow 64 bit arguments, simply because
> something like
> 
> c = (u32)a | (u64)b << 32;
> 
> can be done with a single 1-cycle instruction. It's just not worth the
> extra effort to maintain additional system call variants.

For reference, both tile and mips also have separate 32-bit ABIs that are
only used on 64-bit kernels (aside from the normal 32-bit ABI). Tile
does it like s390 and passes 64-bit arguments as pairs, while MIPS
and x86 and pass them as single registers.

Tile is very similar to arm64 because it also uses the generic system
call table, which I think is a good argument to keep them in sync.

	Arnd

Catalin Marinas May 27, 2016, 9:01 a.m. UTC | #18

On Fri, May 27, 2016 at 08:03:57AM +0200, Heiko Carstens wrote:
> > > > The cost is pretty trivial though. See kernel/compat_wrapper.o:
> > > > COMPAT_SYSCALL_WRAP2(creat, const char __user *, pathname, umode_t, mode);
> > > > 0:   a9bf7bfd        stp     x29, x30, [sp,#-16]!
> > > > 4:   910003fd        mov     x29, sp
> > > > 8:   2a0003e0        mov     w0, w0
> > > > c:   94000000        bl      0 <sys_creat>
> > > > 10:  a8c17bfd        ldp     x29, x30, [sp],#16
> > > > 14:  d65f03c0        ret
> > > 
> > > I would say the above could be more expensive than 8 movs (16 bytes to
> > > write, read, a branch and a ret). You can also add the I-cache locality,
> > > having wrappers for each syscalls instead of a single place for zeroing
> > > the upper half (where no other wrapper is necessary).
> > > 
> > > Can we trick the compiler into doing a tail call optimisation. This
> > > could have simply been:
> > > 
> > > COMPAT_SYSCALL_WRAP2(creat, ...):
> > > 	mov	w0, w0
> > > 	b	<sys_creat>
> > 
> > What you talk about was in my initial version. But Heiko insisted on having all
> > wrappers together.
> > http://www.spinics.net/lists/linux-s390/msg11593.html
> > 
> > Grep your email for discussion.
> 
> I think Catalin's question was more about why there is even a stack frame
> generated. It looks like it is not necessary. I did ask this too a couple
> of months ago, when we discussed this.

Indeed, I was questioning the need for prologue/epilogue, not the use of
COMPAT_SYSCALL_WRAPx(). Maybe something like __naked would do.

> > > > > Cost wise, this seems like it all cancels out in the end, but what
> > > > > do I know?
> > > > 
> > > > I think you know something, and I also think Heiko and other s390 guys
> > > > know something as well. So I'd like to listen their arguments here.
> 
> If it comes to 64 bit arguments for compat system calls: s390 also has an
> x32-like ABI extension which allows user space to use full 64 bit
> registers. As far as I know hardly anybody ever made use of that.
> 
> However even if that would be widely used, to me it wouldn't make sense to
> add new compat system calls which allow 64 bit arguments, simply because
> something like
> 
> c = (u32)a | (u64)b << 32;
> 
> can be done with a single 1-cycle instruction. It's just not worth the
> extra effort to maintain additional system call variants.

If we split 64-bit arguments in two, we can go a step further and avoid
most of the COMPAT_SYSCALL_WRAPx annotations in favour of a common
upper-half zeroing of the argument registers on ILP32 syscall entry.
There would be a few exceptions where we need to re-build 64-bit
arguments on sign-extend.

Catalin Marinas May 27, 2016, 9:30 a.m. UTC | #19

On Fri, May 27, 2016 at 10:42:59AM +0200, Arnd Bergmann wrote:
> On Friday, May 27, 2016 8:03:57 AM CEST Heiko Carstens wrote:
> > > > > > Cost wise, this seems like it all cancels out in the end, but what
> > > > > > do I know?
> > > > > 
> > > > > I think you know something, and I also think Heiko and other s390 guys
> > > > > know something as well. So I'd like to listen their arguments here.
> > 
> > If it comes to 64 bit arguments for compat system calls: s390 also has an
> > x32-like ABI extension which allows user space to use full 64 bit
> > registers. As far as I know hardly anybody ever made use of that.
> > 
> > However even if that would be widely used, to me it wouldn't make sense to
> > add new compat system calls which allow 64 bit arguments, simply because
> > something like
> > 
> > c = (u32)a | (u64)b << 32;
> > 
> > can be done with a single 1-cycle instruction. It's just not worth the
> > extra effort to maintain additional system call variants.
> 
> For reference, both tile and mips also have separate 32-bit ABIs that are
> only used on 64-bit kernels (aside from the normal 32-bit ABI). Tile
> does it like s390 and passes 64-bit arguments as pairs, while MIPS
> and x86 and pass them as single registers.

AFAIK, x32 also requires that the upper half of a 64-bit reg is zeroed
by the user when a 32-bit value is passed. We could require the same on
AArch64/ILP32 but I'm a bit uneasy on trusting a multitude of C
libraries on this.

Catalin Marinas May 27, 2016, 10:10 a.m. UTC | #20

On Thu, May 26, 2016 at 12:43:44PM -0700, David Miller wrote:
> From: Catalin Marinas <catalin.marinas@arm.com>
> Date: Thu, 26 May 2016 15:20:58 +0100
> 
> > We can solve (a) by adding more __SC_WRAP annotations in the generic
> > unistd.h.
>  ...
> 
> I really think it's much more robust to clear the tops of the registers
> by default.  Then you won't be auditing constantly and adding more and
> more wrappers.

I think we could avoid adding a new __SC_WRAP by redefining __SYSCALL
for ILP32 to always invoke a wrapper. But given the wrapper overhead,
cache locality, I don't think we would notice any performance difference
in either case.

> You can't even quantify the performance gains for me in any precise
> way.  Whatever you gain by avoiding the 64-bit
> decompostion/reconstitution for those few system calls with 64-bit
> registers, you are losing by calling the wrappers for more common
> system calls, more often.

I hope Yury can provide some numbers. All being equal, I would go for
the lowest code maintenance cost (which is probably less annotations and
wrappers).

> "it's more natural to pass 64-bit values in a register" is not a clear
> justification for this change.

It's more related to how we went about the ILP32 ABI. We initially asked
for a 64-bit native ABI similar to x32 until the libc-alpha community
raised some POSIX compliance issues on time structures. So we decided to
go for a 32-bit-like ABI while keeping the syscall interface close to
the AArch64/ILP32 procedure calling standard (64-bit values passed in a
single register). And now we have this discussion, revisiting this
decision again (which is perfectly fine, we better get it right before
any merging plans; thanks for your input).

Arnd Bergmann May 27, 2016, 10:49 a.m. UTC | #21

On Friday, May 27, 2016 10:30:52 AM CEST Catalin Marinas wrote:
> On Fri, May 27, 2016 at 10:42:59AM +0200, Arnd Bergmann wrote:
> > On Friday, May 27, 2016 8:03:57 AM CEST Heiko Carstens wrote:
> > > > > > > Cost wise, this seems like it all cancels out in the end, but what
> > > > > > > do I know?
> > > > > > 
> > > > > > I think you know something, and I also think Heiko and other s390 guys
> > > > > > know something as well. So I'd like to listen their arguments here.
> > > 
> > > If it comes to 64 bit arguments for compat system calls: s390 also has an
> > > x32-like ABI extension which allows user space to use full 64 bit
> > > registers. As far as I know hardly anybody ever made use of that.
> > > 
> > > However even if that would be widely used, to me it wouldn't make sense to
> > > add new compat system calls which allow 64 bit arguments, simply because
> > > something like
> > > 
> > > c = (u32)a | (u64)b << 32;
> > > 
> > > can be done with a single 1-cycle instruction. It's just not worth the
> > > extra effort to maintain additional system call variants.
> > 
> > For reference, both tile and mips also have separate 32-bit ABIs that are
> > only used on 64-bit kernels (aside from the normal 32-bit ABI). Tile
> > does it like s390 and passes 64-bit arguments as pairs, while MIPS
> > and x86 and pass them as single registers.
> 
> AFAIK, x32 also requires that the upper half of a 64-bit reg is zeroed
> by the user when a 32-bit value is passed. We could require the same on
> AArch64/ILP32 but I'm a bit uneasy on trusting a multitude of C
> libraries on this.

It's not about trusting a C library, it's about ensuring malicious code
cannot pass argumentst that the kernel code assumes will never happen.

	Arnd

Catalin Marinas May 27, 2016, 1:04 p.m. UTC | #22

On Fri, May 27, 2016 at 12:49:11PM +0200, Arnd Bergmann wrote:
> On Friday, May 27, 2016 10:30:52 AM CEST Catalin Marinas wrote:
> > On Fri, May 27, 2016 at 10:42:59AM +0200, Arnd Bergmann wrote:
> > > On Friday, May 27, 2016 8:03:57 AM CEST Heiko Carstens wrote:
> > > > > > > > Cost wise, this seems like it all cancels out in the end, but what
> > > > > > > > do I know?
> > > > > > > 
> > > > > > > I think you know something, and I also think Heiko and other s390 guys
> > > > > > > know something as well. So I'd like to listen their arguments here.
> > > > 
> > > > If it comes to 64 bit arguments for compat system calls: s390 also has an
> > > > x32-like ABI extension which allows user space to use full 64 bit
> > > > registers. As far as I know hardly anybody ever made use of that.
> > > > 
> > > > However even if that would be widely used, to me it wouldn't make sense to
> > > > add new compat system calls which allow 64 bit arguments, simply because
> > > > something like
> > > > 
> > > > c = (u32)a | (u64)b << 32;
> > > > 
> > > > can be done with a single 1-cycle instruction. It's just not worth the
> > > > extra effort to maintain additional system call variants.
> > > 
> > > For reference, both tile and mips also have separate 32-bit ABIs that are
> > > only used on 64-bit kernels (aside from the normal 32-bit ABI). Tile
> > > does it like s390 and passes 64-bit arguments as pairs, while MIPS
> > > and x86 and pass them as single registers.
> > 
> > AFAIK, x32 also requires that the upper half of a 64-bit reg is zeroed
> > by the user when a 32-bit value is passed. We could require the same on
> > AArch64/ILP32 but I'm a bit uneasy on trusting a multitude of C
> > libraries on this.
> 
> It's not about trusting a C library, it's about ensuring malicious code
> cannot pass argumentst that the kernel code assumes will never happen.

At least for pointers and sizes, we have additional checks in place
already, like __access_ok(). Most of the syscalls should be safe since
they either go through some compat functions taking 32-bit arguments or
are routed to native functions which already need to cope with a full
random 64-bit value.

On arm64, I think the only risk comes from syscall handlers expecting
32-bit arguments but using 64-bit types. Apart from pointer types, I
don't expect this to happen but we could enforce it via a
BUILD_BUG_ON(sizeof(t) > 4 && !__TYPE_IS_PTR(t)) in __SC_DELOUSE as per
the s390 implementation. With ILP32 if we go for 64-bit off_t, those
syscalls would be routed directly to the native layer.

Yury Norov May 27, 2016, 4:58 p.m. UTC | #23

On Fri, May 27, 2016 at 02:04:47PM +0100, Catalin Marinas wrote:
> On Fri, May 27, 2016 at 12:49:11PM +0200, Arnd Bergmann wrote:
> > On Friday, May 27, 2016 10:30:52 AM CEST Catalin Marinas wrote:
> > > On Fri, May 27, 2016 at 10:42:59AM +0200, Arnd Bergmann wrote:
> > > > On Friday, May 27, 2016 8:03:57 AM CEST Heiko Carstens wrote:
> > > > > > > > > Cost wise, this seems like it all cancels out in the end, but what
> > > > > > > > > do I know?
> > > > > > > > 
> > > > > > > > I think you know something, and I also think Heiko and other s390 guys
> > > > > > > > know something as well. So I'd like to listen their arguments here.
> > > > > 
> > > > > If it comes to 64 bit arguments for compat system calls: s390 also has an
> > > > > x32-like ABI extension which allows user space to use full 64 bit
> > > > > registers. As far as I know hardly anybody ever made use of that.
> > > > > 
> > > > > However even if that would be widely used, to me it wouldn't make sense to
> > > > > add new compat system calls which allow 64 bit arguments, simply because
> > > > > something like
> > > > > 
> > > > > c = (u32)a | (u64)b << 32;
> > > > > 
> > > > > can be done with a single 1-cycle instruction. It's just not worth the
> > > > > extra effort to maintain additional system call variants.
> > > > 
> > > > For reference, both tile and mips also have separate 32-bit ABIs that are
> > > > only used on 64-bit kernels (aside from the normal 32-bit ABI). Tile
> > > > does it like s390 and passes 64-bit arguments as pairs, while MIPS
> > > > and x86 and pass them as single registers.
> > > 
> > > AFAIK, x32 also requires that the upper half of a 64-bit reg is zeroed
> > > by the user when a 32-bit value is passed. We could require the same on
> > > AArch64/ILP32 but I'm a bit uneasy on trusting a multitude of C
> > > libraries on this.
> > 
> > It's not about trusting a C library, it's about ensuring malicious code
> > cannot pass argumentst that the kernel code assumes will never happen.
> 
> At least for pointers and sizes, we have additional checks in place
> already, like __access_ok(). Most of the syscalls should be safe since
> they either go through some compat functions taking 32-bit arguments or
> are routed to native functions which already need to cope with a full
> random 64-bit value.

It's not a good idea to rely on current implementation. Implementation
may be changed and it's impossible to check each and every patch
against register top-halves correctness.

> 
> On arm64, I think the only risk comes from syscall handlers expecting
> 32-bit arguments but using 64-bit types. Apart from pointer types, I
> don't expect this to happen but we could enforce it via a
> BUILD_BUG_ON(sizeof(t) > 4 && !__TYPE_IS_PTR(t)) in __SC_DELOUSE as per
> the s390 implementation. With ILP32 if we go for 64-bit off_t, those
> syscalls would be routed directly to the native layer.
> 

64-bit off_t doesn't imply we'd rout it directly. At first glance it's
looking reasonable but there are other considerations like simplicity and
unification with aarch32 that may become more important. That's what
David pointed out.

So, we have 3 options for now:
1. Clear top halves in entry.S which means we pass off_t as a pair.
   The cost is performance (didn't measure it yet and doubt about it
   makes serious impact). The advantage is simplicity and unification with
   aarch32, as I mentioned above. And David likes it. And it mininizes
   the amount of changes on glibc side.
2. Clear top halves in in separated file hosted wrappers.
3. Clear top halves in I-cache and tail optimization friendly in-site wrappers.

2 and 3 are the same from ABI point of view.

2 is the worst for me as it is the most complex in implementation and 
I-cache and tail optimization non-friendly. But Heiko likes it.

3 is what Catalin is talking about, and it was my initial approach.
Though I didn't made compiler to do tail optimization, I think we can
do it.

But 2 is what we have now. And I'd choose it. We'll never get ilp32 done
if will roll back previously agreed decisions again and again.

Yury.

> -- 
> Catalin

Catalin Marinas May 27, 2016, 5:36 p.m. UTC | #24

On Fri, May 27, 2016 at 07:58:06PM +0300, Yury Norov wrote:
> So, we have 3 options for now:
> 1. Clear top halves in entry.S which means we pass off_t as a pair.
>    The cost is performance (didn't measure it yet and doubt about it
>    makes serious impact). The advantage is simplicity and unification with
>    aarch32, as I mentioned above. And David likes it. And it mininizes
>    the amount of changes on glibc side.
> 2. Clear top halves in in separated file hosted wrappers.
> 3. Clear top halves in I-cache and tail optimization friendly in-site wrappers.
> 
> 2 and 3 are the same from ABI point of view.
> 
> 2 is the worst for me as it is the most complex in implementation and 
> I-cache and tail optimization non-friendly. But Heiko likes it.
>  
> 3 is what Catalin is talking about, and it was my initial approach.
> Though I didn't made compiler to do tail optimization, I think we can
> do it.

I don't fully understand the difference between 2 and 3. My comment was
more around annotating the wrappers in (2) with __naked to no longer
generate function prologue/epilogue. They would still be in a separate
kernel/compat_wrapper.c file.

I can't figure out how you would have in-place wrappers for all
syscalls. You can indeed handle the current COMPAT_SYSCALL_DEFINE via
__SC_DELOUSE (and penalising the AArch32/compat support slightly) but
there is no solution for native SYSCALL_DEFINE functions to do it
in-place.

> But 2 is what we have now. And I'd choose it. We'll never get ilp32 done
> if will roll back previously agreed decisions again and again.

I would rather roll back a decision than going ahead with a wrong one.
Note that this is *ABI*, not a driver that you can fix upstream later.

Since yesterday, I realised that (2) requires further annotations and
wrapping for the native and compat syscalls used by ILP32 just to cope
with pointers. Also given davem's comments, (1) starts to look a bit
more appealing (I don't like reverting such decisions either, I'd have
to review the code again and again).

[01/23] all: syscall wrappers: add documentation

Commit Message

Comments

Patch