mbox series

[0/4] uapi, vfs: Change the mount API UAPI [ver #2]

Message ID 155800752418.4037.9567789434648701032.stgit@warthog.procyon.org.uk (mailing list archive)
Headers show
Series uapi, vfs: Change the mount API UAPI [ver #2] | expand

Message

David Howells May 16, 2019, 11:52 a.m. UTC
Hi Linus, Al,

Here are some patches that make changes to the mount API UAPI and two of
them really need applying, before -rc1 - if they're going to be applied at
all.

The following changes are made:

 (1) Make the file descriptors returned by open_tree(), fsopen(), fspick()
     and fsmount() O_CLOEXEC by default and remove the flags that allow
     this to be specified from the UAPI, shuffling other flags down as
     appropriate.  fcntl() can still be used to change the flag.

 (2) Make the name of the anon inode object "[fscontext]" with square
     brackets to match other users.

 (3) Fix the numbering of the mount API syscalls to be in the common space
     rather than in the arch-specific space.

 (4) Wire up the mount API syscalls on all arches (it's only on x86
     currently).

Thanks,
David
---
Christian Brauner (2):
      uapi, fs: make all new mount api fds cloexec by default
      uapi, fsopen: use square brackets around "fscontext"

David Howells (2):
      uapi, x86: Fix the syscall numbering of the mount API syscalls
      uapi: Wire up the mount API syscalls on non-x86 arches


 arch/alpha/kernel/syscalls/syscall.tbl      |    6 ++++++
 arch/arm/tools/syscall.tbl                  |    6 ++++++
 arch/arm64/include/asm/unistd.h             |    2 +-
 arch/arm64/include/asm/unistd32.h           |   12 ++++++++++++
 arch/ia64/kernel/syscalls/syscall.tbl       |    6 ++++++
 arch/m68k/kernel/syscalls/syscall.tbl       |    6 ++++++
 arch/microblaze/kernel/syscalls/syscall.tbl |    6 ++++++
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    6 ++++++
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    6 ++++++
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    6 ++++++
 arch/parisc/kernel/syscalls/syscall.tbl     |    6 ++++++
 arch/powerpc/kernel/syscalls/syscall.tbl    |    6 ++++++
 arch/s390/kernel/syscalls/syscall.tbl       |    6 ++++++
 arch/sh/kernel/syscalls/syscall.tbl         |    6 ++++++
 arch/sparc/kernel/syscalls/syscall.tbl      |    6 ++++++
 arch/x86/entry/syscalls/syscall_32.tbl      |   12 ++++++------
 arch/x86/entry/syscalls/syscall_64.tbl      |   12 ++++++------
 arch/xtensa/kernel/syscalls/syscall.tbl     |    6 ++++++
 fs/fsopen.c                                 |   15 +++++++--------
 fs/namespace.c                              |   11 ++++-------
 include/uapi/asm-generic/unistd.h           |   14 +++++++++++++-
 include/uapi/linux/mount.h                  |   18 +++---------------
 22 files changed, 136 insertions(+), 44 deletions(-)

Comments

Al Viro May 16, 2019, 4:22 p.m. UTC | #1
On Thu, May 16, 2019 at 12:52:04PM +0100, David Howells wrote:
> 
> Hi Linus, Al,
> 
> Here are some patches that make changes to the mount API UAPI and two of
> them really need applying, before -rc1 - if they're going to be applied at
> all.

I'm fine with 2--4, but I'm not convinced that cloexec-by-default crusade
makes any sense.  Could somebody give coherent arguments in favour of
abandoning the existing conventions?
Al Viro May 16, 2019, 4:31 p.m. UTC | #2
On Thu, May 16, 2019 at 05:22:59PM +0100, Al Viro wrote:
> On Thu, May 16, 2019 at 12:52:04PM +0100, David Howells wrote:
> > 
> > Hi Linus, Al,
> > 
> > Here are some patches that make changes to the mount API UAPI and two of
> > them really need applying, before -rc1 - if they're going to be applied at
> > all.
> 
> I'm fine with 2--4, but I'm not convinced that cloexec-by-default crusade
> makes any sense.  Could somebody give coherent arguments in favour of
> abandoning the existing conventions?

To elaborate: existing syscalls (open, socket, pipe, accept, epoll_create,
etc., etc.) are not cloexec-by-default and that's not going to change,
simply because it would be break the living hell out of existing userland
code.

IOW, the userland has to worry about leaking stuff over sensitive execve(),
no matter what.  All this change does is complicate things for userland
programmer - which syscall belongs to which class.

Where's the benefit?  I could buy an argument about gradually changing
over to APIs that are cloexec-by-default across the board, except for
the obvious fact that it's not going to happen; not with the things
like open() involved.
Christian Brauner May 16, 2019, 4:31 p.m. UTC | #3
On Thu, May 16, 2019 at 05:22:59PM +0100, Al Viro wrote:
> On Thu, May 16, 2019 at 12:52:04PM +0100, David Howells wrote:
> > 
> > Hi Linus, Al,
> > 
> > Here are some patches that make changes to the mount API UAPI and two of
> > them really need applying, before -rc1 - if they're going to be applied at
> > all.
> 
> I'm fine with 2--4, but I'm not convinced that cloexec-by-default crusade
> makes any sense.  Could somebody give coherent arguments in favour of
> abandoning the existing conventions?

So as I said in the commit message. From a userspace perspective it's
more of an issue if one accidently leaks an fd to a task during exec.

Also, most of the time one does not want to inherit an fd during an
exec. It is a hazzle to always have to specify an extra flag.

As Al pointed out to me open() semantics are not going anywhere. Sure,
no argument there at all.
But the idea of making fds cloexec by default is only targeted at fds
that come from separate syscalls. fsopen(), open_tree_clone(), etc. they
all return fds independent of open() so it's really easy to have them
cloexec by default without regressing anyone and we also remove the need
for a bunch of separate flags for each syscall to turn them into
cloexec-fds. I mean, those for syscalls came with 4 separate flags to be
able to specify that the returned fd should be made cloexec. The other
way around, cloexec by default, fcntl() to remove the cloexec bit is way
saner imho.

Christian
Al Viro May 16, 2019, 4:50 p.m. UTC | #4
[linux-abi cc'd]

On Thu, May 16, 2019 at 06:31:52PM +0200, Christian Brauner wrote:
> On Thu, May 16, 2019 at 05:22:59PM +0100, Al Viro wrote:
> > On Thu, May 16, 2019 at 12:52:04PM +0100, David Howells wrote:
> > > 
> > > Hi Linus, Al,
> > > 
> > > Here are some patches that make changes to the mount API UAPI and two of
> > > them really need applying, before -rc1 - if they're going to be applied at
> > > all.
> > 
> > I'm fine with 2--4, but I'm not convinced that cloexec-by-default crusade
> > makes any sense.  Could somebody give coherent arguments in favour of
> > abandoning the existing conventions?
> 
> So as I said in the commit message. From a userspace perspective it's
> more of an issue if one accidently leaks an fd to a task during exec.
> 
> Also, most of the time one does not want to inherit an fd during an
> exec. It is a hazzle to always have to specify an extra flag.
> 
> As Al pointed out to me open() semantics are not going anywhere. Sure,
> no argument there at all.
> But the idea of making fds cloexec by default is only targeted at fds
> that come from separate syscalls. fsopen(), open_tree_clone(), etc. they
> all return fds independent of open() so it's really easy to have them
> cloexec by default without regressing anyone and we also remove the need
> for a bunch of separate flags for each syscall to turn them into
> cloexec-fds. I mean, those for syscalls came with 4 separate flags to be
> able to specify that the returned fd should be made cloexec. The other
> way around, cloexec by default, fcntl() to remove the cloexec bit is way
> saner imho.

Re separate flags - it is, in principle, a valid argument.  OTOH, I'm not
sure if they need to be separate - they all have the same value and
I don't see any reason for that to change...

Only tangentially related, but I wonder if something like close_range(from, to)
would be a more useful approach...  That kind of open-coded loops is not
rare in userland and kernel-side code can do them much cheaper.  Something
like
	/* that exec is sensitive */
	unshare(CLONE_FILES);
	/* we don't want anything past stderr here */
	close_range(3, ~0U);
	execve(....);
on the userland side of thing.  Comments?
Christian Brauner May 16, 2019, 5:01 p.m. UTC | #5
On May 16, 2019 6:50:22 PM GMT+02:00, Al Viro <viro@zeniv.linux.org.uk> wrote:
>[linux-abi cc'd]
>
>On Thu, May 16, 2019 at 06:31:52PM +0200, Christian Brauner wrote:
>> On Thu, May 16, 2019 at 05:22:59PM +0100, Al Viro wrote:
>> > On Thu, May 16, 2019 at 12:52:04PM +0100, David Howells wrote:
>> > > 
>> > > Hi Linus, Al,
>> > > 
>> > > Here are some patches that make changes to the mount API UAPI and
>two of
>> > > them really need applying, before -rc1 - if they're going to be
>applied at
>> > > all.
>> > 
>> > I'm fine with 2--4, but I'm not convinced that cloexec-by-default
>crusade
>> > makes any sense.  Could somebody give coherent arguments in favour
>of
>> > abandoning the existing conventions?
>> 
>> So as I said in the commit message. From a userspace perspective it's
>> more of an issue if one accidently leaks an fd to a task during exec.
>> 
>> Also, most of the time one does not want to inherit an fd during an
>> exec. It is a hazzle to always have to specify an extra flag.
>> 
>> As Al pointed out to me open() semantics are not going anywhere.
>Sure,
>> no argument there at all.
>> But the idea of making fds cloexec by default is only targeted at fds
>> that come from separate syscalls. fsopen(), open_tree_clone(), etc.
>they
>> all return fds independent of open() so it's really easy to have them
>> cloexec by default without regressing anyone and we also remove the
>need
>> for a bunch of separate flags for each syscall to turn them into
>> cloexec-fds. I mean, those for syscalls came with 4 separate flags to
>be
>> able to specify that the returned fd should be made cloexec. The
>other
>> way around, cloexec by default, fcntl() to remove the cloexec bit is
>way
>> saner imho.
>
>Re separate flags - it is, in principle, a valid argument.  OTOH, I'm
>not
>sure if they need to be separate - they all have the same value and
>I don't see any reason for that to change...
>
>Only tangentially related, but I wonder if something like
>close_range(from, to)
>would be a more useful approach...  That kind of open-coded loops is
>not
>rare in userland and kernel-side code can do them much cheaper. 
>Something
>like
>	/* that exec is sensitive */
>	unshare(CLONE_FILES);
>	/* we don't want anything past stderr here */
>	close_range(3, ~0U);
>	execve(....);
>on the userland side of thing.  Comments?

Very much in favor of that!
That'd be a neat new addition.
Dmitry V. Levin May 16, 2019, 8:23 p.m. UTC | #6
[looks like linux-abi is a typo, Cc'ed linux-api instead]

On Thu, May 16, 2019 at 05:50:22PM +0100, Al Viro wrote:
> [linux-abi cc'd]
> 
> On Thu, May 16, 2019 at 06:31:52PM +0200, Christian Brauner wrote:
> > On Thu, May 16, 2019 at 05:22:59PM +0100, Al Viro wrote:
> > > On Thu, May 16, 2019 at 12:52:04PM +0100, David Howells wrote:
> > > > 
> > > > Hi Linus, Al,
> > > > 
> > > > Here are some patches that make changes to the mount API UAPI and two of
> > > > them really need applying, before -rc1 - if they're going to be applied at
> > > > all.
> > > 
> > > I'm fine with 2--4, but I'm not convinced that cloexec-by-default crusade
> > > makes any sense.  Could somebody give coherent arguments in favour of
> > > abandoning the existing conventions?
> > 
> > So as I said in the commit message. From a userspace perspective it's
> > more of an issue if one accidently leaks an fd to a task during exec.
> > 
> > Also, most of the time one does not want to inherit an fd during an
> > exec. It is a hazzle to always have to specify an extra flag.
> > 
> > As Al pointed out to me open() semantics are not going anywhere. Sure,
> > no argument there at all.
> > But the idea of making fds cloexec by default is only targeted at fds
> > that come from separate syscalls. fsopen(), open_tree_clone(), etc. they
> > all return fds independent of open() so it's really easy to have them
> > cloexec by default without regressing anyone and we also remove the need
> > for a bunch of separate flags for each syscall to turn them into
> > cloexec-fds. I mean, those for syscalls came with 4 separate flags to be
> > able to specify that the returned fd should be made cloexec. The other
> > way around, cloexec by default, fcntl() to remove the cloexec bit is way
> > saner imho.
> 
> Re separate flags - it is, in principle, a valid argument.  OTOH, I'm not
> sure if they need to be separate - they all have the same value and
> I don't see any reason for that to change...
> 
> Only tangentially related, but I wonder if something like close_range(from, to)
> would be a more useful approach...  That kind of open-coded loops is not
> rare in userland and kernel-side code can do them much cheaper.  Something
> like
> 	/* that exec is sensitive */
> 	unshare(CLONE_FILES);
> 	/* we don't want anything past stderr here */
> 	close_range(3, ~0U);
> 	execve(....);
> on the userland side of thing.  Comments?

glibc people need a syscall to implement closefrom properly, see
https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c14
Christian Brauner May 17, 2019, 6:54 a.m. UTC | #7
On May 16, 2019 10:23:31 PM GMT+02:00, "Dmitry V. Levin" <ldv@altlinux.org> wrote:
>[looks like linux-abi is a typo, Cc'ed linux-api instead]
>
>On Thu, May 16, 2019 at 05:50:22PM +0100, Al Viro wrote:
>> [linux-abi cc'd]
>> 
>> On Thu, May 16, 2019 at 06:31:52PM +0200, Christian Brauner wrote:
>> > On Thu, May 16, 2019 at 05:22:59PM +0100, Al Viro wrote:
>> > > On Thu, May 16, 2019 at 12:52:04PM +0100, David Howells wrote:
>> > > > 
>> > > > Hi Linus, Al,
>> > > > 
>> > > > Here are some patches that make changes to the mount API UAPI
>and two of
>> > > > them really need applying, before -rc1 - if they're going to be
>applied at
>> > > > all.
>> > > 
>> > > I'm fine with 2--4, but I'm not convinced that cloexec-by-default
>crusade
>> > > makes any sense.  Could somebody give coherent arguments in
>favour of
>> > > abandoning the existing conventions?
>> > 
>> > So as I said in the commit message. From a userspace perspective
>it's
>> > more of an issue if one accidently leaks an fd to a task during
>exec.
>> > 
>> > Also, most of the time one does not want to inherit an fd during an
>> > exec. It is a hazzle to always have to specify an extra flag.
>> > 
>> > As Al pointed out to me open() semantics are not going anywhere.
>Sure,
>> > no argument there at all.
>> > But the idea of making fds cloexec by default is only targeted at
>fds
>> > that come from separate syscalls. fsopen(), open_tree_clone(), etc.
>they
>> > all return fds independent of open() so it's really easy to have
>them
>> > cloexec by default without regressing anyone and we also remove the
>need
>> > for a bunch of separate flags for each syscall to turn them into
>> > cloexec-fds. I mean, those for syscalls came with 4 separate flags
>to be
>> > able to specify that the returned fd should be made cloexec. The
>other
>> > way around, cloexec by default, fcntl() to remove the cloexec bit
>is way
>> > saner imho.
>> 
>> Re separate flags - it is, in principle, a valid argument.  OTOH, I'm
>not
>> sure if they need to be separate - they all have the same value and
>> I don't see any reason for that to change...
>> 
>> Only tangentially related, but I wonder if something like
>close_range(from, to)
>> would be a more useful approach...  That kind of open-coded loops is
>not
>> rare in userland and kernel-side code can do them much cheaper. 
>Something
>> like
>> 	/* that exec is sensitive */
>> 	unshare(CLONE_FILES);
>> 	/* we don't want anything past stderr here */
>> 	close_range(3, ~0U);
>> 	execve(....);
>> on the userland side of thing.  Comments?
>
>glibc people need a syscall to implement closefrom properly, see
>https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c14

I have a prototype for close_range().
I'll send it out after rc1.

Christian
Christian Brauner May 17, 2019, 7:01 a.m. UTC | #8
On May 16, 2019 6:50:22 PM GMT+02:00, Al Viro <viro@zeniv.linux.org.uk> wrote:
>[linux-abi cc'd]
>
>On Thu, May 16, 2019 at 06:31:52PM +0200, Christian Brauner wrote:
>> On Thu, May 16, 2019 at 05:22:59PM +0100, Al Viro wrote:
>> > On Thu, May 16, 2019 at 12:52:04PM +0100, David Howells wrote:
>> > > 
>> > > Hi Linus, Al,
>> > > 
>> > > Here are some patches that make changes to the mount API UAPI and
>two of
>> > > them really need applying, before -rc1 - if they're going to be
>applied at
>> > > all.
>> > 
>> > I'm fine with 2--4, but I'm not convinced that cloexec-by-default
>crusade
>> > makes any sense.  Could somebody give coherent arguments in favour
>of
>> > abandoning the existing conventions?
>> 
>> So as I said in the commit message. From a userspace perspective it's
>> more of an issue if one accidently leaks an fd to a task during exec.
>> 
>> Also, most of the time one does not want to inherit an fd during an
>> exec. It is a hazzle to always have to specify an extra flag.
>> 
>> As Al pointed out to me open() semantics are not going anywhere.
>Sure,
>> no argument there at all.
>> But the idea of making fds cloexec by default is only targeted at fds
>> that come from separate syscalls. fsopen(), open_tree_clone(), etc.
>they
>> all return fds independent of open() so it's really easy to have them
>> cloexec by default without regressing anyone and we also remove the
>need
>> for a bunch of separate flags for each syscall to turn them into
>> cloexec-fds. I mean, those for syscalls came with 4 separate flags to
>be
>> able to specify that the returned fd should be made cloexec. The
>other
>> way around, cloexec by default, fcntl() to remove the cloexec bit is
>way
>> saner imho.
>
>Re separate flags - it is, in principle, a valid argument.  OTOH, I'm
>not
>sure if they need to be separate - they all have the same value and
>I don't see any reason for that to change...

One last thing I'd like to point out is that
we already have syscalls and ioctls that
return cloexec fds. So the consistency
argument is kinda dead.

If you still prefer to have cloexec flags
for the 4 new syscalls then yes,
if they could at least all have the same name
(FSMOUNT_CLOEXEC?) that would be good.

>
>Only tangentially related, but I wonder if something like
>close_range(from, to)
>would be a more useful approach...  That kind of open-coded loops is
>not
>rare in userland and kernel-side code can do them much cheaper. 
>Something
>like
>	/* that exec is sensitive */
>	unshare(CLONE_FILES);
>	/* we don't want anything past stderr here */
>	close_range(3, ~0U);
>	execve(....);
>on the userland side of thing.  Comments?

Said it before but, the list was mistyped so again:
I think that's a great idea.
I have a prototype for close_range(start, end, flags).
I'll wait after rc1 and then send it out.

Christian
David Howells May 17, 2019, 7:13 a.m. UTC | #9
Christian Brauner <christian@brauner.io> wrote:

> If you still prefer to have cloexec flags
> for the 4 new syscalls then yes,
> if they could at least all have the same name
> (FSMOUNT_CLOEXEC?) that would be good.

They don't all have the same value (see OPEN_TREE_CLOEXEC).

Note that I also don't want to blindly #define them to O_CLOEXEC because it's
not necessarily the same value on all arches.  Currently it can be 02000000,
010000000 or 0x400000 for instance, which means that if it's sharing a mask
with other flags, at least three bits have to be reserved for it or we have to
have arch-dependent bit juggling.

One thing I like about your approach of just making them O_CLOEXEC by default
and removing the constants is that it avoids this mess entirely.

David
Miklos Szeredi May 17, 2019, 7:25 a.m. UTC | #10
On Fri, May 17, 2019 at 9:13 AM David Howells <dhowells@redhat.com> wrote:
>
> Christian Brauner <christian@brauner.io> wrote:
>
> > If you still prefer to have cloexec flags
> > for the 4 new syscalls then yes,
> > if they could at least all have the same name
> > (FSMOUNT_CLOEXEC?) that would be good.
>
> They don't all have the same value (see OPEN_TREE_CLOEXEC).
>
> Note that I also don't want to blindly #define them to O_CLOEXEC because it's
> not necessarily the same value on all arches.  Currently it can be 02000000,
> 010000000 or 0x400000 for instance, which means that if it's sharing a mask
> with other flags, at least three bits have to be reserved for it or we have to
> have arch-dependent bit juggling.
>
> One thing I like about your approach of just making them O_CLOEXEC by default
> and removing the constants is that it avoids this mess entirely.

+1.

Confusion caused by inconsistency of naming is going to hurt more than
inconsistency of semantics wrt. open(2).

Thanks,
Miklos
Christian Brauner May 17, 2019, 7:27 a.m. UTC | #11
On May 17, 2019 9:13:26 AM GMT+02:00, David Howells <dhowells@redhat.com> wrote:
>Christian Brauner <christian@brauner.io> wrote:
>
>> If you still prefer to have cloexec flags
>> for the 4 new syscalls then yes,
>> if they could at least all have the same name
>> (FSMOUNT_CLOEXEC?) that would be good.
>
>They don't all have the same value (see OPEN_TREE_CLOEXEC).
>
>Note that I also don't want to blindly #define them to O_CLOEXEC
>because it's
>not necessarily the same value on all arches.  Currently it can be
>02000000,
>010000000 or 0x400000 for instance, which means that if it's sharing a
>mask
>with other flags, at least three bits have to be reserved for it or we
>have to
>have arch-dependent bit juggling.


Ugh. Right, I forgot about that entirely.

Christian

>
>One thing I like about your approach of just making them O_CLOEXEC by
>default
>and removing the constants is that it avoids this mess entirely.
>
>David