diff mbox

[30/32] vfs: Allow cloning of a mount tree with open(O_PATH|O_CLONE_MOUNT) [ver #8]

Message ID 152720691829.9073.10564431140980997005.stgit@warthog.procyon.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

David Howells May 25, 2018, 12:08 a.m. UTC
Make it possible to clone a mount tree with a new pair of open flags that
are used in conjunction with O_PATH:

 (1) O_CLONE_MOUNT - Clone the mount or mount tree at the path.

 (2) O_NON_RECURSIVE - Don't clone recursively.

Note that it's not a good idea to reuse other flags (such as O_CREAT)
because the open routine for O_PATH does not give an error if any other
flags are used in conjunction with O_PATH, but rather just masks off any it
doesn't use.

The resultant file struct is marked FMODE_NEED_UNMOUNT to as it pins an
extra reference for the mount.  This will be cleared by the upcoming
move_mount() syscall when it successfully moves a cloned mount into the
filesystem tree.

Note that care needs to be taken with the error handling in do_o_path() in
the case that vfs_open() fails as the path may or may not have been
attached to the file struct and FMODE_NEED_UNMOUNT may or may not be set.
Note that O_DIRECT | O_PATH could be a problem with error handling too.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/fcntl.c                       |    2 +-
 fs/internal.h                    |    1 +
 fs/namei.c                       |   26 ++++++++++++++++++----
 fs/namespace.c                   |   44 ++++++++++++++++++++++++++++++++++++++
 fs/open.c                        |    7 +++++-
 include/linux/fcntl.h            |    3 ++-
 include/uapi/asm-generic/fcntl.h |    8 +++++++
 7 files changed, 83 insertions(+), 8 deletions(-)

Comments

Christoph Hellwig June 1, 2018, 6:26 a.m. UTC | #1
On Fri, May 25, 2018 at 01:08:38AM +0100, David Howells wrote:
> Make it possible to clone a mount tree with a new pair of open flags that
> are used in conjunction with O_PATH:
> 
>  (1) O_CLONE_MOUNT - Clone the mount or mount tree at the path.
> 
>  (2) O_NON_RECURSIVE - Don't clone recursively.

Err.  I don't think we should use up two O_* flags for something
only useful for your new mount API.  Don't we have a better place
to for these flags?

Instead of overloading this on open having a specific syscalls just
seems like a much saner idea.
Al Viro June 1, 2018, 6:39 a.m. UTC | #2
On Thu, May 31, 2018 at 11:26:54PM -0700, Christoph Hellwig wrote:
> On Fri, May 25, 2018 at 01:08:38AM +0100, David Howells wrote:
> > Make it possible to clone a mount tree with a new pair of open flags that
> > are used in conjunction with O_PATH:
> > 
> >  (1) O_CLONE_MOUNT - Clone the mount or mount tree at the path.
> > 
> >  (2) O_NON_RECURSIVE - Don't clone recursively.
> 
> Err.  I don't think we should use up two O_* flags for something
> only useful for your new mount API.  Don't we have a better place
> to for these flags?
> 
> Instead of overloading this on open having a specific syscalls just
> seems like a much saner idea.

It's not just mount API; these can be used independently of that.
Think of the uses where you pass those to ...at() and you'll see
a bunch of applications of that thing.
Amir Goldstein June 1, 2018, 8:02 a.m. UTC | #3
[added linux-api]

On Fri, May 25, 2018 at 3:08 AM, David Howells <dhowells@redhat.com> wrote:
> Make it possible to clone a mount tree with a new pair of open flags that
> are used in conjunction with O_PATH:
>
>  (1) O_CLONE_MOUNT - Clone the mount or mount tree at the path.
>
>  (2) O_NON_RECURSIVE - Don't clone recursively.
>
> Note that it's not a good idea to reuse other flags (such as O_CREAT)
> because the open routine for O_PATH does not give an error if any other
> flags are used in conjunction with O_PATH, but rather just masks off any it
> doesn't use.
>
> The resultant file struct is marked FMODE_NEED_UNMOUNT to as it pins an
> extra reference for the mount.  This will be cleared by the upcoming
> move_mount() syscall when it successfully moves a cloned mount into the
> filesystem tree.
>
> Note that care needs to be taken with the error handling in do_o_path() in
> the case that vfs_open() fails as the path may or may not have been
> attached to the file struct and FMODE_NEED_UNMOUNT may or may not be set.
> Note that O_DIRECT | O_PATH could be a problem with error handling too.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
>
[...]

> @@ -977,8 +979,11 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
>                  * If we have O_PATH in the open flag. Then we
>                  * cannot have anything other than the below set of flags
>                  */
> -               flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
> +               flags &= (O_DIRECTORY | O_NOFOLLOW | O_PATH |
> +                         O_CLONE_MOUNT | O_NON_RECURSIVE);
>                 acc_mode = 0;
> +       } else if (flags & (O_CLONE_MOUNT | O_NON_RECURSIVE)) {
> +               return -EINVAL;

Reject O_NON_RECURSIVE without O_CLONE_MOUNT?
That would free at least one flag combination for future use.

Doesn't it make more sense for user API to opt-into
O_RECURSIVE_CLONE, rather than opt-out of it?


>         }
>
>         op->open_flag = flags;
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index 27dc7a60693e..8f60e2244740 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -9,7 +9,8 @@
>         (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
>          O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
>          FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> -        O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> +        O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | \
> +        O_CLONE_MOUNT | O_NON_RECURSIVE)
>
>  #ifndef force_o_largefile
>  #define force_o_largefile() (BITS_PER_LONG != 32)
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 0b1c7e35090c..f533e35ea19b 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -88,6 +88,14 @@
>  #define __O_TMPFILE    020000000
>  #endif
>
> +#ifndef O_CLONE_MOUNT
> +#define O_CLONE_MOUNT  040000000       /* Used with O_PATH to clone the mount subtree at path */
> +#endif
> +
> +#ifndef O_NON_RECURSIVE
> +#define O_NON_RECURSIVE        0100000000      /* Used with O_CLONE_MOUNT to only clone one mount */
> +#endif
> +
>  /* a horrid kludge trying to make sure that this will fail on old kernels */
>  #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
>  #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
>

I am not sure what are the consequences of opening O_PATH with old kernel
and getting an open file, can't think of anything bad.
Can the same be claimed for O_PATH|O_CLONE_MOUNT?

Wouldn't it be better to apply the O_TMPFILE kludge to the new
open flag, so that apps can check if O_CLONE_MOUNT feature is supported
by kernel?

Thanks,
Amir.
David Howells June 1, 2018, 8:27 a.m. UTC | #4
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> > Instead of overloading this on open having a specific syscalls just
> > seems like a much saner idea.
> 
> It's not just mount API; these can be used independently of that.
> Think of the uses where you pass those to ...at() and you'll see
> a bunch of applications of that thing.

I kind of agree with Christoph on this point.  Yes, you can use the resultant
fd for other things, but that doesn't mean it has to be obtained initially
through open() or openat() rather than, say, a new pick_mount() syscall.

Further, having more parameters available gives us the opportunity to change
the settings on any mounts we create at the point of creation.

David
David Howells June 1, 2018, 8:42 a.m. UTC | #5
Amir Goldstein <amir73il@gmail.com> wrote:

> Reject O_NON_RECURSIVE without O_CLONE_MOUNT?

Yes, I should add that.

> I am not sure what are the consequences of opening O_PATH with old kernel
> and getting an open file, can't think of anything bad.
> Can the same be claimed for O_PATH|O_CLONE_MOUNT?

Yes, actually, there can be consequences.  Some files have side effects.
Think open("/dev/foobar", O_PATH).

> Wouldn't it be better to apply the O_TMPFILE kludge to the new
> open flag, so that apps can check if O_CLONE_MOUNT feature is supported
> by kernel?

Ugh.  The problem is that the O_TMPFILE kludge can't be done because O_PATH
currently just masks off any bits it's not interested in rather than giving an
error.

Even the O_TMPFILE kludge doesn't protect you against someone having set
random unassigned bits when testing on a kernel that didn't support it.

And this bit:

	/*
	 * Clear out all open flags we don't know about so that we don't report
	 * them in fcntl(F_GETFD) or similar interfaces.
	 */
	flags &= VALID_OPEN_FLAGS;

is just plain wrong.  Effectively, it allows userspace to set random reserved
bits without consequences.  It should give an error instead.

Probably we should really replace open() and openat() both before we can
allocate any further open flags.

</grumble>

David
Al Viro June 2, 2018, 3:09 a.m. UTC | #6
On Fri, Jun 01, 2018 at 09:27:43AM +0100, David Howells wrote:
> Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> > > Instead of overloading this on open having a specific syscalls just
> > > seems like a much saner idea.
> > 
> > It's not just mount API; these can be used independently of that.
> > Think of the uses where you pass those to ...at() and you'll see
> > a bunch of applications of that thing.
> 
> I kind of agree with Christoph on this point.  Yes, you can use the resultant
> fd for other things, but that doesn't mean it has to be obtained initially
> through open() or openat() rather than, say, a new pick_mount() syscall.
> 
> Further, having more parameters available gives us the opportunity to change
> the settings on any mounts we create at the point of creation.

open_subtree(int dirfd, const char *pathname, int flags), then?  How would
flags be interpreted?  What I see mapping at that thing is
	* equivalent of O_PATH open
	* clone subtree, O_PATH open root
	* clone one mount, O_PATH open root
and apparently you want to add (orthogonal to that)
	* make shared/slave/private/unbindable
	* ditto with recursion?
	* same for nodev/nosuid/noexec/noatime/nodiratime/relatime/ro/?
as well as usual AT_... flags (empty path, follow)

Choose the encoding...
Al Viro June 2, 2018, 3:42 a.m. UTC | #7
On Sat, Jun 02, 2018 at 04:09:14AM +0100, Al Viro wrote:
> On Fri, Jun 01, 2018 at 09:27:43AM +0100, David Howells wrote:
> > Al Viro <viro@ZenIV.linux.org.uk> wrote:
> > 
> > > > Instead of overloading this on open having a specific syscalls just
> > > > seems like a much saner idea.
> > > 
> > > It's not just mount API; these can be used independently of that.
> > > Think of the uses where you pass those to ...at() and you'll see
> > > a bunch of applications of that thing.
> > 
> > I kind of agree with Christoph on this point.  Yes, you can use the resultant
> > fd for other things, but that doesn't mean it has to be obtained initially
> > through open() or openat() rather than, say, a new pick_mount() syscall.
> > 
> > Further, having more parameters available gives us the opportunity to change
> > the settings on any mounts we create at the point of creation.
> 
> open_subtree(int dirfd, const char *pathname, int flags), then?  How would
> flags be interpreted?  What I see mapping at that thing is
> 	* equivalent of O_PATH open
> 	* clone subtree, O_PATH open root
> 	* clone one mount, O_PATH open root
> and apparently you want to add (orthogonal to that)
> 	* make shared/slave/private/unbindable
> 	* ditto with recursion?
> 	* same for nodev/nosuid/noexec/noatime/nodiratime/relatime/ro/?
> as well as usual AT_... flags (empty path, follow)
> 
> Choose the encoding...

_If_ I'm interpreting that correctly, that should be something like a bitmap
of attributes to modify + values to set for each.  Let's see -
	propagation	1 + 2 bits
	nodev		1 + 1
	noexec		1 + 1
	nosuid		1 + 1
	ro		1 + 1
	atime		1 + 3
That's 15 bits.  On top of that, we have 1 bit for "clone or original"
and 1 bit for "recursive or single-mount".  As well as AT_EMPTY_PATH,
and AT_NO_AUTOMOUNT (inconvenient, since these are fixed bits).  In
principle, that does fit into int, with some space to spare...

Is that what you have in mind?
Al Viro June 2, 2018, 4:04 a.m. UTC | #8
On Sat, Jun 02, 2018 at 04:42:56AM +0100, Al Viro wrote:
> _If_ I'm interpreting that correctly, that should be something like a bitmap
> of attributes to modify + values to set for each.  Let's see -
> 	propagation	1 + 2 bits
> 	nodev		1 + 1
> 	noexec		1 + 1
> 	nosuid		1 + 1
> 	ro		1 + 1
> 	atime		1 + 3
> That's 15 bits.  On top of that, we have 1 bit for "clone or original"
> and 1 bit for "recursive or single-mount".  As well as AT_EMPTY_PATH,
> and AT_NO_AUTOMOUNT (inconvenient, since these are fixed bits).  In
> principle, that does fit into int, with some space to spare...
> 
> Is that what you have in mind?

TBH, I would probably prefer separate mount_setattr(2) for that kind
of work, with something like
	int mount_setattr(int dirfd, const char *path, int flags, int attr)
*not* opening any files.
flags:
	AT_EMPTY_PATH, AT_NO_AUTOMOUNT, AT_RECURSIVE
attr:
	MOUNT_SETATTR_DEV		(1<<0)
	MOUNT_SETATTR_NODEV		(1<<0)|(1<<1)
	MOUNT_SETATTR_EXEC		(1<<2)
	MOUNT_SETATTR_NOEXEC		(1<<2)|(1<<3)
	MOUNT_SETATTR_SUID		(1<<4)
	MOUNT_SETATTR_NOSUID		(1<<4)|(1<<5)
	MOUNT_SETATTR_RW		(1<<6)
	MOUNT_SETATTR_RO		(1<<6)|(1<<7)
	MOUNT_SETATTR_RELATIME		(1<<8)
	MOUNT_SETATTR_NOATIME		(1<<8)|(1<<9)
	MOUNT_SETATTR_NODIRATIME	(1<<8)|(2<<9)
	MOUNT_SETATTR_STRICTATIME	(1<<8)|(3<<9)
	MOUNT_SETATTR_PRIVATE		(1<<11)
	MOUNT_SETATTR_SHARED		(1<<11)|(1<<12)
	MOUNT_SETATTR_SLAVE		(1<<11)|(2<<12)
	MOUNT_SETATTR_UNBINDABLE	(1<<11)|(3<<12)

With either openat() used as in this series, or explicit
	int open_tree(int dirfd, const char *path, int flags)
returning a descriptor, with flags being
	AT_EMPTY_PATH, AT_NO_AUTOMOUNT, AT_RECURSIVE, AT_CLONE
with AT_RECURSIVE without AT_CLONE being an error.  Hell, might
even add AT_UMOUNT for "open root and detach, to be dissolved on
close", incompatible with AT_CLONE.

Comments?
David Howells June 2, 2018, 3:45 p.m. UTC | #9
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> TBH, I would probably prefer separate mount_setattr(2) for that kind
> of work, with something like
> 	int mount_setattr(int dirfd, const char *path, int flags, int attr)
> *not* opening any files.
> flags:
> 	AT_EMPTY_PATH, AT_NO_AUTOMOUNT, AT_RECURSIVE

I would call these MOUNT_SETATTR_* rather than AT_*.

> attr:
> 	MOUNT_SETATTR_DEV		(1<<0)
> 	MOUNT_SETATTR_NODEV		(1<<0)|(1<<1)
> 	MOUNT_SETATTR_EXEC		(1<<2)
> 	MOUNT_SETATTR_NOEXEC		(1<<2)|(1<<3)
> 	MOUNT_SETATTR_SUID		(1<<4)
> 	MOUNT_SETATTR_NOSUID		(1<<4)|(1<<5)
> 	MOUNT_SETATTR_RW		(1<<6)
> 	MOUNT_SETATTR_RO		(1<<6)|(1<<7)
> 	MOUNT_SETATTR_RELATIME		(1<<8)
> 	MOUNT_SETATTR_NOATIME		(1<<8)|(1<<9)
> 	MOUNT_SETATTR_NODIRATIME	(1<<8)|(2<<9)
> 	MOUNT_SETATTR_STRICTATIME	(1<<8)|(3<<9)
> 	MOUNT_SETATTR_PRIVATE		(1<<11)
> 	MOUNT_SETATTR_SHARED		(1<<11)|(1<<12)
> 	MOUNT_SETATTR_SLAVE		(1<<11)|(2<<12)
> 	MOUNT_SETATTR_UNBINDABLE	(1<<11)|(3<<12)

So, I like this generally, some notes though:

I wonder if this should be two separate parameters, a mask and the settings?
I'm not sure that's worth it since some of the mask bits would cover multiple
settings.

Also, should NODIRATIME be separate from the other *ATIME flags?  I do also
like treating some of these settings as enumerations rather than a set of
bits.

I would make the prototype:

	int mount_setattr(int dirfd, const char *path,
			  unsigned int flags, unsigned int attr,
			  void *reserved5);

Further, do we want to say you can either change the propagation type *or*
reconfigure the mountpoint restrictions, but not both at the same time?

> With either openat() used as in this series, or explicit
> 	int open_tree(int dirfd, const char *path, int flags)

Maybe open_mount(), grab_mount() or pick_mount()?

I wonder if fsopen()/fspick() should be create_fs()/open_fs()...

> returning a descriptor, with flags being
> 	AT_EMPTY_PATH, AT_NO_AUTOMOUNT, AT_RECURSIVE, AT_CLONE
> with AT_RECURSIVE without AT_CLONE being an error.

You also need an O_CLOEXEC equivalent.

I would make it:

	OPEN_TREE_CLOEXEC		0x00000001
	OPEN_TREE_EMPTY_PATH		0x00000002
	OPEN_TREE_FOLLOW_SYMLINK	0x00000004
	OPEN_TREE_NO_AUTOMOUNT		0x00000008
	OPEN_TREE_CLONE			0x00000010
	OPEN_TREE_RECURSIVE		0x00000020

adding the follow-symlinks so that you don't grab a symlink target by
accident.  (Can you actually mount on top of a symlink?)

> Hell, might even add AT_UMOUNT for "open root and detach, to be dissolved on
> close", incompatible with AT_CLONE.

Cute.  Guess you could do:

	fd = open_mount(..., OPEN_TREE_DETACH);
	mount_setattr(fd, "",
		      MOUNT_SETATTR_EMPTY_PATH,
		      MOUNT_SETATTR_NOSUID, NULL);
	move_mount(fd, "", ...);

David
Al Viro June 2, 2018, 5:49 p.m. UTC | #10
On Sat, Jun 02, 2018 at 04:45:21PM +0100, David Howells wrote:
> Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> > TBH, I would probably prefer separate mount_setattr(2) for that kind
> > of work, with something like
> > 	int mount_setattr(int dirfd, const char *path, int flags, int attr)
> > *not* opening any files.
> > flags:
> > 	AT_EMPTY_PATH, AT_NO_AUTOMOUNT, AT_RECURSIVE
> 
> I would call these MOUNT_SETATTR_* rather than AT_*.

Why?  AT_EMPTY_PATH/AT_NO_AUTOMOUNT are common with other ...at()
syscalls; AT_RECURSIVE - maybe, but it's still more like AT_...
namespace fodder, IMO.

> > attr:
> > 	MOUNT_SETATTR_DEV		(1<<0)
> > 	MOUNT_SETATTR_NODEV		(1<<0)|(1<<1)
> > 	MOUNT_SETATTR_EXEC		(1<<2)
> > 	MOUNT_SETATTR_NOEXEC		(1<<2)|(1<<3)
> > 	MOUNT_SETATTR_SUID		(1<<4)
> > 	MOUNT_SETATTR_NOSUID		(1<<4)|(1<<5)
> > 	MOUNT_SETATTR_RW		(1<<6)
> > 	MOUNT_SETATTR_RO		(1<<6)|(1<<7)
> > 	MOUNT_SETATTR_RELATIME		(1<<8)
> > 	MOUNT_SETATTR_NOATIME		(1<<8)|(1<<9)
> > 	MOUNT_SETATTR_NODIRATIME	(1<<8)|(2<<9)
> > 	MOUNT_SETATTR_STRICTATIME	(1<<8)|(3<<9)
> > 	MOUNT_SETATTR_PRIVATE		(1<<11)
> > 	MOUNT_SETATTR_SHARED		(1<<11)|(1<<12)
> > 	MOUNT_SETATTR_SLAVE		(1<<11)|(2<<12)
> > 	MOUNT_SETATTR_UNBINDABLE	(1<<11)|(3<<12)
> 
> So, I like this generally, some notes though:
> 
> I wonder if this should be two separate parameters, a mask and the settings?
> I'm not sure that's worth it since some of the mask bits would cover multiple
> settings.

Nah, better put those bits in the same word, as in above.  Here bits 0, 2, 4, 6,
8 and 11 tell which attributes are to be modified, with values to set living
in bits 1, 3, 5, 7, 9--10 and 12--13.  Look at the constants above..

> Also, should NODIRATIME be separate from the other *ATIME flags?  I do also
> like treating some of these settings as enumerations rather than a set of
> bits.

Huh?  That's precisely what I'm doing there: bit 8 is "want to change atime
settings", bits 9 and 10 hold a 4-element enumeration (rel/no/nodir/strict).
Similar for propagation settings (bit 11 indicates that we want to set those,
bits 12 and 13 - 4-element enum)...

> I would make the prototype:
> 
> 	int mount_setattr(int dirfd, const char *path,
> 			  unsigned int flags, unsigned int attr,
> 			  void *reserved5);
> 
> Further, do we want to say you can either change the propagation type *or*
> reconfigure the mountpoint restrictions, but not both at the same time?

Why?  MOUNT_SETATTR_PRIVATE | MOUNT_NOATIME | MOUNT_SUID, i.e. 00101100010000,
i.e. 0xb10 for "turn nosuid off, switch atime polcy to noatime, change propagation
to private, leave everything else as-is"...

And for fsck sake, what's that "void *reserved5" for?

> > With either openat() used as in this series, or explicit
> > 	int open_tree(int dirfd, const char *path, int flags)
> 
> Maybe open_mount(), grab_mount() or pick_mount()?
> 
> I wonder if fsopen()/fspick() should be create_fs()/open_fs()...
> 
> > returning a descriptor, with flags being
> > 	AT_EMPTY_PATH, AT_NO_AUTOMOUNT, AT_RECURSIVE, AT_CLONE
> > with AT_RECURSIVE without AT_CLONE being an error.
> 
> You also need an O_CLOEXEC equivalent.

Point.

> I would make it:
> 
> 	OPEN_TREE_CLOEXEC		0x00000001

Why not O_CLOEXEC, as with epoll_create()/timerfd_create()/etc.?

> 	OPEN_TREE_EMPTY_PATH		0x00000002
> 	OPEN_TREE_FOLLOW_SYMLINK	0x00000004
> 	OPEN_TREE_NO_AUTOMOUNT		0x00000008

Why?  How are those different from normal AT_EMPTY_PATH/AT_NO_AUTOMOUNT?

> 	OPEN_TREE_CLONE			0x00000010
> 	OPEN_TREE_RECURSIVE		0x00000020
> 
> adding the follow-symlinks so that you don't grab a symlink target by
> accident.  (Can you actually mount on top of a symlink?)

You can't - mount(2) uses LOOKUP_FOLLOW for mountpoint (well, user_path(),
actually).

> > Hell, might even add AT_UMOUNT for "open root and detach, to be dissolved on
> > close", incompatible with AT_CLONE.
> 
> Cute.  Guess you could do:
> 
> 	fd = open_mount(..., OPEN_TREE_DETACH);
> 	mount_setattr(fd, "",
> 		      MOUNT_SETATTR_EMPTY_PATH,
> 		      MOUNT_SETATTR_NOSUID, NULL);
> 	move_mount(fd, "", ...);
diff mbox

Patch

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 60bc5bf2f4cf..42a53cf03737 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1028,7 +1028,7 @@  static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(19 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
 		     HWEIGHT32(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
diff --git a/fs/internal.h b/fs/internal.h
index c29552e0522f..e3460a2e6b59 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -75,6 +75,7 @@  extern struct vfsmount *lookup_mnt(const struct path *);
 extern int finish_automount(struct vfsmount *, struct path *);
 
 extern int sb_prepare_remount_readonly(struct super_block *);
+extern int copy_mount_for_o_path(struct path *, struct path *, bool);
 
 extern void __init mnt_init(void);
 
diff --git a/fs/namei.c b/fs/namei.c
index 5cbd980b4031..acb8e27d4288 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3458,13 +3458,29 @@  static int do_tmpfile(struct nameidata *nd, unsigned flags,
 
 static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file)
 {
-	struct path path;
-	int error = path_lookupat(nd, flags, &path);
-	if (!error) {
-		audit_inode(nd->name, path.dentry, 0);
-		error = vfs_open(&path, file, current_cred());
+	struct path path, tmp;
+	int error;
+
+	error = path_lookupat(nd, flags, &path);
+	if (error)
+		return error;
+
+	if (file->f_flags & O_CLONE_MOUNT) {
+		error = copy_mount_for_o_path(
+			&path, &tmp, !(file->f_flags & O_NON_RECURSIVE));
 		path_put(&path);
+		if (error < 0)
+			return error;
+		path = tmp;
 	}
+
+	audit_inode(nd->name, path.dentry, 0);
+	error = vfs_open(&path, file, current_cred());
+	if (error < 0 &&
+	    (flags & O_CLONE_MOUNT) &&
+	    !(file->f_mode & FMODE_NEED_UNMOUNT))
+		__detach_mounts(path.dentry);
+	path_put(&path);
 	return error;
 }
 
diff --git a/fs/namespace.c b/fs/namespace.c
index dba680aa1ea4..e73cfcdfb3d1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2218,6 +2218,50 @@  static int do_loopback(struct path *path, const char *old_name,
 	return err;
 }
 
+/*
+ * Copy the mount or mount subtree at the specified path for
+ * open(O_PATH|O_CLONE_MOUNT).
+ */
+int copy_mount_for_o_path(struct path *from, struct path *to, bool recurse)
+{
+	struct mountpoint *mp;
+	struct mount *mnt = NULL, *f = real_mount(from->mnt);
+	int ret;
+
+	mp = lock_mount(from);
+	if (IS_ERR(mp))
+		return PTR_ERR(mp);
+
+	ret = -EINVAL;
+	if (IS_MNT_UNBINDABLE(f))
+		goto out_unlock;
+
+	if (!check_mnt(f) && from->dentry->d_op != &ns_dentry_operations)
+		goto out_unlock;
+
+	if (!recurse && has_locked_children(f, from->dentry))
+		goto out_unlock;
+
+	if (recurse)
+		mnt = copy_tree(f, from->dentry, CL_COPY_MNT_NS_FILE);
+	else
+		mnt = clone_mnt(f, from->dentry, 0);
+	if (IS_ERR(mnt)) {
+		ret = PTR_ERR(mnt);
+		goto out_unlock;
+	}
+
+	mnt->mnt.mnt_flags &= ~MNT_LOCKED;
+
+	to->mnt = &mnt->mnt;
+	to->dentry = dget(from->dentry);
+	ret = 0;
+
+out_unlock:
+	unlock_mount(mp);
+	return ret;
+}
+
 static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
 {
 	int error = 0;
diff --git a/fs/open.c b/fs/open.c
index 79a8a1bd740d..27ce9c60345a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -748,6 +748,8 @@  static int do_dentry_open(struct file *f,
 
 	if (unlikely(f->f_flags & O_PATH)) {
 		f->f_mode |= FMODE_PATH;
+		if (f->f_flags & O_CLONE_MOUNT)
+			f->f_mode |= FMODE_NEED_UNMOUNT;
 		f->f_op = &empty_fops;
 		goto done;
 	}
@@ -977,8 +979,11 @@  static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		 * If we have O_PATH in the open flag. Then we
 		 * cannot have anything other than the below set of flags
 		 */
-		flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
+		flags &= (O_DIRECTORY | O_NOFOLLOW | O_PATH |
+			  O_CLONE_MOUNT | O_NON_RECURSIVE);
 		acc_mode = 0;
+	} else if (flags & (O_CLONE_MOUNT | O_NON_RECURSIVE)) {
+		return -EINVAL;
 	}
 
 	op->open_flag = flags;
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 27dc7a60693e..8f60e2244740 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,8 @@ 
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | \
+	 O_CLONE_MOUNT | O_NON_RECURSIVE)
 
 #ifndef force_o_largefile
 #define force_o_largefile() (BITS_PER_LONG != 32)
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 0b1c7e35090c..f533e35ea19b 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,14 @@ 
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CLONE_MOUNT
+#define O_CLONE_MOUNT	040000000	/* Used with O_PATH to clone the mount subtree at path */
+#endif
+
+#ifndef O_NON_RECURSIVE
+#define O_NON_RECURSIVE	0100000000	/* Used with O_CLONE_MOUNT to only clone one mount */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)