Message ID | 20180603005532.GZ30522@ZenIV.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, Jun 3, 2018 at 2:55 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Sat, Jun 02, 2018 at 06:49:58PM +0100, Al Viro wrote: > >> > > Hell, might even add AT_UMOUNT for "open root and detach, to be dissolved on >> > > close", incompatible with AT_CLONE. >> > >> > Cute. Guess you could do: >> > >> > fd = open_mount(..., OPEN_TREE_DETACH); >> > mount_setattr(fd, "", >> > MOUNT_SETATTR_EMPTY_PATH, >> > MOUNT_SETATTR_NOSUID, NULL); >> > move_mount(fd, "", ...); > > Hadn't added that yet, but as for the rest of open_tree() - see > vfs.git#mount-open_tree. open() and its flags are not touched at all. > Other changes compared to the previous: > * may_mount() is required for OPEN_TREE_CLONE > * sys_ni.c cruft is dropped - those make no sense until and unless > those syscalls become conditional. > > Appears to work, combined with move_mount() it yields eqiuvalents of > mount --{move,bind,rbind}. Combined with mount_setattr(2) (when that > gets added) we'll get mount -o remount,bind,nodev et.al. fsopen = create fsfd fsmount = fsfd -> mountfd & set attr on mountfd & attach mountfd fspick = path -> fsfd move_mount = attach mountfd or move existing fsinfo = info from path open_tree = new mountfd from path or clone mount_setattr = set attr on mountfd Notice that fsmount() encompasses mount_setattr() + move_mount() functionality. Split those out and leave fsmount() to actually do the "fsfd ->mountfd" translation? fsinfo() name suggests it's in the same class as fsopen/fsmount/fspick, operating on fsfd object, but's it's not and I think that's slightly confusing. Rename move_mount() -> mount_move()? Also does it make sense to make the cloning behavior of open_tree() optional? Without cloning it's just a plain open(O_PATH). That way it could be renamed mount_clone(). Thanks, Miklos
Miklos Szeredi <miklos@szeredi.hu> wrote:
> fsinfo = info from path
Actually, I was thinking of making fsinfo() detect if it's been given an fsfd
and go through an fs_context operation instead in that case.
David
On Mon, Jun 04, 2018 at 12:34:44PM +0200, Miklos Szeredi wrote: > fsopen = create fsfd > fsmount = fsfd -> mountfd & set attr on mountfd & attach mountfd > fspick = path -> fsfd > move_mount = attach mountfd or move existing > fsinfo = info from path > open_tree = new mountfd from path or clone > mount_setattr = set attr on mountfd > > Notice that fsmount() encompasses mount_setattr() + move_mount() > functionality. Split those out and leave fsmount() to actually do > the "fsfd ->mountfd" translation? Might make sense. > fsinfo() name suggests it's in the same class as > fsopen/fsmount/fspick, operating on fsfd object, but's it's not and I > think that's slightly confusing. > > Rename move_mount() -> mount_move()? mount_move_bikeshed_bikeshed_bikeshed(), surely? > Also does it make sense to make the cloning behavior of open_tree() > optional? Without cloning it's just a plain open(O_PATH). That way > it could be renamed mount_clone(). Umm... I'm not sure about that one. If nothing else, OPEN_TREE_DETACH might be a good idea, in which case cloning is not the primary effect; hell knows.
On Mon, Jun 04, 2018 at 04:52:05PM +0100, Al Viro wrote: > On Mon, Jun 04, 2018 at 12:34:44PM +0200, Miklos Szeredi wrote: > > > fsopen = create fsfd > > fsmount = fsfd -> mountfd & set attr on mountfd & attach mountfd > > fspick = path -> fsfd > > move_mount = attach mountfd or move existing > > fsinfo = info from path > > open_tree = new mountfd from path or clone > > mount_setattr = set attr on mountfd > > > > Notice that fsmount() encompasses mount_setattr() + move_mount() > > functionality. Split those out and leave fsmount() to actually do > > the "fsfd ->mountfd" translation? > > Might make sense. FWIW, to make it clear: fsmount(2) in this series actually does *NOT* attach it to the tree. Commit message definitely needs updating - as it is, it's +SYSCALL_DEFINE5(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags, + void *, reserved4, void *, reserved5) PS: IMO these reserved... arguments are in bad taste; if anyone has good reasons for that practice in ABI design, I'd like to hear those.
On Sun, Jun 03, 2018 at 01:55:37AM +0100, Al Viro wrote: > +SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags) > +{ > + struct file *file; > + struct path path; > + int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW; > + bool detached = flags & OPEN_TREE_CLONE; > + int error; > + int fd; > + > + BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC); Why do we need OPEN_TREE_CLOEXEC? Wouldn't we be better off just making the fd returned by open_tree implicitly close-on-exec? I can think of no good reason for these file descriptors to be inherited across exec() and if someone comes up with such a reason, fcntl(F_SETFD) is not an expensive call to make.
On Mon, Jun 04, 2018 at 10:16:30AM -0700, Matthew Wilcox wrote: > On Sun, Jun 03, 2018 at 01:55:37AM +0100, Al Viro wrote: > > +SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags) > > +{ > > + struct file *file; > > + struct path path; > > + int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW; > > + bool detached = flags & OPEN_TREE_CLONE; > > + int error; > > + int fd; > > + > > + BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC); > > Why do we need OPEN_TREE_CLOEXEC? Wouldn't we be better off just making > the fd returned by open_tree implicitly close-on-exec? I can think of > no good reason for these file descriptors to be inherited across exec() How are they different from any file descriptor? It's not as if it was something usable only for mounting stuff - again, you can use them with any ...at() syscalls. > and if someone comes up with such a reason, fcntl(F_SETFD) is not an > expensive call to make.
On Mon, Jun 4, 2018 at 5:52 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Mon, Jun 04, 2018 at 12:34:44PM +0200, Miklos Szeredi wrote: > >> fsopen = create fsfd >> fsmount = fsfd -> mountfd & set attr on mountfd & attach mountfd >> fspick = path -> fsfd >> move_mount = attach mountfd or move existing >> fsinfo = info from path >> open_tree = new mountfd from path or clone >> mount_setattr = set attr on mountfd >> >> Notice that fsmount() encompasses mount_setattr() + move_mount() >> functionality. Split those out and leave fsmount() to actually do >> the "fsfd ->mountfd" translation? > > Might make sense. > FWIW, to make it clear: fsmount(2) in this series actually does *NOT* > attach it to the tree. Ah, that leaves the mount_setattr() functionality to split out. I'd be more happy to rid this new API of all the old MS_* crap and have have a new set of attributes, that just apply to mounts. It will also need two args: a bitmap of new attributes and a mask to tell us which attributes to change. > Commit message definitely needs updating - as it > is, it's > > +SYSCALL_DEFINE5(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags, > + void *, reserved4, void *, reserved5) > > PS: IMO these reserved... arguments are in bad taste; if anyone has good reasons > for that practice in ABI design, I'd like to hear those. Agreed. A flags argument is often wise to add even if currently unused (and should be checked for undefined flags), but adding a random number of pointers doesn't seem to make a lot of sense. > >> fsinfo() name suggests it's in the same class as >> fsopen/fsmount/fspick, operating on fsfd object, but's it's not and I >> think that's slightly confusing. >> >> Rename move_mount() -> mount_move()? > > mount_move_bikeshed_bikeshed_bikeshed(), surely? Consistent naming for related functions... not unheard of in API design. The above set definitely does not qualify. >> Also does it make sense to make the cloning behavior of open_tree() >> optional? Without cloning it's just a plain open(O_PATH). That way >> it could be renamed mount_clone(). > > Umm... I'm not sure about that one. If nothing else, OPEN_TREE_DETACH > might be a good idea, in which case cloning is not the primary effect; > hell knows. So conceptually we have the following distinct mount tree operations: treefd = clone(path); treefd = detach(path); attach(treefd, path); move(path1, path2); The detach/move/attach trio are more related in functionality, while clone and detach have the same signature. I'm not sure either. Thanks, Miklos
On Mon, Jun 4, 2018 at 7:35 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Mon, Jun 04, 2018 at 10:16:30AM -0700, Matthew Wilcox wrote: >> On Sun, Jun 03, 2018 at 01:55:37AM +0100, Al Viro wrote: >> > +SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags) >> > +{ >> > + struct file *file; >> > + struct path path; >> > + int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW; >> > + bool detached = flags & OPEN_TREE_CLONE; >> > + int error; >> > + int fd; >> > + >> > + BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC); >> >> Why do we need OPEN_TREE_CLOEXEC? Wouldn't we be better off just making >> the fd returned by open_tree implicitly close-on-exec? I can think of >> no good reason for these file descriptors to be inherited across exec() > > How are they different from any file descriptor? It's not as if it was > something usable only for mounting stuff - again, you can use them > with any ...at() syscalls. Defaulting to close on exec helps keep out clutter from the API. Is there a disadvantage to needing an explicit fcntl(F_SETFD) call to disable close on exec? Thanks, Miklos
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 14a2f996e543..b2b44ecd2b17 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -397,3 +397,4 @@ 383 i386 statx sys_statx __ia32_sys_statx 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents +391 i386 open_tree sys_open_tree __ia32_sys_open_tree diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index cd36232ab62f..d6f4949378e7 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -342,6 +342,7 @@ 331 common pkey_free __x64_sys_pkey_free 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents +339 common open_tree __x64_sys_open_tree # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/file_table.c b/fs/file_table.c index 7ec0b3e5f05d..7480271a0d21 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -189,6 +189,7 @@ static void __fput(struct file *file) struct dentry *dentry = file->f_path.dentry; struct vfsmount *mnt = file->f_path.mnt; struct inode *inode = file->f_inode; + fmode_t mode = file->f_mode; might_sleep(); @@ -209,14 +210,14 @@ static void __fput(struct file *file) file->f_op->release(inode, file); security_file_free(file); if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL && - !(file->f_mode & FMODE_PATH))) { + !(mode & FMODE_PATH))) { cdev_put(inode->i_cdev); } fops_put(file->f_op); put_pid(file->f_owner.pid); - if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) + if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) i_readcount_dec(inode); - if (file->f_mode & FMODE_WRITER) { + if (mode & FMODE_WRITER) { put_write_access(inode); __mnt_drop_write(mnt); } @@ -224,6 +225,8 @@ static void __fput(struct file *file) file->f_path.mnt = NULL; file->f_inode = NULL; file_free(file); + if (unlikely(mode & FMODE_NEED_UNMOUNT)) + dissolve_on_fput(mnt); dput(dentry); mntput(mnt); } diff --git a/fs/internal.h b/fs/internal.h index 980d005b21b4..b55575b9b55c 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -85,6 +85,7 @@ extern void __mnt_drop_write(struct vfsmount *); extern void __mnt_drop_write_file(struct file *); extern void mnt_drop_write_file_path(struct file *); +extern void dissolve_on_fput(struct vfsmount *); /* * fs_struct.c */ diff --git a/fs/namespace.c b/fs/namespace.c index 5f75969adff1..3281fea98cf0 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -20,12 +20,14 @@ #include <linux/init.h> /* init_rootfs */ #include <linux/fs_struct.h> /* get_fs_root et.al. */ #include <linux/fsnotify.h> /* fsnotify_vfsmount_delete */ +#include <linux/file.h> #include <linux/uaccess.h> #include <linux/proc_ns.h> #include <linux/magic.h> #include <linux/bootmem.h> #include <linux/task_work.h> #include <linux/sched/task.h> +#include <uapi/linux/mount.h> #include "pnode.h" #include "internal.h" @@ -1839,6 +1841,16 @@ struct vfsmount *collect_mounts(const struct path *path) return &tree->mnt; } +void dissolve_on_fput(struct vfsmount *mnt) +{ + namespace_lock(); + lock_mount_hash(); + mntget(mnt); + umount_tree(real_mount(mnt), UMOUNT_SYNC); + unlock_mount_hash(); + namespace_unlock(); +} + void drop_collected_mounts(struct vfsmount *mnt) { namespace_lock(); @@ -2198,6 +2210,30 @@ static bool has_locked_children(struct mount *mnt, struct dentry *dentry) return false; } +static struct mount *__do_loopback(struct path *old_path, int recurse) +{ + struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt); + + if (IS_MNT_UNBINDABLE(old)) + return mnt; + + if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations) + return mnt; + + if (!recurse && has_locked_children(old, old_path->dentry)) + return mnt; + + if (recurse) + mnt = copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE); + else + mnt = clone_mnt(old, old_path->dentry, 0); + + if (!IS_ERR(mnt)) + mnt->mnt.mnt_flags &= ~MNT_LOCKED; + + return mnt; +} + /* * do loopback mount. */ @@ -2205,7 +2241,7 @@ static int do_loopback(struct path *path, const char *old_name, int recurse) { struct path old_path; - struct mount *mnt = NULL, *old, *parent; + struct mount *mnt = NULL, *parent; struct mountpoint *mp; int err; if (!old_name || !*old_name) @@ -2219,38 +2255,21 @@ static int do_loopback(struct path *path, const char *old_name, goto out; mp = lock_mount(path); - err = PTR_ERR(mp); - if (IS_ERR(mp)) + if (IS_ERR(mp)) { + err = PTR_ERR(mp); goto out; + } - old = real_mount(old_path.mnt); parent = real_mount(path->mnt); - - err = -EINVAL; - if (IS_MNT_UNBINDABLE(old)) - goto out2; - if (!check_mnt(parent)) goto out2; - if (!check_mnt(old) && old_path.dentry->d_op != &ns_dentry_operations) - goto out2; - - if (!recurse && has_locked_children(old, old_path.dentry)) - goto out2; - - if (recurse) - mnt = copy_tree(old, old_path.dentry, CL_COPY_MNT_NS_FILE); - else - mnt = clone_mnt(old, old_path.dentry, 0); - + mnt = __do_loopback(&old_path, recurse); if (IS_ERR(mnt)) { err = PTR_ERR(mnt); goto out2; } - mnt->mnt.mnt_flags &= ~MNT_LOCKED; - err = graft_tree(mnt, parent, mp); if (err) { lock_mount_hash(); @@ -2264,6 +2283,75 @@ static int do_loopback(struct path *path, const char *old_name, return err; } +SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags) +{ + struct file *file; + struct path path; + int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW; + bool detached = flags & OPEN_TREE_CLONE; + int error; + int fd; + + BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC); + + if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE | + AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE | + OPEN_TREE_CLOEXEC)) + return -EINVAL; + + if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE)) == AT_RECURSIVE) + return -EINVAL; + + if (flags & AT_NO_AUTOMOUNT) + lookup_flags &= ~LOOKUP_AUTOMOUNT; + if (flags & AT_SYMLINK_NOFOLLOW) + lookup_flags &= ~LOOKUP_FOLLOW; + if (flags & AT_EMPTY_PATH) + lookup_flags |= LOOKUP_EMPTY; + + if (detached && !may_mount()) + return -EPERM; + + fd = get_unused_fd_flags(flags & O_CLOEXEC); + if (fd < 0) + return fd; + + error = user_path_at(dfd, filename, lookup_flags, &path); + if (error) + goto out; + + if (detached) { + struct mount *mnt = __do_loopback(&path, flags & AT_RECURSIVE); + if (IS_ERR(mnt)) { + error = PTR_ERR(mnt); + goto out2; + } + mntput(path.mnt); + path.mnt = &mnt->mnt; + } + + file = dentry_open(&path, O_PATH, current_cred()); + if (IS_ERR(file)) { + error = PTR_ERR(file); + goto out3; + } + + if (detached) + file->f_mode |= FMODE_NEED_UNMOUNT; + path_put(&path); + fd_install(fd, file); + return fd; + +out3: + if (detached) + dissolve_on_fput(path.mnt); +out2: + path_put(&path); +out: + put_unused_fd(fd); + return error; +} + static int change_mount_flags(struct vfsmount *mnt, int ms_flags) { int error = 0; diff --git a/include/linux/fs.h b/include/linux/fs.h index 482563fe549c..706b4605bc26 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -154,6 +154,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, /* File is capable of returning -EAGAIN if I/O will block */ #define FMODE_NOWAIT ((__force fmode_t)0x8000000) +/* File represents mount that needs unmounting */ +#define FMODE_NEED_UNMOUNT ((__force fmode_t)0x10000000) + /* * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector * that indicates that they should check the contents of the iovec are diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 811172fcb916..925483aba03a 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -896,7 +896,7 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val); asmlinkage long sys_pkey_free(int pkey); asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags, unsigned mask, struct statx __user *buffer); - +asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags); /* * Architecture-specific system calls diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 6448cdd9a350..594b85f7cb86 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -90,5 +90,7 @@ #define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */ #define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */ +#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */ + #endif /* _UAPI_LINUX_FCNTL_H */ diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h new file mode 100644 index 000000000000..b9c3b46210db --- /dev/null +++ b/include/uapi/linux/mount.h @@ -0,0 +1,7 @@ +#ifndef _UAPI_LINUX_MOUNT_H +#define _UAPI_LINUX_MOUNT_H + +#define OPEN_TREE_CLONE 1 +#define OPEN_TREE_CLOEXEC O_CLOEXEC + +#endif /* _UAPI_LINUX_MOUNT_H */