Message ID | 20200522085723.29007-1-mszeredi@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ovl: make private mounts longterm | expand |
On Fri, May 22, 2020 at 10:57:23AM +0200, Miklos Szeredi wrote: > Overlayfs is using clone_private_mount() to create internal mounts for > underlying layers. These are used for operations requiring a path, such as > dentry_open(). > > Since these private mounts are not in any namespace they are treated as > short term, "detached" mounts and mntput() involves taking the global > mount_lock, which can result in serious cacheline pingpong. > > Make these private mounts longterm instead, which trade the penalty on > mntput() for a slightly longer shutdown time due to an added RCU grace > period when putting these mounts. > > Introduce a new helper kern_unmount_many() that can take care of multiple > longterm mounts with a single RCU grace period. Umm... 1) Documentation/filesystems/porting - something along the lines of "clone_private_mount() returns a longterm mount now, so the proper destructor of its result is kern_unmount()" 2) the name kern_unmount_many() has an unfortunate clash with fput_many(), with arguments that look similar and mean something entirely different. How about kern_unmount_array()? 3) > - mntput(ofs->upper_mnt); > - for (i = 1; i < ofs->numlayer; i++) { > - iput(ofs->layers[i].trap); > - mntput(ofs->layers[i].mnt); > + > + if (!ofs->layers) { > + /* Deal with partial setup */ > + kern_unmount(ofs->upper_mnt); > + } else { > + /* Hack! Reuse ofs->layers as a mounts array */ > + struct vfsmount **mounts = (struct vfsmount **) ofs->layers; > + > + for (i = 0; i < ofs->numlayer; i++) { > + iput(ofs->layers[i].trap); > + mounts[i] = ofs->layers[i].mnt; > + } > + kern_unmount_many(mounts, ofs->numlayer); > + kfree(ofs->layers); That's _way_ too subtle. AFAICS, you rely upon ->upper_mnt == ->layers[0].mnt, ->layers[0].trap == NULL, without even mentioning that. And the hack you do mention... Yecchhh... How many layers are possible, again?
On Fri, May 22, 2020 at 6:08 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Fri, May 22, 2020 at 10:57:23AM +0200, Miklos Szeredi wrote: > > Overlayfs is using clone_private_mount() to create internal mounts for > > underlying layers. These are used for operations requiring a path, such as > > dentry_open(). > > > > Since these private mounts are not in any namespace they are treated as > > short term, "detached" mounts and mntput() involves taking the global > > mount_lock, which can result in serious cacheline pingpong. > > > > Make these private mounts longterm instead, which trade the penalty on > > mntput() for a slightly longer shutdown time due to an added RCU grace > > period when putting these mounts. > > > > Introduce a new helper kern_unmount_many() that can take care of multiple > > longterm mounts with a single RCU grace period. > > Umm... > > 1) Documentation/filesystems/porting - something along the lines > of "clone_private_mount() returns a longterm mount now, so the proper > destructor of its result is kern_unmount()" > > 2) the name kern_unmount_many() has an unfortunate clash with > fput_many(), with arguments that look similar and mean something > entirely different. How about kern_unmount_array()? > > 3) > > - mntput(ofs->upper_mnt); > > - for (i = 1; i < ofs->numlayer; i++) { > > - iput(ofs->layers[i].trap); > > - mntput(ofs->layers[i].mnt); > > + > > + if (!ofs->layers) { > > + /* Deal with partial setup */ > > + kern_unmount(ofs->upper_mnt); > > + } else { > > + /* Hack! Reuse ofs->layers as a mounts array */ > > + struct vfsmount **mounts = (struct vfsmount **) ofs->layers; > > + > > + for (i = 0; i < ofs->numlayer; i++) { > > + iput(ofs->layers[i].trap); > > + mounts[i] = ofs->layers[i].mnt; > > + } > > + kern_unmount_many(mounts, ofs->numlayer); > > + kfree(ofs->layers); > > That's _way_ too subtle. AFAICS, you rely upon ->upper_mnt == ->layers[0].mnt, > ->layers[0].trap == NULL, without even mentioning that. And the hack you do > mention... Yecchhh... How many layers are possible, again? 500, mounts array would fit inside a page and a page can be allocated with __GFP_NOFAIL. But why bother? It's not all that bad, is it? Thanks, Miklos
> > > - mntput(ofs->upper_mnt); > > > - for (i = 1; i < ofs->numlayer; i++) { > > > - iput(ofs->layers[i].trap); > > > - mntput(ofs->layers[i].mnt); > > > + > > > + if (!ofs->layers) { > > > + /* Deal with partial setup */ > > > + kern_unmount(ofs->upper_mnt); > > > + } else { > > > + /* Hack! Reuse ofs->layers as a mounts array */ > > > + struct vfsmount **mounts = (struct vfsmount **) ofs->layers; > > > + > > > + for (i = 0; i < ofs->numlayer; i++) { > > > + iput(ofs->layers[i].trap); > > > + mounts[i] = ofs->layers[i].mnt; > > > + } > > > + kern_unmount_many(mounts, ofs->numlayer); > > > + kfree(ofs->layers); > > > > That's _way_ too subtle. AFAICS, you rely upon ->upper_mnt == ->layers[0].mnt, > > ->layers[0].trap == NULL, without even mentioning that. And the hack you do > > mention... Yecchhh... How many layers are possible, again? > > 500, mounts array would fit inside a page and a page can be allocated > with __GFP_NOFAIL. But why bother? It's not all that bad, is it? FWIW, it seems fine to me. We can transfer the reference from upperdir_trap to layers[0].trap when initializing layers[0] for the sake of clarity. Thanks, Amir.
On Fri, May 22, 2020 at 7:02 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > - mntput(ofs->upper_mnt); > > > > - for (i = 1; i < ofs->numlayer; i++) { > > > > - iput(ofs->layers[i].trap); > > > > - mntput(ofs->layers[i].mnt); > > > > + > > > > + if (!ofs->layers) { > > > > + /* Deal with partial setup */ > > > > + kern_unmount(ofs->upper_mnt); > > > > + } else { > > > > + /* Hack! Reuse ofs->layers as a mounts array */ > > > > + struct vfsmount **mounts = (struct vfsmount **) ofs->layers; > > > > + > > > > + for (i = 0; i < ofs->numlayer; i++) { > > > > + iput(ofs->layers[i].trap); > > > > + mounts[i] = ofs->layers[i].mnt; > > > > + } > > > > + kern_unmount_many(mounts, ofs->numlayer); > > > > + kfree(ofs->layers); > > > > > > That's _way_ too subtle. AFAICS, you rely upon ->upper_mnt == ->layers[0].mnt, > > > ->layers[0].trap == NULL, without even mentioning that. And the hack you do > > > mention... Yecchhh... How many layers are possible, again? > > > > 500, mounts array would fit inside a page and a page can be allocated > > with __GFP_NOFAIL. But why bother? It's not all that bad, is it? > > FWIW, it seems fine to me. > We can transfer the reference from upperdir_trap to layers[0].trap > when initializing layers[0] for the sake of clarity. Right, we should just get rid of ofs->upper_mnt and ofs->upperdir_trap and use ofs->layers[0] to store those. Thanks, Miklos
On Fri, May 22, 2020 at 08:53:49PM +0200, Miklos Szeredi wrote: > On Fri, May 22, 2020 at 7:02 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > - mntput(ofs->upper_mnt); > > > > > - for (i = 1; i < ofs->numlayer; i++) { > > > > > - iput(ofs->layers[i].trap); > > > > > - mntput(ofs->layers[i].mnt); > > > > > + > > > > > + if (!ofs->layers) { > > > > > + /* Deal with partial setup */ > > > > > + kern_unmount(ofs->upper_mnt); > > > > > + } else { > > > > > + /* Hack! Reuse ofs->layers as a mounts array */ > > > > > + struct vfsmount **mounts = (struct vfsmount **) ofs->layers; > > > > > + > > > > > + for (i = 0; i < ofs->numlayer; i++) { > > > > > + iput(ofs->layers[i].trap); > > > > > + mounts[i] = ofs->layers[i].mnt; > > > > > + } > > > > > + kern_unmount_many(mounts, ofs->numlayer); > > > > > + kfree(ofs->layers); > > > > > > > > That's _way_ too subtle. AFAICS, you rely upon ->upper_mnt == ->layers[0].mnt, > > > > ->layers[0].trap == NULL, without even mentioning that. And the hack you do > > > > mention... Yecchhh... How many layers are possible, again? > > > > > > 500, mounts array would fit inside a page and a page can be allocated > > > with __GFP_NOFAIL. But why bother? It's not all that bad, is it? > > > > FWIW, it seems fine to me. > > We can transfer the reference from upperdir_trap to layers[0].trap > > when initializing layers[0] for the sake of clarity. > > Right, we should just get rid of ofs->upper_mnt and ofs->upperdir_trap > and use ofs->layers[0] to store those. For that you'd need to allocate ->layers before you get to ovl_get_upper(), though. I'm not saying it's a bad idea - doing plain memory allocations before anything else tends to make failure exits cleaner; it's just that it'll take some massage. Basically, do ovl_split_lowerdirs() early, then allocate everything you need, then do lookups, etc., filling that stuff. Regarding this series - the points regarding the name choice and the need to document the calling conventions change still remain.
On Fri, May 22, 2020 at 9:56 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Fri, May 22, 2020 at 08:53:49PM +0200, Miklos Szeredi wrote: > > Right, we should just get rid of ofs->upper_mnt and ofs->upperdir_trap > > and use ofs->layers[0] to store those. > > For that you'd need to allocate ->layers before you get to ovl_get_upper(), > though. I'm not saying it's a bad idea - doing plain memory allocations > before anything else tends to make failure exits cleaner; it's just that > it'll take some massage. Basically, do ovl_split_lowerdirs() early, > then allocate everything you need, then do lookups, etc., filling that > stuff. That was exactly the plan I set out. > Regarding this series - the points regarding the name choice and the > need to document the calling conventions change still remain. Agreed. Thanks, Miklos
diff --git a/fs/namespace.c b/fs/namespace.c index a28e4db075ed..5d16d87b6b8b 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1879,6 +1879,9 @@ struct vfsmount *clone_private_mount(const struct path *path) if (IS_ERR(new_mnt)) return ERR_CAST(new_mnt); + /* Longterm mount to be removed by kern_unmount*() */ + new_mnt->mnt_ns = MNT_NS_INTERNAL; + return &new_mnt->mnt; } EXPORT_SYMBOL_GPL(clone_private_mount); @@ -3804,6 +3807,19 @@ void kern_unmount(struct vfsmount *mnt) } EXPORT_SYMBOL(kern_unmount); +void kern_unmount_many(struct vfsmount *mnt[], unsigned int num) +{ + unsigned int i; + + for (i = 0; i < num; i++) + if (mnt[i]) + real_mount(mnt[i])->mnt_ns = NULL; + synchronize_rcu_expedited(); + for (i = 0; i < num; i++) + mntput(mnt[i]); +} +EXPORT_SYMBOL(kern_unmount_many); + bool our_mnt(struct vfsmount *mnt) { return check_mnt(real_mount(mnt)); diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c index 60dfb27bc12b..a938dd2521b2 100644 --- a/fs/overlayfs/super.c +++ b/fs/overlayfs/super.c @@ -225,12 +225,21 @@ static void ovl_free_fs(struct ovl_fs *ofs) dput(ofs->workbasedir); if (ofs->upperdir_locked) ovl_inuse_unlock(ofs->upper_mnt->mnt_root); - mntput(ofs->upper_mnt); - for (i = 1; i < ofs->numlayer; i++) { - iput(ofs->layers[i].trap); - mntput(ofs->layers[i].mnt); + + if (!ofs->layers) { + /* Deal with partial setup */ + kern_unmount(ofs->upper_mnt); + } else { + /* Hack! Reuse ofs->layers as a mounts array */ + struct vfsmount **mounts = (struct vfsmount **) ofs->layers; + + for (i = 0; i < ofs->numlayer; i++) { + iput(ofs->layers[i].trap); + mounts[i] = ofs->layers[i].mnt; + } + kern_unmount_many(mounts, ofs->numlayer); + kfree(ofs->layers); } - kfree(ofs->layers); for (i = 0; i < ofs->numfs; i++) free_anon_bdev(ofs->fs[i].pseudo_dev); kfree(ofs->fs); diff --git a/include/linux/mount.h b/include/linux/mount.h index bf8cc4108b8f..e3e994bfcecb 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -109,4 +109,6 @@ extern unsigned int sysctl_mount_max; extern bool path_is_mountpoint(const struct path *path); +extern void kern_unmount_many(struct vfsmount *mnt[], unsigned int num); + #endif /* _LINUX_MOUNT_H */
Overlayfs is using clone_private_mount() to create internal mounts for underlying layers. These are used for operations requiring a path, such as dentry_open(). Since these private mounts are not in any namespace they are treated as short term, "detached" mounts and mntput() involves taking the global mount_lock, which can result in serious cacheline pingpong. Make these private mounts longterm instead, which trade the penalty on mntput() for a slightly longer shutdown time due to an added RCU grace period when putting these mounts. Introduce a new helper kern_unmount_many() that can take care of multiple longterm mounts with a single RCU grace period. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> --- fs/namespace.c | 16 ++++++++++++++++ fs/overlayfs/super.c | 19 ++++++++++++++----- include/linux/mount.h | 2 ++ 3 files changed, 32 insertions(+), 5 deletions(-)