Message ID | 20240626201129.272750-3-lkarpins@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs/namespace: defer RCU sync for MNT_DETACH umount | expand |
On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: > +++ b/fs/namespace.c > @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; > static DECLARE_RWSEM(namespace_sem); > static HLIST_HEAD(unmounted); /* protected by namespace_sem */ > static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ > +static bool lazy_unlock = false; /* protected by namespace_sem */ That's a pretty ugly way of doing it. How about this? +++ b/fs/namespace.c @@ -1553,7 +1553,7 @@ int may_umount(struct vfsmount *mnt) EXPORT_SYMBOL(may_umount); -static void namespace_unlock(void) +static void __namespace_unlock(bool lazy) { struct hlist_head head; struct hlist_node *p; @@ -1570,7 +1570,8 @@ static void namespace_unlock(void) if (likely(hlist_empty(&head))) return; - synchronize_rcu_expedited(); + if (!lazy) + synchronize_rcu_expedited(); hlist_for_each_entry_safe(m, p, &head, mnt_umount) { hlist_del(&m->mnt_umount); @@ -1578,6 +1579,11 @@ static void namespace_unlock(void) } } +static inline void namespace_unlock(void) +{ + __namespace_unlock(false); +} + static inline void namespace_lock(void) { down_write(&namespace_sem); @@ -1798,7 +1804,7 @@ static int do_umount(struct mount *mnt, int flags) } out: unlock_mount_hash(); - namespace_unlock(); + __namespace_unlock(flags & MNT_DETACH); return retval; } (other variants on this theme might be to pass the flags to __namespace_unlock() and check MNT_DETACH there)
On 27/6/24 04:47, Matthew Wilcox wrote: > On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: >> +++ b/fs/namespace.c >> @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; >> static DECLARE_RWSEM(namespace_sem); >> static HLIST_HEAD(unmounted); /* protected by namespace_sem */ >> static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ >> +static bool lazy_unlock = false; /* protected by namespace_sem */ > That's a pretty ugly way of doing it. How about this? Ha! That was my original thought but I also didn't much like changing all the callers. I don't really like the proliferation of these small helper functions either but if everyone is happy to do this I think it's a great idea. Ian > > +++ b/fs/namespace.c > @@ -1553,7 +1553,7 @@ int may_umount(struct vfsmount *mnt) > > EXPORT_SYMBOL(may_umount); > > -static void namespace_unlock(void) > +static void __namespace_unlock(bool lazy) > { > struct hlist_head head; > struct hlist_node *p; > @@ -1570,7 +1570,8 @@ static void namespace_unlock(void) > if (likely(hlist_empty(&head))) > return; > > - synchronize_rcu_expedited(); > + if (!lazy) > + synchronize_rcu_expedited(); > > hlist_for_each_entry_safe(m, p, &head, mnt_umount) { > hlist_del(&m->mnt_umount); > @@ -1578,6 +1579,11 @@ static void namespace_unlock(void) > } > } > > +static inline void namespace_unlock(void) > +{ > + __namespace_unlock(false); > +} > + > static inline void namespace_lock(void) > { > down_write(&namespace_sem); > @@ -1798,7 +1804,7 @@ static int do_umount(struct mount *mnt, int flags) > } > out: > unlock_mount_hash(); > - namespace_unlock(); > + __namespace_unlock(flags & MNT_DETACH); > return retval; > } > > > (other variants on this theme might be to pass the flags to > __namespace_unlock() and check MNT_DETACH there) >
On Thu 27-06-24 09:11:14, Ian Kent wrote: > On 27/6/24 04:47, Matthew Wilcox wrote: > > On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: > > > +++ b/fs/namespace.c > > > @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; > > > static DECLARE_RWSEM(namespace_sem); > > > static HLIST_HEAD(unmounted); /* protected by namespace_sem */ > > > static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ > > > +static bool lazy_unlock = false; /* protected by namespace_sem */ > > That's a pretty ugly way of doing it. How about this? > > Ha! > > That was my original thought but I also didn't much like changing all the > callers. > > I don't really like the proliferation of these small helper functions either > but if everyone > > is happy to do this I think it's a great idea. So I know you've suggested removing synchronize_rcu_expedited() call in your comment to v2. But I wonder why is it safe? I *thought* synchronize_rcu_expedited() is there to synchronize the dropping of the last mnt reference (and maybe something else) - see the comment at the beginning of mntput_no_expire() - and this change would break that? Honza
On Thu, Jun 27, 2024 at 01:54:18PM GMT, Jan Kara wrote: > On Thu 27-06-24 09:11:14, Ian Kent wrote: > > On 27/6/24 04:47, Matthew Wilcox wrote: > > > On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: > > > > +++ b/fs/namespace.c > > > > @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; > > > > static DECLARE_RWSEM(namespace_sem); > > > > static HLIST_HEAD(unmounted); /* protected by namespace_sem */ > > > > static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ > > > > +static bool lazy_unlock = false; /* protected by namespace_sem */ > > > That's a pretty ugly way of doing it. How about this? > > > > Ha! > > > > That was my original thought but I also didn't much like changing all the > > callers. > > > > I don't really like the proliferation of these small helper functions either > > but if everyone > > > > is happy to do this I think it's a great idea. > > So I know you've suggested removing synchronize_rcu_expedited() call in > your comment to v2. But I wonder why is it safe? I *thought* > synchronize_rcu_expedited() is there to synchronize the dropping of the > last mnt reference (and maybe something else) - see the comment at the > beginning of mntput_no_expire() - and this change would break that? Yes. During umount mnt->mnt_ns will be set to NULL with namespace_sem and the mount seqlock held. mntput() doesn't acquire namespace_sem as that would get rather problematic during path lookup. It also elides lock_mount_hash() by looking at mnt->mnt_ns because that's set to NULL when a mount is actually unmounted. So iirc synchronize_rcu_expedited() will ensure that it is actually the system call that shuts down all the mounts it put on the umounted list and not some other task that also called mntput() as that would cause pretty blatant EBUSY issues. So callers that come before mnt->mnt_ns = NULL simply return of course but callers that come after mnt->mnt_ns = NULL will acquire lock_mount_hash() _under_ rcu_read_lock(). These callers see an elevated reference count and thus simply return while namespace_lock()'s synchronize_rcu_expedited() prevents the system call from making progress. But I also don't see it working without risk even with MNT_DETACH. It still has potential to cause issues in userspace. Any program that always passes MNT_DETACH simply to ensure that even in the very rare case that a mount might still be busy is unmounted might now end up seeing increased EBUSY failures for mounts that didn't actually need to be unmounted with MNT_DETACH. In other words, this is only inocuous if userspace only uses MNT_DETACH for stuff they actually know is busy when they're trying to unmount. And I don't think that's the case.
On 27/6/24 19:54, Jan Kara wrote: > On Thu 27-06-24 09:11:14, Ian Kent wrote: >> On 27/6/24 04:47, Matthew Wilcox wrote: >>> On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: >>>> +++ b/fs/namespace.c >>>> @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; >>>> static DECLARE_RWSEM(namespace_sem); >>>> static HLIST_HEAD(unmounted); /* protected by namespace_sem */ >>>> static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ >>>> +static bool lazy_unlock = false; /* protected by namespace_sem */ >>> That's a pretty ugly way of doing it. How about this? >> Ha! >> >> That was my original thought but I also didn't much like changing all the >> callers. >> >> I don't really like the proliferation of these small helper functions either >> but if everyone >> >> is happy to do this I think it's a great idea. > So I know you've suggested removing synchronize_rcu_expedited() call in > your comment to v2. But I wonder why is it safe? I *thought* > synchronize_rcu_expedited() is there to synchronize the dropping of the > last mnt reference (and maybe something else) - see the comment at the > beginning of mntput_no_expire() - and this change would break that? Interesting, because of the definition of lazy umount I didn't look closely enough at that. But I wonder, how exactly would that race occur, is holding the rcu read lock sufficient since the rcu'd mount free won't be done until it's released (at least I think that's how rcu works). In this case, when lazy is true, the mount will have been detached in umount_tree() and mnt->mnt_ns set to NULL under the namespace sem write lock. So that condition in mntput_no_expre() won't be true and the mount will no longer be found by the VFS. I guess the question then becomes will any outstanding lockless path walks race with this with only the rcu read lock to protect it, Christian? Ian
On 27/6/24 23:16, Christian Brauner wrote: > On Thu, Jun 27, 2024 at 01:54:18PM GMT, Jan Kara wrote: >> On Thu 27-06-24 09:11:14, Ian Kent wrote: >>> On 27/6/24 04:47, Matthew Wilcox wrote: >>>> On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: >>>>> +++ b/fs/namespace.c >>>>> @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; >>>>> static DECLARE_RWSEM(namespace_sem); >>>>> static HLIST_HEAD(unmounted); /* protected by namespace_sem */ >>>>> static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ >>>>> +static bool lazy_unlock = false; /* protected by namespace_sem */ >>>> That's a pretty ugly way of doing it. How about this? >>> Ha! >>> >>> That was my original thought but I also didn't much like changing all the >>> callers. >>> >>> I don't really like the proliferation of these small helper functions either >>> but if everyone >>> >>> is happy to do this I think it's a great idea. >> So I know you've suggested removing synchronize_rcu_expedited() call in >> your comment to v2. But I wonder why is it safe? I *thought* >> synchronize_rcu_expedited() is there to synchronize the dropping of the >> last mnt reference (and maybe something else) - see the comment at the >> beginning of mntput_no_expire() - and this change would break that? > Yes. During umount mnt->mnt_ns will be set to NULL with namespace_sem > and the mount seqlock held. mntput() doesn't acquire namespace_sem as > that would get rather problematic during path lookup. It also elides > lock_mount_hash() by looking at mnt->mnt_ns because that's set to NULL > when a mount is actually unmounted. > > So iirc synchronize_rcu_expedited() will ensure that it is actually the > system call that shuts down all the mounts it put on the umounted list > and not some other task that also called mntput() as that would cause > pretty blatant EBUSY issues. > > So callers that come before mnt->mnt_ns = NULL simply return of course > but callers that come after mnt->mnt_ns = NULL will acquire > lock_mount_hash() _under_ rcu_read_lock(). These callers see an elevated > reference count and thus simply return while namespace_lock()'s > synchronize_rcu_expedited() prevents the system call from making > progress. > > But I also don't see it working without risk even with MNT_DETACH. It > still has potential to cause issues in userspace. Any program that > always passes MNT_DETACH simply to ensure that even in the very rare > case that a mount might still be busy is unmounted might now end up > seeing increased EBUSY failures for mounts that didn't actually need to > be unmounted with MNT_DETACH. In other words, this is only inocuous if > userspace only uses MNT_DETACH for stuff they actually know is busy when > they're trying to unmount. And I don't think that's the case. > I'm sorry but how does an MNT_DETACH umount system call return EBUSY, I can't see how that can happen? I have used lazy umount a lot over the years and I haven't had problems with it. There is a tendency to think there might be problems using it but I've never been able to spot them. Ian
On Fri 28-06-24 10:58:54, Ian Kent wrote: > > On 27/6/24 19:54, Jan Kara wrote: > > On Thu 27-06-24 09:11:14, Ian Kent wrote: > > > On 27/6/24 04:47, Matthew Wilcox wrote: > > > > On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: > > > > > +++ b/fs/namespace.c > > > > > @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; > > > > > static DECLARE_RWSEM(namespace_sem); > > > > > static HLIST_HEAD(unmounted); /* protected by namespace_sem */ > > > > > static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ > > > > > +static bool lazy_unlock = false; /* protected by namespace_sem */ > > > > That's a pretty ugly way of doing it. How about this? > > > Ha! > > > > > > That was my original thought but I also didn't much like changing all the > > > callers. > > > > > > I don't really like the proliferation of these small helper functions either > > > but if everyone > > > > > > is happy to do this I think it's a great idea. > > So I know you've suggested removing synchronize_rcu_expedited() call in > > your comment to v2. But I wonder why is it safe? I *thought* > > synchronize_rcu_expedited() is there to synchronize the dropping of the > > last mnt reference (and maybe something else) - see the comment at the > > beginning of mntput_no_expire() - and this change would break that? > > Interesting, because of the definition of lazy umount I didn't look closely > enough at that. > > But I wonder, how exactly would that race occur, is holding the rcu read > lock sufficient since the rcu'd mount free won't be done until it's > released (at least I think that's how rcu works). I'm concerned about a race like: [path lookup] [umount -l] ... path_put() mntput(mnt) mntput_no_expire(m) rcu_read_lock(); if (likely(READ_ONCE(mnt->mnt_ns))) { do_umount() umount_tree() ... mnt->mnt_ns = NULL; ... namespace_unlock() mntput(&m->mnt) mntput_no_expire(mnt) smp_mb(); mnt_add_count(mnt, -1); count = mnt_get_count(mnt); if (count != 0) { ... return; mnt_add_count(mnt, -1); rcu_read_unlock(); return; -> KABOOM, mnt->mnt_count dropped to 0 but nobody cleaned up the mount! } And this scenario is exactly prevented by synchronize_rcu() in namespace_unlock(). Honza
On Fri, Jun 28, 2024 at 11:17:43AM GMT, Ian Kent wrote: > > On 27/6/24 23:16, Christian Brauner wrote: > > On Thu, Jun 27, 2024 at 01:54:18PM GMT, Jan Kara wrote: > > > On Thu 27-06-24 09:11:14, Ian Kent wrote: > > > > On 27/6/24 04:47, Matthew Wilcox wrote: > > > > > On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: > > > > > > +++ b/fs/namespace.c > > > > > > @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; > > > > > > static DECLARE_RWSEM(namespace_sem); > > > > > > static HLIST_HEAD(unmounted); /* protected by namespace_sem */ > > > > > > static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ > > > > > > +static bool lazy_unlock = false; /* protected by namespace_sem */ > > > > > That's a pretty ugly way of doing it. How about this? > > > > Ha! > > > > > > > > That was my original thought but I also didn't much like changing all the > > > > callers. > > > > > > > > I don't really like the proliferation of these small helper functions either > > > > but if everyone > > > > > > > > is happy to do this I think it's a great idea. > > > So I know you've suggested removing synchronize_rcu_expedited() call in > > > your comment to v2. But I wonder why is it safe? I *thought* > > > synchronize_rcu_expedited() is there to synchronize the dropping of the > > > last mnt reference (and maybe something else) - see the comment at the > > > beginning of mntput_no_expire() - and this change would break that? > > Yes. During umount mnt->mnt_ns will be set to NULL with namespace_sem > > and the mount seqlock held. mntput() doesn't acquire namespace_sem as > > that would get rather problematic during path lookup. It also elides > > lock_mount_hash() by looking at mnt->mnt_ns because that's set to NULL > > when a mount is actually unmounted. > > > > So iirc synchronize_rcu_expedited() will ensure that it is actually the > > system call that shuts down all the mounts it put on the umounted list > > and not some other task that also called mntput() as that would cause > > pretty blatant EBUSY issues. > > > > So callers that come before mnt->mnt_ns = NULL simply return of course > > but callers that come after mnt->mnt_ns = NULL will acquire > > lock_mount_hash() _under_ rcu_read_lock(). These callers see an elevated > > reference count and thus simply return while namespace_lock()'s > > synchronize_rcu_expedited() prevents the system call from making > > progress. > > > > But I also don't see it working without risk even with MNT_DETACH. It > > still has potential to cause issues in userspace. Any program that > > always passes MNT_DETACH simply to ensure that even in the very rare > > case that a mount might still be busy is unmounted might now end up > > seeing increased EBUSY failures for mounts that didn't actually need to > > be unmounted with MNT_DETACH. In other words, this is only inocuous if > > userspace only uses MNT_DETACH for stuff they actually know is busy when > > they're trying to unmount. And I don't think that's the case. > > > I'm sorry but how does an MNT_DETACH umount system call return EBUSY, I > can't > > see how that can happen? Not the umount() call is the problem. Say you have the following sequence: (1) mount(ext4-device, /mnt) umount(/mnt, 0) mount(ext4-device, /mnt) If that ext4 filesystem isn't in use anymore then umount() will succeed. The same task can immediately issue a second mount() call on the same device and it must succeed. Today the behavior for this is the same whether or no the caller uses MNT_DETACH. So: (2) mount(ext4-device, /mnt) umount(/mnt, MNT_DETACH) mount(ext4-device, /mnt) All that MNT_DETACH does is to skip the check for busy mounts otherwise it's identical to a regular umount. So (1) and (2) will behave the same as long as the filesystem isn't used anymore. But afaict with your changes this wouldn't be true anymore. If someone uses (2) on a filesystem that isn't busy then they might end up getting EBUSY on the second mount. And if I'm right then that's potentially a rather visible change.
On Fri, Jun 28, 2024 at 2:54 PM Christian Brauner <brauner@kernel.org> wrote: > > On Fri, Jun 28, 2024 at 11:17:43AM GMT, Ian Kent wrote: > > > > On 27/6/24 23:16, Christian Brauner wrote: > > > On Thu, Jun 27, 2024 at 01:54:18PM GMT, Jan Kara wrote: > > > > On Thu 27-06-24 09:11:14, Ian Kent wrote: > > > > > On 27/6/24 04:47, Matthew Wilcox wrote: > > > > > > On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: > > > > > > > +++ b/fs/namespace.c > > > > > > > @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; > > > > > > > static DECLARE_RWSEM(namespace_sem); > > > > > > > static HLIST_HEAD(unmounted); /* protected by namespace_sem */ > > > > > > > static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ > > > > > > > +static bool lazy_unlock = false; /* protected by namespace_sem */ > > > > > > That's a pretty ugly way of doing it. How about this? > > > > > Ha! > > > > > > > > > > That was my original thought but I also didn't much like changing all the > > > > > callers. > > > > > > > > > > I don't really like the proliferation of these small helper functions either > > > > > but if everyone > > > > > > > > > > is happy to do this I think it's a great idea. > > > > So I know you've suggested removing synchronize_rcu_expedited() call in > > > > your comment to v2. But I wonder why is it safe? I *thought* > > > > synchronize_rcu_expedited() is there to synchronize the dropping of the > > > > last mnt reference (and maybe something else) - see the comment at the > > > > beginning of mntput_no_expire() - and this change would break that? > > > Yes. During umount mnt->mnt_ns will be set to NULL with namespace_sem > > > and the mount seqlock held. mntput() doesn't acquire namespace_sem as > > > that would get rather problematic during path lookup. It also elides > > > lock_mount_hash() by looking at mnt->mnt_ns because that's set to NULL > > > when a mount is actually unmounted. > > > > > > So iirc synchronize_rcu_expedited() will ensure that it is actually the > > > system call that shuts down all the mounts it put on the umounted list > > > and not some other task that also called mntput() as that would cause > > > pretty blatant EBUSY issues. > > > > > > So callers that come before mnt->mnt_ns = NULL simply return of course > > > but callers that come after mnt->mnt_ns = NULL will acquire > > > lock_mount_hash() _under_ rcu_read_lock(). These callers see an elevated > > > reference count and thus simply return while namespace_lock()'s > > > synchronize_rcu_expedited() prevents the system call from making > > > progress. > > > > > > But I also don't see it working without risk even with MNT_DETACH. It > > > still has potential to cause issues in userspace. Any program that > > > always passes MNT_DETACH simply to ensure that even in the very rare > > > case that a mount might still be busy is unmounted might now end up > > > seeing increased EBUSY failures for mounts that didn't actually need to > > > be unmounted with MNT_DETACH. In other words, this is only inocuous if > > > userspace only uses MNT_DETACH for stuff they actually know is busy when > > > they're trying to unmount. And I don't think that's the case. > > > > > I'm sorry but how does an MNT_DETACH umount system call return EBUSY, I > > can't > > > > see how that can happen? > > Not the umount() call is the problem. Say you have the following > sequence: > > (1) mount(ext4-device, /mnt) > umount(/mnt, 0) > mount(ext4-device, /mnt) > > If that ext4 filesystem isn't in use anymore then umount() will succeed. > The same task can immediately issue a second mount() call on the same > device and it must succeed. > > Today the behavior for this is the same whether or no the caller uses > MNT_DETACH. So: > > (2) mount(ext4-device, /mnt) > umount(/mnt, MNT_DETACH) > mount(ext4-device, /mnt) > > All that MNT_DETACH does is to skip the check for busy mounts otherwise > it's identical to a regular umount. So (1) and (2) will behave the same > as long as the filesystem isn't used anymore. > > But afaict with your changes this wouldn't be true anymore. If someone > uses (2) on a filesystem that isn't busy then they might end up getting > EBUSY on the second mount. And if I'm right then that's potentially a > rather visible change. This is rather unfortunate, as the synchronize_rcu call is quite expensive. In particular on a real-time kernel where there are no expedited RCUs. This is causing container startup to be slow, as there are several umount(MNT_DETACH) happening during container setup (after the pivot_root, etc). Maybe we can add a umount flag for users that don't need the current behaviour wrt EBUSY? In the container usecase the important part is that the old mounts are disconnected from the child namespace and not really what the mount busy state is (typically it is still mounted in the parent namespace anyway).
On 28/6/24 23:13, Alexander Larsson wrote: > On Fri, Jun 28, 2024 at 2:54 PM Christian Brauner <brauner@kernel.org> wrote: >> On Fri, Jun 28, 2024 at 11:17:43AM GMT, Ian Kent wrote: >>> On 27/6/24 23:16, Christian Brauner wrote: >>>> On Thu, Jun 27, 2024 at 01:54:18PM GMT, Jan Kara wrote: >>>>> On Thu 27-06-24 09:11:14, Ian Kent wrote: >>>>>> On 27/6/24 04:47, Matthew Wilcox wrote: >>>>>>> On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: >>>>>>>> +++ b/fs/namespace.c >>>>>>>> @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; >>>>>>>> static DECLARE_RWSEM(namespace_sem); >>>>>>>> static HLIST_HEAD(unmounted); /* protected by namespace_sem */ >>>>>>>> static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ >>>>>>>> +static bool lazy_unlock = false; /* protected by namespace_sem */ >>>>>>> That's a pretty ugly way of doing it. How about this? >>>>>> Ha! >>>>>> >>>>>> That was my original thought but I also didn't much like changing all the >>>>>> callers. >>>>>> >>>>>> I don't really like the proliferation of these small helper functions either >>>>>> but if everyone >>>>>> >>>>>> is happy to do this I think it's a great idea. >>>>> So I know you've suggested removing synchronize_rcu_expedited() call in >>>>> your comment to v2. But I wonder why is it safe? I *thought* >>>>> synchronize_rcu_expedited() is there to synchronize the dropping of the >>>>> last mnt reference (and maybe something else) - see the comment at the >>>>> beginning of mntput_no_expire() - and this change would break that? >>>> Yes. During umount mnt->mnt_ns will be set to NULL with namespace_sem >>>> and the mount seqlock held. mntput() doesn't acquire namespace_sem as >>>> that would get rather problematic during path lookup. It also elides >>>> lock_mount_hash() by looking at mnt->mnt_ns because that's set to NULL >>>> when a mount is actually unmounted. >>>> >>>> So iirc synchronize_rcu_expedited() will ensure that it is actually the >>>> system call that shuts down all the mounts it put on the umounted list >>>> and not some other task that also called mntput() as that would cause >>>> pretty blatant EBUSY issues. >>>> >>>> So callers that come before mnt->mnt_ns = NULL simply return of course >>>> but callers that come after mnt->mnt_ns = NULL will acquire >>>> lock_mount_hash() _under_ rcu_read_lock(). These callers see an elevated >>>> reference count and thus simply return while namespace_lock()'s >>>> synchronize_rcu_expedited() prevents the system call from making >>>> progress. >>>> >>>> But I also don't see it working without risk even with MNT_DETACH. It >>>> still has potential to cause issues in userspace. Any program that >>>> always passes MNT_DETACH simply to ensure that even in the very rare >>>> case that a mount might still be busy is unmounted might now end up >>>> seeing increased EBUSY failures for mounts that didn't actually need to >>>> be unmounted with MNT_DETACH. In other words, this is only inocuous if >>>> userspace only uses MNT_DETACH for stuff they actually know is busy when >>>> they're trying to unmount. And I don't think that's the case. >>>> >>> I'm sorry but how does an MNT_DETACH umount system call return EBUSY, I >>> can't >>> >>> see how that can happen? >> Not the umount() call is the problem. Say you have the following >> sequence: >> >> (1) mount(ext4-device, /mnt) >> umount(/mnt, 0) >> mount(ext4-device, /mnt) >> >> If that ext4 filesystem isn't in use anymore then umount() will succeed. >> The same task can immediately issue a second mount() call on the same >> device and it must succeed. >> >> Today the behavior for this is the same whether or no the caller uses >> MNT_DETACH. So: >> >> (2) mount(ext4-device, /mnt) >> umount(/mnt, MNT_DETACH) >> mount(ext4-device, /mnt) >> >> All that MNT_DETACH does is to skip the check for busy mounts otherwise >> it's identical to a regular umount. So (1) and (2) will behave the same >> as long as the filesystem isn't used anymore. >> >> But afaict with your changes this wouldn't be true anymore. If someone >> uses (2) on a filesystem that isn't busy then they might end up getting >> EBUSY on the second mount. And if I'm right then that's potentially a >> rather visible change. I'm not sure this change affects the the likelyhood of an EBUSY return in the described case, in fact it looks like it does the opposite. I always thought the rcu delay was to ensure concurrent path walks "see" the umount not to ensure correct operation of the following mntput()(s). Isn't the sequence of operations roughly, resolve path, lock, deatch, release lock, rcu wait, mntput() subordinate mounts, put path. So the mount gets detached in the critical section, then we wait followed by the mntput()(s). The catch is that not waiting might increase the likelyhood that concurrent path walks don't see the umount (so that possibly the umount goes away before the walks see the umount) but I'm not certain. What looks to be as much of a problem is mntput() racing with a concurrent mount beacase while the detach is done in the critical section the super block instance list deletion is not and the wait will make the race possibility more likely. What's more mntput() delegates the mount cleanup (which deletes the list instance) to a workqueue job so this can also occur serially in a following mount command. In fact I might have seen exactly this behavior in a recent xfs-tests run where I was puzzled to see occasional EBUSY return on mounting of mounts that should not have been in use following their umount. So I think there are problems here but I don't think the removal of the wait for lazy umount is the worst of it. The question then becomes, to start with, how do we resolve this unjustified EBUSY return. Perhaps a completion (used between the umount and mount system calls) would work well here? > This is rather unfortunate, as the synchronize_rcu call is quite > expensive. In particular on a real-time kernel where there are no > expedited RCUs. This is causing container startup to be slow, as there > are several umount(MNT_DETACH) happening during container setup (after > the pivot_root, etc). > > Maybe we can add a umount flag for users that don't need the current > behaviour wrt EBUSY? In the container usecase the important part is > that the old mounts are disconnected from the child namespace and not > really what the mount busy state is (typically it is still mounted in > the parent namespace anyway). > I think it's a little too soon to try and work out what to do about the speed of umount, lazy or not. Umount has taken progressively longer over the years and is in fact quite slow now. I'm really not sure what to do about that having looked at it a number of times without joy. Nevertheless, I believe we need to find a way to do this or something like it to reduce the delays involved in umount. Ian
On 28/6/24 19:13, Jan Kara wrote: > On Fri 28-06-24 10:58:54, Ian Kent wrote: >> On 27/6/24 19:54, Jan Kara wrote: >>> On Thu 27-06-24 09:11:14, Ian Kent wrote: >>>> On 27/6/24 04:47, Matthew Wilcox wrote: >>>>> On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: >>>>>> +++ b/fs/namespace.c >>>>>> @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; >>>>>> static DECLARE_RWSEM(namespace_sem); >>>>>> static HLIST_HEAD(unmounted); /* protected by namespace_sem */ >>>>>> static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ >>>>>> +static bool lazy_unlock = false; /* protected by namespace_sem */ >>>>> That's a pretty ugly way of doing it. How about this? >>>> Ha! >>>> >>>> That was my original thought but I also didn't much like changing all the >>>> callers. >>>> >>>> I don't really like the proliferation of these small helper functions either >>>> but if everyone >>>> >>>> is happy to do this I think it's a great idea. >>> So I know you've suggested removing synchronize_rcu_expedited() call in >>> your comment to v2. But I wonder why is it safe? I *thought* >>> synchronize_rcu_expedited() is there to synchronize the dropping of the >>> last mnt reference (and maybe something else) - see the comment at the >>> beginning of mntput_no_expire() - and this change would break that? >> Interesting, because of the definition of lazy umount I didn't look closely >> enough at that. >> >> But I wonder, how exactly would that race occur, is holding the rcu read >> lock sufficient since the rcu'd mount free won't be done until it's >> released (at least I think that's how rcu works). > I'm concerned about a race like: > > [path lookup] [umount -l] > ... > path_put() > mntput(mnt) > mntput_no_expire(m) > rcu_read_lock(); > if (likely(READ_ONCE(mnt->mnt_ns))) { > do_umount() > umount_tree() > ... > mnt->mnt_ns = NULL; > ... > namespace_unlock() > mntput(&m->mnt) > mntput_no_expire(mnt) > smp_mb(); > mnt_add_count(mnt, -1); > count = mnt_get_count(mnt); > if (count != 0) { > ... > return; > mnt_add_count(mnt, -1); > rcu_read_unlock(); > return; > -> KABOOM, mnt->mnt_count dropped to 0 but nobody cleaned up the mount! > } > > And this scenario is exactly prevented by synchronize_rcu() in > namespace_unlock(). I just wanted to say that I don't have a reply to this yet, having been distracted looking at the concern that Christian raised, in fact this looks like it will be hard to grok ... Ian
> I always thought the rcu delay was to ensure concurrent path walks "see" the > > umount not to ensure correct operation of the following mntput()(s). > > > Isn't the sequence of operations roughly, resolve path, lock, deatch, > release > > lock, rcu wait, mntput() subordinate mounts, put path. The crucial bit is really that synchronize_rcu_expedited() ensures that the final mntput() won't happen until path walk leaves RCU mode. This allows caller's like legitimize_mnt() which are called with only the RCU read-lock during lazy path walk to simple check for MNT_SYNC_UMOUNT and see that the mnt is about to be killed. If they see that this mount is MNT_SYNC_UMOUNT then they know that the mount won't be freed until an RCU grace period is up and so they know that they can simply put the reference count they took _without having to actually call mntput()_. Because if they did have to call mntput() they might end up shutting the filesystem down instead of umount() and that will cause said EBUSY errors I mentioned in my earlier mails. > > > So the mount gets detached in the critical section, then we wait followed by > > the mntput()(s). The catch is that not waiting might increase the likelyhood > > that concurrent path walks don't see the umount (so that possibly the umount > > goes away before the walks see the umount) but I'm not certain. What looks > to > > be as much of a problem is mntput() racing with a concurrent mount beacase > while > > the detach is done in the critical section the super block instance list > deletion > > is not and the wait will make the race possibility more likely. What's more Concurrent mounters of the same filesystem will wait for each other via grab_super(). That has it's own logic based on sb->s_active which goes to zero when all mounts are gone. > > mntput() delegates the mount cleanup (which deletes the list instance) to a > > workqueue job so this can also occur serially in a following mount command. No, that only happens when it's a kthread. Regular umount() call goes via task work which finishes before the caller returns to userspace (same as closing files work). > > > In fact I might have seen exactly this behavior in a recent xfs-tests run > where I > > was puzzled to see occasional EBUSY return on mounting of mounts that should > not > > have been in use following their umount. That's usually very much other bugs. See commit 2ae4db5647d8 ("fs: don't misleadingly warn during thaw operations") in vfs.fixes for example. > > > So I think there are problems here but I don't think the removal of the wait > for > > lazy umount is the worst of it. > > > The question then becomes, to start with, how do we resolve this unjustified > EBUSY > > return. Perhaps a completion (used between the umount and mount system > calls) would > > work well here? Again, this already exists deeper down the stack...
On 1/7/24 13:50, Christian Brauner wrote: >> I always thought the rcu delay was to ensure concurrent path walks "see" the >> >> umount not to ensure correct operation of the following mntput()(s). >> >> >> Isn't the sequence of operations roughly, resolve path, lock, deatch, >> release >> >> lock, rcu wait, mntput() subordinate mounts, put path. > The crucial bit is really that synchronize_rcu_expedited() ensures that > the final mntput() won't happen until path walk leaves RCU mode. > > This allows caller's like legitimize_mnt() which are called with only > the RCU read-lock during lazy path walk to simple check for > MNT_SYNC_UMOUNT and see that the mnt is about to be killed. If they see > that this mount is MNT_SYNC_UMOUNT then they know that the mount won't > be freed until an RCU grace period is up and so they know that they can > simply put the reference count they took _without having to actually > call mntput()_. > > Because if they did have to call mntput() they might end up shutting the > filesystem down instead of umount() and that will cause said EBUSY > errors I mentioned in my earlier mails. Yes, I get that, the problem with this was always whether lockless path walks would correctly see the mount had become invalid when being checked for legitimacy. > >> >> So the mount gets detached in the critical section, then we wait followed by >> >> the mntput()(s). The catch is that not waiting might increase the likelyhood >> >> that concurrent path walks don't see the umount (so that possibly the umount >> >> goes away before the walks see the umount) but I'm not certain. What looks >> to >> >> be as much of a problem is mntput() racing with a concurrent mount beacase >> while >> >> the detach is done in the critical section the super block instance list >> deletion >> >> is not and the wait will make the race possibility more likely. What's more > Concurrent mounters of the same filesystem will wait for each other via > grab_super(). That has it's own logic based on sb->s_active which goes > to zero when all mounts are gone. Yep, missed that, I'm too hasty, thanks for your patience. > >> mntput() delegates the mount cleanup (which deletes the list instance) to a >> >> workqueue job so this can also occur serially in a following mount command. > No, that only happens when it's a kthread. Regular umount() call goes > via task work which finishes before the caller returns to userspace > (same as closing files work). Umm, misread that, oops! Ian > >> >> In fact I might have seen exactly this behavior in a recent xfs-tests run >> where I >> >> was puzzled to see occasional EBUSY return on mounting of mounts that should >> not >> >> have been in use following their umount. > That's usually very much other bugs. See commit 2ae4db5647d8 ("fs: don't > misleadingly warn during thaw operations") in vfs.fixes for example. > >> >> So I think there are problems here but I don't think the removal of the wait >> for >> >> lazy umount is the worst of it. >> >> >> The question then becomes, to start with, how do we resolve this unjustified >> EBUSY >> >> return. Perhaps a completion (used between the umount and mount system >> calls) would >> >> work well here? > Again, this already exists deeper down the stack... >
On Mon, Jul 1, 2024 at 7:50 AM Christian Brauner <brauner@kernel.org> wrote: > > > I always thought the rcu delay was to ensure concurrent path walks "see" the > > > > umount not to ensure correct operation of the following mntput()(s). > > > > > > Isn't the sequence of operations roughly, resolve path, lock, deatch, > > release > > > > lock, rcu wait, mntput() subordinate mounts, put path. > > The crucial bit is really that synchronize_rcu_expedited() ensures that > the final mntput() won't happen until path walk leaves RCU mode. > > This allows caller's like legitimize_mnt() which are called with only > the RCU read-lock during lazy path walk to simple check for > MNT_SYNC_UMOUNT and see that the mnt is about to be killed. If they see > that this mount is MNT_SYNC_UMOUNT then they know that the mount won't > be freed until an RCU grace period is up and so they know that they can > simply put the reference count they took _without having to actually > call mntput()_. > > Because if they did have to call mntput() they might end up shutting the > filesystem down instead of umount() and that will cause said EBUSY > errors I mentioned in my earlier mails. But such behaviour could be kept even without an expedited RCU sync. Such as in my alternative patch for this: https://www.spinics.net/lists/linux-fsdevel/msg270117.html I.e. we would still guarantee the final mput is called, but not block the return of the unmount call.
On Mon 01-07-24 10:41:40, Alexander Larsson wrote: > On Mon, Jul 1, 2024 at 7:50 AM Christian Brauner <brauner@kernel.org> wrote: > > > > > I always thought the rcu delay was to ensure concurrent path walks "see" the > > > > > > umount not to ensure correct operation of the following mntput()(s). > > > > > > > > > Isn't the sequence of operations roughly, resolve path, lock, deatch, > > > release > > > > > > lock, rcu wait, mntput() subordinate mounts, put path. > > > > The crucial bit is really that synchronize_rcu_expedited() ensures that > > the final mntput() won't happen until path walk leaves RCU mode. > > > > This allows caller's like legitimize_mnt() which are called with only > > the RCU read-lock during lazy path walk to simple check for > > MNT_SYNC_UMOUNT and see that the mnt is about to be killed. If they see > > that this mount is MNT_SYNC_UMOUNT then they know that the mount won't > > be freed until an RCU grace period is up and so they know that they can > > simply put the reference count they took _without having to actually > > call mntput()_. > > > > Because if they did have to call mntput() they might end up shutting the > > filesystem down instead of umount() and that will cause said EBUSY > > errors I mentioned in my earlier mails. > > But such behaviour could be kept even without an expedited RCU sync. > Such as in my alternative patch for this: > https://www.spinics.net/lists/linux-fsdevel/msg270117.html > > I.e. we would still guarantee the final mput is called, but not block > the return of the unmount call. So FWIW the approach of handing off the remainder of namespace_unlock() into rcu callback for lazy unmount looks workable to me. Just as Al Viro pointed out you cannot do all the stuff right from the RCU callback as the context doesn't allow all the work to happen there, so you just need to queue work from RCU callback and then do the real work from there (but OTOH you can avoid the task work in mnput_noexpire() in that case - will need a bit of refactoring). Honza
On Mon, Jul 01, 2024 at 10:41:40AM GMT, Alexander Larsson wrote: > On Mon, Jul 1, 2024 at 7:50 AM Christian Brauner <brauner@kernel.org> wrote: > > > > > I always thought the rcu delay was to ensure concurrent path walks "see" the > > > > > > umount not to ensure correct operation of the following mntput()(s). > > > > > > > > > Isn't the sequence of operations roughly, resolve path, lock, deatch, > > > release > > > > > > lock, rcu wait, mntput() subordinate mounts, put path. > > > > The crucial bit is really that synchronize_rcu_expedited() ensures that > > the final mntput() won't happen until path walk leaves RCU mode. > > > > This allows caller's like legitimize_mnt() which are called with only > > the RCU read-lock during lazy path walk to simple check for > > MNT_SYNC_UMOUNT and see that the mnt is about to be killed. If they see > > that this mount is MNT_SYNC_UMOUNT then they know that the mount won't > > be freed until an RCU grace period is up and so they know that they can > > simply put the reference count they took _without having to actually > > call mntput()_. > > > > Because if they did have to call mntput() they might end up shutting the > > filesystem down instead of umount() and that will cause said EBUSY > > errors I mentioned in my earlier mails. > > But such behaviour could be kept even without an expedited RCU sync. > Such as in my alternative patch for this: > https://www.spinics.net/lists/linux-fsdevel/msg270117.html > > I.e. we would still guarantee the final mput is called, but not block > the return of the unmount call. That's fine but the patch as sent doesn't work is my point. It'll cause exactly the issues described earlier, no? So I'm confused why this version simply ended up removing synchronize_rcu_expedited() when the proposed soluton seems to have been to use queue_rcu_work(). But anyway, my concern with this is still that this changes the way MNT_DETACH behaves when you shut down a non-busy filesystem with MNT_DETACH as outlined in my other mail. If you find a workable version I'm not entirely opposed to try this but I wouldn't be surprised if this causes user visible issues for anyone that uses MNT_DETACH on a non-used filesystem.
On Mon, Jul 01, 2024 at 12:15:36PM GMT, Jan Kara wrote: > On Mon 01-07-24 10:41:40, Alexander Larsson wrote: > > On Mon, Jul 1, 2024 at 7:50 AM Christian Brauner <brauner@kernel.org> wrote: > > > > > > > I always thought the rcu delay was to ensure concurrent path walks "see" the > > > > > > > > umount not to ensure correct operation of the following mntput()(s). > > > > > > > > > > > > Isn't the sequence of operations roughly, resolve path, lock, deatch, > > > > release > > > > > > > > lock, rcu wait, mntput() subordinate mounts, put path. > > > > > > The crucial bit is really that synchronize_rcu_expedited() ensures that > > > the final mntput() won't happen until path walk leaves RCU mode. > > > > > > This allows caller's like legitimize_mnt() which are called with only > > > the RCU read-lock during lazy path walk to simple check for > > > MNT_SYNC_UMOUNT and see that the mnt is about to be killed. If they see > > > that this mount is MNT_SYNC_UMOUNT then they know that the mount won't > > > be freed until an RCU grace period is up and so they know that they can > > > simply put the reference count they took _without having to actually > > > call mntput()_. > > > > > > Because if they did have to call mntput() they might end up shutting the > > > filesystem down instead of umount() and that will cause said EBUSY > > > errors I mentioned in my earlier mails. > > > > But such behaviour could be kept even without an expedited RCU sync. > > Such as in my alternative patch for this: > > https://www.spinics.net/lists/linux-fsdevel/msg270117.html > > > > I.e. we would still guarantee the final mput is called, but not block > > the return of the unmount call. > > So FWIW the approach of handing off the remainder of namespace_unlock() > into rcu callback for lazy unmount looks workable to me. Just as Al Viro > pointed out you cannot do all the stuff right from the RCU callback as the > context doesn't allow all the work to happen there, so you just need to > queue work from RCU callback and then do the real work from there (but OTOH > you can avoid the task work in mnput_noexpire() in that case - will need a > bit of refactoring). Yes, but that wasn't what this patch did. As I said I'm not opposed to trying a _working_ version of this but I suspect we'll slightly change MNT_DETACH and cause user visible changes (But then we may end up adding MNT_ASYNC or something which I wouldn't consider the worst idea ever.).
On 1/7/24 13:50, Christian Brauner wrote: >> I always thought the rcu delay was to ensure concurrent path walks "see" the >> >> umount not to ensure correct operation of the following mntput()(s). >> >> >> Isn't the sequence of operations roughly, resolve path, lock, deatch, >> release >> >> lock, rcu wait, mntput() subordinate mounts, put path. Sorry but I'm still having trouble understanding the role of the rcu wait. > The crucial bit is really that synchronize_rcu_expedited() ensures that > the final mntput() won't happen until path walk leaves RCU mode. Sure, that's easily seen, even for me, but the rcu read lock is held for the duration of the rcu walk and not released until leaving rcu walk more and, on fail, switches to ref walk mode and restarts. So the mount struct won't be freed from under the process in rcu walk mode, correct? > > This allows caller's like legitimize_mnt() which are called with only > the RCU read-lock during lazy path walk to simple check for > MNT_SYNC_UMOUNT and see that the mnt is about to be killed. If they see > that this mount is MNT_SYNC_UMOUNT then they know that the mount won't > be freed until an RCU grace period is up and so they know that they can > simply put the reference count they took _without having to actually > call mntput()_. > > Because if they did have to call mntput() they might end up shutting the > filesystem down instead of umount() and that will cause said EBUSY > errors I mentioned in my earlier mails. Again, I get this too, but where is the need for the rcu wait in this? Originally I had the notion that it was to ensure any path walkers had seen the mount become invalid before tearing down things that enable the detection but suddenly I don't get that any more ... Please help me out here, I just don't get the need (and I'm sure there is one) for the rcu wait. Ian > >> >> So the mount gets detached in the critical section, then we wait followed by >> >> the mntput()(s). The catch is that not waiting might increase the likelyhood >> >> that concurrent path walks don't see the umount (so that possibly the umount >> >> goes away before the walks see the umount) but I'm not certain. What looks >> to >> >> be as much of a problem is mntput() racing with a concurrent mount beacase >> while >> >> the detach is done in the critical section the super block instance list >> deletion >> >> is not and the wait will make the race possibility more likely. What's more > Concurrent mounters of the same filesystem will wait for each other via > grab_super(). That has it's own logic based on sb->s_active which goes > to zero when all mounts are gone. > >> mntput() delegates the mount cleanup (which deletes the list instance) to a >> >> workqueue job so this can also occur serially in a following mount command. > No, that only happens when it's a kthread. Regular umount() call goes > via task work which finishes before the caller returns to userspace > (same as closing files work). > >> >> In fact I might have seen exactly this behavior in a recent xfs-tests run >> where I >> >> was puzzled to see occasional EBUSY return on mounting of mounts that should >> not >> >> have been in use following their umount. > That's usually very much other bugs. See commit 2ae4db5647d8 ("fs: don't > misleadingly warn during thaw operations") in vfs.fixes for example. > >> >> So I think there are problems here but I don't think the removal of the wait >> for >> >> lazy umount is the worst of it. >> >> >> The question then becomes, to start with, how do we resolve this unjustified >> EBUSY >> >> return. Perhaps a completion (used between the umount and mount system >> calls) would >> >> work well here? > Again, this already exists deeper down the stack...
On Tue, Jul 02, 2024 at 09:29:49AM GMT, Ian Kent wrote: > On 1/7/24 13:50, Christian Brauner wrote: > > > I always thought the rcu delay was to ensure concurrent path walks "see" the > > > > > > umount not to ensure correct operation of the following mntput()(s). > > > > > > > > > Isn't the sequence of operations roughly, resolve path, lock, deatch, > > > release > > > > > > lock, rcu wait, mntput() subordinate mounts, put path. > > Sorry but I'm still having trouble understanding the role of the rcu wait. Ok, maybe I'm missing what you're after. > > > > The crucial bit is really that synchronize_rcu_expedited() ensures that > > the final mntput() won't happen until path walk leaves RCU mode. > > Sure, that's easily seen, even for me, but the rcu read lock is held for > > the duration of the rcu walk and not released until leaving rcu walk more > > and, on fail, switches to ref walk mode and restarts. So the mount struct > > won't be freed from under the process in rcu walk mode, correct? Yes. > > > > > > This allows caller's like legitimize_mnt() which are called with only > > the RCU read-lock during lazy path walk to simple check for > > MNT_SYNC_UMOUNT and see that the mnt is about to be killed. If they see > > that this mount is MNT_SYNC_UMOUNT then they know that the mount won't > > be freed until an RCU grace period is up and so they know that they can > > simply put the reference count they took _without having to actually > > call mntput()_. > > > > Because if they did have to call mntput() they might end up shutting the > > filesystem down instead of umount() and that will cause said EBUSY > > errors I mentioned in my earlier mails. > > Again, I get this too, but where is the need for the rcu wait in this? The rcu wait is there to allow lazy path walk to not call mntput(). Otherwise you'll see these EBUSY issues. Maybe I'm confused now but you just said that you got it. One thing that I had misremembered though is that a lazy umount won't set MNT_SYNC_UMOUNT on the mounts it kills. So really for that case synchronize_rcu_expedited() won't matter. > Originally I had the notion that it was to ensure any path walkers had seen > > the mount become invalid before tearing down things that enable the > detection > > but suddenly I don't get that any more ... For a regular umount concurrent lazy path walkers want to be able to not steal the last umount. So the synchronize_*() ensures that they all see MNT_SYNC_UMOUNT and don't need to call mntput(). Afaict, what you're thinking about is handled by call_rcu(&mnt->mnt_rcu, delayed-free_vfsmnt() in cleanup_mnt() which is always called and makes sure that anyone still holding an rcu_read_lock() over that mount can access the data. That's how I had always understood it. > > > Please help me out here, I just don't get the need (and I'm sure there is > > one) for the rcu wait. > > > Ian > > > > > > > > > So the mount gets detached in the critical section, then we wait followed by > > > > > > the mntput()(s). The catch is that not waiting might increase the likelyhood > > > > > > that concurrent path walks don't see the umount (so that possibly the umount > > > > > > goes away before the walks see the umount) but I'm not certain. What looks > > > to > > > > > > be as much of a problem is mntput() racing with a concurrent mount beacase > > > while > > > > > > the detach is done in the critical section the super block instance list > > > deletion > > > > > > is not and the wait will make the race possibility more likely. What's more > > Concurrent mounters of the same filesystem will wait for each other via > > grab_super(). That has it's own logic based on sb->s_active which goes > > to zero when all mounts are gone. > > > > > mntput() delegates the mount cleanup (which deletes the list instance) to a > > > > > > workqueue job so this can also occur serially in a following mount command. > > No, that only happens when it's a kthread. Regular umount() call goes > > via task work which finishes before the caller returns to userspace > > (same as closing files work). > > > > > > > > In fact I might have seen exactly this behavior in a recent xfs-tests run > > > where I > > > > > > was puzzled to see occasional EBUSY return on mounting of mounts that should > > > not > > > > > > have been in use following their umount. > > That's usually very much other bugs. See commit 2ae4db5647d8 ("fs: don't > > misleadingly warn during thaw operations") in vfs.fixes for example. > > > > > > > > So I think there are problems here but I don't think the removal of the wait > > > for > > > > > > lazy umount is the worst of it. > > > > > > > > > The question then becomes, to start with, how do we resolve this unjustified > > > EBUSY > > > > > > return. Perhaps a completion (used between the umount and mount system > > > calls) would > > > > > > work well here? > > Again, this already exists deeper down the stack...
On Fri, Jun 28, 2024 at 01:13:45PM GMT, Jan Kara wrote: > On Fri 28-06-24 10:58:54, Ian Kent wrote: > > > > On 27/6/24 19:54, Jan Kara wrote: > > > On Thu 27-06-24 09:11:14, Ian Kent wrote: > > > > On 27/6/24 04:47, Matthew Wilcox wrote: > > > > > On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: > > > > > > +++ b/fs/namespace.c > > > > > > @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; > > > > > > static DECLARE_RWSEM(namespace_sem); > > > > > > static HLIST_HEAD(unmounted); /* protected by namespace_sem */ > > > > > > static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ > > > > > > +static bool lazy_unlock = false; /* protected by namespace_sem */ > > > > > That's a pretty ugly way of doing it. How about this? > > > > Ha! > > > > > > > > That was my original thought but I also didn't much like changing all the > > > > callers. > > > > > > > > I don't really like the proliferation of these small helper functions either > > > > but if everyone > > > > > > > > is happy to do this I think it's a great idea. > > > So I know you've suggested removing synchronize_rcu_expedited() call in > > > your comment to v2. But I wonder why is it safe? I *thought* > > > synchronize_rcu_expedited() is there to synchronize the dropping of the > > > last mnt reference (and maybe something else) - see the comment at the > > > beginning of mntput_no_expire() - and this change would break that? > > > > Interesting, because of the definition of lazy umount I didn't look closely > > enough at that. > > > > But I wonder, how exactly would that race occur, is holding the rcu read > > lock sufficient since the rcu'd mount free won't be done until it's > > released (at least I think that's how rcu works). > > I'm concerned about a race like: > > [path lookup] [umount -l] > ... > path_put() > mntput(mnt) > mntput_no_expire(m) > rcu_read_lock(); > if (likely(READ_ONCE(mnt->mnt_ns))) { > do_umount() > umount_tree() > ... > mnt->mnt_ns = NULL; > ... > namespace_unlock() > mntput(&m->mnt) > mntput_no_expire(mnt) > smp_mb(); > mnt_add_count(mnt, -1); > count = mnt_get_count(mnt); > if (count != 0) { > ... > return; > mnt_add_count(mnt, -1); > rcu_read_unlock(); > return; > -> KABOOM, mnt->mnt_count dropped to 0 but nobody cleaned up the mount! > } Yeah, I think that's a valid concern. mntput_no_expire() requires that the last reference is dropped after an rcu grace period and that can only be done by synchronize_rcu_*() (It could be reworked but that would be quite ugly.). See also mnt_make_shortterm() caller's for kernel initiated unmounts.
On 2/7/24 12:58, Christian Brauner wrote: > On Fri, Jun 28, 2024 at 01:13:45PM GMT, Jan Kara wrote: >> On Fri 28-06-24 10:58:54, Ian Kent wrote: >>> On 27/6/24 19:54, Jan Kara wrote: >>>> On Thu 27-06-24 09:11:14, Ian Kent wrote: >>>>> On 27/6/24 04:47, Matthew Wilcox wrote: >>>>>> On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: >>>>>>> +++ b/fs/namespace.c >>>>>>> @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; >>>>>>> static DECLARE_RWSEM(namespace_sem); >>>>>>> static HLIST_HEAD(unmounted); /* protected by namespace_sem */ >>>>>>> static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ >>>>>>> +static bool lazy_unlock = false; /* protected by namespace_sem */ >>>>>> That's a pretty ugly way of doing it. How about this? >>>>> Ha! >>>>> >>>>> That was my original thought but I also didn't much like changing all the >>>>> callers. >>>>> >>>>> I don't really like the proliferation of these small helper functions either >>>>> but if everyone >>>>> >>>>> is happy to do this I think it's a great idea. >>>> So I know you've suggested removing synchronize_rcu_expedited() call in >>>> your comment to v2. But I wonder why is it safe? I *thought* >>>> synchronize_rcu_expedited() is there to synchronize the dropping of the >>>> last mnt reference (and maybe something else) - see the comment at the >>>> beginning of mntput_no_expire() - and this change would break that? >>> Interesting, because of the definition of lazy umount I didn't look closely >>> enough at that. >>> >>> But I wonder, how exactly would that race occur, is holding the rcu read >>> lock sufficient since the rcu'd mount free won't be done until it's >>> released (at least I think that's how rcu works). >> I'm concerned about a race like: >> >> [path lookup] [umount -l] >> ... >> path_put() >> mntput(mnt) >> mntput_no_expire(m) >> rcu_read_lock(); >> if (likely(READ_ONCE(mnt->mnt_ns))) { >> do_umount() >> umount_tree() >> ... >> mnt->mnt_ns = NULL; >> ... >> namespace_unlock() >> mntput(&m->mnt) >> mntput_no_expire(mnt) >> smp_mb(); >> mnt_add_count(mnt, -1); >> count = mnt_get_count(mnt); >> if (count != 0) { >> ... >> return; >> mnt_add_count(mnt, -1); >> rcu_read_unlock(); >> return; >> -> KABOOM, mnt->mnt_count dropped to 0 but nobody cleaned up the mount! >> } > Yeah, I think that's a valid concern. mntput_no_expire() requires that > the last reference is dropped after an rcu grace period and that can > only be done by synchronize_rcu_*() (It could be reworked but that would > be quite ugly.). See also mnt_make_shortterm() caller's for kernel > initiated unmounts. I've thought about this a couple of times now. Isn't it the case here that the path lookup thread will have taken a reference (because it's calling path_put()) and the umount will have taken a reference on system call entry. So for the mount being umounted the starting count will be at lest three then if the umount mntput() is called from namespace_unlock() it will correctly see count != 0 and the path lookup mntput() to release it's reference finally leaving the mntput() of the path_put() from the top level system call function to release the last reference. Once again I find myself thinking this should be independent of the rcu wait because only path walks done before the mount being detached can be happening and the lockless walks are done holding the rcu read lock and how likely is it a ref-walk path lookup (that should follow a failed rcu-walk in this case) has been able to grab a reference anyway? I think the only reason the wait could be significant is to prevent changes to the structures concerned causing problems because they happen earlier than can be tolerated. That I can understand. Mmm ... I feel like I'm starting to sound like a broken record ... oops! Ian
On Tue 02-07-24 15:01:54, Ian Kent wrote: > On 2/7/24 12:58, Christian Brauner wrote: > > On Fri, Jun 28, 2024 at 01:13:45PM GMT, Jan Kara wrote: > > > On Fri 28-06-24 10:58:54, Ian Kent wrote: > > > > On 27/6/24 19:54, Jan Kara wrote: > > > > > On Thu 27-06-24 09:11:14, Ian Kent wrote: > > > > > > On 27/6/24 04:47, Matthew Wilcox wrote: > > > > > > > On Wed, Jun 26, 2024 at 04:07:49PM -0400, Lucas Karpinski wrote: > > > > > > > > +++ b/fs/namespace.c > > > > > > > > @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; > > > > > > > > static DECLARE_RWSEM(namespace_sem); > > > > > > > > static HLIST_HEAD(unmounted); /* protected by namespace_sem */ > > > > > > > > static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ > > > > > > > > +static bool lazy_unlock = false; /* protected by namespace_sem */ > > > > > > > That's a pretty ugly way of doing it. How about this? > > > > > > Ha! > > > > > > > > > > > > That was my original thought but I also didn't much like changing all the > > > > > > callers. > > > > > > > > > > > > I don't really like the proliferation of these small helper functions either > > > > > > but if everyone > > > > > > > > > > > > is happy to do this I think it's a great idea. > > > > > So I know you've suggested removing synchronize_rcu_expedited() call in > > > > > your comment to v2. But I wonder why is it safe? I *thought* > > > > > synchronize_rcu_expedited() is there to synchronize the dropping of the > > > > > last mnt reference (and maybe something else) - see the comment at the > > > > > beginning of mntput_no_expire() - and this change would break that? > > > > Interesting, because of the definition of lazy umount I didn't look closely > > > > enough at that. > > > > > > > > But I wonder, how exactly would that race occur, is holding the rcu read > > > > lock sufficient since the rcu'd mount free won't be done until it's > > > > released (at least I think that's how rcu works). > > > I'm concerned about a race like: > > > > > > [path lookup] [umount -l] > > > ... > > > path_put() > > > mntput(mnt) > > > mntput_no_expire(m) > > > rcu_read_lock(); > > > if (likely(READ_ONCE(mnt->mnt_ns))) { > > > do_umount() > > > umount_tree() > > > ... > > > mnt->mnt_ns = NULL; > > > ... > > > namespace_unlock() > > > mntput(&m->mnt) > > > mntput_no_expire(mnt) > > > smp_mb(); > > > mnt_add_count(mnt, -1); > > > count = mnt_get_count(mnt); > > > if (count != 0) { > > > ... > > > return; > > > mnt_add_count(mnt, -1); > > > rcu_read_unlock(); > > > return; > > > -> KABOOM, mnt->mnt_count dropped to 0 but nobody cleaned up the mount! > > > } > > Yeah, I think that's a valid concern. mntput_no_expire() requires that > > the last reference is dropped after an rcu grace period and that can > > only be done by synchronize_rcu_*() (It could be reworked but that would > > be quite ugly.). See also mnt_make_shortterm() caller's for kernel > > initiated unmounts. > > I've thought about this a couple of times now. > > Isn't it the case here that the path lookup thread will have taken a > reference (because it's calling path_put()) and the umount will have > taken a reference on system call entry. Yes, path lookup has taken a reference to mnt in this case. Umount syscall also has a reference to the mount in its struct path it has got from user_path_at(). But note that single umount call can end up tearing the whole tree of mounts AFAICT (in umount_tree()) so you cannot really rely on the fact that the syscall holds a ref to the mount it is tearing down. Secondly, even if the path_umount() would be holding a reference to the mount being torn down, it is trivial to extend the race window so that the task doing 'umount -l' continues until it gets past mntput_no_expire() in path_umount() and only then the task doing path_put() wakes up and you get the same problem. > So for the mount being umounted the starting count will be at lest three > then if the umount mntput() is called from namespace_unlock() it will > correctly see count != 0 and the path lookup mntput() to release it's > reference finally leaving the mntput() of the path_put() from the top > level system call function to release the last reference. > > Once again I find myself thinking this should be independent of the rcu > wait because only path walks done before the mount being detached can be > happening and the lockless walks are done holding the rcu read lock and > how likely is it a ref-walk path lookup (that should follow a failed > rcu-walk in this case) has been able to grab a reference anyway? No, it really is not independent of the RCU wait. mntput_no_expire() uses RCU for proper synchronization of dropping of the last mount reference. AFAIU there's a rule - after you clear mnt->mnt_ns, you must wait for RCU period to expire before you can drop the mnt reference you are holding. RCU path walk uses this fact but also any part of the kernel calling mntput_no_expire() implicitely depends on this behavior. And the changes to lazy umount path must not break this rule (or they have to come up with a different way to synchronize dropping of the last mnt reference). Honza
On Mon, Jul 01, 2024 at 02:10:31PM GMT, Christian Brauner wrote: > On Mon, Jul 01, 2024 at 10:41:40AM GMT, Alexander Larsson wrote: > > On Mon, Jul 1, 2024 at 7:50 AM Christian Brauner <brauner@kernel.org> wrote: > > > > > > > I always thought the rcu delay was to ensure concurrent path walks "see" the > > > > > > > > umount not to ensure correct operation of the following mntput()(s). > > > > > > > > > > > > Isn't the sequence of operations roughly, resolve path, lock, deatch, > > > > release > > > > > > > > lock, rcu wait, mntput() subordinate mounts, put path. > > > > > > The crucial bit is really that synchronize_rcu_expedited() ensures that > > > the final mntput() won't happen until path walk leaves RCU mode. > > > > > > This allows caller's like legitimize_mnt() which are called with only > > > the RCU read-lock during lazy path walk to simple check for > > > MNT_SYNC_UMOUNT and see that the mnt is about to be killed. If they see > > > that this mount is MNT_SYNC_UMOUNT then they know that the mount won't > > > be freed until an RCU grace period is up and so they know that they can > > > simply put the reference count they took _without having to actually > > > call mntput()_. > > > > > > Because if they did have to call mntput() they might end up shutting the > > > filesystem down instead of umount() and that will cause said EBUSY > > > errors I mentioned in my earlier mails. > > > > But such behaviour could be kept even without an expedited RCU sync. > > Such as in my alternative patch for this: > > https://www.spinics.net/lists/linux-fsdevel/msg270117.html > > > > I.e. we would still guarantee the final mput is called, but not block > > the return of the unmount call. > > That's fine but the patch as sent doesn't work is my point. It'll cause > exactly the issues described earlier, no? So I'm confused why this > version simply ended up removing synchronize_rcu_expedited() when > the proposed soluton seems to have been to use queue_rcu_work(). > > But anyway, my concern with this is still that this changes the way > MNT_DETACH behaves when you shut down a non-busy filesystem with > MNT_DETACH as outlined in my other mail. > > If you find a workable version I'm not entirely opposed to try this but > I wouldn't be surprised if this causes user visible issues for anyone > that uses MNT_DETACH on a non-used filesystem. Correction: I misremembered that umount_tree() is called with UMOUNT_SYNC only in the case that umount() isn't called with MNT_DETACH. I mentioned this yesterday in the thread but just in case you missed it I want to spell it out in detail as well. This is relevant because UMOUNT_SYNC will raise MNT_SYNC_UMOUNT on all mounts it unmounts. And that ends up being checked in legitimize_mnt() to ensure that legitimize_mnt() doesn't call mntput() during path lookup and risking EBUSY for a umount(..., 0) + mount() sequence for the same filesystem. But for umount(.., MNT_DETACH) UMOUNT_SYNC isn't passed and so MNT_SYNC_UMOUNT isn't raised on the mount and so legitimize_mnt() may end up doing the last mntput() and cleaning up the filesystem. In other words, a umount(..., MNT_DETACH) caller needs to be prepared to deal with EBUSY for a umount(..., MNT_DETACH) + mount() sequence. So I think we can certainly try this as long as we make it via queue_rcu_work() to handle the other mntput_no_expire() grace period dependency we discussed upthread. Thanks for making take a closer look.
On 3/7/24 17:22, Christian Brauner wrote: > On Mon, Jul 01, 2024 at 02:10:31PM GMT, Christian Brauner wrote: >> On Mon, Jul 01, 2024 at 10:41:40AM GMT, Alexander Larsson wrote: >>> On Mon, Jul 1, 2024 at 7:50 AM Christian Brauner <brauner@kernel.org> wrote: >>>>> I always thought the rcu delay was to ensure concurrent path walks "see" the >>>>> >>>>> umount not to ensure correct operation of the following mntput()(s). >>>>> >>>>> >>>>> Isn't the sequence of operations roughly, resolve path, lock, deatch, >>>>> release >>>>> >>>>> lock, rcu wait, mntput() subordinate mounts, put path. >>>> The crucial bit is really that synchronize_rcu_expedited() ensures that >>>> the final mntput() won't happen until path walk leaves RCU mode. >>>> >>>> This allows caller's like legitimize_mnt() which are called with only >>>> the RCU read-lock during lazy path walk to simple check for >>>> MNT_SYNC_UMOUNT and see that the mnt is about to be killed. If they see >>>> that this mount is MNT_SYNC_UMOUNT then they know that the mount won't >>>> be freed until an RCU grace period is up and so they know that they can >>>> simply put the reference count they took _without having to actually >>>> call mntput()_. >>>> >>>> Because if they did have to call mntput() they might end up shutting the >>>> filesystem down instead of umount() and that will cause said EBUSY >>>> errors I mentioned in my earlier mails. >>> But such behaviour could be kept even without an expedited RCU sync. >>> Such as in my alternative patch for this: >>> https://www.spinics.net/lists/linux-fsdevel/msg270117.html >>> >>> I.e. we would still guarantee the final mput is called, but not block >>> the return of the unmount call. >> That's fine but the patch as sent doesn't work is my point. It'll cause >> exactly the issues described earlier, no? So I'm confused why this >> version simply ended up removing synchronize_rcu_expedited() when >> the proposed soluton seems to have been to use queue_rcu_work(). >> >> But anyway, my concern with this is still that this changes the way >> MNT_DETACH behaves when you shut down a non-busy filesystem with >> MNT_DETACH as outlined in my other mail. >> >> If you find a workable version I'm not entirely opposed to try this but >> I wouldn't be surprised if this causes user visible issues for anyone >> that uses MNT_DETACH on a non-used filesystem. > Correction: I misremembered that umount_tree() is called with > UMOUNT_SYNC only in the case that umount() isn't called with MNT_DETACH. > I mentioned this yesterday in the thread but just in case you missed it > I want to spell it out in detail as well. Thanks Christian, I did see that, yep. There's also the seqlock in there to alert the legitimize that it needs to restart using ref-walk. > > This is relevant because UMOUNT_SYNC will raise MNT_SYNC_UMOUNT on all > mounts it unmounts. And that ends up being checked in legitimize_mnt() > to ensure that legitimize_mnt() doesn't call mntput() during path lookup > and risking EBUSY for a umount(..., 0) + mount() sequence for the same > filesystem. > > But for umount(.., MNT_DETACH) UMOUNT_SYNC isn't passed and so > MNT_SYNC_UMOUNT isn't raised on the mount and so legitimize_mnt() may > end up doing the last mntput() and cleaning up the filesystem. > > In other words, a umount(..., MNT_DETACH) caller needs to be prepared to > deal with EBUSY for a umount(..., MNT_DETACH) + mount() sequence. > > So I think we can certainly try this as long as we make it via > queue_rcu_work() to handle the other mntput_no_expire() grace period > dependency we discussed upthread. > > Thanks for making take a closer look. I'm still not sure I fully understand the subtleties of how this works, I think I'll need to do a deep dive into the rcu code and then revisit the umount code. At least I won't be idle, ;( Nevertheless I have to thank both you and Honza for your efforts and tolerance. Ian Ian
diff --git a/fs/namespace.c b/fs/namespace.c index 5a51315c6678..5d889e05dd14 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -78,6 +78,7 @@ static struct kmem_cache *mnt_cache __ro_after_init; static DECLARE_RWSEM(namespace_sem); static HLIST_HEAD(unmounted); /* protected by namespace_sem */ static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */ +static bool lazy_unlock = false; /* protected by namespace_sem */ struct mount_kattr { unsigned int attr_set; @@ -1555,6 +1556,7 @@ EXPORT_SYMBOL(may_umount); static void namespace_unlock(void) { + bool lazy; struct hlist_head head; struct hlist_node *p; struct mount *m; @@ -1563,6 +1565,9 @@ static void namespace_unlock(void) hlist_move_list(&unmounted, &head); list_splice_init(&ex_mountpoints, &list); + lazy = lazy_unlock; + lazy_unlock = false; + up_write(&namespace_sem); shrink_dentry_list(&list); @@ -1570,7 +1575,8 @@ static void namespace_unlock(void) if (likely(hlist_empty(&head))) return; - synchronize_rcu_expedited(); + if (!lazy) + synchronize_rcu_expedited(); hlist_for_each_entry_safe(m, p, &head, mnt_umount) { hlist_del(&m->mnt_umount); @@ -1798,6 +1804,7 @@ static int do_umount(struct mount *mnt, int flags) } out: unlock_mount_hash(); + lazy_unlock = flags & MNT_DETACH ? true : false; namespace_unlock(); return retval; }