diff mbox

[v2] vfs: introduce UMOUNT_WAIT which waits for umount completion

Message ID 20170921003409.GA11365@jaegeuk-macbookpro.roam.corp.google.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jaegeuk Kim Sept. 21, 2017, 12:34 a.m. UTC
On 09/20, Al Viro wrote:
> On Wed, Sep 20, 2017 at 10:38:31AM -0700, Jaegeuk Kim wrote:
> > This patch introduces UMOUNT_WAIT flag for umount(2) which let user wait for
> > umount(2) to complete filesystem shutdown. This should fix a kernel panic
> > triggered when a living filesystem tries to access dead block device after
> > device_shutdown done by kernel_restart as below.
> 
> NAK.  This is just papering over the race you've got; it does not fix it.
> You count upon the kernel threads in question having already gotten past
> scheduling delayed fput, but what's there to guarantee that?  You are
> essentially adding a "flush all pending fput that had already been
> scheduled" syscall.  It
> 	a) doesn't belong in umount(2) and
> 	b) doesn't fix the race.
>
> It might change the timing enough to have your specific reproducer survive,
> but that kind of approach is simply wrong.
> Incidentally, the name is a misnomer - it does *NOT* wait for completion of
> fs shutdown.  Proof: have a filesystem mounted in two namespaces and issue
> that thing in one of them.  Then observe how it's still alive, well and
> accessible in another.

Yes, I wrote the description incorrectly. Let me try describing UMOUNT_WAIT
which waits for any pending delayed works only like what you said. In normal
cases where other namespace is still active, this doesn't work at all.

> The only case that gets affected by it is when another mount is heading for
> shutdown and is in a very specific part of that.  That is waited for.
> If it's just before *OR* just past that stage, you are fucked.
> 
> And yes, "just past" is also affected.  Look:
> CPU1: delayed_fput()
>         struct llist_node *node = llist_del_all(&delayed_fput_list);
> delayed_fput_list() is empty now
>         llist_for_each_entry_safe(f, t, node, f_u.fu_llist)
>                 __fput(f);
> CPU2: your umount UMOUNT_WAIT
> 	flush_delayed_fput()
> 		does nothing, the list is empty

		how about waiting for workqueue completion here?

> 	....

	If all the __fput()s are not finished, do_umount() will return -EBUSY.

> 	flush_scheduled_work()
> 		waits for delayed_fput() to finish
> CPU1:
> 	finish __fput()
> 	call mntput() from it
> 	schedule_delayed_work(&delayed_mntput_work, 1);
> CPU2:
> 	OK, everything scheduled prior to call of flush_scheduled_work() is completed,
> we are done.
> 	return from umount(2)
> 	(in bogus userland code) tell it to shut devices down
> ...
> oops, that delayed_mntput_work we'd scheduled there got to run.  Too bad...

Is this doable?

---
 fs/file_table.c      |  6 ++++++
 fs/namespace.c       | 26 +++++++++++++++++++++++++-
 include/linux/file.h |  1 +
 include/linux/fs.h   |  1 +
 4 files changed, 33 insertions(+), 1 deletion(-)

Comments

Al Viro Sept. 21, 2017, 2:42 a.m. UTC | #1
On Wed, Sep 20, 2017 at 05:34:09PM -0700, Jaegeuk Kim wrote:
> > 	flush_delayed_fput()
> > 		does nothing, the list is empty
> 
> 		how about waiting for workqueue completion here?
> 
> > 	....
> 
> 	If all the __fput()s are not finished, do_umount() will return -EBUSY.

Hell, no.  That's only when they are all on the same vfsmount.  And in that
case you don't need any waiting - if any of those mntput() is not past the
unlock_mount_hash() in mntput_no_expire(), you will get -EBUSY.  And if they
all are, the caller of umount(2) will end up dropping the last reference.  
In which case the shutdown will be scheduled via task_work_add() and processed
before umount(2) returns to userland.

The whole problem is that you have several vfsmounts over the same filesystem
(== same struct super_block), some of them held by kernel threads of yours.
umount(2) doesn't affect those and isn't affected by those.  What you do is,
AFAICS,
	ask the kernel threads to start shutting down
	umount()
	shut device down, hoping that all vfsmounts that used
to be held by those threads are gone by that point.

Your patch tries to stick "flush the pending work" in the umount().
With no warranty that it will catch that stuff in the stage where
flushing will affect anything.

> +void flush_delayed_fput_wait(void)
> +{
> +	delayed_fput(NULL);
> +	flush_delayed_work(&delayed_fput_work);
> +}

> +void flush_delayed_mntput_wait(void)
> +{
> +	delayed_mntput(NULL);
> +	flush_delayed_work(&delayed_mntput_work);
> +}

It's still a broken approach.  What I don't understand is why bother
with that sort of brittle logics in the first place.  Why not simply
open the damn thing with O_EXCL before proceeding to device shutdown?
And if you get "busy" from that, wait and retry...
Jaegeuk Kim Sept. 21, 2017, 5:02 a.m. UTC | #2
On 09/21, Al Viro wrote:
> On Wed, Sep 20, 2017 at 05:34:09PM -0700, Jaegeuk Kim wrote:
> > > 	flush_delayed_fput()
> > > 		does nothing, the list is empty
> > 
> > 		how about waiting for workqueue completion here?
> > 
> > > 	....
> > 
> > 	If all the __fput()s are not finished, do_umount() will return -EBUSY.
> 
> Hell, no.  That's only when they are all on the same vfsmount.  And in that
> case you don't need any waiting - if any of those mntput() is not past the
> unlock_mount_hash() in mntput_no_expire(), you will get -EBUSY.  And if they
> all are, the caller of umount(2) will end up dropping the last reference.  
> In which case the shutdown will be scheduled via task_work_add() and processed
> before umount(2) returns to userland.

Yes, what I'm trying to do with this flag would be waiting for releasing
mnt_count in the same vfsmount as well as sb->s_active across namespaces.
So, here at first, I wanted to wait for delayed_fput which grabs mnt_count
in the same vfsmount, so that do_umount() could be succeeded in time. If
this is the last remaining namespace, waiting for delayed_mntput enables
us to shut the filesystem down by task_work at the end of umount(2).

> The whole problem is that you have several vfsmounts over the same filesystem
> (== same struct super_block), some of them held by kernel threads of yours.
> umount(2) doesn't affect those and isn't affected by those.  What you do is,
> AFAICS,
> 	ask the kernel threads to start shutting down
> 	umount()
> 	shut device down, hoping that all vfsmounts that used
> to be held by those threads are gone by that point.

Yes, and actually, android retries umount(2) for several seconds, if it gets
failure. So, first I thought it'd be better to make umount() more deterministic.

> Your patch tries to stick "flush the pending work" in the umount().
> With no warranty that it will catch that stuff in the stage where
> flushing will affect anything.
> 
> > +void flush_delayed_fput_wait(void)
> > +{
> > +	delayed_fput(NULL);
> > +	flush_delayed_work(&delayed_fput_work);
> > +}
> 
> > +void flush_delayed_mntput_wait(void)
> > +{
> > +	delayed_mntput(NULL);
> > +	flush_delayed_work(&delayed_mntput_work);
> > +}
> 
> It's still a broken approach.  What I don't understand is why bother
> with that sort of brittle logics in the first place.  Why not simply
> open the damn thing with O_EXCL before proceeding to device shutdown?
> And if you get "busy" from that, wait and retry...

I'm not sure how many times we can retry and wait for this. IMHO, it'd be better
to use this together with the new flag, since this can detect unclosed namespace
given successful umount(2).
Theodore Ts'o Sept. 21, 2017, 2:48 p.m. UTC | #3
On Wed, Sep 20, 2017 at 10:02:37PM -0700, Jaegeuk Kim wrote:
> 
> Yes, and actually, android retries umount(2) for several seconds, if it gets
> failure. So, first I thought it'd be better to make umount() more deterministic.
> 
> I'm not sure how many times we can retry and wait for this. IMHO, it'd be better
> to use this together with the new flag, since this can detect unclosed namespace
> given successful umount(2).

Even if you could make umount wait for the binder deferred workqueue
to exit, if it takes time, it takes time.  And that's not the core
VFS's fault.  It's binder's.

So... why not fix this by adding a binder ioctl which forces its
workqueue to be flushed, and use that *before* you try calling mount?

If that binder ioctl takes minutes or hours to complete, then that's a
binder bug, and needs to be fixed in binder.

       	    	      	    	  - Ted
Jaegeuk Kim Sept. 21, 2017, 5:16 p.m. UTC | #4
On 09/21, Theodore Ts'o wrote:
> On Wed, Sep 20, 2017 at 10:02:37PM -0700, Jaegeuk Kim wrote:
> > 
> > Yes, and actually, android retries umount(2) for several seconds, if it gets
> > failure. So, first I thought it'd be better to make umount() more deterministic.
> > 
> > I'm not sure how many times we can retry and wait for this. IMHO, it'd be better
> > to use this together with the new flag, since this can detect unclosed namespace
> > given successful umount(2).
> 
> Even if you could make umount wait for the binder deferred workqueue
> to exit, if it takes time, it takes time.  And that's not the core
> VFS's fault.  It's binder's.
> 
> So... why not fix this by adding a binder ioctl which forces its
> workqueue to be flushed, and use that *before* you try calling mount?

Yes, we can, but I'm not 100% sure this happens only by binder. This is caused
by delayed_fput/mntput called by any kernel thread which would be much generic.

> If that binder ioctl takes minutes or hours to complete, then that's a
> binder bug, and needs to be fixed in binder.
> 
>        	    	      	    	  - Ted
diff mbox

Patch

diff --git a/fs/file_table.c b/fs/file_table.c
index 72e861a35a7f..35b32ffdb934 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -263,6 +263,12 @@  void flush_delayed_fput(void)
 
 static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);
 
+void flush_delayed_fput_wait(void)
+{
+	delayed_fput(NULL);
+	flush_delayed_work(&delayed_fput_work);
+}
+
 void fput(struct file *file)
 {
 	if (atomic_long_dec_and_test(&file->f_count)) {
diff --git a/fs/namespace.c b/fs/namespace.c
index f8893dc6a989..e2586a38c83c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -21,6 +21,7 @@ 
 #include <linux/fs_struct.h>	/* get_fs_root et.al. */
 #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
 #include <linux/uaccess.h>
+#include <linux/file.h>
 #include <linux/proc_ns.h>
 #include <linux/magic.h>
 #include <linux/bootmem.h>
@@ -1133,6 +1134,12 @@  static void delayed_mntput(struct work_struct *unused)
 }
 static DECLARE_DELAYED_WORK(delayed_mntput_work, delayed_mntput);
 
+void flush_delayed_mntput_wait(void)
+{
+	delayed_mntput(NULL);
+	flush_delayed_work(&delayed_mntput_work);
+}
+
 static void mntput_no_expire(struct mount *mnt)
 {
 	rcu_read_lock();
@@ -1629,7 +1636,8 @@  SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	int retval;
 	int lookup_flags = 0;
 
-	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
+	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW |
+			UMOUNT_WAIT))
 		return -EINVAL;
 
 	if (!may_mount())
@@ -1653,11 +1661,27 @@  SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	if (flags & MNT_FORCE && !capable(CAP_SYS_ADMIN))
 		goto dput_and_out;
 
+	/* flush delayed_fput to put mnt_count */
+	if (flags & UMOUNT_WAIT)
+		flush_delayed_fput_wait();
+
 	retval = do_umount(mnt, flags);
 dput_and_out:
 	/* we mustn't call path_put() as that would clear mnt_expiry_mark */
 	dput(path.dentry);
 	mntput_no_expire(mnt);
+
+	if (!retval && (flags & UMOUNT_WAIT)) {
+		/*
+		 * If the last delayed_fput() is called during do_umount()
+		 * and makes mnt_count zero, we need to guarantee to register
+		 * delayed_mntput by waiting for delayed_fput work again.
+		 */
+		flush_delayed_fput_wait();
+
+		/* flush delayed_mntput_work to put sb->s_active */
+		flush_delayed_mntput_wait();
+	}
 out:
 	return retval;
 }
diff --git a/include/linux/file.h b/include/linux/file.h
index 61eb82cbafba..ffb4236cde39 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -84,6 +84,7 @@  extern void put_unused_fd(unsigned int fd);
 extern void fd_install(unsigned int fd, struct file *file);
 
 extern void flush_delayed_fput(void);
+extern void flush_delayed_fput_wait(void);
 extern void __fput_sync(struct file *);
 
 #endif /* __LINUX_FILE_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6e1fd5d21248..69f0fd53c9c7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1278,6 +1278,7 @@  struct mm_struct;
 #define MNT_DETACH	0x00000002	/* Just detach from the tree */
 #define MNT_EXPIRE	0x00000004	/* Mark for expiry */
 #define UMOUNT_NOFOLLOW	0x00000008	/* Don't follow symlink on umount */
+#define UMOUNT_WAIT	0x00000010	/* Wait to unmount completely */
 #define UMOUNT_UNUSED	0x80000000	/* Flag guaranteed to be unused */
 
 /* sb->s_iflags */