diff mbox

[v2] vfs: introduce UMOUNT_WAIT which waits for umount completion

Message ID 20170920173831.GA7151@jaegeuk-macbookpro.roam.corp.google.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jaegeuk Kim Sept. 20, 2017, 5:38 p.m. UTC
This patch introduces UMOUNT_WAIT flag for umount(2) which let user wait for
umount(2) to complete filesystem shutdown. This should fix a kernel panic
triggered when a living filesystem tries to access dead block device after
device_shutdown done by kernel_restart as below.

Term: namespace(mnt_get_count())

1. create_new_namespaces() creates ns1 and ns2,

  /data(1)    ns1(1)    ns2(1)
    |          |          |
     ---------------------
               |
        sb->s_active = 3

2. after binder_proc_clear_zombies() for ns2 and ns1 triggers
  - delayed_fput()
    - delayed_mntput_work(ns2)

  /data(1)    ns1(1)
    |          |
     ----------
          |
    sb->s_active = 2

3. umount() for /data is successed.

  ns1(1)
    |
 sb->s_active = 1

4. device_shutdown() by init

5.  - delayed_mntput_work(ns1)
     - put_super(), since sb->s_active = 0
       - -EIO

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
---
 fs/namespace.c     | 12 +++++++++++-
 include/linux/fs.h |  1 +
 2 files changed, 12 insertions(+), 1 deletion(-)

Comments

Al Viro Sept. 20, 2017, 6:38 p.m. UTC | #1
On Wed, Sep 20, 2017 at 10:38:31AM -0700, Jaegeuk Kim wrote:
> This patch introduces UMOUNT_WAIT flag for umount(2) which let user wait for
> umount(2) to complete filesystem shutdown. This should fix a kernel panic
> triggered when a living filesystem tries to access dead block device after
> device_shutdown done by kernel_restart as below.

NAK.  This is just papering over the race you've got; it does not fix it.
You count upon the kernel threads in question having already gotten past
scheduling delayed fput, but what's there to guarantee that?  You are
essentially adding a "flush all pending fput that had already been
scheduled" syscall.  It
	a) doesn't belong in umount(2) and
	b) doesn't fix the race.
It might change the timing enough to have your specific reproducer survive,
but that kind of approach is simply wrong.

Incidentally, the name is a misnomer - it does *NOT* wait for completion of
fs shutdown.  Proof: have a filesystem mounted in two namespaces and issue
that thing in one of them.  Then observe how it's still alive, well and
accessible in another.

The only case that gets affected by it is when another mount is heading for
shutdown and is in a very specific part of that.  That is waited for.
If it's just before *OR* just past that stage, you are fucked.

And yes, "just past" is also affected.  Look:
CPU1: delayed_fput()
        struct llist_node *node = llist_del_all(&delayed_fput_list);
delayed_fput_list() is empty now
        llist_for_each_entry_safe(f, t, node, f_u.fu_llist)
                __fput(f);
CPU2: your umount UMOUNT_WAIT
	flush_delayed_fput()
		does nothing, the list is empty
	....
	flush_scheduled_work()
		waits for delayed_fput() to finish
CPU1:
	finish __fput()
	call mntput() from it
	schedule_delayed_work(&delayed_mntput_work, 1);
CPU2:
	OK, everything scheduled prior to call of flush_scheduled_work() is completed,
we are done.
	return from umount(2)
	(in bogus userland code) tell it to shut devices down
...
oops, that delayed_mntput_work we'd scheduled there got to run.  Too bad...
diff mbox

Patch

diff --git a/fs/namespace.c b/fs/namespace.c
index f8893dc6a989..f2c15c4f6e23 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -21,6 +21,7 @@ 
 #include <linux/fs_struct.h>	/* get_fs_root et.al. */
 #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
 #include <linux/uaccess.h>
+#include <linux/file.h>
 #include <linux/proc_ns.h>
 #include <linux/magic.h>
 #include <linux/bootmem.h>
@@ -1629,7 +1630,8 @@  SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	int retval;
 	int lookup_flags = 0;
 
-	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
+	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW |
+			UMOUNT_WAIT))
 		return -EINVAL;
 
 	if (!may_mount())
@@ -1653,11 +1655,19 @@  SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	if (flags & MNT_FORCE && !capable(CAP_SYS_ADMIN))
 		goto dput_and_out;
 
+	/* flush delayed_fput to put mnt_count */
+	if (flags & UMOUNT_WAIT)
+		flush_delayed_fput();
+
 	retval = do_umount(mnt, flags);
 dput_and_out:
 	/* we mustn't call path_put() as that would clear mnt_expiry_mark */
 	dput(path.dentry);
 	mntput_no_expire(mnt);
+
+	/* flush delayed_mntput_work to put sb->s_active */
+	if (!retval && (flags & UMOUNT_WAIT))
+		flush_scheduled_work();
 out:
 	return retval;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6e1fd5d21248..69f0fd53c9c7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1278,6 +1278,7 @@  struct mm_struct;
 #define MNT_DETACH	0x00000002	/* Just detach from the tree */
 #define MNT_EXPIRE	0x00000004	/* Mark for expiry */
 #define UMOUNT_NOFOLLOW	0x00000008	/* Don't follow symlink on umount */
+#define UMOUNT_WAIT	0x00000010	/* Wait to unmount completely */
 #define UMOUNT_UNUSED	0x80000000	/* Flag guaranteed to be unused */
 
 /* sb->s_iflags */