diff mbox

vfs: introduce UMOUNT_WAIT which waits for umount completion

Message ID 20170913200941.39420-1-jaegeuk@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Jaegeuk Kim Sept. 13, 2017, 8:09 p.m. UTC
This patch introduces UMOUNT_WAIT flag for umount(2) which let user wait for
its completion. This would fix a kernel panic caused by block device access by
filesystem, after device_shutdown during kernel_restart. This can happen due
to delayed umount -- reboot process already succeeded to unmount filesystem,
but its instance is sitll alive.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
---
 fs/namespace.c     | 12 +++++++++++-
 include/linux/fs.h |  1 +
 2 files changed, 12 insertions(+), 1 deletion(-)

Comments

Al Viro Sept. 13, 2017, 11:04 p.m. UTC | #1
On Wed, Sep 13, 2017 at 01:09:41PM -0700, Jaegeuk Kim wrote:
> +	if (!retval && (flags & UMOUNT_WAIT)) {
> +		if (likely(!(current->flags & PF_KTHREAD)))
> +			task_work_run();

This is complete crap.  The same damn thing will be done by
caller of sys_umount() pretty much immediately afterwards.
I'm not sure what it is that you are trying to paper over,
but this is just plain wrong.

What _is_ the semantics of UMOUNT_WAIT?  What does it guarantee,
and what would be supplying it to umount(2)?
Jaegeuk Kim Sept. 13, 2017, 11:31 p.m. UTC | #2
Hi Al,

On 09/14, Al Viro wrote:
> On Wed, Sep 13, 2017 at 01:09:41PM -0700, Jaegeuk Kim wrote:
> > +	if (!retval && (flags & UMOUNT_WAIT)) {
> > +		if (likely(!(current->flags & PF_KTHREAD)))
> > +			task_work_run();
> 
> This is complete crap.  The same damn thing will be done by
> caller of sys_umount() pretty much immediately afterwards.
> I'm not sure what it is that you are trying to paper over,
> but this is just plain wrong.

Okay.

> What _is_ the semantics of UMOUNT_WAIT?  What does it guarantee,
> and what would be supplying it to umount(2)?

When android tries to reboot the system, it calls umount(2) without any flag.
Then, mntput_no_expire() will add delayed_mntput_work() which finally does
cleanup_mnt() later. In the mean time, android proceeded to shutdown all
the UFS devices, but filesystem would be still alive and tries to trigger
some I/Os. At this moment, I'd like to avoid EIO, since ext4 can issue kernel
panic because of error=panic.

So, what I'm trying to do is 1) adding a flag to wait for umount() completion,
2) issuing umount(2) with UMOUNT_WAIT in android. Then, it can guarantee there'd
be no I/Os after sucessful umount().

Could you please correct me?

Thanks,
Al Viro Sept. 13, 2017, 11:44 p.m. UTC | #3
On Wed, Sep 13, 2017 at 04:31:16PM -0700, Jaegeuk Kim wrote:
> Hi Al,
> 
> On 09/14, Al Viro wrote:
> > On Wed, Sep 13, 2017 at 01:09:41PM -0700, Jaegeuk Kim wrote:
> > > +	if (!retval && (flags & UMOUNT_WAIT)) {
> > > +		if (likely(!(current->flags & PF_KTHREAD)))
> > > +			task_work_run();
> > 
> > This is complete crap.  The same damn thing will be done by
> > caller of sys_umount() pretty much immediately afterwards.
> > I'm not sure what it is that you are trying to paper over,
> > but this is just plain wrong.
> 
> Okay.
> 
> > What _is_ the semantics of UMOUNT_WAIT?  What does it guarantee,
> > and what would be supplying it to umount(2)?
> 
> When android tries to reboot the system, it calls umount(2) without any flag.
> Then, mntput_no_expire() will add delayed_mntput_work() which finally does
> cleanup_mnt() later. In the mean time, android proceeded to shutdown all
> the UFS devices.

Why has task_work_add() failed?  Or is that umount(2) issued by a kernel thread?
Jaegeuk Kim Sept. 14, 2017, 1:10 a.m. UTC | #4
On 09/14, Al Viro wrote:
> On Wed, Sep 13, 2017 at 04:31:16PM -0700, Jaegeuk Kim wrote:
> > Hi Al,
> > 
> > On 09/14, Al Viro wrote:
> > > On Wed, Sep 13, 2017 at 01:09:41PM -0700, Jaegeuk Kim wrote:
> > > > +	if (!retval && (flags & UMOUNT_WAIT)) {
> > > > +		if (likely(!(current->flags & PF_KTHREAD)))
> > > > +			task_work_run();
> > > 
> > > This is complete crap.  The same damn thing will be done by
> > > caller of sys_umount() pretty much immediately afterwards.
> > > I'm not sure what it is that you are trying to paper over,
> > > but this is just plain wrong.
> > 
> > Okay.
> > 
> > > What _is_ the semantics of UMOUNT_WAIT?  What does it guarantee,
> > > and what would be supplying it to umount(2)?
> > 
> > When android tries to reboot the system, it calls umount(2) without any flag.
> > Then, mntput_no_expire() will add delayed_mntput_work() which finally does
> > cleanup_mnt() later. In the mean time, android proceeded to shutdown all
> > the UFS devices.
> 
> Why has task_work_add() failed?  Or is that umount(2) issued by a kernel thread?

Android triggers umount(2) by init process, which is definitely not a kernel
thread. But, we've seen some kernel panics which say umount(2) was succeeded,
but ext4 triggered a kernel panic due to EIO after then like below. I'm also
not sure task_work_run() would be also safe enoughly. May I ask where I can
find sys_umount() calls task_work_run()?

[254012.860565] c4  12426 [<ffffff909b6a7ebc>] panic+0x184/0x37c
[254012.860589] c4  12426 [<ffffff909b88e724>] __ext4_abort+0x198/0x19c
[254012.860606] c4  12426 [<ffffff909b897468>] ext4_put_super+0x80/0x2b4
[254012.860629] c4  12426 [<ffffff909b7e7274>] generic_shutdown_super+0x68/0xd0
[254012.860646] c4  12426 [<ffffff909b7e87fc>] kill_block_super+0x1c/0x5c
[254012.860663] c4  12426 [<ffffff909b7e70dc>] deactivate_locked_super+0x5c/0xc0
[254012.860679] c4  12426 [<ffffff909b7e71a8>] deactivate_super+0x68/0x74
[254012.860696] c4  12426 [<ffffff909b80a25c>] cleanup_mnt+0xb0/0x12c
[254012.860712] c4  12426 [<ffffff909b80a310>] delayed_mntput+0x38/0x4c
[254012.860737] c4  12426 [<ffffff909b6c6524>] process_one_work+0x1e0/0x490
[254012.860753] c4  12426 [<ffffff909b6c6094>] worker_thread+0x314/0x494
[254012.860771] c4  12426 [<ffffff909b6cb35c>] kthread+0xdc/0xec
[254012.860790] c4  12426 [<ffffff909b683860>] ret_from_fork+0x10/0x30
Al Viro Sept. 14, 2017, 1:30 a.m. UTC | #5
On Wed, Sep 13, 2017 at 06:10:48PM -0700, Jaegeuk Kim wrote:

> Android triggers umount(2) by init process, which is definitely not a kernel
> thread. But, we've seen some kernel panics which say umount(2) was succeeded,
> but ext4 triggered a kernel panic due to EIO after then like below. I'm also
> not sure task_work_run() would be also safe enoughly. May I ask where I can
> find sys_umount() calls task_work_run()?

ret_{fast,slow}_syscall ->
	slow_work_pending ->
		do_work_pending() ->
			tracehook_notify_resume() ->
				task_work_run()

It's not sys_umount() (or any other sys_...()) - it's syscall dispatcher after
having called one of those and before returning to userland.  What is guaranteed
is that after successful task_work_add() the damn thing will be run in context
of originating process before it returns from syscall.  So any subsequent
syscalls from that process are guaranteed to happen after the work has run.
The same happens if the process exits rather than returns to userland (do_exit() ->
exit_task_work() -> task_work_run()), but for that you would need it to die in
umount(2) (e.g. get kill -9 delivered on the way out).

Please, check if you are seeing task_work_add() failure in there and if you do,
I would like to see a stack trace.  IOW, slap WARN_ON(1); right after
                        if (!task_work_add(task, &mnt->mnt_rcu, true))
                                return;
and see what (if anything) gets printed.
Al Viro Sept. 14, 2017, 6:37 p.m. UTC | #6
On Thu, Sep 14, 2017 at 02:30:17AM +0100, Al Viro wrote:
> On Wed, Sep 13, 2017 at 06:10:48PM -0700, Jaegeuk Kim wrote:
> 
> > Android triggers umount(2) by init process, which is definitely not a kernel
> > thread. But, we've seen some kernel panics which say umount(2) was succeeded,
> > but ext4 triggered a kernel panic due to EIO after then like below. I'm also
> > not sure task_work_run() would be also safe enoughly. May I ask where I can
> > find sys_umount() calls task_work_run()?
> 
> ret_{fast,slow}_syscall ->
> 	slow_work_pending ->
> 		do_work_pending() ->
> 			tracehook_notify_resume() ->
> 				task_work_run()
> 
> It's not sys_umount() (or any other sys_...()) - it's syscall dispatcher after
> having called one of those and before returning to userland.  What is guaranteed
> is that after successful task_work_add() the damn thing will be run in context
> of originating process before it returns from syscall.  So any subsequent
> syscalls from that process are guaranteed to happen after the work has run.
> The same happens if the process exits rather than returns to userland (do_exit() ->
> exit_task_work() -> task_work_run()), but for that you would need it to die in
> umount(2) (e.g. get kill -9 delivered on the way out).
> 
> Please, check if you are seeing task_work_add() failure in there and if you do,
> I would like to see a stack trace.  IOW, slap WARN_ON(1); right after
>                         if (!task_work_add(task, &mnt->mnt_rcu, true))
>                                 return;
> and see what (if anything) gets printed.

AFAICS, for task_work_add() to fail here we need a final mntput() to be run
in context of a thread that already had exit_signals() run *and* subsequent
task_work_run() run to completion (with all pending callbacks executed, along
with all callbacks added by those, etc.)

For that to have happened during umount(2) we would've needed
	* killing signal delivered while going through the syscall
	* final mntput() to have been done *NOT* from sys_umount() (otherwise
the work would've been added before we got to exit_signals())
	* final mntput() to have been done *NOT* from any task_work callbacks
(otherwise it would've been added before we'd observed a combination of empty
list of pending work with PF_EXITING)

I really want to see the stack trace of that failing task_work_add(), if that's
what actually happens there.  What kind of a reproducer do you have for that?
Jaegeuk Kim Sept. 14, 2017, 7:14 p.m. UTC | #7
On 09/14, Al Viro wrote:
> On Thu, Sep 14, 2017 at 02:30:17AM +0100, Al Viro wrote:
> > On Wed, Sep 13, 2017 at 06:10:48PM -0700, Jaegeuk Kim wrote:
> > 
> > > Android triggers umount(2) by init process, which is definitely not a kernel
> > > thread. But, we've seen some kernel panics which say umount(2) was succeeded,
> > > but ext4 triggered a kernel panic due to EIO after then like below. I'm also
> > > not sure task_work_run() would be also safe enoughly. May I ask where I can
> > > find sys_umount() calls task_work_run()?
> > 
> > ret_{fast,slow}_syscall ->
> > 	slow_work_pending ->
> > 		do_work_pending() ->
> > 			tracehook_notify_resume() ->
> > 				task_work_run()
> > 
> > It's not sys_umount() (or any other sys_...()) - it's syscall dispatcher after
> > having called one of those and before returning to userland.  What is guaranteed
> > is that after successful task_work_add() the damn thing will be run in context
> > of originating process before it returns from syscall.  So any subsequent
> > syscalls from that process are guaranteed to happen after the work has run.
> > The same happens if the process exits rather than returns to userland (do_exit() ->
> > exit_task_work() -> task_work_run()), but for that you would need it to die in
> > umount(2) (e.g. get kill -9 delivered on the way out).
> > 
> > Please, check if you are seeing task_work_add() failure in there and if you do,
> > I would like to see a stack trace.  IOW, slap WARN_ON(1); right after
> >                         if (!task_work_add(task, &mnt->mnt_rcu, true))
> >                                 return;
> > and see what (if anything) gets printed.
> 
> AFAICS, for task_work_add() to fail here we need a final mntput() to be run
> in context of a thread that already had exit_signals() run *and* subsequent
> task_work_run() run to completion (with all pending callbacks executed, along
> with all callbacks added by those, etc.)
> 
> For that to have happened during umount(2) we would've needed
> 	* killing signal delivered while going through the syscall
> 	* final mntput() to have been done *NOT* from sys_umount() (otherwise
> the work would've been added before we got to exit_signals())
> 	* final mntput() to have been done *NOT* from any task_work callbacks
> (otherwise it would've been added before we'd observed a combination of empty
> list of pending work with PF_EXITING)
> 
> I really want to see the stack trace of that failing task_work_add(), if that's
> what actually happens there.  What kind of a reproducer do you have for that?

I've got this error from Android user, so there's no reproducer unfortunately.
So, I wrote a script capturing WARN_ON after reboot running at every minute, but
couldn't have got the error since yesterday so far.
Theodore Ts'o Sept. 15, 2017, 10:12 p.m. UTC | #8
On Fri, Sep 15, 2017 at 11:44:33AM -0700, Jaegeuk Kim wrote:
> 
> So, I digged it in more detail, and found, in drivers/android/binder.c [1],
> - binder_ioctl()
>  - create a kernel thread
>  - zombie_cleanup_check()
>   - binder_defer_work()
>     - queue_work(..., &binder_deferred_work);
> 
> - binder_deferred_func()
>  - binder_clear_zombies()
>   - binder_proc_clear_zombies()
>    - put_files_struct()
>     - close_files()
>      - filp_close()
>       - fput()
> 
> It seems binder holds some proc files.

If binder was holding some files open, then umount should have failed
with EBUSY, no?

Does Android use mount namespaces at all?

					- Ted
Jaegeuk Kim Sept. 15, 2017, 11:29 p.m. UTC | #9
On 09/15, Theodore Ts'o wrote:
> On Fri, Sep 15, 2017 at 11:44:33AM -0700, Jaegeuk Kim wrote:
> > 
> > So, I digged it in more detail, and found, in drivers/android/binder.c [1],
> > - binder_ioctl()
> >  - create a kernel thread
> >  - zombie_cleanup_check()
> >   - binder_defer_work()
> >     - queue_work(..., &binder_deferred_work);
> > 
> > - binder_deferred_func()
> >  - binder_clear_zombies()
> >   - binder_proc_clear_zombies()
> >    - put_files_struct()
> >     - close_files()
> >      - filp_close()
> >       - fput()
> > 
> > It seems binder holds some proc files.
> 
> If binder was holding some files open, then umount should have failed
> with EBUSY, no?

Based on what I've got some traces so far,

- binder_ioctl
 - create a kernel thread
 - zombie_cleanup_check
  - binder_defer_work
   - queue_work(..., &binder_deferred_work);

- binder_deferred_func
 - binder_clear_zombies
  - binder_proc_clear_zombies
   - put_files_struct
     - close_files
      - filp_close
       - fput

- delayed_fput
 ...
 - file_free
 - dput
                                                init
                                                - umount
 - mntput
  - mntput_no_expire
                                                - do_umount
						 - mnt_get_count() > 2
                                                - mntput_no_expire
                                                 - mnt_add_count(-1);
   - mnt_add_count(-1);
                                                 - mnt_get_count() return;
                                                 - return 0;
   - delayed_mntput_work
                                                - device_shutdown
    - ext4_put_super()
     - EIO, and panic if error=panic


The mntput() in delayed_fput() is the last function call. So before that moment,
sys_umount() may see mnt_get_count() as 2, so it avoids EBUSY condition. I'm not
sure why it check over 2 tho.

> 
> Does Android use mount namespaces at all?
> 
> 					- Ted
Al Viro Sept. 15, 2017, 11:43 p.m. UTC | #10
On Fri, Sep 15, 2017 at 04:29:11PM -0700, Jaegeuk Kim wrote:

> The mntput() in delayed_fput() is the last function call. So before that moment,
> sys_umount() may see mnt_get_count() as 2, so it avoids EBUSY condition. I'm not
> sure why it check over 2 tho.

Because it has just grabbed a reference itself, in addition to the one that keeps
the damn thing alive (due to being mounted).  So it bloody well should have
triggered -EBUSY, if they refer to the same vfsmount.
Amir Goldstein Sept. 16, 2017, 7:11 a.m. UTC | #11
On Sat, Sep 16, 2017 at 1:12 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Fri, Sep 15, 2017 at 11:44:33AM -0700, Jaegeuk Kim wrote:
>>
>> So, I digged it in more detail, and found, in drivers/android/binder.c [1],
>> - binder_ioctl()
>>  - create a kernel thread
>>  - zombie_cleanup_check()
>>   - binder_defer_work()
>>     - queue_work(..., &binder_deferred_work);
>>
>> - binder_deferred_func()
>>  - binder_clear_zombies()
>>   - binder_proc_clear_zombies()
>>    - put_files_struct()
>>     - close_files()
>>      - filp_close()
>>       - fput()
>>
>> It seems binder holds some proc files.
>
> If binder was holding some files open, then umount should have failed
> with EBUSY, no?
>
> Does Android use mount namespaces at all?
>

Extensively.
Every user (i.e. from multi user) has its own mount ns with private /data
Every app has its own mount ns, with /sdcard mounted to one of 3 FUSE
sdcard mounts depending of app storage permission (none, rdonly, rdwr).

Amir.
Jaegeuk Kim Sept. 19, 2017, 3:55 p.m. UTC | #12
On 09/16, Al Viro wrote:
> On Fri, Sep 15, 2017 at 04:29:11PM -0700, Jaegeuk Kim wrote:
> 
> > The mntput() in delayed_fput() is the last function call. So before that moment,
> > sys_umount() may see mnt_get_count() as 2, so it avoids EBUSY condition. I'm not
> > sure why it check over 2 tho.
> 
> Because it has just grabbed a reference itself, in addition to the one that keeps
> the damn thing alive (due to being mounted).  So it bloody well should have
> triggered -EBUSY, if they refer to the same vfsmount.

I've tracked another view in terms of mnt_get_count() and sb->s_active based on
namespaces, and could get the below scenario for instance.

Term: namespace(mnt_get_count())

1. create_new_namespaces() creates ns1 and ns2,

  /data(1)    ns1(1)    ns2(1)
    |          |          |
     ---------------------
               |
        sb->s_active = 3

2. after binder_proc_clear_zombies() for ns2 and ns1 triggers
  - delayed_fput()
    - delayed_mntput_work(ns2)

  /data(1)    ns1(1)
    |          |
     ----------
          |
    sb->s_active = 2

3. umount() for /data is successed.

  ns1(1)
    |
 sb->s_active = 1

4. device_shutdown() by init

5.  - delayed_mntput_work(ns1)
     - put_super(), since sb->s_active = 0
       - -EIO

Please let me know, if I'm missing something.

Thanks,
diff mbox

Patch

diff --git a/fs/namespace.c b/fs/namespace.c
index f8893dc6a989..b1ac89915b10 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -21,6 +21,7 @@ 
 #include <linux/fs_struct.h>	/* get_fs_root et.al. */
 #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
 #include <linux/uaccess.h>
+#include <linux/file.h>
 #include <linux/proc_ns.h>
 #include <linux/magic.h>
 #include <linux/bootmem.h>
@@ -1629,7 +1630,8 @@  SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	int retval;
 	int lookup_flags = 0;
 
-	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
+	if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW |
+			UMOUNT_WAIT))
 		return -EINVAL;
 
 	if (!may_mount())
@@ -1652,12 +1654,20 @@  SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
 	retval = -EPERM;
 	if (flags & MNT_FORCE && !capable(CAP_SYS_ADMIN))
 		goto dput_and_out;
+	if (flags & UMOUNT_WAIT)
+		flush_delayed_fput();
 
 	retval = do_umount(mnt, flags);
 dput_and_out:
 	/* we mustn't call path_put() as that would clear mnt_expiry_mark */
 	dput(path.dentry);
 	mntput_no_expire(mnt);
+	if (!retval && (flags & UMOUNT_WAIT)) {
+		if (likely(!(current->flags & PF_KTHREAD)))
+			task_work_run();
+		else
+			flush_scheduled_work();
+	}
 out:
 	return retval;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 83341a6a553e..cb62af7a03e7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1283,6 +1283,7 @@  struct mm_struct;
 #define MNT_DETACH	0x00000002	/* Just detach from the tree */
 #define MNT_EXPIRE	0x00000004	/* Mark for expiry */
 #define UMOUNT_NOFOLLOW	0x00000008	/* Don't follow symlink on umount */
+#define UMOUNT_WAIT	0x00000010	/* Wait to unmount completely */
 #define UMOUNT_UNUSED	0x80000000	/* Flag guaranteed to be unused */
 
 /* sb->s_iflags */