Message ID | 20170913200941.39420-1-jaegeuk@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Sep 13, 2017 at 01:09:41PM -0700, Jaegeuk Kim wrote: > + if (!retval && (flags & UMOUNT_WAIT)) { > + if (likely(!(current->flags & PF_KTHREAD))) > + task_work_run(); This is complete crap. The same damn thing will be done by caller of sys_umount() pretty much immediately afterwards. I'm not sure what it is that you are trying to paper over, but this is just plain wrong. What _is_ the semantics of UMOUNT_WAIT? What does it guarantee, and what would be supplying it to umount(2)?
Hi Al, On 09/14, Al Viro wrote: > On Wed, Sep 13, 2017 at 01:09:41PM -0700, Jaegeuk Kim wrote: > > + if (!retval && (flags & UMOUNT_WAIT)) { > > + if (likely(!(current->flags & PF_KTHREAD))) > > + task_work_run(); > > This is complete crap. The same damn thing will be done by > caller of sys_umount() pretty much immediately afterwards. > I'm not sure what it is that you are trying to paper over, > but this is just plain wrong. Okay. > What _is_ the semantics of UMOUNT_WAIT? What does it guarantee, > and what would be supplying it to umount(2)? When android tries to reboot the system, it calls umount(2) without any flag. Then, mntput_no_expire() will add delayed_mntput_work() which finally does cleanup_mnt() later. In the mean time, android proceeded to shutdown all the UFS devices, but filesystem would be still alive and tries to trigger some I/Os. At this moment, I'd like to avoid EIO, since ext4 can issue kernel panic because of error=panic. So, what I'm trying to do is 1) adding a flag to wait for umount() completion, 2) issuing umount(2) with UMOUNT_WAIT in android. Then, it can guarantee there'd be no I/Os after sucessful umount(). Could you please correct me? Thanks,
On Wed, Sep 13, 2017 at 04:31:16PM -0700, Jaegeuk Kim wrote: > Hi Al, > > On 09/14, Al Viro wrote: > > On Wed, Sep 13, 2017 at 01:09:41PM -0700, Jaegeuk Kim wrote: > > > + if (!retval && (flags & UMOUNT_WAIT)) { > > > + if (likely(!(current->flags & PF_KTHREAD))) > > > + task_work_run(); > > > > This is complete crap. The same damn thing will be done by > > caller of sys_umount() pretty much immediately afterwards. > > I'm not sure what it is that you are trying to paper over, > > but this is just plain wrong. > > Okay. > > > What _is_ the semantics of UMOUNT_WAIT? What does it guarantee, > > and what would be supplying it to umount(2)? > > When android tries to reboot the system, it calls umount(2) without any flag. > Then, mntput_no_expire() will add delayed_mntput_work() which finally does > cleanup_mnt() later. In the mean time, android proceeded to shutdown all > the UFS devices. Why has task_work_add() failed? Or is that umount(2) issued by a kernel thread?
On 09/14, Al Viro wrote: > On Wed, Sep 13, 2017 at 04:31:16PM -0700, Jaegeuk Kim wrote: > > Hi Al, > > > > On 09/14, Al Viro wrote: > > > On Wed, Sep 13, 2017 at 01:09:41PM -0700, Jaegeuk Kim wrote: > > > > + if (!retval && (flags & UMOUNT_WAIT)) { > > > > + if (likely(!(current->flags & PF_KTHREAD))) > > > > + task_work_run(); > > > > > > This is complete crap. The same damn thing will be done by > > > caller of sys_umount() pretty much immediately afterwards. > > > I'm not sure what it is that you are trying to paper over, > > > but this is just plain wrong. > > > > Okay. > > > > > What _is_ the semantics of UMOUNT_WAIT? What does it guarantee, > > > and what would be supplying it to umount(2)? > > > > When android tries to reboot the system, it calls umount(2) without any flag. > > Then, mntput_no_expire() will add delayed_mntput_work() which finally does > > cleanup_mnt() later. In the mean time, android proceeded to shutdown all > > the UFS devices. > > Why has task_work_add() failed? Or is that umount(2) issued by a kernel thread? Android triggers umount(2) by init process, which is definitely not a kernel thread. But, we've seen some kernel panics which say umount(2) was succeeded, but ext4 triggered a kernel panic due to EIO after then like below. I'm also not sure task_work_run() would be also safe enoughly. May I ask where I can find sys_umount() calls task_work_run()? [254012.860565] c4 12426 [<ffffff909b6a7ebc>] panic+0x184/0x37c [254012.860589] c4 12426 [<ffffff909b88e724>] __ext4_abort+0x198/0x19c [254012.860606] c4 12426 [<ffffff909b897468>] ext4_put_super+0x80/0x2b4 [254012.860629] c4 12426 [<ffffff909b7e7274>] generic_shutdown_super+0x68/0xd0 [254012.860646] c4 12426 [<ffffff909b7e87fc>] kill_block_super+0x1c/0x5c [254012.860663] c4 12426 [<ffffff909b7e70dc>] deactivate_locked_super+0x5c/0xc0 [254012.860679] c4 12426 [<ffffff909b7e71a8>] deactivate_super+0x68/0x74 [254012.860696] c4 12426 [<ffffff909b80a25c>] cleanup_mnt+0xb0/0x12c [254012.860712] c4 12426 [<ffffff909b80a310>] delayed_mntput+0x38/0x4c [254012.860737] c4 12426 [<ffffff909b6c6524>] process_one_work+0x1e0/0x490 [254012.860753] c4 12426 [<ffffff909b6c6094>] worker_thread+0x314/0x494 [254012.860771] c4 12426 [<ffffff909b6cb35c>] kthread+0xdc/0xec [254012.860790] c4 12426 [<ffffff909b683860>] ret_from_fork+0x10/0x30
On Wed, Sep 13, 2017 at 06:10:48PM -0700, Jaegeuk Kim wrote: > Android triggers umount(2) by init process, which is definitely not a kernel > thread. But, we've seen some kernel panics which say umount(2) was succeeded, > but ext4 triggered a kernel panic due to EIO after then like below. I'm also > not sure task_work_run() would be also safe enoughly. May I ask where I can > find sys_umount() calls task_work_run()? ret_{fast,slow}_syscall -> slow_work_pending -> do_work_pending() -> tracehook_notify_resume() -> task_work_run() It's not sys_umount() (or any other sys_...()) - it's syscall dispatcher after having called one of those and before returning to userland. What is guaranteed is that after successful task_work_add() the damn thing will be run in context of originating process before it returns from syscall. So any subsequent syscalls from that process are guaranteed to happen after the work has run. The same happens if the process exits rather than returns to userland (do_exit() -> exit_task_work() -> task_work_run()), but for that you would need it to die in umount(2) (e.g. get kill -9 delivered on the way out). Please, check if you are seeing task_work_add() failure in there and if you do, I would like to see a stack trace. IOW, slap WARN_ON(1); right after if (!task_work_add(task, &mnt->mnt_rcu, true)) return; and see what (if anything) gets printed.
On Thu, Sep 14, 2017 at 02:30:17AM +0100, Al Viro wrote: > On Wed, Sep 13, 2017 at 06:10:48PM -0700, Jaegeuk Kim wrote: > > > Android triggers umount(2) by init process, which is definitely not a kernel > > thread. But, we've seen some kernel panics which say umount(2) was succeeded, > > but ext4 triggered a kernel panic due to EIO after then like below. I'm also > > not sure task_work_run() would be also safe enoughly. May I ask where I can > > find sys_umount() calls task_work_run()? > > ret_{fast,slow}_syscall -> > slow_work_pending -> > do_work_pending() -> > tracehook_notify_resume() -> > task_work_run() > > It's not sys_umount() (or any other sys_...()) - it's syscall dispatcher after > having called one of those and before returning to userland. What is guaranteed > is that after successful task_work_add() the damn thing will be run in context > of originating process before it returns from syscall. So any subsequent > syscalls from that process are guaranteed to happen after the work has run. > The same happens if the process exits rather than returns to userland (do_exit() -> > exit_task_work() -> task_work_run()), but for that you would need it to die in > umount(2) (e.g. get kill -9 delivered on the way out). > > Please, check if you are seeing task_work_add() failure in there and if you do, > I would like to see a stack trace. IOW, slap WARN_ON(1); right after > if (!task_work_add(task, &mnt->mnt_rcu, true)) > return; > and see what (if anything) gets printed. AFAICS, for task_work_add() to fail here we need a final mntput() to be run in context of a thread that already had exit_signals() run *and* subsequent task_work_run() run to completion (with all pending callbacks executed, along with all callbacks added by those, etc.) For that to have happened during umount(2) we would've needed * killing signal delivered while going through the syscall * final mntput() to have been done *NOT* from sys_umount() (otherwise the work would've been added before we got to exit_signals()) * final mntput() to have been done *NOT* from any task_work callbacks (otherwise it would've been added before we'd observed a combination of empty list of pending work with PF_EXITING) I really want to see the stack trace of that failing task_work_add(), if that's what actually happens there. What kind of a reproducer do you have for that?
On 09/14, Al Viro wrote: > On Thu, Sep 14, 2017 at 02:30:17AM +0100, Al Viro wrote: > > On Wed, Sep 13, 2017 at 06:10:48PM -0700, Jaegeuk Kim wrote: > > > > > Android triggers umount(2) by init process, which is definitely not a kernel > > > thread. But, we've seen some kernel panics which say umount(2) was succeeded, > > > but ext4 triggered a kernel panic due to EIO after then like below. I'm also > > > not sure task_work_run() would be also safe enoughly. May I ask where I can > > > find sys_umount() calls task_work_run()? > > > > ret_{fast,slow}_syscall -> > > slow_work_pending -> > > do_work_pending() -> > > tracehook_notify_resume() -> > > task_work_run() > > > > It's not sys_umount() (or any other sys_...()) - it's syscall dispatcher after > > having called one of those and before returning to userland. What is guaranteed > > is that after successful task_work_add() the damn thing will be run in context > > of originating process before it returns from syscall. So any subsequent > > syscalls from that process are guaranteed to happen after the work has run. > > The same happens if the process exits rather than returns to userland (do_exit() -> > > exit_task_work() -> task_work_run()), but for that you would need it to die in > > umount(2) (e.g. get kill -9 delivered on the way out). > > > > Please, check if you are seeing task_work_add() failure in there and if you do, > > I would like to see a stack trace. IOW, slap WARN_ON(1); right after > > if (!task_work_add(task, &mnt->mnt_rcu, true)) > > return; > > and see what (if anything) gets printed. > > AFAICS, for task_work_add() to fail here we need a final mntput() to be run > in context of a thread that already had exit_signals() run *and* subsequent > task_work_run() run to completion (with all pending callbacks executed, along > with all callbacks added by those, etc.) > > For that to have happened during umount(2) we would've needed > * killing signal delivered while going through the syscall > * final mntput() to have been done *NOT* from sys_umount() (otherwise > the work would've been added before we got to exit_signals()) > * final mntput() to have been done *NOT* from any task_work callbacks > (otherwise it would've been added before we'd observed a combination of empty > list of pending work with PF_EXITING) > > I really want to see the stack trace of that failing task_work_add(), if that's > what actually happens there. What kind of a reproducer do you have for that? I've got this error from Android user, so there's no reproducer unfortunately. So, I wrote a script capturing WARN_ON after reboot running at every minute, but couldn't have got the error since yesterday so far.
On Fri, Sep 15, 2017 at 11:44:33AM -0700, Jaegeuk Kim wrote: > > So, I digged it in more detail, and found, in drivers/android/binder.c [1], > - binder_ioctl() > - create a kernel thread > - zombie_cleanup_check() > - binder_defer_work() > - queue_work(..., &binder_deferred_work); > > - binder_deferred_func() > - binder_clear_zombies() > - binder_proc_clear_zombies() > - put_files_struct() > - close_files() > - filp_close() > - fput() > > It seems binder holds some proc files. If binder was holding some files open, then umount should have failed with EBUSY, no? Does Android use mount namespaces at all? - Ted
On 09/15, Theodore Ts'o wrote: > On Fri, Sep 15, 2017 at 11:44:33AM -0700, Jaegeuk Kim wrote: > > > > So, I digged it in more detail, and found, in drivers/android/binder.c [1], > > - binder_ioctl() > > - create a kernel thread > > - zombie_cleanup_check() > > - binder_defer_work() > > - queue_work(..., &binder_deferred_work); > > > > - binder_deferred_func() > > - binder_clear_zombies() > > - binder_proc_clear_zombies() > > - put_files_struct() > > - close_files() > > - filp_close() > > - fput() > > > > It seems binder holds some proc files. > > If binder was holding some files open, then umount should have failed > with EBUSY, no? Based on what I've got some traces so far, - binder_ioctl - create a kernel thread - zombie_cleanup_check - binder_defer_work - queue_work(..., &binder_deferred_work); - binder_deferred_func - binder_clear_zombies - binder_proc_clear_zombies - put_files_struct - close_files - filp_close - fput - delayed_fput ... - file_free - dput init - umount - mntput - mntput_no_expire - do_umount - mnt_get_count() > 2 - mntput_no_expire - mnt_add_count(-1); - mnt_add_count(-1); - mnt_get_count() return; - return 0; - delayed_mntput_work - device_shutdown - ext4_put_super() - EIO, and panic if error=panic The mntput() in delayed_fput() is the last function call. So before that moment, sys_umount() may see mnt_get_count() as 2, so it avoids EBUSY condition. I'm not sure why it check over 2 tho. > > Does Android use mount namespaces at all? > > - Ted
On Fri, Sep 15, 2017 at 04:29:11PM -0700, Jaegeuk Kim wrote: > The mntput() in delayed_fput() is the last function call. So before that moment, > sys_umount() may see mnt_get_count() as 2, so it avoids EBUSY condition. I'm not > sure why it check over 2 tho. Because it has just grabbed a reference itself, in addition to the one that keeps the damn thing alive (due to being mounted). So it bloody well should have triggered -EBUSY, if they refer to the same vfsmount.
On Sat, Sep 16, 2017 at 1:12 AM, Theodore Ts'o <tytso@mit.edu> wrote: > On Fri, Sep 15, 2017 at 11:44:33AM -0700, Jaegeuk Kim wrote: >> >> So, I digged it in more detail, and found, in drivers/android/binder.c [1], >> - binder_ioctl() >> - create a kernel thread >> - zombie_cleanup_check() >> - binder_defer_work() >> - queue_work(..., &binder_deferred_work); >> >> - binder_deferred_func() >> - binder_clear_zombies() >> - binder_proc_clear_zombies() >> - put_files_struct() >> - close_files() >> - filp_close() >> - fput() >> >> It seems binder holds some proc files. > > If binder was holding some files open, then umount should have failed > with EBUSY, no? > > Does Android use mount namespaces at all? > Extensively. Every user (i.e. from multi user) has its own mount ns with private /data Every app has its own mount ns, with /sdcard mounted to one of 3 FUSE sdcard mounts depending of app storage permission (none, rdonly, rdwr). Amir.
On 09/16, Al Viro wrote: > On Fri, Sep 15, 2017 at 04:29:11PM -0700, Jaegeuk Kim wrote: > > > The mntput() in delayed_fput() is the last function call. So before that moment, > > sys_umount() may see mnt_get_count() as 2, so it avoids EBUSY condition. I'm not > > sure why it check over 2 tho. > > Because it has just grabbed a reference itself, in addition to the one that keeps > the damn thing alive (due to being mounted). So it bloody well should have > triggered -EBUSY, if they refer to the same vfsmount. I've tracked another view in terms of mnt_get_count() and sb->s_active based on namespaces, and could get the below scenario for instance. Term: namespace(mnt_get_count()) 1. create_new_namespaces() creates ns1 and ns2, /data(1) ns1(1) ns2(1) | | | --------------------- | sb->s_active = 3 2. after binder_proc_clear_zombies() for ns2 and ns1 triggers - delayed_fput() - delayed_mntput_work(ns2) /data(1) ns1(1) | | ---------- | sb->s_active = 2 3. umount() for /data is successed. ns1(1) | sb->s_active = 1 4. device_shutdown() by init 5. - delayed_mntput_work(ns1) - put_super(), since sb->s_active = 0 - -EIO Please let me know, if I'm missing something. Thanks,
diff --git a/fs/namespace.c b/fs/namespace.c index f8893dc6a989..b1ac89915b10 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -21,6 +21,7 @@ #include <linux/fs_struct.h> /* get_fs_root et.al. */ #include <linux/fsnotify.h> /* fsnotify_vfsmount_delete */ #include <linux/uaccess.h> +#include <linux/file.h> #include <linux/proc_ns.h> #include <linux/magic.h> #include <linux/bootmem.h> @@ -1629,7 +1630,8 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags) int retval; int lookup_flags = 0; - if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW)) + if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW | + UMOUNT_WAIT)) return -EINVAL; if (!may_mount()) @@ -1652,12 +1654,20 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags) retval = -EPERM; if (flags & MNT_FORCE && !capable(CAP_SYS_ADMIN)) goto dput_and_out; + if (flags & UMOUNT_WAIT) + flush_delayed_fput(); retval = do_umount(mnt, flags); dput_and_out: /* we mustn't call path_put() as that would clear mnt_expiry_mark */ dput(path.dentry); mntput_no_expire(mnt); + if (!retval && (flags & UMOUNT_WAIT)) { + if (likely(!(current->flags & PF_KTHREAD))) + task_work_run(); + else + flush_scheduled_work(); + } out: return retval; } diff --git a/include/linux/fs.h b/include/linux/fs.h index 83341a6a553e..cb62af7a03e7 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1283,6 +1283,7 @@ struct mm_struct; #define MNT_DETACH 0x00000002 /* Just detach from the tree */ #define MNT_EXPIRE 0x00000004 /* Mark for expiry */ #define UMOUNT_NOFOLLOW 0x00000008 /* Don't follow symlink on umount */ +#define UMOUNT_WAIT 0x00000010 /* Wait to unmount completely */ #define UMOUNT_UNUSED 0x80000000 /* Flag guaranteed to be unused */ /* sb->s_iflags */
This patch introduces UMOUNT_WAIT flag for umount(2) which let user wait for its completion. This would fix a kernel panic caused by block device access by filesystem, after device_shutdown during kernel_restart. This can happen due to delayed umount -- reboot process already succeeded to unmount filesystem, but its instance is sitll alive. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> --- fs/namespace.c | 12 +++++++++++- include/linux/fs.h | 1 + 2 files changed, 12 insertions(+), 1 deletion(-)