Message ID | 20190404105255.12189-1-amir73il@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] acct: fix possible deadlock in acct_pin_kill | expand |
On Thu, Apr 04, 2019 at 01:52:55PM +0300, Amir Goldstein wrote: > This looks like an old bug, pre-dating the "Fixes" commit, but the > "Fixes" commit did not handle it properly. > > The bug recently surfaced as a lockdep possible deadlock warning > with commit d1d04ef8572b ("ovl: stack file ops"). > > When acct_on() replaces one acct file with another, it takes sb_writers > lock on new file sb and calls acct_pin_kill(old) before releasing the > sb_writers lock. > > If new file is on the same fs as old file, acct_pin_kill(old) fail to > file_start_write_trylock() and skip writing the old file, because > sb_writers (of new) is already taken by acct_on(). > > If new file is not on same fs as old file, this ordering violation > creates an unneeded dependency between new sb_writers and old sb_writers, > which may later be reported as possible deadlock. > > This could result in an actual deadlock if acct file is replaced from > an old file in overlayfs over "real fs" to a new file in "real fs". > acct_on() takes freeze protection on "real fs" and tries to write to > overlayfs file. overlayfs is not freeze protected so do_acct_process() > can carry on with __kernel_write() to overlayfs file, which would > try to take freeze protection on "real fs" and deadlock. Huh? sb_writers is taken when we *open* the new file. Then we replace its ->path.mnt with a clone and transfer the write count from the original to new one. And close the old file while we are at it. From sb_writers POV mainline has sb_start_write(new_sb) // in open sb_start_write(new_sb) // mnt_want_write() on clone last write to old_sb, then sb_end_write(old_sb) // acct_pin_kill() sb_end_write(new_sb) // mnt_drop_write(mnt) and you flip the order of the last two lines. Could you explain how exactly does your patch help whatever problem overlayfs has?
On Thu, Apr 04, 2019 at 07:44:48PM +0100, Al Viro wrote: > Huh? sb_writers is taken when we *open* the new file. Then we replace > its ->path.mnt with a clone and transfer the write count from the original > to new one. And close the old file while we are at it. > > >From sb_writers POV mainline has > sb_start_write(new_sb) // in open > sb_start_write(new_sb) // mnt_want_write() on clone > last write to old_sb, then sb_end_write(old_sb) // acct_pin_kill() > sb_end_write(new_sb) // mnt_drop_write(mnt) > and you flip the order of the last two lines. > > Could you explain how exactly does your patch help whatever > problem overlayfs has? Ow... Sorry, -ENOCOFFEE... Let me wake up properly and look at that thing again.
On Thu, Apr 04, 2019 at 07:49:44PM +0100, Al Viro wrote: > On Thu, Apr 04, 2019 at 07:44:48PM +0100, Al Viro wrote: > > > Huh? sb_writers is taken when we *open* the new file. Then we replace > > its ->path.mnt with a clone and transfer the write count from the original > > to new one. And close the old file while we are at it. > > > > >From sb_writers POV mainline has > > sb_start_write(new_sb) // in open > > sb_start_write(new_sb) // mnt_want_write() on clone > > last write to old_sb, then sb_end_write(old_sb) // acct_pin_kill() > > sb_end_write(new_sb) // mnt_drop_write(mnt) > > and you flip the order of the last two lines. > > > > Could you explain how exactly does your patch help whatever > > problem overlayfs has? > > Ow... Sorry, -ENOCOFFEE... > > Let me wake up properly and look at that thing again. OK, so... My first reaction had been complete BS. However, the same goes for your analysis - it's not an ordering problem at all. What happens is that we are replacing file->path.mnt with a clone and we want the write count contribution (file is opened for write) to be transferred. That's it. We do *NOT* want any kind of freeze protection for the duration of switchover. IOW, the solution is to switch to __mnt_{want,drop}_write() for that switchover; we don't want to mess with freeze protection at all. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- diff --git a/fs/internal.h b/fs/internal.h index 6a8b71643af4..2e7362837a6e 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -89,9 +89,7 @@ extern int sb_prepare_remount_readonly(struct super_block *); extern void __init mnt_init(void); -extern int __mnt_want_write(struct vfsmount *); extern int __mnt_want_write_file(struct file *); -extern void __mnt_drop_write(struct vfsmount *); extern void __mnt_drop_write_file(struct file *); /* diff --git a/include/linux/mount.h b/include/linux/mount.h index 9197ddbf35fb..bf8cc4108b8f 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -87,6 +87,8 @@ extern bool mnt_may_suid(struct vfsmount *mnt); struct path; extern struct vfsmount *clone_private_mount(const struct path *path); +extern int __mnt_want_write(struct vfsmount *); +extern void __mnt_drop_write(struct vfsmount *); struct file_system_type; extern struct vfsmount *fc_mount(struct fs_context *fc); diff --git a/kernel/acct.c b/kernel/acct.c index addf7732fb56..81f9831a7859 100644 --- a/kernel/acct.c +++ b/kernel/acct.c @@ -227,7 +227,7 @@ static int acct_on(struct filename *pathname) filp_close(file, NULL); return PTR_ERR(internal); } - err = mnt_want_write(internal); + err = __mnt_want_write(internal); if (err) { mntput(internal); kfree(acct); @@ -252,7 +252,7 @@ static int acct_on(struct filename *pathname) old = xchg(&ns->bacct, &acct->pin); mutex_unlock(&acct->lock); pin_kill(old); - mnt_drop_write(mnt); + __mnt_drop_write(mnt); mntput(mnt); return 0; }
On Thu, Apr 04, 2019 at 01:52:55PM +0300, Amir Goldstein wrote: > If new file is on the same fs as old file, acct_pin_kill(old) fail to > file_start_write_trylock() and skip writing the old file, because > sb_writers (of new) is already taken by acct_on(). The above is BS, BTW. sb_start_write() does *not* make file_start_write_trylock() to fail. It's basically percpu_down_read() vs. percpu_down_read_trylock(). sb_wait_write() would have file_start_write_trylock() fail (as it should - its caller is freeze_super(); we want an exclusion with attempts to start extra writes there). sb_start_write() sure as hell doesn't - if it would have, we would get its failures from things like e.g. truncate(2) somewhere on the same fs. We don't want to mess with anything freeze-related in acct_on(), but the bug you are refering to in this part really doesn't exist.
On Thu, Apr 4, 2019 at 10:05 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > ... > OK, so... My first reaction had been complete BS. However, the > same goes for your analysis - it's not an ordering problem at all. > What happens is that we are replacing file->path.mnt with a clone > and we want the write count contribution (file is opened for write) > to be transferred. That's it. We do *NOT* want any kind of > freeze protection for the duration of switchover. > > IOW, the solution is to switch to __mnt_{want,drop}_write() for that > switchover; we don't want to mess with freeze protection at all. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cool. That works for me. Thanks for setting this straight. You may add: Tested-by: Amir Goldstein <amir73il@gmail.com> Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com Thanks, Amir. > --- > diff --git a/fs/internal.h b/fs/internal.h > index 6a8b71643af4..2e7362837a6e 100644 > --- a/fs/internal.h > +++ b/fs/internal.h > @@ -89,9 +89,7 @@ extern int sb_prepare_remount_readonly(struct super_block *); > > extern void __init mnt_init(void); > > -extern int __mnt_want_write(struct vfsmount *); > extern int __mnt_want_write_file(struct file *); > -extern void __mnt_drop_write(struct vfsmount *); > extern void __mnt_drop_write_file(struct file *); > > /* > diff --git a/include/linux/mount.h b/include/linux/mount.h > index 9197ddbf35fb..bf8cc4108b8f 100644 > --- a/include/linux/mount.h > +++ b/include/linux/mount.h > @@ -87,6 +87,8 @@ extern bool mnt_may_suid(struct vfsmount *mnt); > > struct path; > extern struct vfsmount *clone_private_mount(const struct path *path); > +extern int __mnt_want_write(struct vfsmount *); > +extern void __mnt_drop_write(struct vfsmount *); > > struct file_system_type; > extern struct vfsmount *fc_mount(struct fs_context *fc); > diff --git a/kernel/acct.c b/kernel/acct.c > index addf7732fb56..81f9831a7859 100644 > --- a/kernel/acct.c > +++ b/kernel/acct.c > @@ -227,7 +227,7 @@ static int acct_on(struct filename *pathname) > filp_close(file, NULL); > return PTR_ERR(internal); > } > - err = mnt_want_write(internal); > + err = __mnt_want_write(internal); > if (err) { > mntput(internal); > kfree(acct); > @@ -252,7 +252,7 @@ static int acct_on(struct filename *pathname) > old = xchg(&ns->bacct, &acct->pin); > mutex_unlock(&acct->lock); > pin_kill(old); > - mnt_drop_write(mnt); > + __mnt_drop_write(mnt); > mntput(mnt); > return 0; > }
On Thu, Apr 4, 2019 at 10:33 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Thu, Apr 4, 2019 at 10:05 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > ... > > OK, so... My first reaction had been complete BS. However, the > > same goes for your analysis - it's not an ordering problem at all. > > What happens is that we are replacing file->path.mnt with a clone > > and we want the write count contribution (file is opened for write) > > to be transferred. That's it. We do *NOT* want any kind of > > freeze protection for the duration of switchover. > > > > IOW, the solution is to switch to __mnt_{want,drop}_write() for that > > switchover; we don't want to mess with freeze protection at all. > > > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > > Cool. That works for me. > Thanks for setting this straight. > > You may add: > Tested-by: Amir Goldstein <amir73il@gmail.com> > Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com > > Thanks, > Amir. > > > --- > > diff --git a/fs/internal.h b/fs/internal.h > > index 6a8b71643af4..2e7362837a6e 100644 > > --- a/fs/internal.h > > +++ b/fs/internal.h > > @@ -89,9 +89,7 @@ extern int sb_prepare_remount_readonly(struct super_block *); > > > > extern void __init mnt_init(void); > > > > -extern int __mnt_want_write(struct vfsmount *); > > extern int __mnt_want_write_file(struct file *); > > -extern void __mnt_drop_write(struct vfsmount *); > > extern void __mnt_drop_write_file(struct file *); > > > > /* > > diff --git a/include/linux/mount.h b/include/linux/mount.h > > index 9197ddbf35fb..bf8cc4108b8f 100644 > > --- a/include/linux/mount.h > > +++ b/include/linux/mount.h > > @@ -87,6 +87,8 @@ extern bool mnt_may_suid(struct vfsmount *mnt); > > > > struct path; > > extern struct vfsmount *clone_private_mount(const struct path *path); > > +extern int __mnt_want_write(struct vfsmount *); > > +extern void __mnt_drop_write(struct vfsmount *); > > Al, One minor nit. If you place these function declarations by their definition order above their wrappers, it would be nicer + patch should apply cleanly to stable v4.4+. As it is, it applies with a minor conflict. Just a suggestion. Thanks, Amir.
diff --git a/kernel/acct.c b/kernel/acct.c index addf7732fb56..c09355a7ae46 100644 --- a/kernel/acct.c +++ b/kernel/acct.c @@ -251,8 +251,8 @@ static int acct_on(struct filename *pathname) rcu_read_lock(); old = xchg(&ns->bacct, &acct->pin); mutex_unlock(&acct->lock); - pin_kill(old); mnt_drop_write(mnt); + pin_kill(old); mntput(mnt); return 0; }
This looks like an old bug, pre-dating the "Fixes" commit, but the "Fixes" commit did not handle it properly. The bug recently surfaced as a lockdep possible deadlock warning with commit d1d04ef8572b ("ovl: stack file ops"). When acct_on() replaces one acct file with another, it takes sb_writers lock on new file sb and calls acct_pin_kill(old) before releasing the sb_writers lock. If new file is on the same fs as old file, acct_pin_kill(old) fail to file_start_write_trylock() and skip writing the old file, because sb_writers (of new) is already taken by acct_on(). If new file is not on same fs as old file, this ordering violation creates an unneeded dependency between new sb_writers and old sb_writers, which may later be reported as possible deadlock. This could result in an actual deadlock if acct file is replaced from an old file in overlayfs over "real fs" to a new file in "real fs". acct_on() takes freeze protection on "real fs" and tries to write to overlayfs file. overlayfs is not freeze protected so do_acct_process() can carry on with __kernel_write() to overlayfs file, which would try to take freeze protection on "real fs" and deadlock. Reproducer of lockdep possible deadlock warning: mkdir -p mnt upper lower work touch upper/x upper/y mount -t overlay none mnt -olowerdir=lower,upperdir=upper,workdir=work accton mnt/x accton upper/y ====================================================== WARNING: possible circular locking dependency detected 4.19.0-rc1-xfstests #3424 Not tainted ------------------------------------------------------ accton/1390 is trying to acquire lock: 00000000e0585aa5 (&acct->lock#2){+.+.}, at: acct_pin_kill+0x1b/0x76 but task is already holding lock: 000000003692505a (sb_writers#6){.+.+}, at: mnt_want_write+0x1d/0x42 Reported-and-tested-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com Fixes: 59eda0e07f43 ("new fs_pin killing logics") Signed-off-by: Amir Goldstein <amir73il@gmail.com> --- Hi Al, I've pinged you several times about this patch. Not sure if you had missed it or haven't had the time to look at it yet. syzbot has been complaining about the bug for a while now. Fixes label claims to fix your commit, but I believe the bug was already there before your commit. As you can see, I have a reproducer to demonstrate the manifestation of the bug in the case of switching acct file from overlayfs to real fs. This started to manifest with overlayfs stacked f_op. I have made another claim about correctness of acct_on() when switching files on the same fs which seems obvious from the code, but did not bother to try to reproducer, because I doubt if anyone cares. Thanks, Amir. Changes from v1: - Add Reported-and-tested-by syzbot kernel/acct.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)