[v2] acct: fix possible deadlock in acct_pin_kill
diff mbox series

Message ID 20190404105255.12189-1-amir73il@gmail.com
State New
Headers show
Series
  • [v2] acct: fix possible deadlock in acct_pin_kill
Related show

Commit Message

Amir Goldstein April 4, 2019, 10:52 a.m. UTC
This looks like an old bug, pre-dating the "Fixes" commit, but the
"Fixes" commit did not handle it properly.

The bug recently surfaced as a lockdep possible deadlock warning
with commit d1d04ef8572b ("ovl: stack file ops").

When acct_on() replaces one acct file with another, it takes sb_writers
lock on new file sb and calls acct_pin_kill(old) before releasing the
sb_writers lock.

If new file is on the same fs as old file, acct_pin_kill(old) fail to
file_start_write_trylock() and skip writing the old file, because
sb_writers (of new) is already taken by acct_on().

If new file is not on same fs as old file, this ordering violation
creates an unneeded dependency between new sb_writers and old sb_writers,
which may later be reported as possible deadlock.

This could result in an actual deadlock if acct file is replaced from
an old file in overlayfs over "real fs" to a new file in "real fs".
acct_on() takes freeze protection on "real fs" and tries to write to
overlayfs file. overlayfs is not freeze protected so do_acct_process()
can carry on with __kernel_write() to overlayfs file, which would
try to take freeze protection on "real fs" and deadlock.

Reproducer of lockdep possible deadlock warning:

  mkdir -p mnt upper lower work
  touch upper/x upper/y
  mount -t overlay none mnt -olowerdir=lower,upperdir=upper,workdir=work
  accton mnt/x
  accton upper/y

 ======================================================
 WARNING: possible circular locking dependency detected
 4.19.0-rc1-xfstests #3424 Not tainted
 ------------------------------------------------------
 accton/1390 is trying to acquire lock:
 00000000e0585aa5 (&acct->lock#2){+.+.}, at: acct_pin_kill+0x1b/0x76

 but task is already holding lock:
 000000003692505a (sb_writers#6){.+.+}, at: mnt_want_write+0x1d/0x42

Reported-and-tested-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com
Fixes: 59eda0e07f43 ("new fs_pin killing logics")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Hi Al,

I've pinged you several times about this patch.
Not sure if you had missed it or haven't had the time to look at
it yet. syzbot has been complaining about the bug for a while now.

Fixes label claims to fix your commit, but I believe the bug was
already there before your commit.

As you can see, I have a reproducer to demonstrate the manifestation of
the bug in the case of switching acct file from overlayfs to real fs.
This started to manifest with overlayfs stacked f_op.

I have made another claim about correctness of acct_on() when switching
files on the same fs which seems obvious from the code, but did not
bother to try to reproducer, because I doubt if anyone cares.

Thanks,
Amir.

Changes from v1:
- Add Reported-and-tested-by syzbot

 kernel/acct.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Al Viro April 4, 2019, 6:44 p.m. UTC | #1
On Thu, Apr 04, 2019 at 01:52:55PM +0300, Amir Goldstein wrote:
> This looks like an old bug, pre-dating the "Fixes" commit, but the
> "Fixes" commit did not handle it properly.
> 
> The bug recently surfaced as a lockdep possible deadlock warning
> with commit d1d04ef8572b ("ovl: stack file ops").
> 
> When acct_on() replaces one acct file with another, it takes sb_writers
> lock on new file sb and calls acct_pin_kill(old) before releasing the
> sb_writers lock.
>
> If new file is on the same fs as old file, acct_pin_kill(old) fail to
> file_start_write_trylock() and skip writing the old file, because
> sb_writers (of new) is already taken by acct_on().
> 
> If new file is not on same fs as old file, this ordering violation
> creates an unneeded dependency between new sb_writers and old sb_writers,
> which may later be reported as possible deadlock.
> 
> This could result in an actual deadlock if acct file is replaced from
> an old file in overlayfs over "real fs" to a new file in "real fs".
> acct_on() takes freeze protection on "real fs" and tries to write to
> overlayfs file. overlayfs is not freeze protected so do_acct_process()
> can carry on with __kernel_write() to overlayfs file, which would
> try to take freeze protection on "real fs" and deadlock.

Huh?  sb_writers is taken when we *open* the new file.  Then we replace
its ->path.mnt with a clone and transfer the write count from the original
to new one.  And close the old file while we are at it.

From sb_writers POV mainline has
	sb_start_write(new_sb)	// in open
	sb_start_write(new_sb)	// mnt_want_write() on clone
	last write to old_sb, then sb_end_write(old_sb) // acct_pin_kill()
	sb_end_write(new_sb)	// mnt_drop_write(mnt)
and you flip the order of the last two lines.

Could you explain how exactly does your patch help whatever
problem overlayfs has?
Al Viro April 4, 2019, 6:49 p.m. UTC | #2
On Thu, Apr 04, 2019 at 07:44:48PM +0100, Al Viro wrote:

> Huh?  sb_writers is taken when we *open* the new file.  Then we replace
> its ->path.mnt with a clone and transfer the write count from the original
> to new one.  And close the old file while we are at it.
> 
> >From sb_writers POV mainline has
> 	sb_start_write(new_sb)	// in open
> 	sb_start_write(new_sb)	// mnt_want_write() on clone
> 	last write to old_sb, then sb_end_write(old_sb) // acct_pin_kill()
> 	sb_end_write(new_sb)	// mnt_drop_write(mnt)
> and you flip the order of the last two lines.
> 
> Could you explain how exactly does your patch help whatever
> problem overlayfs has?

Ow...  Sorry, -ENOCOFFEE...

Let me wake up properly and look at that thing again.
Al Viro April 4, 2019, 7:05 p.m. UTC | #3
On Thu, Apr 04, 2019 at 07:49:44PM +0100, Al Viro wrote:
> On Thu, Apr 04, 2019 at 07:44:48PM +0100, Al Viro wrote:
> 
> > Huh?  sb_writers is taken when we *open* the new file.  Then we replace
> > its ->path.mnt with a clone and transfer the write count from the original
> > to new one.  And close the old file while we are at it.
> > 
> > >From sb_writers POV mainline has
> > 	sb_start_write(new_sb)	// in open
> > 	sb_start_write(new_sb)	// mnt_want_write() on clone
> > 	last write to old_sb, then sb_end_write(old_sb) // acct_pin_kill()
> > 	sb_end_write(new_sb)	// mnt_drop_write(mnt)
> > and you flip the order of the last two lines.
> > 
> > Could you explain how exactly does your patch help whatever
> > problem overlayfs has?
> 
> Ow...  Sorry, -ENOCOFFEE...
> 
> Let me wake up properly and look at that thing again.

OK, so...  My first reaction had been complete BS.  However, the
same goes for your analysis - it's not an ordering problem at all.
What happens is that we are replacing file->path.mnt with a clone
and we want the write count contribution (file is opened for write)
to be transferred.  That's it.  We do *NOT* want any kind of
freeze protection for the duration of switchover.

IOW, the solution is to switch to __mnt_{want,drop}_write() for that
switchover; we don't want to mess with freeze protection at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/internal.h b/fs/internal.h
index 6a8b71643af4..2e7362837a6e 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -89,9 +89,7 @@ extern int sb_prepare_remount_readonly(struct super_block *);
 
 extern void __init mnt_init(void);
 
-extern int __mnt_want_write(struct vfsmount *);
 extern int __mnt_want_write_file(struct file *);
-extern void __mnt_drop_write(struct vfsmount *);
 extern void __mnt_drop_write_file(struct file *);
 
 /*
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 9197ddbf35fb..bf8cc4108b8f 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -87,6 +87,8 @@ extern bool mnt_may_suid(struct vfsmount *mnt);
 
 struct path;
 extern struct vfsmount *clone_private_mount(const struct path *path);
+extern int __mnt_want_write(struct vfsmount *);
+extern void __mnt_drop_write(struct vfsmount *);
 
 struct file_system_type;
 extern struct vfsmount *fc_mount(struct fs_context *fc);
diff --git a/kernel/acct.c b/kernel/acct.c
index addf7732fb56..81f9831a7859 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -227,7 +227,7 @@ static int acct_on(struct filename *pathname)
 		filp_close(file, NULL);
 		return PTR_ERR(internal);
 	}
-	err = mnt_want_write(internal);
+	err = __mnt_want_write(internal);
 	if (err) {
 		mntput(internal);
 		kfree(acct);
@@ -252,7 +252,7 @@ static int acct_on(struct filename *pathname)
 	old = xchg(&ns->bacct, &acct->pin);
 	mutex_unlock(&acct->lock);
 	pin_kill(old);
-	mnt_drop_write(mnt);
+	__mnt_drop_write(mnt);
 	mntput(mnt);
 	return 0;
 }
Al Viro April 4, 2019, 7:19 p.m. UTC | #4
On Thu, Apr 04, 2019 at 01:52:55PM +0300, Amir Goldstein wrote:

> If new file is on the same fs as old file, acct_pin_kill(old) fail to
> file_start_write_trylock() and skip writing the old file, because
> sb_writers (of new) is already taken by acct_on().

	The above is BS, BTW.  sb_start_write() does *not*
make file_start_write_trylock() to fail.  It's basically
percpu_down_read() vs. percpu_down_read_trylock().

sb_wait_write() would have file_start_write_trylock() fail (as it
should - its caller is freeze_super(); we want an exclusion with
attempts to start extra writes there).  sb_start_write() sure as
hell doesn't - if it would have, we would get its failures from
things like e.g. truncate(2) somewhere on the same fs.

We don't want to mess with anything freeze-related in acct_on(), but
the bug you are refering to in this part really doesn't exist.
Amir Goldstein April 4, 2019, 7:33 p.m. UTC | #5
On Thu, Apr 4, 2019 at 10:05 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
...
> OK, so...  My first reaction had been complete BS.  However, the
> same goes for your analysis - it's not an ordering problem at all.
> What happens is that we are replacing file->path.mnt with a clone
> and we want the write count contribution (file is opened for write)
> to be transferred.  That's it.  We do *NOT* want any kind of
> freeze protection for the duration of switchover.
>
> IOW, the solution is to switch to __mnt_{want,drop}_write() for that
> switchover; we don't want to mess with freeze protection at all.
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Cool. That works for me.
Thanks for setting this straight.

You may add:
Tested-by: Amir Goldstein <amir73il@gmail.com>
Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com

Thanks,
Amir.

> ---
> diff --git a/fs/internal.h b/fs/internal.h
> index 6a8b71643af4..2e7362837a6e 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -89,9 +89,7 @@ extern int sb_prepare_remount_readonly(struct super_block *);
>
>  extern void __init mnt_init(void);
>
> -extern int __mnt_want_write(struct vfsmount *);
>  extern int __mnt_want_write_file(struct file *);
> -extern void __mnt_drop_write(struct vfsmount *);
>  extern void __mnt_drop_write_file(struct file *);
>
>  /*
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 9197ddbf35fb..bf8cc4108b8f 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -87,6 +87,8 @@ extern bool mnt_may_suid(struct vfsmount *mnt);
>
>  struct path;
>  extern struct vfsmount *clone_private_mount(const struct path *path);
> +extern int __mnt_want_write(struct vfsmount *);
> +extern void __mnt_drop_write(struct vfsmount *);
>
>  struct file_system_type;
>  extern struct vfsmount *fc_mount(struct fs_context *fc);
> diff --git a/kernel/acct.c b/kernel/acct.c
> index addf7732fb56..81f9831a7859 100644
> --- a/kernel/acct.c
> +++ b/kernel/acct.c
> @@ -227,7 +227,7 @@ static int acct_on(struct filename *pathname)
>                 filp_close(file, NULL);
>                 return PTR_ERR(internal);
>         }
> -       err = mnt_want_write(internal);
> +       err = __mnt_want_write(internal);
>         if (err) {
>                 mntput(internal);
>                 kfree(acct);
> @@ -252,7 +252,7 @@ static int acct_on(struct filename *pathname)
>         old = xchg(&ns->bacct, &acct->pin);
>         mutex_unlock(&acct->lock);
>         pin_kill(old);
> -       mnt_drop_write(mnt);
> +       __mnt_drop_write(mnt);
>         mntput(mnt);
>         return 0;
>  }
Amir Goldstein April 11, 2019, 7:10 p.m. UTC | #6
On Thu, Apr 4, 2019 at 10:33 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Apr 4, 2019 at 10:05 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> ...
> > OK, so...  My first reaction had been complete BS.  However, the
> > same goes for your analysis - it's not an ordering problem at all.
> > What happens is that we are replacing file->path.mnt with a clone
> > and we want the write count contribution (file is opened for write)
> > to be transferred.  That's it.  We do *NOT* want any kind of
> > freeze protection for the duration of switchover.
> >
> > IOW, the solution is to switch to __mnt_{want,drop}_write() for that
> > switchover; we don't want to mess with freeze protection at all.
> >
> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
>
> Cool. That works for me.
> Thanks for setting this straight.
>
> You may add:
> Tested-by: Amir Goldstein <amir73il@gmail.com>
> Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com
>
> Thanks,
> Amir.
>
> > ---
> > diff --git a/fs/internal.h b/fs/internal.h
> > index 6a8b71643af4..2e7362837a6e 100644
> > --- a/fs/internal.h
> > +++ b/fs/internal.h
> > @@ -89,9 +89,7 @@ extern int sb_prepare_remount_readonly(struct super_block *);
> >
> >  extern void __init mnt_init(void);
> >
> > -extern int __mnt_want_write(struct vfsmount *);
> >  extern int __mnt_want_write_file(struct file *);
> > -extern void __mnt_drop_write(struct vfsmount *);
> >  extern void __mnt_drop_write_file(struct file *);
> >
> >  /*
> > diff --git a/include/linux/mount.h b/include/linux/mount.h
> > index 9197ddbf35fb..bf8cc4108b8f 100644
> > --- a/include/linux/mount.h
> > +++ b/include/linux/mount.h
> > @@ -87,6 +87,8 @@ extern bool mnt_may_suid(struct vfsmount *mnt);
> >
> >  struct path;
> >  extern struct vfsmount *clone_private_mount(const struct path *path);
> > +extern int __mnt_want_write(struct vfsmount *);
> > +extern void __mnt_drop_write(struct vfsmount *);
> >

Al,

One minor nit.

If you place these function declarations by their definition order
above their wrappers, it would be nicer + patch should apply
cleanly to stable v4.4+. As it is, it applies with a minor conflict.

Just a suggestion.

Thanks,
Amir.

Patch
diff mbox series

diff --git a/kernel/acct.c b/kernel/acct.c
index addf7732fb56..c09355a7ae46 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -251,8 +251,8 @@  static int acct_on(struct filename *pathname)
 	rcu_read_lock();
 	old = xchg(&ns->bacct, &acct->pin);
 	mutex_unlock(&acct->lock);
-	pin_kill(old);
 	mnt_drop_write(mnt);
+	pin_kill(old);
 	mntput(mnt);
 	return 0;
 }