[v2] Btrfs: fix missing delayed iputs on unmount
diff mbox series

Message ID 5d98091d3e089b4f74cb61fb2ed691e1f4dd1d6b.1541005462.git.osandov@fb.com
State New
Headers show
Series
  • [v2] Btrfs: fix missing delayed iputs on unmount
Related show

Commit Message

Omar Sandoval Oct. 31, 2018, 5:06 p.m. UTC
From: Omar Sandoval <osandov@fb.com>

There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.

Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().

Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
Changes from v1:

- Add a comment explaining why it needs to be a kthread_park(), not
  kthread_stop()
- Update later comment now that the cleaner thread is definitely stopped

 fs/btrfs/disk-io.c | 51 ++++++++++++++--------------------------------
 1 file changed, 15 insertions(+), 36 deletions(-)

Comments

David Sterba Nov. 1, 2018, 10:15 a.m. UTC | #1
On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> There's a race between close_ctree() and cleaner_kthread().
> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> sees it set, but this is racy; the cleaner might have already checked
> the bit and could be cleaning stuff. In particular, if it deletes unused
> block groups, it will create delayed iputs for the free space cache
> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> longer running delayed iputs after a commit. Therefore, if the cleaner
> creates more delayed iputs after delayed iputs are run in
> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> inode crash from the VFS.
> 
> Fix it by parking the cleaner

Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
we're missing a commit or clean up data structures. Messing with state
of a thread would be the last thing I'd try after proving that it's not
possible to fix in the logic of btrfs itself.

The shutdown sequence in close_tree is quite tricky and we've had bugs
there. The interdependencies of thread and data structures and other
subsystems cannot have loops that could not find an ordering that will
not leak something.

It's not a big problem if some step is done more than once, like
committing or cleaning up some other structures if we know that
it could create new.
Chris Mason Nov. 1, 2018, 1:31 p.m. UTC | #2
On 1 Nov 2018, at 6:15, David Sterba wrote:

> On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
>> From: Omar Sandoval <osandov@fb.com>
>>
>> There's a race between close_ctree() and cleaner_kthread().
>> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
>> sees it set, but this is racy; the cleaner might have already checked
>> the bit and could be cleaning stuff. In particular, if it deletes 
>> unused
>> block groups, it will create delayed iputs for the free space cache
>> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
>> longer running delayed iputs after a commit. Therefore, if the 
>> cleaner
>> creates more delayed iputs after delayed iputs are run in
>> btrfs_commit_super(), we will leak inodes on unmount and get a busy
>> inode crash from the VFS.
>>
>> Fix it by parking the cleaner
>
> Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> we're missing a commit or clean up data structures. Messing with state
> of a thread would be the last thing I'd try after proving that it's 
> not
> possible to fix in the logic of btrfs itself.
>
> The shutdown sequence in close_tree is quite tricky and we've had bugs
> there. The interdependencies of thread and data structures and other
> subsystems cannot have loops that could not find an ordering that will
> not leak something.
>
> It's not a big problem if some step is done more than once, like
> committing or cleaning up some other structures if we know that
> it could create new.

The problem is the cleaner thread needs to be told to stop doing new 
work, and we need to wait for the work it's already doing to be 
finished.  We're getting "stop doing new work" already because the 
cleaner thread checks to see if the FS is closing, but we don't have a 
way today to wait for him to finish what he's already doing.

kthread_park() is basically the same as adding another mutex or 
synchronization point.  I'm not sure how we could change close_tree() or 
the final commit to pick this up more effectively?

-chris
Nikolay Borisov Nov. 1, 2018, 2:35 p.m. UTC | #3
On 31.10.18 г. 19:06 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> There's a race between close_ctree() and cleaner_kthread().
> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> sees it set, but this is racy; the cleaner might have already checked
> the bit and could be cleaning stuff. In particular, if it deletes unused
> block groups, it will create delayed iputs for the free space cache
> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> longer running delayed iputs after a commit. Therefore, if the cleaner
> creates more delayed iputs after delayed iputs are run in
> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> inode crash from the VFS.
> 
> Fix it by parking the cleaner before we actually close anything. Then,
> any remaining delayed iputs will always be handled in
> btrfs_commit_super(). This also ensures that the commit in close_ctree()
> is really the last commit, so we can get rid of the commit in
> cleaner_kthread().
> 
> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Also I believe this patch renders the wake_up_process in
btrfs_commit_super a null op so it can also be removed, which leaves a
single place that could wake up the cleaner - transaction_kthread.

So can't we stop transaction and cleaner thread right after setting
CLOSING_FS. And commit the transaction in close_ctree whenever we deem
necessary (in btrfs_commit_super for example) ?

> ---
> Changes from v1:
> 
> - Add a comment explaining why it needs to be a kthread_park(), not
>   kthread_stop()
> - Update later comment now that the cleaner thread is definitely stopped
> 
>  fs/btrfs/disk-io.c | 51 ++++++++++++++--------------------------------
>  1 file changed, 15 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b0ab41da91d1..40bcc45d827d 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
>  	struct btrfs_root *root = arg;
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  	int again;
> -	struct btrfs_trans_handle *trans;
>  
> -	do {
> +	while (1) {
>  		again = 0;
>  
>  		/* Make the cleaner go to sleep early. */
> @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
>  		 */
>  		btrfs_delete_unused_bgs(fs_info);
>  sleep:
> +		if (kthread_should_park())
> +			kthread_parkme();
> +		if (kthread_should_stop())
> +			return 0;
>  		if (!again) {
>  			set_current_state(TASK_INTERRUPTIBLE);
> -			if (!kthread_should_stop())
> -				schedule();
> +			schedule();
>  			__set_current_state(TASK_RUNNING);
>  		}
> -	} while (!kthread_should_stop());
> -
> -	/*
> -	 * Transaction kthread is stopped before us and wakes us up.
> -	 * However we might have started a new transaction and COWed some
> -	 * tree blocks when deleting unused block groups for example. So
> -	 * make sure we commit the transaction we started to have a clean
> -	 * shutdown when evicting the btree inode - if it has dirty pages
> -	 * when we do the final iput() on it, eviction will trigger a
> -	 * writeback for it which will fail with null pointer dereferences
> -	 * since work queues and other resources were already released and
> -	 * destroyed by the time the iput/eviction/writeback is made.
> -	 */
> -	trans = btrfs_attach_transaction(root);
> -	if (IS_ERR(trans)) {
> -		if (PTR_ERR(trans) != -ENOENT)
> -			btrfs_err(fs_info,
> -				  "cleaner transaction attach returned %ld",
> -				  PTR_ERR(trans));
> -	} else {
> -		int ret;
> -
> -		ret = btrfs_commit_transaction(trans);
> -		if (ret)
> -			btrfs_err(fs_info,
> -				  "cleaner open transaction commit returned %d",
> -				  ret);
>  	}
> -
> -	return 0;
>  }
>  
>  static int transaction_kthread(void *arg)
> @@ -3931,6 +3904,13 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>  	int ret;
>  
>  	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> +	/*
> +	 * We don't want the cleaner to start new transactions, add more delayed
> +	 * iputs, etc. while we're closing. We can't use kthread_stop() yet
> +	 * because that frees the task_struct, and the transaction kthread might
> +	 * still try to wake up the cleaner.
> +	 */
> +	kthread_park(fs_info->cleaner_kthread);
>  
>  	/* wait for the qgroup rescan worker to stop */
>  	btrfs_qgroup_wait_for_completion(fs_info, false);
> @@ -3958,9 +3938,8 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>  
>  	if (!sb_rdonly(fs_info->sb)) {
>  		/*
> -		 * If the cleaner thread is stopped and there are
> -		 * block groups queued for removal, the deletion will be
> -		 * skipped when we quit the cleaner thread.
> +		 * The cleaner kthread is stopped, so do one final pass over
> +		 * unused block groups.
>  		 */
>  		btrfs_delete_unused_bgs(fs_info);
>  
>
David Sterba Nov. 1, 2018, 3:08 p.m. UTC | #4
On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote:
> On 1 Nov 2018, at 6:15, David Sterba wrote:
> 
> > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> >> From: Omar Sandoval <osandov@fb.com>
> >>
> >> There's a race between close_ctree() and cleaner_kthread().
> >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> >> sees it set, but this is racy; the cleaner might have already checked
> >> the bit and could be cleaning stuff. In particular, if it deletes 
> >> unused
> >> block groups, it will create delayed iputs for the free space cache
> >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> >> longer running delayed iputs after a commit. Therefore, if the 
> >> cleaner
> >> creates more delayed iputs after delayed iputs are run in
> >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> >> inode crash from the VFS.
> >>
> >> Fix it by parking the cleaner
> >
> > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > we're missing a commit or clean up data structures. Messing with state
> > of a thread would be the last thing I'd try after proving that it's 
> > not
> > possible to fix in the logic of btrfs itself.
> >
> > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > there. The interdependencies of thread and data structures and other
> > subsystems cannot have loops that could not find an ordering that will
> > not leak something.
> >
> > It's not a big problem if some step is done more than once, like
> > committing or cleaning up some other structures if we know that
> > it could create new.
> 
> The problem is the cleaner thread needs to be told to stop doing new 
> work, and we need to wait for the work it's already doing to be 
> finished.  We're getting "stop doing new work" already because the 
> cleaner thread checks to see if the FS is closing, but we don't have a 
> way today to wait for him to finish what he's already doing.
> 
> kthread_park() is basically the same as adding another mutex or 
> synchronization point.  I'm not sure how we could change close_tree() or 
> the final commit to pick this up more effectively?

The idea is:

cleaner                                     close_ctree thread

                                            tell cleaner to stop
					    wait
start work
if should_stop, then exit
                                            cleaner is stopped

[does not run: finish work]
[does not run: loop]
                                            pick up the work or make
					    sure there's nothing in
					    progress anymore


A simplified version in code:

  set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);

  wait for defrag - could be started from cleaner but next iteration will
		    see the fs closed and will not continue

  kthread_stop(transaction_kthread)

  kthread_stop(cleaner_kthread)

  /* next, everything that could be left from cleaner should be finished */

  btrfs_delete_unused_bgs();
  assert there are no defrags
  assert there are no delayed iputs
  commit if necessary

IOW the unused block groups are removed from close_ctree too early,
moving that after the threads stop AND makins sure that it does not need
either of them should work.

The "AND" above is not currently implemented as btrfs_delete_unused_bgs
calls plain btrfs_end_transaction that wakes up transaction ktread, so
there would need to be an argument passed to tell it to do full commit.
Nikolay Borisov Nov. 1, 2018, 3:10 p.m. UTC | #5
On 1.11.18 г. 16:35 ч., Nikolay Borisov wrote:
> 
> 
> On 31.10.18 г. 19:06 ч., Omar Sandoval wrote:
>> From: Omar Sandoval <osandov@fb.com>
>>
>> There's a race between close_ctree() and cleaner_kthread().
>> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
>> sees it set, but this is racy; the cleaner might have already checked
>> the bit and could be cleaning stuff. In particular, if it deletes unused
>> block groups, it will create delayed iputs for the free space cache
>> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
>> longer running delayed iputs after a commit. Therefore, if the cleaner
>> creates more delayed iputs after delayed iputs are run in
>> btrfs_commit_super(), we will leak inodes on unmount and get a busy
>> inode crash from the VFS.
>>
>> Fix it by parking the cleaner before we actually close anything. Then,
>> any remaining delayed iputs will always be handled in
>> btrfs_commit_super(). This also ensures that the commit in close_ctree()
>> is really the last commit, so we can get rid of the commit in
>> cleaner_kthread().
>>
>> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
>> Signed-off-by: Omar Sandoval <osandov@fb.com>
> 
> Also I believe this patch renders the wake_up_process in
> btrfs_commit_super a null op so it can also be removed, which leaves a
> single place that could wake up the cleaner - transaction_kthread.
> 
> So can't we stop transaction and cleaner thread right after setting
> CLOSING_FS. And commit the transaction in close_ctree whenever we deem
> necessary (in btrfs_commit_super for example) ?

Ok, that won't work because commit_super is called in other contexts
where it can genuinely wake up the trans kthread, blimey. Why can't we
stop transaction kthread first thing in close_ctree ?



> 
>> ---
>> Changes from v1:
>>
>> - Add a comment explaining why it needs to be a kthread_park(), not
>>   kthread_stop()
>> - Update later comment now that the cleaner thread is definitely stopped
>>
>>  fs/btrfs/disk-io.c | 51 ++++++++++++++--------------------------------
>>  1 file changed, 15 insertions(+), 36 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index b0ab41da91d1..40bcc45d827d 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
>>  	struct btrfs_root *root = arg;
>>  	struct btrfs_fs_info *fs_info = root->fs_info;
>>  	int again;
>> -	struct btrfs_trans_handle *trans;
>>  
>> -	do {
>> +	while (1) {
>>  		again = 0;
>>  
>>  		/* Make the cleaner go to sleep early. */
>> @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
>>  		 */
>>  		btrfs_delete_unused_bgs(fs_info);
>>  sleep:
>> +		if (kthread_should_park())
>> +			kthread_parkme();
>> +		if (kthread_should_stop())
>> +			return 0;
>>  		if (!again) {
>>  			set_current_state(TASK_INTERRUPTIBLE);
>> -			if (!kthread_should_stop())
>> -				schedule();
>> +			schedule();
>>  			__set_current_state(TASK_RUNNING);
>>  		}
>> -	} while (!kthread_should_stop());
>> -
>> -	/*
>> -	 * Transaction kthread is stopped before us and wakes us up.
>> -	 * However we might have started a new transaction and COWed some
>> -	 * tree blocks when deleting unused block groups for example. So
>> -	 * make sure we commit the transaction we started to have a clean
>> -	 * shutdown when evicting the btree inode - if it has dirty pages
>> -	 * when we do the final iput() on it, eviction will trigger a
>> -	 * writeback for it which will fail with null pointer dereferences
>> -	 * since work queues and other resources were already released and
>> -	 * destroyed by the time the iput/eviction/writeback is made.
>> -	 */
>> -	trans = btrfs_attach_transaction(root);
>> -	if (IS_ERR(trans)) {
>> -		if (PTR_ERR(trans) != -ENOENT)
>> -			btrfs_err(fs_info,
>> -				  "cleaner transaction attach returned %ld",
>> -				  PTR_ERR(trans));
>> -	} else {
>> -		int ret;
>> -
>> -		ret = btrfs_commit_transaction(trans);
>> -		if (ret)
>> -			btrfs_err(fs_info,
>> -				  "cleaner open transaction commit returned %d",
>> -				  ret);
>>  	}
>> -
>> -	return 0;
>>  }
>>  
>>  static int transaction_kthread(void *arg)
>> @@ -3931,6 +3904,13 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>>  	int ret;
>>  
>>  	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
>> +	/*
>> +	 * We don't want the cleaner to start new transactions, add more delayed
>> +	 * iputs, etc. while we're closing. We can't use kthread_stop() yet
>> +	 * because that frees the task_struct, and the transaction kthread might
>> +	 * still try to wake up the cleaner.
>> +	 */
>> +	kthread_park(fs_info->cleaner_kthread);
>>  
>>  	/* wait for the qgroup rescan worker to stop */
>>  	btrfs_qgroup_wait_for_completion(fs_info, false);
>> @@ -3958,9 +3938,8 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>>  
>>  	if (!sb_rdonly(fs_info->sb)) {
>>  		/*
>> -		 * If the cleaner thread is stopped and there are
>> -		 * block groups queued for removal, the deletion will be
>> -		 * skipped when we quit the cleaner thread.
>> +		 * The cleaner kthread is stopped, so do one final pass over
>> +		 * unused block groups.
>>  		 */
>>  		btrfs_delete_unused_bgs(fs_info);
>>  
>>
David Sterba Nov. 1, 2018, 3:22 p.m. UTC | #6
On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote:
> > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > 
> > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > >> From: Omar Sandoval <osandov@fb.com>
> > >>
> > >> There's a race between close_ctree() and cleaner_kthread().
> > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > >> sees it set, but this is racy; the cleaner might have already checked
> > >> the bit and could be cleaning stuff. In particular, if it deletes 
> > >> unused
> > >> block groups, it will create delayed iputs for the free space cache
> > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > >> longer running delayed iputs after a commit. Therefore, if the 
> > >> cleaner
> > >> creates more delayed iputs after delayed iputs are run in
> > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > >> inode crash from the VFS.
> > >>
> > >> Fix it by parking the cleaner
> > >
> > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > we're missing a commit or clean up data structures. Messing with state
> > > of a thread would be the last thing I'd try after proving that it's 
> > > not
> > > possible to fix in the logic of btrfs itself.
> > >
> > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > there. The interdependencies of thread and data structures and other
> > > subsystems cannot have loops that could not find an ordering that will
> > > not leak something.
> > >
> > > It's not a big problem if some step is done more than once, like
> > > committing or cleaning up some other structures if we know that
> > > it could create new.
> > 
> > The problem is the cleaner thread needs to be told to stop doing new 
> > work, and we need to wait for the work it's already doing to be 
> > finished.  We're getting "stop doing new work" already because the 
> > cleaner thread checks to see if the FS is closing, but we don't have a 
> > way today to wait for him to finish what he's already doing.
> > 
> > kthread_park() is basically the same as adding another mutex or 
> > synchronization point.  I'm not sure how we could change close_tree() or 
> > the final commit to pick this up more effectively?
> 
> The idea is:
> 
> cleaner                                     close_ctree thread
> 
>                                             tell cleaner to stop
> 					    wait
> start work
> if should_stop, then exit
>                                             cleaner is stopped
> 
> [does not run: finish work]
> [does not run: loop]
>                                             pick up the work or make
> 					    sure there's nothing in
> 					    progress anymore
> 
> 
> A simplified version in code:
> 
>   set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> 
>   wait for defrag - could be started from cleaner but next iteration will
> 		    see the fs closed and will not continue
> 
>   kthread_stop(transaction_kthread)
> 
>   kthread_stop(cleaner_kthread)
> 
>   /* next, everything that could be left from cleaner should be finished */
> 
>   btrfs_delete_unused_bgs();
>   assert there are no defrags
>   assert there are no delayed iputs
>   commit if necessary
> 
> IOW the unused block groups are removed from close_ctree too early,
> moving that after the threads stop AND makins sure that it does not need
> either of them should work.
> 
> The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> calls plain btrfs_end_transaction that wakes up transaction ktread, so
> there would need to be an argument passed to tell it to do full commit.

Not perfect, relies on the fact that wake_up_process(thread) on a stopped
thread is a no-op, arguments would need to be added to skip that in
btrfs_delete_unused_bgs and btrfs_commit_super.

--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3956,11 +3956,18 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 
        cancel_work_sync(&fs_info->async_reclaim_work);
 
+       if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) ||
+           test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state))
+               btrfs_error_commit_super(fs_info);
+
+       kthread_stop(fs_info->transaction_kthread);
+       kthread_stop(fs_info->cleaner_kthread);
+
        if (!sb_rdonly(fs_info->sb)) {
                /*
-                * If the cleaner thread is stopped and there are
-                * block groups queued for removal, the deletion will be
-                * skipped when we quit the cleaner thread.
+                * The cleaner thread is now stopped and if there are block
+                * groups queued for removal, we have to pick up the work here
+                * so there are no delayed iputs triggered.
                 */
                btrfs_delete_unused_bgs(fs_info);
 
@@ -3969,13 +3976,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
                        btrfs_err(fs_info, "commit super ret %d", ret);
        }
 
-       if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) ||
-           test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state))
-               btrfs_error_commit_super(fs_info);
-
-       kthread_stop(fs_info->transaction_kthread);
-       kthread_stop(fs_info->cleaner_kthread);
-
        ASSERT(list_empty(&fs_info->delayed_iputs));
        set_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags);
Omar Sandoval Nov. 1, 2018, 3:23 p.m. UTC | #7
On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote:
> > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > 
> > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > >> From: Omar Sandoval <osandov@fb.com>
> > >>
> > >> There's a race between close_ctree() and cleaner_kthread().
> > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > >> sees it set, but this is racy; the cleaner might have already checked
> > >> the bit and could be cleaning stuff. In particular, if it deletes 
> > >> unused
> > >> block groups, it will create delayed iputs for the free space cache
> > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > >> longer running delayed iputs after a commit. Therefore, if the 
> > >> cleaner
> > >> creates more delayed iputs after delayed iputs are run in
> > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > >> inode crash from the VFS.
> > >>
> > >> Fix it by parking the cleaner
> > >
> > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > we're missing a commit or clean up data structures. Messing with state
> > > of a thread would be the last thing I'd try after proving that it's 
> > > not
> > > possible to fix in the logic of btrfs itself.
> > >
> > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > there. The interdependencies of thread and data structures and other
> > > subsystems cannot have loops that could not find an ordering that will
> > > not leak something.
> > >
> > > It's not a big problem if some step is done more than once, like
> > > committing or cleaning up some other structures if we know that
> > > it could create new.
> > 
> > The problem is the cleaner thread needs to be told to stop doing new 
> > work, and we need to wait for the work it's already doing to be 
> > finished.  We're getting "stop doing new work" already because the 
> > cleaner thread checks to see if the FS is closing, but we don't have a 
> > way today to wait for him to finish what he's already doing.
> > 
> > kthread_park() is basically the same as adding another mutex or 
> > synchronization point.  I'm not sure how we could change close_tree() or 
> > the final commit to pick this up more effectively?
> 
> The idea is:
> 
> cleaner                                     close_ctree thread
> 
>                                             tell cleaner to stop
> 					    wait
> start work
> if should_stop, then exit
>                                             cleaner is stopped
> 
> [does not run: finish work]
> [does not run: loop]
>                                             pick up the work or make
> 					    sure there's nothing in
> 					    progress anymore
> 
> 
> A simplified version in code:
> 
>   set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> 
>   wait for defrag - could be started from cleaner but next iteration will
> 		    see the fs closed and will not continue
> 
>   kthread_stop(transaction_kthread)
> 
>   kthread_stop(cleaner_kthread)
> 
>   /* next, everything that could be left from cleaner should be finished */
> 
>   btrfs_delete_unused_bgs();
>   assert there are no defrags
>   assert there are no delayed iputs
>   commit if necessary
> 
> IOW the unused block groups are removed from close_ctree too early,
> moving that after the threads stop AND makins sure that it does not need
> either of them should work.
> 
> The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> calls plain btrfs_end_transaction that wakes up transaction ktread, so
> there would need to be an argument passed to tell it to do full commit.

It's too fragile to run around in the filesystem with these threads
freed. We can probably make it now, but I'm worried that we'll add
another wakeup somewhere and blow up.
Omar Sandoval Nov. 1, 2018, 3:24 p.m. UTC | #8
On Thu, Nov 01, 2018 at 04:22:29PM +0100, David Sterba wrote:
> On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote:
> > > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > > 
> > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > > >> From: Omar Sandoval <osandov@fb.com>
> > > >>
> > > >> There's a race between close_ctree() and cleaner_kthread().
> > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > > >> sees it set, but this is racy; the cleaner might have already checked
> > > >> the bit and could be cleaning stuff. In particular, if it deletes 
> > > >> unused
> > > >> block groups, it will create delayed iputs for the free space cache
> > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > > >> longer running delayed iputs after a commit. Therefore, if the 
> > > >> cleaner
> > > >> creates more delayed iputs after delayed iputs are run in
> > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > > >> inode crash from the VFS.
> > > >>
> > > >> Fix it by parking the cleaner
> > > >
> > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > > we're missing a commit or clean up data structures. Messing with state
> > > > of a thread would be the last thing I'd try after proving that it's 
> > > > not
> > > > possible to fix in the logic of btrfs itself.
> > > >
> > > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > > there. The interdependencies of thread and data structures and other
> > > > subsystems cannot have loops that could not find an ordering that will
> > > > not leak something.
> > > >
> > > > It's not a big problem if some step is done more than once, like
> > > > committing or cleaning up some other structures if we know that
> > > > it could create new.
> > > 
> > > The problem is the cleaner thread needs to be told to stop doing new 
> > > work, and we need to wait for the work it's already doing to be 
> > > finished.  We're getting "stop doing new work" already because the 
> > > cleaner thread checks to see if the FS is closing, but we don't have a 
> > > way today to wait for him to finish what he's already doing.
> > > 
> > > kthread_park() is basically the same as adding another mutex or 
> > > synchronization point.  I'm not sure how we could change close_tree() or 
> > > the final commit to pick this up more effectively?
> > 
> > The idea is:
> > 
> > cleaner                                     close_ctree thread
> > 
> >                                             tell cleaner to stop
> > 					    wait
> > start work
> > if should_stop, then exit
> >                                             cleaner is stopped
> > 
> > [does not run: finish work]
> > [does not run: loop]
> >                                             pick up the work or make
> > 					    sure there's nothing in
> > 					    progress anymore
> > 
> > 
> > A simplified version in code:
> > 
> >   set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> > 
> >   wait for defrag - could be started from cleaner but next iteration will
> > 		    see the fs closed and will not continue
> > 
> >   kthread_stop(transaction_kthread)
> > 
> >   kthread_stop(cleaner_kthread)
> > 
> >   /* next, everything that could be left from cleaner should be finished */
> > 
> >   btrfs_delete_unused_bgs();
> >   assert there are no defrags
> >   assert there are no delayed iputs
> >   commit if necessary
> > 
> > IOW the unused block groups are removed from close_ctree too early,
> > moving that after the threads stop AND makins sure that it does not need
> > either of them should work.
> > 
> > The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> > calls plain btrfs_end_transaction that wakes up transaction ktread, so
> > there would need to be an argument passed to tell it to do full commit.
> 
> Not perfect, relies on the fact that wake_up_process(thread) on a stopped
> thread is a no-op,

How is that? kthread_stop() frees the task struct, so wake_up_process()
would be a use-after-free.
David Sterba Nov. 1, 2018, 3:28 p.m. UTC | #9
On Thu, Nov 01, 2018 at 08:23:17AM -0700, Omar Sandoval wrote:
> On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote:
> > > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > > 
> > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > > >> From: Omar Sandoval <osandov@fb.com>
> > > >>
> > > >> There's a race between close_ctree() and cleaner_kthread().
> > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > > >> sees it set, but this is racy; the cleaner might have already checked
> > > >> the bit and could be cleaning stuff. In particular, if it deletes 
> > > >> unused
> > > >> block groups, it will create delayed iputs for the free space cache
> > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > > >> longer running delayed iputs after a commit. Therefore, if the 
> > > >> cleaner
> > > >> creates more delayed iputs after delayed iputs are run in
> > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > > >> inode crash from the VFS.
> > > >>
> > > >> Fix it by parking the cleaner
> > > >
> > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > > we're missing a commit or clean up data structures. Messing with state
> > > > of a thread would be the last thing I'd try after proving that it's 
> > > > not
> > > > possible to fix in the logic of btrfs itself.
> > > >
> > > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > > there. The interdependencies of thread and data structures and other
> > > > subsystems cannot have loops that could not find an ordering that will
> > > > not leak something.
> > > >
> > > > It's not a big problem if some step is done more than once, like
> > > > committing or cleaning up some other structures if we know that
> > > > it could create new.
> > > 
> > > The problem is the cleaner thread needs to be told to stop doing new 
> > > work, and we need to wait for the work it's already doing to be 
> > > finished.  We're getting "stop doing new work" already because the 
> > > cleaner thread checks to see if the FS is closing, but we don't have a 
> > > way today to wait for him to finish what he's already doing.
> > > 
> > > kthread_park() is basically the same as adding another mutex or 
> > > synchronization point.  I'm not sure how we could change close_tree() or 
> > > the final commit to pick this up more effectively?
> > 
> > The idea is:
> > 
> > cleaner                                     close_ctree thread
> > 
> >                                             tell cleaner to stop
> > 					    wait
> > start work
> > if should_stop, then exit
> >                                             cleaner is stopped
> > 
> > [does not run: finish work]
> > [does not run: loop]
> >                                             pick up the work or make
> > 					    sure there's nothing in
> > 					    progress anymore
> > 
> > 
> > A simplified version in code:
> > 
> >   set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> > 
> >   wait for defrag - could be started from cleaner but next iteration will
> > 		    see the fs closed and will not continue
> > 
> >   kthread_stop(transaction_kthread)
> > 
> >   kthread_stop(cleaner_kthread)
> > 
> >   /* next, everything that could be left from cleaner should be finished */
> > 
> >   btrfs_delete_unused_bgs();
> >   assert there are no defrags
> >   assert there are no delayed iputs
> >   commit if necessary
> > 
> > IOW the unused block groups are removed from close_ctree too early,
> > moving that after the threads stop AND makins sure that it does not need
> > either of them should work.
> > 
> > The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> > calls plain btrfs_end_transaction that wakes up transaction ktread, so
> > there would need to be an argument passed to tell it to do full commit.
> 
> It's too fragile to run around in the filesystem with these threads
> freed. We can probably make it now, but I'm worried that we'll add
> another wakeup somewhere and blow up.

So functions that are called from close_ctree will be careful to either
not wake up or check before trying to wake them up. We can also wrap the
wakeups like

wake_up_cleaner(fs_info) {
	if (!fs_info->state & CLOSING)
		wake_up_process(cleaner_kthread)
	else
		BUG_ON
}

Same for kthread.
Omar Sandoval Nov. 1, 2018, 3:28 p.m. UTC | #10
On Thu, Nov 01, 2018 at 08:24:25AM -0700, Omar Sandoval wrote:
> On Thu, Nov 01, 2018 at 04:22:29PM +0100, David Sterba wrote:
> > On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> > > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote:
> > > > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > > > 
> > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > > > >> From: Omar Sandoval <osandov@fb.com>
> > > > >>
> > > > >> There's a race between close_ctree() and cleaner_kthread().
> > > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > > > >> sees it set, but this is racy; the cleaner might have already checked
> > > > >> the bit and could be cleaning stuff. In particular, if it deletes 
> > > > >> unused
> > > > >> block groups, it will create delayed iputs for the free space cache
> > > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > > > >> longer running delayed iputs after a commit. Therefore, if the 
> > > > >> cleaner
> > > > >> creates more delayed iputs after delayed iputs are run in
> > > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > > > >> inode crash from the VFS.
> > > > >>
> > > > >> Fix it by parking the cleaner
> > > > >
> > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > > > we're missing a commit or clean up data structures. Messing with state
> > > > > of a thread would be the last thing I'd try after proving that it's 
> > > > > not
> > > > > possible to fix in the logic of btrfs itself.
> > > > >
> > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > > > there. The interdependencies of thread and data structures and other
> > > > > subsystems cannot have loops that could not find an ordering that will
> > > > > not leak something.
> > > > >
> > > > > It's not a big problem if some step is done more than once, like
> > > > > committing or cleaning up some other structures if we know that
> > > > > it could create new.
> > > > 
> > > > The problem is the cleaner thread needs to be told to stop doing new 
> > > > work, and we need to wait for the work it's already doing to be 
> > > > finished.  We're getting "stop doing new work" already because the 
> > > > cleaner thread checks to see if the FS is closing, but we don't have a 
> > > > way today to wait for him to finish what he's already doing.
> > > > 
> > > > kthread_park() is basically the same as adding another mutex or 
> > > > synchronization point.  I'm not sure how we could change close_tree() or 
> > > > the final commit to pick this up more effectively?
> > > 
> > > The idea is:
> > > 
> > > cleaner                                     close_ctree thread
> > > 
> > >                                             tell cleaner to stop
> > > 					    wait
> > > start work
> > > if should_stop, then exit
> > >                                             cleaner is stopped
> > > 
> > > [does not run: finish work]
> > > [does not run: loop]
> > >                                             pick up the work or make
> > > 					    sure there's nothing in
> > > 					    progress anymore
> > > 
> > > 
> > > A simplified version in code:
> > > 
> > >   set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> > > 
> > >   wait for defrag - could be started from cleaner but next iteration will
> > > 		    see the fs closed and will not continue
> > > 
> > >   kthread_stop(transaction_kthread)
> > > 
> > >   kthread_stop(cleaner_kthread)
> > > 
> > >   /* next, everything that could be left from cleaner should be finished */
> > > 
> > >   btrfs_delete_unused_bgs();
> > >   assert there are no defrags
> > >   assert there are no delayed iputs
> > >   commit if necessary
> > > 
> > > IOW the unused block groups are removed from close_ctree too early,
> > > moving that after the threads stop AND makins sure that it does not need
> > > either of them should work.
> > > 
> > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> > > calls plain btrfs_end_transaction that wakes up transaction ktread, so
> > > there would need to be an argument passed to tell it to do full commit.
> > 
> > Not perfect, relies on the fact that wake_up_process(thread) on a stopped
> > thread is a no-op,
> 
> How is that? kthread_stop() frees the task struct, so wake_up_process()
> would be a use-after-free.

(Indirectly, through the kthread calling do_exit() -> do_task_dead() ->
being put in the scheduler)
David Sterba Nov. 1, 2018, 3:29 p.m. UTC | #11
On Thu, Nov 01, 2018 at 08:24:25AM -0700, Omar Sandoval wrote:
> On Thu, Nov 01, 2018 at 04:22:29PM +0100, David Sterba wrote:
> > On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> > > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote:
> > > > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > > > 
> > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > > > >> From: Omar Sandoval <osandov@fb.com>
> > > > >>
> > > > >> There's a race between close_ctree() and cleaner_kthread().
> > > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > > > >> sees it set, but this is racy; the cleaner might have already checked
> > > > >> the bit and could be cleaning stuff. In particular, if it deletes 
> > > > >> unused
> > > > >> block groups, it will create delayed iputs for the free space cache
> > > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > > > >> longer running delayed iputs after a commit. Therefore, if the 
> > > > >> cleaner
> > > > >> creates more delayed iputs after delayed iputs are run in
> > > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > > > >> inode crash from the VFS.
> > > > >>
> > > > >> Fix it by parking the cleaner
> > > > >
> > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > > > we're missing a commit or clean up data structures. Messing with state
> > > > > of a thread would be the last thing I'd try after proving that it's 
> > > > > not
> > > > > possible to fix in the logic of btrfs itself.
> > > > >
> > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > > > there. The interdependencies of thread and data structures and other
> > > > > subsystems cannot have loops that could not find an ordering that will
> > > > > not leak something.
> > > > >
> > > > > It's not a big problem if some step is done more than once, like
> > > > > committing or cleaning up some other structures if we know that
> > > > > it could create new.
> > > > 
> > > > The problem is the cleaner thread needs to be told to stop doing new 
> > > > work, and we need to wait for the work it's already doing to be 
> > > > finished.  We're getting "stop doing new work" already because the 
> > > > cleaner thread checks to see if the FS is closing, but we don't have a 
> > > > way today to wait for him to finish what he's already doing.
> > > > 
> > > > kthread_park() is basically the same as adding another mutex or 
> > > > synchronization point.  I'm not sure how we could change close_tree() or 
> > > > the final commit to pick this up more effectively?
> > > 
> > > The idea is:
> > > 
> > > cleaner                                     close_ctree thread
> > > 
> > >                                             tell cleaner to stop
> > > 					    wait
> > > start work
> > > if should_stop, then exit
> > >                                             cleaner is stopped
> > > 
> > > [does not run: finish work]
> > > [does not run: loop]
> > >                                             pick up the work or make
> > > 					    sure there's nothing in
> > > 					    progress anymore
> > > 
> > > 
> > > A simplified version in code:
> > > 
> > >   set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> > > 
> > >   wait for defrag - could be started from cleaner but next iteration will
> > > 		    see the fs closed and will not continue
> > > 
> > >   kthread_stop(transaction_kthread)
> > > 
> > >   kthread_stop(cleaner_kthread)
> > > 
> > >   /* next, everything that could be left from cleaner should be finished */
> > > 
> > >   btrfs_delete_unused_bgs();
> > >   assert there are no defrags
> > >   assert there are no delayed iputs
> > >   commit if necessary
> > > 
> > > IOW the unused block groups are removed from close_ctree too early,
> > > moving that after the threads stop AND makins sure that it does not need
> > > either of them should work.
> > > 
> > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> > > calls plain btrfs_end_transaction that wakes up transaction ktread, so
> > > there would need to be an argument passed to tell it to do full commit.
> > 
> > Not perfect, relies on the fact that wake_up_process(thread) on a stopped
> > thread is a no-op,
> 
> How is that? kthread_stop() frees the task struct, so wake_up_process()
> would be a use-after-free.

That was an assumption for the demonstration purposes, the wording was
confusing sorry.
Omar Sandoval Nov. 1, 2018, 4 p.m. UTC | #12
On Thu, Nov 01, 2018 at 04:29:48PM +0100, David Sterba wrote:
> On Thu, Nov 01, 2018 at 08:24:25AM -0700, Omar Sandoval wrote:
> > On Thu, Nov 01, 2018 at 04:22:29PM +0100, David Sterba wrote:
> > > On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> > > > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote:
> > > > > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > > > > 
> > > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > > > > >> From: Omar Sandoval <osandov@fb.com>
> > > > > >>
> > > > > >> There's a race between close_ctree() and cleaner_kthread().
> > > > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > > > > >> sees it set, but this is racy; the cleaner might have already checked
> > > > > >> the bit and could be cleaning stuff. In particular, if it deletes 
> > > > > >> unused
> > > > > >> block groups, it will create delayed iputs for the free space cache
> > > > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > > > > >> longer running delayed iputs after a commit. Therefore, if the 
> > > > > >> cleaner
> > > > > >> creates more delayed iputs after delayed iputs are run in
> > > > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > > > > >> inode crash from the VFS.
> > > > > >>
> > > > > >> Fix it by parking the cleaner
> > > > > >
> > > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > > > > we're missing a commit or clean up data structures. Messing with state
> > > > > > of a thread would be the last thing I'd try after proving that it's 
> > > > > > not
> > > > > > possible to fix in the logic of btrfs itself.
> > > > > >
> > > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > > > > there. The interdependencies of thread and data structures and other
> > > > > > subsystems cannot have loops that could not find an ordering that will
> > > > > > not leak something.
> > > > > >
> > > > > > It's not a big problem if some step is done more than once, like
> > > > > > committing or cleaning up some other structures if we know that
> > > > > > it could create new.
> > > > > 
> > > > > The problem is the cleaner thread needs to be told to stop doing new 
> > > > > work, and we need to wait for the work it's already doing to be 
> > > > > finished.  We're getting "stop doing new work" already because the 
> > > > > cleaner thread checks to see if the FS is closing, but we don't have a 
> > > > > way today to wait for him to finish what he's already doing.
> > > > > 
> > > > > kthread_park() is basically the same as adding another mutex or 
> > > > > synchronization point.  I'm not sure how we could change close_tree() or 
> > > > > the final commit to pick this up more effectively?
> > > > 
> > > > The idea is:
> > > > 
> > > > cleaner                                     close_ctree thread
> > > > 
> > > >                                             tell cleaner to stop
> > > > 					    wait
> > > > start work
> > > > if should_stop, then exit
> > > >                                             cleaner is stopped
> > > > 
> > > > [does not run: finish work]
> > > > [does not run: loop]
> > > >                                             pick up the work or make
> > > > 					    sure there's nothing in
> > > > 					    progress anymore
> > > > 
> > > > 
> > > > A simplified version in code:
> > > > 
> > > >   set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> > > > 
> > > >   wait for defrag - could be started from cleaner but next iteration will
> > > > 		    see the fs closed and will not continue
> > > > 
> > > >   kthread_stop(transaction_kthread)
> > > > 
> > > >   kthread_stop(cleaner_kthread)
> > > > 
> > > >   /* next, everything that could be left from cleaner should be finished */
> > > > 
> > > >   btrfs_delete_unused_bgs();
> > > >   assert there are no defrags
> > > >   assert there are no delayed iputs
> > > >   commit if necessary
> > > > 
> > > > IOW the unused block groups are removed from close_ctree too early,
> > > > moving that after the threads stop AND makins sure that it does not need
> > > > either of them should work.
> > > > 
> > > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> > > > calls plain btrfs_end_transaction that wakes up transaction ktread, so
> > > > there would need to be an argument passed to tell it to do full commit.
> > > 
> > > Not perfect, relies on the fact that wake_up_process(thread) on a stopped
> > > thread is a no-op,
> > 
> > How is that? kthread_stop() frees the task struct, so wake_up_process()
> > would be a use-after-free.
> 
> That was an assumption for the demonstration purposes, the wording was
> confusing sorry.

Oh, well in that case, that's exactly what kthread_park() is ;) Stop the
thread and make wake_up a noop, and then we don't need to add special
cases everywhere else.
David Sterba Nov. 1, 2018, 4:44 p.m. UTC | #13
On Thu, Nov 01, 2018 at 09:00:56AM -0700, Omar Sandoval wrote:
> > That was an assumption for the demonstration purposes, the wording was
> > confusing sorry.
> 
> Oh, well in that case, that's exactly what kthread_park() is ;) Stop the
> thread and make wake_up a noop, and then we don't need to add special
> cases everywhere else.

The end result is equivalent and should be, I'm now weighing both
approaches. Explicit state tracking outside of the thread structures
seems more clear and in case we want to query the status, we can't after
kthread_stop is called. My current version below.

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 68ca41dbbef3..cc076ae45ced 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -715,6 +715,7 @@ struct btrfs_delayed_root;
 #define BTRFS_FS_BARRIER			1
 #define BTRFS_FS_CLOSING_START			2
 #define BTRFS_FS_CLOSING_DONE			3
+#define BTRFS_FS_CLOSING_SINGLE			19
 #define BTRFS_FS_LOG_RECOVERING			4
 #define BTRFS_FS_OPEN				5
 #define BTRFS_FS_QUOTA_ENABLED			6
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b0ab41da91d1..570096fb3f35 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1801,7 +1801,7 @@ static int transaction_kthread(void *arg)
 			btrfs_end_transaction(trans);
 		}
 sleep:
-		wake_up_process(fs_info->cleaner_kthread);
+		btrfs_wake_up_cleaner(fs_info);
 		mutex_unlock(&fs_info->transaction_kthread_mutex);
 
 		if (unlikely(test_bit(BTRFS_FS_STATE_ERROR,
@@ -3914,7 +3914,7 @@ int btrfs_commit_super(struct btrfs_fs_info *fs_info)
 	mutex_lock(&fs_info->cleaner_mutex);
 	btrfs_run_delayed_iputs(fs_info);
 	mutex_unlock(&fs_info->cleaner_mutex);
-	wake_up_process(fs_info->cleaner_kthread);
+	btrfs_wake_up_cleaner(fs_info);
 
 	/* wait until ongoing cleanup work done */
 	down_write(&fs_info->cleanup_work_sem);
@@ -3956,11 +3956,24 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 
 	cancel_work_sync(&fs_info->async_reclaim_work);
 
+	if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) ||
+	    test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state))
+		btrfs_error_commit_super(fs_info);
+
+	kthread_stop(fs_info->transaction_kthread);
+	kthread_stop(fs_info->cleaner_kthread);
+
+	/*
+	 * btrfs_delete_unused_bgs or btrfs_commit_super can still try to wake
+	 * up the threads but will become a no-op
+	 */
+	set_bit(BTRFS_FS_CLOSING_SINGLE, &fs_info->flags);
+
 	if (!sb_rdonly(fs_info->sb)) {
 		/*
-		 * If the cleaner thread is stopped and there are
-		 * block groups queued for removal, the deletion will be
-		 * skipped when we quit the cleaner thread.
+		 * The cleaner thread is now stopped and if there are block
+		 * groups queued for removal, we have to pick up the work here
+		 * so there are no delayed iputs triggered.
 		 */
 		btrfs_delete_unused_bgs(fs_info);
 
@@ -3969,13 +3982,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 			btrfs_err(fs_info, "commit super ret %d", ret);
 	}
 
-	if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) ||
-	    test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state))
-		btrfs_error_commit_super(fs_info);
-
-	kthread_stop(fs_info->transaction_kthread);
-	kthread_stop(fs_info->cleaner_kthread);
-
 	ASSERT(list_empty(&fs_info->delayed_iputs));
 	set_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags);
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a990a9045139..a5786b6e8d37 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5915,7 +5915,7 @@ long btrfs_ioctl(struct file *file, unsigned int
 		 * namely it pokes the cleaner kthread that will start
 		 * processing uncleaned subvols.
 		 */
-		wake_up_process(fs_info->transaction_kthread);
+		btrfs_wake_up_transaction(fs_info);
 		return ret;
 	}
 	case BTRFS_IOC_START_SYNC:
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b362b45dd757..9e73e35b94ea 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1896,7 +1896,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 		set_bit(BTRFS_FS_OPEN, &fs_info->flags);
 	}
 out:
-	wake_up_process(fs_info->transaction_kthread);
+	btrfs_wake_up_transaction(fs_info);
 	btrfs_remount_cleanup(fs_info, old_opts);
 	return 0;
 
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 3717c864ba23..d00311dc76d2 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -154,7 +154,7 @@ static ssize_t btrfs_feature_attr_store(struct kobject *kobj,
 	 * We don't want to do full transaction commit from inside sysfs
 	 */
 	btrfs_set_pending(fs_info, COMMIT);
-	wake_up_process(fs_info->transaction_kthread);
+	btrfs_wake_up_transaction(fs_info);
 
 	return count;
 }
@@ -427,7 +427,7 @@ static ssize_t btrfs_label_store(struct kobject *kobj,
 	 * We don't want to do full transaction commit from inside sysfs
 	 */
 	btrfs_set_pending(fs_info, COMMIT);
-	wake_up_process(fs_info->transaction_kthread);
+	btrfs_wake_up_transaction(fs_info);
 
 	return len;
 }
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index d1eeef9ec5da..3ffed32c680e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -867,7 +867,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
 		if (throttle)
 			return btrfs_commit_transaction(trans);
 		else
-			wake_up_process(info->transaction_kthread);
+			btrfs_wake_up_transaction(info);
 	}
 
 	if (trans->type & __TRANS_FREEZABLE)
@@ -889,7 +889,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
 
 	if (trans->aborted ||
 	    test_bit(BTRFS_FS_STATE_ERROR, &info->fs_state)) {
-		wake_up_process(info->transaction_kthread);
+		btrfs_wake_up_transaction(info);
 		err = -EIO;
 	}
 
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 4cbb1b55387d..64a38c4af565 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -199,6 +199,25 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans);
 int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
 				   int wait_for_unblock);
 
+static inline void btrfs_wake_up_cleaner(struct btrfs_fs_info *fs_info)
+{
+	if (!test_bit(BTRFS_FS_CLOSING_SINGLE, &fs_info->flags)) {
+		wake_up_process(fs_info->cleaner_kthread);
+	} else {
+		btrfs_debug(fs_info,
+			"attempt to wake up cleaner kthread during shutdown");
+	}
+}
+static inline void btrfs_wake_up_transaction(struct btrfs_fs_info *fs_info)
+{
+	if (!test_bit(BTRFS_FS_CLOSING_SINGLE, &fs_info->flags)) {
+		wake_up_process(fs_info->transaction_kthread);
+	} else {
+		btrfs_debug(fs_info,
+			"attempt to wake up transaction kthread during shutdown");
+	}
+}
+
 /*
  * Try to commit transaction asynchronously, so this is safe to call
  * even holding a spinlock.
@@ -210,7 +229,7 @@ static inline void btrfs_commit_transaction_locksafe(
 		struct btrfs_fs_info *fs_info)
 {
 	set_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags);
-	wake_up_process(fs_info->transaction_kthread);
+	btrfs_wake_up_transaction(fs_info);
 }
 int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans);
 int btrfs_should_end_transaction(struct btrfs_trans_handle *trans);
Nikolay Borisov Nov. 1, 2018, 4:50 p.m. UTC | #14
On 1.11.18 г. 18:44 ч., David Sterba wrote:
> On Thu, Nov 01, 2018 at 09:00:56AM -0700, Omar Sandoval wrote:
>>> That was an assumption for the demonstration purposes, the wording was
>>> confusing sorry.
>>
>> Oh, well in that case, that's exactly what kthread_park() is ;) Stop the
>> thread and make wake_up a noop, and then we don't need to add special
>> cases everywhere else.
> 
> The end result is equivalent and should be, I'm now weighing both
> approaches. Explicit state tracking outside of the thread structures
> seems more clear and in case we want to query the status, we can't after
> kthread_stop is called. My current version below.
> 

<snip>

> +static inline void btrfs_wake_up_transaction(struct btrfs_fs_info *fs_info)
> +{
> +	if (!test_bit(BTRFS_FS_CLOSING_SINGLE, &fs_info->flags)) {

Why do we need a separate flag for a predicate? If we use
FS_CLOSING_START then waking up becomes a noop from the moment
close_ctree is called.

> +		wake_up_process(fs_info->transaction_kthread);
> +	} else {
> +		btrfs_debug(fs_info,
> +			"attempt to wake up transaction kthread during shutdown");
> +	}
> +}
> +
>  /*
>   * Try to commit transaction asynchronously, so this is safe to call
>   * even holding a spinlock.
> @@ -210,7 +229,7 @@ static inline void btrfs_commit_transaction_locksafe(
>  		struct btrfs_fs_info *fs_info)
>  {
>  	set_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags);
> -	wake_up_process(fs_info->transaction_kthread);
> +	btrfs_wake_up_transaction(fs_info);
>  }
>  int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans);
>  int btrfs_should_end_transaction(struct btrfs_trans_handle *trans);
>
David Sterba Nov. 1, 2018, 5:15 p.m. UTC | #15
On Thu, Nov 01, 2018 at 06:50:03PM +0200, Nikolay Borisov wrote:
> 
> 
> On 1.11.18 г. 18:44 ч., David Sterba wrote:
> > On Thu, Nov 01, 2018 at 09:00:56AM -0700, Omar Sandoval wrote:
> >>> That was an assumption for the demonstration purposes, the wording was
> >>> confusing sorry.
> >>
> >> Oh, well in that case, that's exactly what kthread_park() is ;) Stop the
> >> thread and make wake_up a noop, and then we don't need to add special
> >> cases everywhere else.
> > 
> > The end result is equivalent and should be, I'm now weighing both
> > approaches. Explicit state tracking outside of the thread structures
> > seems more clear and in case we want to query the status, we can't after
> > kthread_stop is called. My current version below.
> > 
> 
> <snip>
> 
> > +static inline void btrfs_wake_up_transaction(struct btrfs_fs_info *fs_info)
> > +{
> > +	if (!test_bit(BTRFS_FS_CLOSING_SINGLE, &fs_info->flags)) {
> 
> Why do we need a separate flag for a predicate? If we use
> FS_CLOSING_START then waking up becomes a noop from the moment
> close_ctree is called.

Yeah it's not strictly necessary, it's a leftover from when I had a
WARN_ON in the wakeup helpers that triggered too often so another flag
was added just after the threads were stopped.
Chris Mason Nov. 1, 2018, 5:36 p.m. UTC | #16
On 1 Nov 2018, at 12:00, Omar Sandoval wrote:

> On Thu, Nov 01, 2018 at 04:29:48PM +0100, David Sterba wrote:
>>>
>>> How is that? kthread_stop() frees the task struct, so 
>>> wake_up_process()
>>> would be a use-after-free.
>>
>> That was an assumption for the demonstration purposes, the wording 
>> was
>> confusing sorry.
>
> Oh, well in that case, that's exactly what kthread_park() is ;) Stop 
> the
> thread and make wake_up a noop, and then we don't need to add special
> cases everywhere else.

This is how Omar won me over to kthread_park().  I think Dave's 
alternatives can be made correct, but Omar's way solves a whole class of 
potential problems.  The park() makes the thread safely ignore future 
work, and won't don't have to chase down weird bugs where people are 
sending future work at the wrong time.

-chris
David Sterba Nov. 7, 2018, 4:01 p.m. UTC | #17
On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> There's a race between close_ctree() and cleaner_kthread().
> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> sees it set, but this is racy; the cleaner might have already checked
> the bit and could be cleaning stuff. In particular, if it deletes unused
> block groups, it will create delayed iputs for the free space cache
> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> longer running delayed iputs after a commit. Therefore, if the cleaner
> creates more delayed iputs after delayed iputs are run in
> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> inode crash from the VFS.
> 
> Fix it by parking the cleaner before we actually close anything. Then,
> any remaining delayed iputs will always be handled in
> btrfs_commit_super(). This also ensures that the commit in close_ctree()
> is really the last commit, so we can get rid of the commit in
> cleaner_kthread().
> 
> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
> Signed-off-by: Omar Sandoval <osandov@fb.com>

I'll queue this patch for rc2 as it fixes crashes I see during testing.
My version does more changes and would be more suitable for a series,
that could actually document the shutdown sequence and add a few
assertions on top.

Reviewed-by: David Sterba <dsterba@suse.com>
Omar Sandoval Nov. 10, 2018, 4:07 a.m. UTC | #18
On Wed, Nov 07, 2018 at 05:01:19PM +0100, David Sterba wrote:
> On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > There's a race between close_ctree() and cleaner_kthread().
> > close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > sees it set, but this is racy; the cleaner might have already checked
> > the bit and could be cleaning stuff. In particular, if it deletes unused
> > block groups, it will create delayed iputs for the free space cache
> > inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > longer running delayed iputs after a commit. Therefore, if the cleaner
> > creates more delayed iputs after delayed iputs are run in
> > btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > inode crash from the VFS.
> > 
> > Fix it by parking the cleaner before we actually close anything. Then,
> > any remaining delayed iputs will always be handled in
> > btrfs_commit_super(). This also ensures that the commit in close_ctree()
> > is really the last commit, so we can get rid of the commit in
> > cleaner_kthread().
> > 
> > Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> 
> I'll queue this patch for rc2 as it fixes crashes I see during testing.
> My version does more changes and would be more suitable for a series,
> that could actually document the shutdown sequence and add a few
> assertions on top.

Thanks, Dave! I'll keep an eye out for the further cleanups.

> Reviewed-by: David Sterba <dsterba@suse.com>

Patch
diff mbox series

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b0ab41da91d1..40bcc45d827d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1664,9 +1664,8 @@  static int cleaner_kthread(void *arg)
 	struct btrfs_root *root = arg;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	int again;
-	struct btrfs_trans_handle *trans;
 
-	do {
+	while (1) {
 		again = 0;
 
 		/* Make the cleaner go to sleep early. */
@@ -1715,42 +1714,16 @@  static int cleaner_kthread(void *arg)
 		 */
 		btrfs_delete_unused_bgs(fs_info);
 sleep:
+		if (kthread_should_park())
+			kthread_parkme();
+		if (kthread_should_stop())
+			return 0;
 		if (!again) {
 			set_current_state(TASK_INTERRUPTIBLE);
-			if (!kthread_should_stop())
-				schedule();
+			schedule();
 			__set_current_state(TASK_RUNNING);
 		}
-	} while (!kthread_should_stop());
-
-	/*
-	 * Transaction kthread is stopped before us and wakes us up.
-	 * However we might have started a new transaction and COWed some
-	 * tree blocks when deleting unused block groups for example. So
-	 * make sure we commit the transaction we started to have a clean
-	 * shutdown when evicting the btree inode - if it has dirty pages
-	 * when we do the final iput() on it, eviction will trigger a
-	 * writeback for it which will fail with null pointer dereferences
-	 * since work queues and other resources were already released and
-	 * destroyed by the time the iput/eviction/writeback is made.
-	 */
-	trans = btrfs_attach_transaction(root);
-	if (IS_ERR(trans)) {
-		if (PTR_ERR(trans) != -ENOENT)
-			btrfs_err(fs_info,
-				  "cleaner transaction attach returned %ld",
-				  PTR_ERR(trans));
-	} else {
-		int ret;
-
-		ret = btrfs_commit_transaction(trans);
-		if (ret)
-			btrfs_err(fs_info,
-				  "cleaner open transaction commit returned %d",
-				  ret);
 	}
-
-	return 0;
 }
 
 static int transaction_kthread(void *arg)
@@ -3931,6 +3904,13 @@  void close_ctree(struct btrfs_fs_info *fs_info)
 	int ret;
 
 	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
+	/*
+	 * We don't want the cleaner to start new transactions, add more delayed
+	 * iputs, etc. while we're closing. We can't use kthread_stop() yet
+	 * because that frees the task_struct, and the transaction kthread might
+	 * still try to wake up the cleaner.
+	 */
+	kthread_park(fs_info->cleaner_kthread);
 
 	/* wait for the qgroup rescan worker to stop */
 	btrfs_qgroup_wait_for_completion(fs_info, false);
@@ -3958,9 +3938,8 @@  void close_ctree(struct btrfs_fs_info *fs_info)
 
 	if (!sb_rdonly(fs_info->sb)) {
 		/*
-		 * If the cleaner thread is stopped and there are
-		 * block groups queued for removal, the deletion will be
-		 * skipped when we quit the cleaner thread.
+		 * The cleaner kthread is stopped, so do one final pass over
+		 * unused block groups.
 		 */
 		btrfs_delete_unused_bgs(fs_info);