Btrfs: fix missing delayed iputs on unmount

Message ID	9b8c7d1ce662d216cf29ffcb756d177b0bf60b64.1540944854.git.osandov@fb.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Omar Sandoval <osandov@osandov.com> To: linux-btrfs@vger.kernel.org Cc: kernel-team@fb.com Subject: [PATCH] Btrfs: fix missing delayed iputs on unmount Date: Tue, 30 Oct 2018 17:14:42 -0700 Message-Id: <9b8c7d1ce662d216cf29ffcb756d177b0bf60b64.1540944854.git.osandov@fb.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	Btrfs: fix missing delayed iputs on unmount \| expand Btrfs: fix missing delayed iputs on unmount

Omar Sandoval Oct. 31, 2018, 12:14 a.m. UTC

From: Omar Sandoval <osandov@fb.com>

There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.

Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().

Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
We found this with a stress test that our containers team runs. I'm
wondering if this same race could have caused any other issues other
than this new iput thing, but I couldn't identify any.

 fs/btrfs/disk-io.c | 40 +++++++---------------------------------
 1 file changed, 7 insertions(+), 33 deletions(-)

Lu Fengqi Oct. 31, 2018, 7:14 a.m. UTC | #1

On Tue, Oct 30, 2018 at 05:14:42PM -0700, Omar Sandoval wrote:
>From: Omar Sandoval <osandov@fb.com>
>
>There's a race between close_ctree() and cleaner_kthread().
>close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
>sees it set, but this is racy; the cleaner might have already checked
>the bit and could be cleaning stuff. In particular, if it deletes unused
>block groups, it will create delayed iputs for the free space cache
>inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
>longer running delayed iputs after a commit. Therefore, if the cleaner
>creates more delayed iputs after delayed iputs are run in
>btrfs_commit_super(), we will leak inodes on unmount and get a busy
>inode crash from the VFS.
>
>Fix it by parking the cleaner before we actually close anything. Then,
>any remaining delayed iputs will always be handled in
>btrfs_commit_super(). This also ensures that the commit in close_ctree()
>is really the last commit, so we can get rid of the commit in
>cleaner_kthread().
>
>Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
>Signed-off-by: Omar Sandoval <osandov@fb.com>
>---
>We found this with a stress test that our containers team runs. I'm
>wondering if this same race could have caused any other issues other
>than this new iput thing, but I couldn't identify any.

I noticed an inode leak issue in generic/475, but whether dropping commit
30928e9baac2 ("btrfs: don't run delayed_iputs in commit") or applying
this patch, the issue still exists.

I have attached the dmesg.

>
> fs/btrfs/disk-io.c | 40 +++++++---------------------------------
> 1 file changed, 7 insertions(+), 33 deletions(-)
>
>diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>index b0ab41da91d1..7c17284ae3c2 100644
>--- a/fs/btrfs/disk-io.c
>+++ b/fs/btrfs/disk-io.c
>@@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
> 	struct btrfs_root *root = arg;
> 	struct btrfs_fs_info *fs_info = root->fs_info;
> 	int again;
>-	struct btrfs_trans_handle *trans;
> 
>-	do {
>+	while (1) {
> 		again = 0;
> 
> 		/* Make the cleaner go to sleep early. */
>@@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
> 		 */
> 		btrfs_delete_unused_bgs(fs_info);
> sleep:
>+		if (kthread_should_park())
>+			kthread_parkme();
>+		if (kthread_should_stop())
>+			return 0;
> 		if (!again) {
> 			set_current_state(TASK_INTERRUPTIBLE);
>-			if (!kthread_should_stop())
>-				schedule();
>+			schedule();
> 			__set_current_state(TASK_RUNNING);
> 		}
>-	} while (!kthread_should_stop());
>-
>-	/*
>-	 * Transaction kthread is stopped before us and wakes us up.
>-	 * However we might have started a new transaction and COWed some
>-	 * tree blocks when deleting unused block groups for example. So
>-	 * make sure we commit the transaction we started to have a clean
>-	 * shutdown when evicting the btree inode - if it has dirty pages
>-	 * when we do the final iput() on it, eviction will trigger a
>-	 * writeback for it which will fail with null pointer dereferences
>-	 * since work queues and other resources were already released and
>-	 * destroyed by the time the iput/eviction/writeback is made.
>-	 */
>-	trans = btrfs_attach_transaction(root);
>-	if (IS_ERR(trans)) {
>-		if (PTR_ERR(trans) != -ENOENT)
>-			btrfs_err(fs_info,
>-				  "cleaner transaction attach returned %ld",
>-				  PTR_ERR(trans));
>-	} else {
>-		int ret;
>-
>-		ret = btrfs_commit_transaction(trans);
>-		if (ret)
>-			btrfs_err(fs_info,
>-				  "cleaner open transaction commit returned %d",
>-				  ret);
> 	}
>-
>-	return 0;
> }
> 
> static int transaction_kthread(void *arg)
>@@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
> 	int ret;
> 
> 	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
>+	kthread_park(fs_info->cleaner_kthread);

Since we are not going to call kthread_unpark, I am not sure why
kthread_park is used instead of kthread_stop here. It looks like there
is no significant difference between stopping instantly and delayed stop.

Lu Fengqi Oct. 31, 2018, 7:41 a.m. UTC | #2

On Tue, Oct 30, 2018 at 05:14:42PM -0700, Omar Sandoval wrote:
>From: Omar Sandoval <osandov@fb.com>
>
>There's a race between close_ctree() and cleaner_kthread().
>close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
>sees it set, but this is racy; the cleaner might have already checked
>the bit and could be cleaning stuff. In particular, if it deletes unused
>block groups, it will create delayed iputs for the free space cache
>inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
>longer running delayed iputs after a commit. Therefore, if the cleaner
>creates more delayed iputs after delayed iputs are run in
>btrfs_commit_super(), we will leak inodes on unmount and get a busy

Since the assert added via commit e187831e1875 ("btrfs: assert on non-empty
delayed iputs") wasn't triggered, it doesn't seem to be the cause of inode
leak.

Nikolay Borisov Oct. 31, 2018, 8:26 a.m. UTC | #3

On 31.10.18 г. 9:14 ч., Lu Fengqi wrote:
> On Tue, Oct 30, 2018 at 05:14:42PM -0700, Omar Sandoval wrote:
>> From: Omar Sandoval <osandov@fb.com>
>>
>> There's a race between close_ctree() and cleaner_kthread().
>> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
>> sees it set, but this is racy; the cleaner might have already checked
>> the bit and could be cleaning stuff. In particular, if it deletes unused
>> block groups, it will create delayed iputs for the free space cache
>> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
>> longer running delayed iputs after a commit. Therefore, if the cleaner
>> creates more delayed iputs after delayed iputs are run in
>> btrfs_commit_super(), we will leak inodes on unmount and get a busy
>> inode crash from the VFS.
>>
>> Fix it by parking the cleaner before we actually close anything. Then,
>> any remaining delayed iputs will always be handled in
>> btrfs_commit_super(). This also ensures that the commit in close_ctree()
>> is really the last commit, so we can get rid of the commit in
>> cleaner_kthread().
>>
>> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>> ---
>> We found this with a stress test that our containers team runs. I'm
>> wondering if this same race could have caused any other issues other
>> than this new iput thing, but I couldn't identify any.
> 
> I noticed an inode leak issue in generic/475, but whether dropping commit
> 30928e9baac2 ("btrfs: don't run delayed_iputs in commit") or applying
> this patch, the issue still exists.
> 
> I have attached the dmesg.

Are you able to trigger this reliably i.e 100% or just, say, 80-90% of
the time? If it's sporadic (but frequent) then it is likely an issue
with error handling. Also looking at the log I see:

[  367.977998] BTRFS info (device dm-3): at unmount delalloc count 8192

Meaning we are leaking 8k of delalloc space which should have been
cleaned up. And the warn is caused because we have an inode in the inode
rb tree. I will suggest that you patch btrfs_free_fs_root to print the
ino's if root->inode_tree is not empty. That way you can see if the inos
correspond to a data file or a freespace ino. But given btrfs free_fs
also complains of delaloc bytes I'd be willing to say it's data inodes.

<snip>

Nikolay Borisov Oct. 31, 2018, 8:38 a.m. UTC | #4

On 31.10.18 г. 10:26 ч., Nikolay Borisov wrote:
> 
> 
> On 31.10.18 г. 9:14 ч., Lu Fengqi wrote:
>> On Tue, Oct 30, 2018 at 05:14:42PM -0700, Omar Sandoval wrote:
>>> From: Omar Sandoval <osandov@fb.com>
>>>
>>> There's a race between close_ctree() and cleaner_kthread().
>>> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
>>> sees it set, but this is racy; the cleaner might have already checked
>>> the bit and could be cleaning stuff. In particular, if it deletes unused
>>> block groups, it will create delayed iputs for the free space cache
>>> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
>>> longer running delayed iputs after a commit. Therefore, if the cleaner
>>> creates more delayed iputs after delayed iputs are run in
>>> btrfs_commit_super(), we will leak inodes on unmount and get a busy
>>> inode crash from the VFS.
>>>
>>> Fix it by parking the cleaner before we actually close anything. Then,
>>> any remaining delayed iputs will always be handled in
>>> btrfs_commit_super(). This also ensures that the commit in close_ctree()
>>> is really the last commit, so we can get rid of the commit in
>>> cleaner_kthread().
>>>
>>> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>> ---
>>> We found this with a stress test that our containers team runs. I'm
>>> wondering if this same race could have caused any other issues other
>>> than this new iput thing, but I couldn't identify any.
>>
>> I noticed an inode leak issue in generic/475, but whether dropping commit
>> 30928e9baac2 ("btrfs: don't run delayed_iputs in commit") or applying
>> this patch, the issue still exists.
>>
>> I have attached the dmesg.
> 
> Are you able to trigger this reliably i.e 100% or just, say, 80-90% of
> the time? If it's sporadic (but frequent) then it is likely an issue
> with error handling. Also looking at the log I see:
> 
> [  367.977998] BTRFS info (device dm-3): at unmount delalloc count 8192
> 
> Meaning we are leaking 8k of delalloc space which should have been
> cleaned up. And the warn is caused because we have an inode in the inode
> rb tree. I will suggest that you patch btrfs_free_fs_root to print the
> ino's if root->inode_tree is not empty. That way you can see if the inos
> correspond to a data file or a freespace ino. But given btrfs free_fs
> also complains of delaloc bytes I'd be willing to say it's data inodes.

As a matter of fact this should have been fixed by :

fe816d0f1d4c ("btrfs: Fix delalloc inodes invalidation during
transaction abort")


Perhaps this commit is not a full fix ?

> 
> <snip>
>

Nikolay Borisov Oct. 31, 2018, 8:43 a.m. UTC | #5

On 31.10.18 г. 2:14 ч., Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> There's a race between close_ctree() and cleaner_kthread().
> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> sees it set, but this is racy; the cleaner might have already checked
> the bit and could be cleaning stuff. In particular, if it deletes unused
> block groups, it will create delayed iputs for the free space cache
> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> longer running delayed iputs after a commit. Therefore, if the cleaner
> creates more delayed iputs after delayed iputs are run in
> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> inode crash from the VFS.
> 
> Fix it by parking the cleaner before we actually close anything. Then,
> any remaining delayed iputs will always be handled in
> btrfs_commit_super(). This also ensures that the commit in close_ctree()
> is really the last commit, so we can get rid of the commit in
> cleaner_kthread().
> 
> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> We found this with a stress test that our containers team runs. I'm
> wondering if this same race could have caused any other issues other
> than this new iput thing, but I couldn't identify any.
> 
>  fs/btrfs/disk-io.c | 40 +++++++---------------------------------
>  1 file changed, 7 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b0ab41da91d1..7c17284ae3c2 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
>  	struct btrfs_root *root = arg;
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  	int again;
> -	struct btrfs_trans_handle *trans;
>  
> -	do {
> +	while (1) {
>  		again = 0;
>  
>  		/* Make the cleaner go to sleep early. */
> @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
>  		 */
>  		btrfs_delete_unused_bgs(fs_info);
>  sleep:
> +		if (kthread_should_park())
> +			kthread_parkme();
> +		if (kthread_should_stop())
> +			return 0;
>  		if (!again) {
>  			set_current_state(TASK_INTERRUPTIBLE);
> -			if (!kthread_should_stop())
> -				schedule();
> +			schedule();
>  			__set_current_state(TASK_RUNNING);
>  		}
> -	} while (!kthread_should_stop());
> -
> -	/*
> -	 * Transaction kthread is stopped before us and wakes us up.
> -	 * However we might have started a new transaction and COWed some
> -	 * tree blocks when deleting unused block groups for example. So
> -	 * make sure we commit the transaction we started to have a clean
> -	 * shutdown when evicting the btree inode - if it has dirty pages
> -	 * when we do the final iput() on it, eviction will trigger a
> -	 * writeback for it which will fail with null pointer dereferences
> -	 * since work queues and other resources were already released and
> -	 * destroyed by the time the iput/eviction/writeback is made.
> -	 */
> -	trans = btrfs_attach_transaction(root);
> -	if (IS_ERR(trans)) {
> -		if (PTR_ERR(trans) != -ENOENT)
> -			btrfs_err(fs_info,
> -				  "cleaner transaction attach returned %ld",
> -				  PTR_ERR(trans));
> -	} else {
> -		int ret;
> -
> -		ret = btrfs_commit_transaction(trans);
> -		if (ret)
> -			btrfs_err(fs_info,
> -				  "cleaner open transaction commit returned %d",
> -				  ret);
>  	}
> -
> -	return 0;
>  }
>  
>  static int transaction_kthread(void *arg)
> @@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>  	int ret;
>  
>  	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> +	kthread_park(fs_info->cleaner_kthread);

Can't you directly call kthread_stop here? When you park the thread it
will sleep and then when you call kthread_stop that function will unpark
the thread and the cleaner kthread will see KTHREAD_SHOULD_STOP bit and
just return 0. So the from the moment the thread is parked until it's
stopped it doesn't have a chance to do useful work.

>  
>  	/* wait for the qgroup rescan worker to stop */
>  	btrfs_qgroup_wait_for_completion(fs_info, false);
>

Omar Sandoval Oct. 31, 2018, 4:35 p.m. UTC | #6

On Wed, Oct 31, 2018 at 10:43:02AM +0200, Nikolay Borisov wrote:
> 
> 
> On 31.10.18 г. 2:14 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > There's a race between close_ctree() and cleaner_kthread().
> > close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > sees it set, but this is racy; the cleaner might have already checked
> > the bit and could be cleaning stuff. In particular, if it deletes unused
> > block groups, it will create delayed iputs for the free space cache
> > inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > longer running delayed iputs after a commit. Therefore, if the cleaner
> > creates more delayed iputs after delayed iputs are run in
> > btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > inode crash from the VFS.
> > 
> > Fix it by parking the cleaner before we actually close anything. Then,
> > any remaining delayed iputs will always be handled in
> > btrfs_commit_super(). This also ensures that the commit in close_ctree()
> > is really the last commit, so we can get rid of the commit in
> > cleaner_kthread().
> > 
> > Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> > We found this with a stress test that our containers team runs. I'm
> > wondering if this same race could have caused any other issues other
> > than this new iput thing, but I couldn't identify any.
> > 
> >  fs/btrfs/disk-io.c | 40 +++++++---------------------------------
> >  1 file changed, 7 insertions(+), 33 deletions(-)
> > 
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index b0ab41da91d1..7c17284ae3c2 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
> >  	struct btrfs_root *root = arg;
> >  	struct btrfs_fs_info *fs_info = root->fs_info;
> >  	int again;
> > -	struct btrfs_trans_handle *trans;
> >  
> > -	do {
> > +	while (1) {
> >  		again = 0;
> >  
> >  		/* Make the cleaner go to sleep early. */
> > @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
> >  		 */
> >  		btrfs_delete_unused_bgs(fs_info);
> >  sleep:
> > +		if (kthread_should_park())
> > +			kthread_parkme();
> > +		if (kthread_should_stop())
> > +			return 0;
> >  		if (!again) {
> >  			set_current_state(TASK_INTERRUPTIBLE);
> > -			if (!kthread_should_stop())
> > -				schedule();
> > +			schedule();
> >  			__set_current_state(TASK_RUNNING);
> >  		}
> > -	} while (!kthread_should_stop());
> > -
> > -	/*
> > -	 * Transaction kthread is stopped before us and wakes us up.
> > -	 * However we might have started a new transaction and COWed some
> > -	 * tree blocks when deleting unused block groups for example. So
> > -	 * make sure we commit the transaction we started to have a clean
> > -	 * shutdown when evicting the btree inode - if it has dirty pages
> > -	 * when we do the final iput() on it, eviction will trigger a
> > -	 * writeback for it which will fail with null pointer dereferences
> > -	 * since work queues and other resources were already released and
> > -	 * destroyed by the time the iput/eviction/writeback is made.
> > -	 */
> > -	trans = btrfs_attach_transaction(root);
> > -	if (IS_ERR(trans)) {
> > -		if (PTR_ERR(trans) != -ENOENT)
> > -			btrfs_err(fs_info,
> > -				  "cleaner transaction attach returned %ld",
> > -				  PTR_ERR(trans));
> > -	} else {
> > -		int ret;
> > -
> > -		ret = btrfs_commit_transaction(trans);
> > -		if (ret)
> > -			btrfs_err(fs_info,
> > -				  "cleaner open transaction commit returned %d",
> > -				  ret);
> >  	}
> > -
> > -	return 0;
> >  }
> >  
> >  static int transaction_kthread(void *arg)
> > @@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
> >  	int ret;
> >  
> >  	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> > +	kthread_park(fs_info->cleaner_kthread);
> 
> Can't you directly call kthread_stop here? When you park the thread it
> will sleep and then when you call kthread_stop that function will unpark
> the thread and the cleaner kthread will see KTHREAD_SHOULD_STOP bit and
> just return 0. So the from the moment the thread is parked until it's
> stopped it doesn't have a chance to do useful work.

kthread_stop() frees the task_struct, but we might still try to wake up
the cleaner kthread from somewhere (e.g., from the transaction kthread).
So we really need to keep the cleaner alive but not doing work.

Omar Sandoval Oct. 31, 2018, 4:37 p.m. UTC | #7

On Wed, Oct 31, 2018 at 03:41:47PM +0800, Lu Fengqi wrote:
> On Tue, Oct 30, 2018 at 05:14:42PM -0700, Omar Sandoval wrote:
> >From: Omar Sandoval <osandov@fb.com>
> >
> >There's a race between close_ctree() and cleaner_kthread().
> >close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> >sees it set, but this is racy; the cleaner might have already checked
> >the bit and could be cleaning stuff. In particular, if it deletes unused
> >block groups, it will create delayed iputs for the free space cache
> >inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> >longer running delayed iputs after a commit. Therefore, if the cleaner
> >creates more delayed iputs after delayed iputs are run in
> >btrfs_commit_super(), we will leak inodes on unmount and get a busy
> 
> Since the assert added via commit e187831e1875 ("btrfs: assert on non-empty
> delayed iputs") wasn't triggered, it doesn't seem to be the cause of inode
> leak.

This was in our production build without CONFIG_BTRFS_ASSERT.

Nikolay Borisov Oct. 31, 2018, 4:40 p.m. UTC | #8

On 31.10.18 г. 18:35 ч., Omar Sandoval wrote:
> On Wed, Oct 31, 2018 at 10:43:02AM +0200, Nikolay Borisov wrote:
>>
>>
>> On 31.10.18 г. 2:14 ч., Omar Sandoval wrote:
>>> From: Omar Sandoval <osandov@fb.com>
>>>
>>> There's a race between close_ctree() and cleaner_kthread().
>>> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
>>> sees it set, but this is racy; the cleaner might have already checked
>>> the bit and could be cleaning stuff. In particular, if it deletes unused
>>> block groups, it will create delayed iputs for the free space cache
>>> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
>>> longer running delayed iputs after a commit. Therefore, if the cleaner
>>> creates more delayed iputs after delayed iputs are run in
>>> btrfs_commit_super(), we will leak inodes on unmount and get a busy
>>> inode crash from the VFS.
>>>
>>> Fix it by parking the cleaner before we actually close anything. Then,
>>> any remaining delayed iputs will always be handled in
>>> btrfs_commit_super(). This also ensures that the commit in close_ctree()
>>> is really the last commit, so we can get rid of the commit in
>>> cleaner_kthread().
>>>
>>> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>> ---
>>> We found this with a stress test that our containers team runs. I'm
>>> wondering if this same race could have caused any other issues other
>>> than this new iput thing, but I couldn't identify any.
>>>
>>>  fs/btrfs/disk-io.c | 40 +++++++---------------------------------
>>>  1 file changed, 7 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index b0ab41da91d1..7c17284ae3c2 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
>>>  	struct btrfs_root *root = arg;
>>>  	struct btrfs_fs_info *fs_info = root->fs_info;
>>>  	int again;
>>> -	struct btrfs_trans_handle *trans;
>>>  
>>> -	do {
>>> +	while (1) {
>>>  		again = 0;
>>>  
>>>  		/* Make the cleaner go to sleep early. */
>>> @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
>>>  		 */
>>>  		btrfs_delete_unused_bgs(fs_info);
>>>  sleep:
>>> +		if (kthread_should_park())
>>> +			kthread_parkme();
>>> +		if (kthread_should_stop())
>>> +			return 0;
>>>  		if (!again) {
>>>  			set_current_state(TASK_INTERRUPTIBLE);
>>> -			if (!kthread_should_stop())
>>> -				schedule();
>>> +			schedule();
>>>  			__set_current_state(TASK_RUNNING);
>>>  		}
>>> -	} while (!kthread_should_stop());
>>> -
>>> -	/*
>>> -	 * Transaction kthread is stopped before us and wakes us up.
>>> -	 * However we might have started a new transaction and COWed some
>>> -	 * tree blocks when deleting unused block groups for example. So
>>> -	 * make sure we commit the transaction we started to have a clean
>>> -	 * shutdown when evicting the btree inode - if it has dirty pages
>>> -	 * when we do the final iput() on it, eviction will trigger a
>>> -	 * writeback for it which will fail with null pointer dereferences
>>> -	 * since work queues and other resources were already released and
>>> -	 * destroyed by the time the iput/eviction/writeback is made.
>>> -	 */
>>> -	trans = btrfs_attach_transaction(root);
>>> -	if (IS_ERR(trans)) {
>>> -		if (PTR_ERR(trans) != -ENOENT)
>>> -			btrfs_err(fs_info,
>>> -				  "cleaner transaction attach returned %ld",
>>> -				  PTR_ERR(trans));
>>> -	} else {
>>> -		int ret;
>>> -
>>> -		ret = btrfs_commit_transaction(trans);
>>> -		if (ret)
>>> -			btrfs_err(fs_info,
>>> -				  "cleaner open transaction commit returned %d",
>>> -				  ret);
>>>  	}
>>> -
>>> -	return 0;
>>>  }
>>>  
>>>  static int transaction_kthread(void *arg)
>>> @@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>>>  	int ret;
>>>  
>>>  	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
>>> +	kthread_park(fs_info->cleaner_kthread);
>>
>> Can't you directly call kthread_stop here? When you park the thread it
>> will sleep and then when you call kthread_stop that function will unpark
>> the thread and the cleaner kthread will see KTHREAD_SHOULD_STOP bit and
>> just return 0. So the from the moment the thread is parked until it's
>> stopped it doesn't have a chance to do useful work.
> 
> kthread_stop() frees the task_struct, but we might still try to wake up
> the cleaner kthread from somewhere (e.g., from the transaction kthread).
> So we really need to keep the cleaner alive but not doing work.

This dependency then needs to be documented via a comment or at the very
least mentioned in the changelog. Is it possible to refactor the code
(in a different patch) to actually ensure that transaction is stopped
and then kthread as well to remove this dependency ?

>

Omar Sandoval Oct. 31, 2018, 4:48 p.m. UTC | #9

On Wed, Oct 31, 2018 at 06:40:29PM +0200, Nikolay Borisov wrote:
> 
> 
> On 31.10.18 г. 18:35 ч., Omar Sandoval wrote:
> > On Wed, Oct 31, 2018 at 10:43:02AM +0200, Nikolay Borisov wrote:
> >>
> >>
> >> On 31.10.18 г. 2:14 ч., Omar Sandoval wrote:
> >>> From: Omar Sandoval <osandov@fb.com>
> >>>
> >>> There's a race between close_ctree() and cleaner_kthread().
> >>> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> >>> sees it set, but this is racy; the cleaner might have already checked
> >>> the bit and could be cleaning stuff. In particular, if it deletes unused
> >>> block groups, it will create delayed iputs for the free space cache
> >>> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> >>> longer running delayed iputs after a commit. Therefore, if the cleaner
> >>> creates more delayed iputs after delayed iputs are run in
> >>> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> >>> inode crash from the VFS.
> >>>
> >>> Fix it by parking the cleaner before we actually close anything. Then,
> >>> any remaining delayed iputs will always be handled in
> >>> btrfs_commit_super(). This also ensures that the commit in close_ctree()
> >>> is really the last commit, so we can get rid of the commit in
> >>> cleaner_kthread().
> >>>
> >>> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
> >>> Signed-off-by: Omar Sandoval <osandov@fb.com>
> >>> ---
> >>> We found this with a stress test that our containers team runs. I'm
> >>> wondering if this same race could have caused any other issues other
> >>> than this new iput thing, but I couldn't identify any.
> >>>
> >>>  fs/btrfs/disk-io.c | 40 +++++++---------------------------------
> >>>  1 file changed, 7 insertions(+), 33 deletions(-)
> >>>
> >>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >>> index b0ab41da91d1..7c17284ae3c2 100644
> >>> --- a/fs/btrfs/disk-io.c
> >>> +++ b/fs/btrfs/disk-io.c
> >>> @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
> >>>  	struct btrfs_root *root = arg;
> >>>  	struct btrfs_fs_info *fs_info = root->fs_info;
> >>>  	int again;
> >>> -	struct btrfs_trans_handle *trans;
> >>>  
> >>> -	do {
> >>> +	while (1) {
> >>>  		again = 0;
> >>>  
> >>>  		/* Make the cleaner go to sleep early. */
> >>> @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
> >>>  		 */
> >>>  		btrfs_delete_unused_bgs(fs_info);
> >>>  sleep:
> >>> +		if (kthread_should_park())
> >>> +			kthread_parkme();
> >>> +		if (kthread_should_stop())
> >>> +			return 0;
> >>>  		if (!again) {
> >>>  			set_current_state(TASK_INTERRUPTIBLE);
> >>> -			if (!kthread_should_stop())
> >>> -				schedule();
> >>> +			schedule();
> >>>  			__set_current_state(TASK_RUNNING);
> >>>  		}
> >>> -	} while (!kthread_should_stop());
> >>> -
> >>> -	/*
> >>> -	 * Transaction kthread is stopped before us and wakes us up.
> >>> -	 * However we might have started a new transaction and COWed some
> >>> -	 * tree blocks when deleting unused block groups for example. So
> >>> -	 * make sure we commit the transaction we started to have a clean
> >>> -	 * shutdown when evicting the btree inode - if it has dirty pages
> >>> -	 * when we do the final iput() on it, eviction will trigger a
> >>> -	 * writeback for it which will fail with null pointer dereferences
> >>> -	 * since work queues and other resources were already released and
> >>> -	 * destroyed by the time the iput/eviction/writeback is made.
> >>> -	 */
> >>> -	trans = btrfs_attach_transaction(root);
> >>> -	if (IS_ERR(trans)) {
> >>> -		if (PTR_ERR(trans) != -ENOENT)
> >>> -			btrfs_err(fs_info,
> >>> -				  "cleaner transaction attach returned %ld",
> >>> -				  PTR_ERR(trans));
> >>> -	} else {
> >>> -		int ret;
> >>> -
> >>> -		ret = btrfs_commit_transaction(trans);
> >>> -		if (ret)
> >>> -			btrfs_err(fs_info,
> >>> -				  "cleaner open transaction commit returned %d",
> >>> -				  ret);
> >>>  	}
> >>> -
> >>> -	return 0;
> >>>  }
> >>>  
> >>>  static int transaction_kthread(void *arg)
> >>> @@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
> >>>  	int ret;
> >>>  
> >>>  	set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> >>> +	kthread_park(fs_info->cleaner_kthread);
> >>
> >> Can't you directly call kthread_stop here? When you park the thread it
> >> will sleep and then when you call kthread_stop that function will unpark
> >> the thread and the cleaner kthread will see KTHREAD_SHOULD_STOP bit and
> >> just return 0. So the from the moment the thread is parked until it's
> >> stopped it doesn't have a chance to do useful work.
> > 
> > kthread_stop() frees the task_struct, but we might still try to wake up
> > the cleaner kthread from somewhere (e.g., from the transaction kthread).
> > So we really need to keep the cleaner alive but not doing work.
> 
> This dependency then needs to be documented via a comment or at the very
> least mentioned in the changelog. Is it possible to refactor the code
> (in a different patch) to actually ensure that transaction is stopped
> and then kthread as well to remove this dependency ?

Then we get the same issue with things trying to wake up the transaction
kthread after it's been freed (__btrfs_end_transaction(), in
particular). Maybe we could make it work, but it seems very fragile.

I'll send a v2 with a comment.

Btrfs: fix missing delayed iputs on unmount

Commit Message

Comments

Patch