diff mbox series

[v3] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag

Message ID cc2a0c81-ee9e-d2bd-9cc0-025873f394c0@gmail.com (mailing list archive)
State Superseded
Headers show
Series [v3] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag | expand

Commit Message

Kaixu Xia Aug. 21, 2019, 4:46 a.m. UTC
When performing rename operation with RENAME_WHITEOUT flag, we will
hold AGF lock to allocate or free extents in manipulating the dirents
firstly, and then doing the xfs_iunlink_remove() call last to hold
AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.

The big problem here is that we have an ordering constraint on AGF
and AGI locking - inode allocation locks the AGI, then can allocate
a new extent for new inodes, locking the AGF after the AGI. Hence
the ordering that is imposed by other parts of the code is AGI before
AGF. So we get the ABBA agi&agf deadlock here.

Process A:
Call trace:
  ? __schedule+0x2bd/0x620
  schedule+0x33/0x90
  schedule_timeout+0x17d/0x290
  __down_common+0xef/0x125
  ? xfs_buf_find+0x215/0x6c0 [xfs]
  down+0x3b/0x50
  xfs_buf_lock+0x34/0xf0 [xfs]
  xfs_buf_find+0x215/0x6c0 [xfs]
  xfs_buf_get_map+0x37/0x230 [xfs]
  xfs_buf_read_map+0x29/0x190 [xfs]
  xfs_trans_read_buf_map+0x13d/0x520 [xfs]
  xfs_read_agf+0xa6/0x180 [xfs]
  ? schedule_timeout+0x17d/0x290
  xfs_alloc_read_agf+0x52/0x1f0 [xfs]
  xfs_alloc_fix_freelist+0x432/0x590 [xfs]
  ? down+0x3b/0x50
  ? xfs_buf_lock+0x34/0xf0 [xfs]
  ? xfs_buf_find+0x215/0x6c0 [xfs]
  xfs_alloc_vextent+0x301/0x6c0 [xfs]
  xfs_ialloc_ag_alloc+0x182/0x700 [xfs]
  ? _xfs_trans_bjoin+0x72/0xf0 [xfs]
  xfs_dialloc+0x116/0x290 [xfs]
  xfs_ialloc+0x6d/0x5e0 [xfs]
  ? xfs_log_reserve+0x165/0x280 [xfs]
  xfs_dir_ialloc+0x8c/0x240 [xfs]
  xfs_create+0x35a/0x610 [xfs]
  xfs_generic_create+0x1f1/0x2f0 [xfs]
  ...

Process B:
Call trace:
  ? __schedule+0x2bd/0x620
  ? xfs_bmapi_allocate+0x245/0x380 [xfs]
  schedule+0x33/0x90
  schedule_timeout+0x17d/0x290
  ? xfs_buf_find+0x1fd/0x6c0 [xfs]
  __down_common+0xef/0x125
  ? xfs_buf_get_map+0x37/0x230 [xfs]
  ? xfs_buf_find+0x215/0x6c0 [xfs]
  down+0x3b/0x50
  xfs_buf_lock+0x34/0xf0 [xfs]
  xfs_buf_find+0x215/0x6c0 [xfs]
  xfs_buf_get_map+0x37/0x230 [xfs]
  xfs_buf_read_map+0x29/0x190 [xfs]
  xfs_trans_read_buf_map+0x13d/0x520 [xfs]
  xfs_read_agi+0xa8/0x160 [xfs]
  xfs_iunlink_remove+0x6f/0x2a0 [xfs]
  ? current_time+0x46/0x80
  ? xfs_trans_ichgtime+0x39/0xb0 [xfs]
  xfs_rename+0x57a/0xae0 [xfs]
  xfs_vn_rename+0xe4/0x150 [xfs]
  ...

In this patch we move the xfs_iunlink_remove() call to
before acquiring the AGF lock to preserve correct AGI/AGF locking
order.

Signed-off-by: kaixuxia <kaixuxia@tencent.com>
---
  fs/xfs/xfs_inode.c | 61 ++++++++++++++++++++++++++++++++++--------------------
  1 file changed, 38 insertions(+), 23 deletions(-)

Comments

Brian Foster Aug. 21, 2019, 11:25 a.m. UTC | #1
On Wed, Aug 21, 2019 at 12:46:18PM +0800, kaixuxia wrote:
> When performing rename operation with RENAME_WHITEOUT flag, we will
> hold AGF lock to allocate or free extents in manipulating the dirents
> firstly, and then doing the xfs_iunlink_remove() call last to hold
> AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
> 
> The big problem here is that we have an ordering constraint on AGF
> and AGI locking - inode allocation locks the AGI, then can allocate
> a new extent for new inodes, locking the AGF after the AGI. Hence
> the ordering that is imposed by other parts of the code is AGI before
> AGF. So we get the ABBA agi&agf deadlock here.
> 
...
> 
> In this patch we move the xfs_iunlink_remove() call to
> before acquiring the AGF lock to preserve correct AGI/AGF locking
> order.
> 
> Signed-off-by: kaixuxia <kaixuxia@tencent.com>
> ---

FYI, I see this when I pull in this patch:

warning: Patch sent with format=flowed; space at the end of lines might be lost.

Not sure what it means or if it matters. :P

Otherwise this looks much better to me generally. Just some nits..

>  fs/xfs/xfs_inode.c | 61 ++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 38 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 6467d5e..cf06568 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3282,7 +3282,8 @@ struct xfs_iunlink {
>  					spaceres);
> 
>  	/*
> -	 * Set up the target.
> +	 * Error checks before we dirty the transaction, return
> +	 * the error code if check failed and the filesystem is clean.

I'm not sure what "filesystem is clean" refers to here. I think you mean
transaction, but I'm wondering if something like the following is a bit
more clear:

"Check for expected errors before we dirty the transaction so we can
return an error without a transaction abort."

>  	 */
>  	if (target_ip == NULL) {
>  		/*
> @@ -3294,6 +3295,40 @@ struct xfs_iunlink {
>  			if (error)
>  				goto out_trans_cancel;
>  		}
> +	} else {
> +		/*
> +		 * If target exists and it's a directory, check that both
> +		 * target and source are directories and that target can be
> +		 * destroyed, or that neither is a directory.
> +		 */

Interesting that the existing comment refers to checking the source
inode, but that doesn't happen in the code. That's not a bug in this
patch, but are we missing a check here or is the comment stale?

> +		if (S_ISDIR(VFS_I(target_ip)->i_mode)) {
> +			/*
> +			 * Make sure target dir is empty.
> +			 */
> +			if (!(xfs_dir_isempty(target_ip)) ||
> +			    (VFS_I(target_ip)->i_nlink > 2)) {
> +				error = -EEXIST;
> +				goto out_trans_cancel;
> +			}
> +		}
> +	}

Code seems fine, but I think we could save some lines by condensing the
logic a bit. For example:

	/*
	 * ...
	 */
	if (!target_ip && !spaceres) {
		/* check for a no res dentry creation */
		error = xfs_dir_canenter();
		...
	} else if (target_ip && S_ISDIR(VFS_I(target_ip)->i_mode) &&
		   (!(xfs_dir_isempty(target_ip)) || 
		    (VFS_I(target_ip)->i_nlink > 2)))
		/* can't rename over a non-empty directory */
		error = -EEXIST;
		goto out_trans_cancel;
	}

Hm? Note that we use an 80 column limit, but we also want to expand
short lines to that limit as much as possible and use alignment to make
logic easier to read.

> +
> +	/*
> +	 * Directory entry creation below may acquire the AGF. Remove
> +	 * the whiteout from the unlinked list first to preserve correct
> +	 * AGI/AGF locking order.
> +	 */
> +	if (wip) {
> +		ASSERT(VFS_I(wip)->i_nlink == 0);
> +		error = xfs_iunlink_remove(tp, wip);
> +		if (error)
> +			goto out_trans_cancel;
> +	}
> +
> +	/*
> +	 * Set up the target.
> +	 */
> +	if (target_ip == NULL) {
>  		/*
>  		 * If target does not exist and the rename crosses
>  		 * directories, adjust the target directory link count
> @@ -3312,22 +3347,6 @@ struct xfs_iunlink {
>  		}
>  	} else { /* target_ip != NULL */
>  		/*
> -		 * If target exists and it's a directory, check that both
> -		 * target and source are directories and that target can be
> -		 * destroyed, or that neither is a directory.
> -		 */
> -		if (S_ISDIR(VFS_I(target_ip)->i_mode)) {
> -			/*
> -			 * Make sure target dir is empty.
> -			 */
> -			if (!(xfs_dir_isempty(target_ip)) ||
> -			    (VFS_I(target_ip)->i_nlink > 2)) {
> -				error = -EEXIST;
> -				goto out_trans_cancel;
> -			}
> -		}
> -
> -		/*
>  		 * Link the source inode under the target name.
>  		 * If the source inode is a directory and we are moving
>  		 * it across directories, its ".." entry will be
> @@ -3421,16 +3440,12 @@ struct xfs_iunlink {
>  	 * For whiteouts, we need to bump the link count on the whiteout inode.
>  	 * This means that failures all the way up to this point leave the inode
>  	 * on the unlinked list and so cleanup is a simple matter of dropping
> -	 * the remaining reference to it. If we fail here after bumping the link
> -	 * count, we're shutting down the filesystem so we'll never see the
> -	 * intermediate state on disk.
> +	 * the remaining reference to it. Move the xfs_iunlink_remove() call to
> +	 * before acquiring the AGF lock to preserve correct AGI/AGF locking order.

With this change, the earlier part of this comment about failures up
this point leaving the whiteout on the unlinked list is no longer true.
We've already removed it earlier in the function. Also, the new bit
about "moving" the call is confusing because it describes more what this
patch does vs the current code.

I'd suggest a new comment that combines with the one within this branch
(not shown in the patch). For example:

        /*
         * For whiteouts, we need to bump the link count on the whiteout inode.
         * This means that failures all the way up to this point leave the inode
         * on the unlinked list and so cleanup is a simple matter of dropping
         * the remaining reference to it. If we fail here after bumping the link
         * count, we're shutting down the filesystem so we'll never see the
         * intermediate state on disk.
         */

And then remove the comment inside the branch. FWIW, you could also add
a sentence to the earlier comment where the wip is removed like: "This
dirties the transaction so failures after this point will abort and log
recovery will clean up the mess."

Brian

>  	 */
>  	if (wip) {
>  		ASSERT(VFS_I(wip)->i_nlink == 0);
>  		xfs_bumplink(tp, wip);
> -		error = xfs_iunlink_remove(tp, wip);
> -		if (error)
> -			goto out_trans_cancel;
>  		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);
> 
>  		/*
> -- 
> 1.8.3.1
> 
> -- 
> kaixuxia
Kaixu Xia Aug. 22, 2019, 2:07 a.m. UTC | #2
On 2019/8/21 19:25, Brian Foster wrote:
> On Wed, Aug 21, 2019 at 12:46:18PM +0800, kaixuxia wrote:
>> When performing rename operation with RENAME_WHITEOUT flag, we will
>> hold AGF lock to allocate or free extents in manipulating the dirents
>> firstly, and then doing the xfs_iunlink_remove() call last to hold
>> AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
>>
>> The big problem here is that we have an ordering constraint on AGF
>> and AGI locking - inode allocation locks the AGI, then can allocate
>> a new extent for new inodes, locking the AGF after the AGI. Hence
>> the ordering that is imposed by other parts of the code is AGI before
>> AGF. So we get the ABBA agi&agf deadlock here.
>>
> ...
>>
>> In this patch we move the xfs_iunlink_remove() call to
>> before acquiring the AGF lock to preserve correct AGI/AGF locking
>> order.
>>
>> Signed-off-by: kaixuxia <kaixuxia@tencent.com>
>> ---
> 
> FYI, I see this when I pull in this patch:
> 
> warning: Patch sent with format=flowed; space at the end of lines might be lost.
> 
> Not sure what it means or if it matters. :P

This is my Thunderbird edit config problem, will fix it. :) 
> 
> Otherwise this looks much better to me generally. Just some nits..
> 
>>  fs/xfs/xfs_inode.c | 61 ++++++++++++++++++++++++++++++++++--------------------
>>  1 file changed, 38 insertions(+), 23 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 6467d5e..cf06568 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -3282,7 +3282,8 @@ struct xfs_iunlink {
>>  					spaceres);
>>
>>  	/*
>> -	 * Set up the target.
>> +	 * Error checks before we dirty the transaction, return
>> +	 * the error code if check failed and the filesystem is clean.
> 
> I'm not sure what "filesystem is clean" refers to here. I think you mean
> transaction, but I'm wondering if something like the following is a bit
> more clear:
> 
> "Check for expected errors before we dirty the transaction so we can
> return an error without a transaction abort."
> 
>>  	 */
>>  	if (target_ip == NULL) {
>>  		/*
>> @@ -3294,6 +3295,40 @@ struct xfs_iunlink {
>>  			if (error)
>>  				goto out_trans_cancel;
>>  		}
>> +	} else {
>> +		/*
>> +		 * If target exists and it's a directory, check that both
>> +		 * target and source are directories and that target can be
>> +		 * destroyed, or that neither is a directory.
>> +		 */
> 
> Interesting that the existing comment refers to checking the source
> inode, but that doesn't happen in the code. That's not a bug in this
> patch, but are we missing a check here or is the comment stale?
> 
>> +		if (S_ISDIR(VFS_I(target_ip)->i_mode)) {
>> +			/*
>> +			 * Make sure target dir is empty.
>> +			 */
>> +			if (!(xfs_dir_isempty(target_ip)) ||
>> +			    (VFS_I(target_ip)->i_nlink > 2)) {
>> +				error = -EEXIST;
>> +				goto out_trans_cancel;
>> +			}
>> +		}
>> +	}
> 
> Code seems fine, but I think we could save some lines by condensing the
> logic a bit. For example:
> 
> 	/*
> 	 * ...
> 	 */
> 	if (!target_ip && !spaceres) {
> 		/* check for a no res dentry creation */
> 		error = xfs_dir_canenter();
> 		...
> 	} else if (target_ip && S_ISDIR(VFS_I(target_ip)->i_mode) &&
> 		   (!(xfs_dir_isempty(target_ip)) || 
> 		    (VFS_I(target_ip)->i_nlink > 2)))
> 		/* can't rename over a non-empty directory */
> 		error = -EEXIST;
> 		goto out_trans_cancel;
> 	}
> 
> Hm? Note that we use an 80 column limit, but we also want to expand
> short lines to that limit as much as possible and use alignment to make
> logic easier to read.
> 
>> +
>> +	/*
>> +	 * Directory entry creation below may acquire the AGF. Remove
>> +	 * the whiteout from the unlinked list first to preserve correct
>> +	 * AGI/AGF locking order.
>> +	 */
>> +	if (wip) {
>> +		ASSERT(VFS_I(wip)->i_nlink == 0);
>> +		error = xfs_iunlink_remove(tp, wip);
>> +		if (error)
>> +			goto out_trans_cancel;
>> +	}
>> +
>> +	/*
>> +	 * Set up the target.
>> +	 */
>> +	if (target_ip == NULL) {
>>  		/*
>>  		 * If target does not exist and the rename crosses
>>  		 * directories, adjust the target directory link count
>> @@ -3312,22 +3347,6 @@ struct xfs_iunlink {
>>  		}
>>  	} else { /* target_ip != NULL */
>>  		/*
>> -		 * If target exists and it's a directory, check that both
>> -		 * target and source are directories and that target can be
>> -		 * destroyed, or that neither is a directory.
>> -		 */
>> -		if (S_ISDIR(VFS_I(target_ip)->i_mode)) {
>> -			/*
>> -			 * Make sure target dir is empty.
>> -			 */
>> -			if (!(xfs_dir_isempty(target_ip)) ||
>> -			    (VFS_I(target_ip)->i_nlink > 2)) {
>> -				error = -EEXIST;
>> -				goto out_trans_cancel;
>> -			}
>> -		}
>> -
>> -		/*
>>  		 * Link the source inode under the target name.
>>  		 * If the source inode is a directory and we are moving
>>  		 * it across directories, its ".." entry will be
>> @@ -3421,16 +3440,12 @@ struct xfs_iunlink {
>>  	 * For whiteouts, we need to bump the link count on the whiteout inode.
>>  	 * This means that failures all the way up to this point leave the inode
>>  	 * on the unlinked list and so cleanup is a simple matter of dropping
>> -	 * the remaining reference to it. If we fail here after bumping the link
>> -	 * count, we're shutting down the filesystem so we'll never see the
>> -	 * intermediate state on disk.
>> +	 * the remaining reference to it. Move the xfs_iunlink_remove() call to
>> +	 * before acquiring the AGF lock to preserve correct AGI/AGF locking order.
> 
> With this change, the earlier part of this comment about failures up
> this point leaving the whiteout on the unlinked list is no longer true.
> We've already removed it earlier in the function. Also, the new bit
> about "moving" the call is confusing because it describes more what this
> patch does vs the current code.
> 
> I'd suggest a new comment that combines with the one within this branch
> (not shown in the patch). For example:
> 
>         /*
>          * For whiteouts, we need to bump the link count on the whiteout inode.
>          * This means that failures all the way up to this point leave the inode
>          * on the unlinked list and so cleanup is a simple matter of dropping
>          * the remaining reference to it. If we fail here after bumping the link
>          * count, we're shutting down the filesystem so we'll never see the
>          * intermediate state on disk.
>          */
> 
> And then remove the comment inside the branch. FWIW, you could also add
> a sentence to the earlier comment where the wip is removed like: "This
> dirties the transaction so failures after this point will abort and log
> recovery will clean up the mess."

Thanks a lot for all the comments. I will address them and send
the new patch.
> 
> Brian
> 
>>  	 */
>>  	if (wip) {
>>  		ASSERT(VFS_I(wip)->i_nlink == 0);
>>  		xfs_bumplink(tp, wip);
>> -		error = xfs_iunlink_remove(tp, wip);
>> -		if (error)
>> -			goto out_trans_cancel;
>>  		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);
>>
>>  		/*
>> -- 
>> 1.8.3.1
>>
>> -- 
>> kaixuxia
diff mbox series

Patch

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 6467d5e..cf06568 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3282,7 +3282,8 @@  struct xfs_iunlink {
  					spaceres);

  	/*
-	 * Set up the target.
+	 * Error checks before we dirty the transaction, return
+	 * the error code if check failed and the filesystem is clean.
  	 */
  	if (target_ip == NULL) {
  		/*
@@ -3294,6 +3295,40 @@  struct xfs_iunlink {
  			if (error)
  				goto out_trans_cancel;
  		}
+	} else {
+		/*
+		 * If target exists and it's a directory, check that both
+		 * target and source are directories and that target can be
+		 * destroyed, or that neither is a directory.
+		 */
+		if (S_ISDIR(VFS_I(target_ip)->i_mode)) {
+			/*
+			 * Make sure target dir is empty.
+			 */
+			if (!(xfs_dir_isempty(target_ip)) ||
+			    (VFS_I(target_ip)->i_nlink > 2)) {
+				error = -EEXIST;
+				goto out_trans_cancel;
+			}
+		}
+	}
+
+	/*
+	 * Directory entry creation below may acquire the AGF. Remove
+	 * the whiteout from the unlinked list first to preserve correct
+	 * AGI/AGF locking order.
+	 */
+	if (wip) {
+		ASSERT(VFS_I(wip)->i_nlink == 0);
+		error = xfs_iunlink_remove(tp, wip);
+		if (error)
+			goto out_trans_cancel;
+	}
+
+	/*
+	 * Set up the target.
+	 */
+	if (target_ip == NULL) {
  		/*
  		 * If target does not exist and the rename crosses
  		 * directories, adjust the target directory link count
@@ -3312,22 +3347,6 @@  struct xfs_iunlink {
  		}
  	} else { /* target_ip != NULL */
  		/*
-		 * If target exists and it's a directory, check that both
-		 * target and source are directories and that target can be
-		 * destroyed, or that neither is a directory.
-		 */
-		if (S_ISDIR(VFS_I(target_ip)->i_mode)) {
-			/*
-			 * Make sure target dir is empty.
-			 */
-			if (!(xfs_dir_isempty(target_ip)) ||
-			    (VFS_I(target_ip)->i_nlink > 2)) {
-				error = -EEXIST;
-				goto out_trans_cancel;
-			}
-		}
-
-		/*
  		 * Link the source inode under the target name.
  		 * If the source inode is a directory and we are moving
  		 * it across directories, its ".." entry will be
@@ -3421,16 +3440,12 @@  struct xfs_iunlink {
  	 * For whiteouts, we need to bump the link count on the whiteout inode.
  	 * This means that failures all the way up to this point leave the inode
  	 * on the unlinked list and so cleanup is a simple matter of dropping
-	 * the remaining reference to it. If we fail here after bumping the link
-	 * count, we're shutting down the filesystem so we'll never see the
-	 * intermediate state on disk.
+	 * the remaining reference to it. Move the xfs_iunlink_remove() call to
+	 * before acquiring the AGF lock to preserve correct AGI/AGF locking order.
  	 */
  	if (wip) {
  		ASSERT(VFS_I(wip)->i_nlink == 0);
  		xfs_bumplink(tp, wip);
-		error = xfs_iunlink_remove(tp, wip);
-		if (error)
-			goto out_trans_cancel;
  		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);

  		/*