diff mbox

xfs: fix transaction allocation deadlock in IO path

Message ID 20180305041120.4224-1-david@fromorbit.com (mailing list archive)
State Accepted
Headers show

Commit Message

Dave Chinner March 5, 2018, 4:11 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it
while holding pages locked for writeback in the ->writepages path.
The memory allocation is allowed to wait on pages under writeback,
and so can wait on pages that are held locked in writeback by the
caller.

This affects both pre-IO submission and post-IO submission paths.
Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(),
xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range().
xfs_iomap_write_unwritten() already does the right thing, but the
others don't. Fix them.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c    | 3 ++-
 fs/xfs/xfs_reflink.c | 4 ++--
 2 files changed, 4 insertions(+), 3 deletions(-)

Comments

Brian Foster March 5, 2018, 2:40 p.m. UTC | #1
On Mon, Mar 05, 2018 at 03:11:20PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it
> while holding pages locked for writeback in the ->writepages path.
> The memory allocation is allowed to wait on pages under writeback,
> and so can wait on pages that are held locked in writeback by the
> caller.
> 

It looks like xfs_start_page_writeback() sets the page as writeback and
unlocks it. I suspect this doesn't affect the actual problem if
allocation waits on writeback, but rather something like "tagged as
writeback" might be more clear than "held locked in writeback."
Otherwise this seems fine to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> This affects both pre-IO submission and post-IO submission paths.
> Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(),
> xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range().
> xfs_iomap_write_unwritten() already does the right thing, but the
> others don't. Fix them.
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c    | 3 ++-
>  fs/xfs/xfs_reflink.c | 4 ++--
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 9c6a830da0ee..a0afb6411417 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -209,7 +209,8 @@ xfs_setfilesize_trans_alloc(
>  	struct xfs_trans	*tp;
>  	int			error;
>  
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0,
> +				XFS_TRANS_NOFS, &tp);

Was there another reason we preallocate this transaction where we don't
seem to for others?

Brian

>  	if (error)
>  		return error;
>  
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 270246943a06..8c16177b33d4 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -668,7 +668,7 @@ xfs_reflink_cancel_cow_range(
>  
>  	/* Start a rolling transaction to remove the mappings */
>  	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
> -			0, 0, 0, &tp);
> +			0, 0, XFS_TRANS_NOFS, &tp);
>  	if (error)
>  		goto out;
>  
> @@ -741,7 +741,7 @@ xfs_reflink_end_cow(
>  			(unsigned int)(end_fsb - offset_fsb),
>  			XFS_DATA_FORK);
>  	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
> -			resblks, 0, XFS_TRANS_RESERVE, &tp);
> +			resblks, 0, XFS_TRANS_RESERVE | XFS_TRANS_NOFS, &tp);
>  	if (error)
>  		goto out;
>  
> -- 
> 2.16.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luis Chamberlain March 5, 2018, 5:44 p.m. UTC | #2
On Mon, Mar 05, 2018 at 03:11:20PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it
> while holding pages locked for writeback in the ->writepages path.
> The memory allocation is allowed to wait on pages under writeback,
> and so can wait on pages that are held locked in writeback by the
> caller.
> 
> This affects both pre-IO submission and post-IO submission paths.
> Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(),
> xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range().
> xfs_iomap_write_unwritten() already does the right thing, but the
> others don't. Fix them.
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>

I believe these are two separate regressions though, introduced on separate
kernels Can we treat them as such and use respective Fixes tag for them?

> ---
>  fs/xfs/xfs_aops.c    | 3 ++-
>  fs/xfs/xfs_reflink.c | 4 ++--
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 9c6a830da0ee..a0afb6411417 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -209,7 +209,8 @@ xfs_setfilesize_trans_alloc(
>  	struct xfs_trans	*tp;
>  	int			error;
>  
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0,
> +				XFS_TRANS_NOFS, &tp);
>  	if (error)
>  		return error;
>  

Fixes: 253f4911f297b ("xfs: better xfs_trans_alloc interface")

Introduced on v4.7

> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 270246943a06..8c16177b33d4 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -668,7 +668,7 @@ xfs_reflink_cancel_cow_range(
>  
>  	/* Start a rolling transaction to remove the mappings */
>  	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
> -			0, 0, 0, &tp);
> +			0, 0, XFS_TRANS_NOFS, &tp);
>  	if (error)
>  		goto out;
>  
> @@ -741,7 +741,7 @@ xfs_reflink_end_cow(
>  			(unsigned int)(end_fsb - offset_fsb),
>  			XFS_DATA_FORK);
>  	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
> -			resblks, 0, XFS_TRANS_RESERVE, &tp);
> +			resblks, 0, XFS_TRANS_RESERVE | XFS_TRANS_NOFS, &tp);
>  	if (error)
>  		goto out;

For both of the above:

Fixes: 43caeb187deb9 ("xfs: move mappings from cow fork to data fork after copy-write)"

Introduced on v4.9.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner March 5, 2018, 9:32 p.m. UTC | #3
On Mon, Mar 05, 2018 at 05:44:55PM +0000, Luis R. Rodriguez wrote:
> On Mon, Mar 05, 2018 at 03:11:20PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it
> > while holding pages locked for writeback in the ->writepages path.
> > The memory allocation is allowed to wait on pages under writeback,
> > and so can wait on pages that are held locked in writeback by the
> > caller.
> > 
> > This affects both pre-IO submission and post-IO submission paths.
> > Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(),
> > xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range().
> > xfs_iomap_write_unwritten() already does the right thing, but the
> > others don't. Fix them.
> > 
> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> 
> I believe these are two separate regressions though, introduced on separate
> kernels Can we treat them as such and use respective Fixes tag for them?

Neither are regressions - they are effectively zero-day bugs. In
general, I don't use Fixes tags for things that are not regressions
and are easily discoverable from the published git history...

> 
> > ---
> >  fs/xfs/xfs_aops.c    | 3 ++-
> >  fs/xfs/xfs_reflink.c | 4 ++--
> >  2 files changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 9c6a830da0ee..a0afb6411417 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -209,7 +209,8 @@ xfs_setfilesize_trans_alloc(
> >  	struct xfs_trans	*tp;
> >  	int			error;
> >  
> > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp);
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0,
> > +				XFS_TRANS_NOFS, &tp);
> >  	if (error)
> >  		return error;
> >  
> 
> Fixes: 253f4911f297b ("xfs: better xfs_trans_alloc interface")

No, thats wrong - that commit didn't change any behaviour. The
original commit:

281627df3eb5 ("xfs: log file size updates at I/O completion time")

called:

	tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);


which resulted in a GFP_KERNEL allocation via:

	tp = _xfs_trans_alloc(mp, type, KM_SLEEP);

So this is a zero-day bug in logging file size updates at IO
completion.

> Introduced on v4.7
> 
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 270246943a06..8c16177b33d4 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -668,7 +668,7 @@ xfs_reflink_cancel_cow_range(
> >  
> >  	/* Start a rolling transaction to remove the mappings */
> >  	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
> > -			0, 0, 0, &tp);
> > +			0, 0, XFS_TRANS_NOFS, &tp);
> >  	if (error)
> >  		goto out;
> >  
> > @@ -741,7 +741,7 @@ xfs_reflink_end_cow(
> >  			(unsigned int)(end_fsb - offset_fsb),
> >  			XFS_DATA_FORK);
> >  	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
> > -			resblks, 0, XFS_TRANS_RESERVE, &tp);
> > +			resblks, 0, XFS_TRANS_RESERVE | XFS_TRANS_NOFS, &tp);
> >  	if (error)
> >  		goto out;
> 
> For both of the above:
> 
> Fixes: 43caeb187deb9 ("xfs: move mappings from cow fork to data fork after copy-write)"

And that's a zero-day, too. So neither are regressions.

-Dave.
Dave Chinner March 5, 2018, 9:38 p.m. UTC | #4
On Mon, Mar 05, 2018 at 09:40:23AM -0500, Brian Foster wrote:
> On Mon, Mar 05, 2018 at 03:11:20PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it
> > while holding pages locked for writeback in the ->writepages path.
> > The memory allocation is allowed to wait on pages under writeback,
> > and so can wait on pages that are held locked in writeback by the
> > caller.
> > 
> 
> It looks like xfs_start_page_writeback() sets the page as writeback and
> unlocks it. I suspect this doesn't affect the actual problem if
> allocation waits on writeback, but rather something like "tagged as
> writeback" might be more clear than "held locked in writeback."

Yeah, taht's what I meant.

> Otherwise this seems fine to me:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > This affects both pre-IO submission and post-IO submission paths.
> > Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(),
> > xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range().
> > xfs_iomap_write_unwritten() already does the right thing, but the
> > others don't. Fix them.
> > 
> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c    | 3 ++-
> >  fs/xfs/xfs_reflink.c | 4 ++--
> >  2 files changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 9c6a830da0ee..a0afb6411417 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -209,7 +209,8 @@ xfs_setfilesize_trans_alloc(
> >  	struct xfs_trans	*tp;
> >  	int			error;
> >  
> > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp);
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0,
> > +				XFS_TRANS_NOFS, &tp);
> 
> Was there another reason we preallocate this transaction where we don't
> seem to for others?

Because it happens all the time for buffered IO and this prevents
stalling the completion workqueue for a simple size update.
Unwritten conversion and COW completion are less frequently done,
are much larger pieces of work, can require multiple transactions
and have many more potential blocking points, and so such an
optimisation is much less beneficial in those cases.

Cheers,

Dave.
Eric Sandeen March 6, 2018, 4:53 p.m. UTC | #5
On 3/5/18 11:44 AM, Luis R. Rodriguez wrote:
> On Mon, Mar 05, 2018 at 03:11:20PM +1100, Dave Chinner wrote:
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it
>> while holding pages locked for writeback in the ->writepages path.
>> The memory allocation is allowed to wait on pages under writeback,
>> and so can wait on pages that are held locked in writeback by the
>> caller.
>>
>> This affects both pre-IO submission and post-IO submission paths.
>> Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(),
>> xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range().
>> xfs_iomap_write_unwritten() already does the right thing, but the
>> others don't. Fix them.
>>
>> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> 
> I believe these are two separate regressions though, introduced on separate
> kernels Can we treat them as such and use respective Fixes tag for them?
> 
>> ---
>>  fs/xfs/xfs_aops.c    | 3 ++-
>>  fs/xfs/xfs_reflink.c | 4 ++--
>>  2 files changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
>> index 9c6a830da0ee..a0afb6411417 100644
>> --- a/fs/xfs/xfs_aops.c
>> +++ b/fs/xfs/xfs_aops.c
>> @@ -209,7 +209,8 @@ xfs_setfilesize_trans_alloc(
>>  	struct xfs_trans	*tp;
>>  	int			error;
>>  
>> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp);
>> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0,
>> +				XFS_TRANS_NOFS, &tp);
>>  	if (error)
>>  		return error;
>>  
> 
> Fixes: 253f4911f297b ("xfs: better xfs_trans_alloc interface")
> 
> Introduced on v4.7

I don't think so - prior to that commit, this allocation could still
recurse into the filesystem, the only allocation flag used was KM_SLEEP,
AFAICT.
 
>> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
>> index 270246943a06..8c16177b33d4 100644
>> --- a/fs/xfs/xfs_reflink.c
>> +++ b/fs/xfs/xfs_reflink.c
>> @@ -668,7 +668,7 @@ xfs_reflink_cancel_cow_range(
>>  
>>  	/* Start a rolling transaction to remove the mappings */
>>  	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
>> -			0, 0, 0, &tp);
>> +			0, 0, XFS_TRANS_NOFS, &tp);
>>  	if (error)
>>  		goto out;
>>  
>> @@ -741,7 +741,7 @@ xfs_reflink_end_cow(
>>  			(unsigned int)(end_fsb - offset_fsb),
>>  			XFS_DATA_FORK);
>>  	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
>> -			resblks, 0, XFS_TRANS_RESERVE, &tp);
>> +			resblks, 0, XFS_TRANS_RESERVE | XFS_TRANS_NOFS, &tp);
>>  	if (error)
>>  		goto out;
> 
> For both of the above:
> 
> Fixes: 43caeb187deb9 ("xfs: move mappings from cow fork to data fork after copy-write)"
> 
> Introduced on v4.9.

Perhaps, as that commit did add the new allocations which could recurse.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen March 7, 2018, 3:52 p.m. UTC | #6
On 3/4/18 10:11 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it
> while holding pages locked for writeback in the ->writepages path.
> The memory allocation is allowed to wait on pages under writeback,
> and so can wait on pages that are held locked in writeback by the
> caller.
> 
> This affects both pre-IO submission and post-IO submission paths.
> Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(),
> xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range().
> xfs_iomap_write_unwritten() already does the right thing, but the
> others don't. Fix them.
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig March 8, 2018, 8:08 a.m. UTC | #7
Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9c6a830da0ee..a0afb6411417 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -209,7 +209,8 @@  xfs_setfilesize_trans_alloc(
 	struct xfs_trans	*tp;
 	int			error;
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0,
+				XFS_TRANS_NOFS, &tp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 270246943a06..8c16177b33d4 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -668,7 +668,7 @@  xfs_reflink_cancel_cow_range(
 
 	/* Start a rolling transaction to remove the mappings */
 	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
-			0, 0, 0, &tp);
+			0, 0, XFS_TRANS_NOFS, &tp);
 	if (error)
 		goto out;
 
@@ -741,7 +741,7 @@  xfs_reflink_end_cow(
 			(unsigned int)(end_fsb - offset_fsb),
 			XFS_DATA_FORK);
 	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
-			resblks, 0, XFS_TRANS_RESERVE, &tp);
+			resblks, 0, XFS_TRANS_RESERVE | XFS_TRANS_NOFS, &tp);
 	if (error)
 		goto out;