[2/3] xfs: rework insert range into an atomic operation

Message ID	20191213171258.36934-3-bfoster@redhat.com (mailing list archive)
State	Accepted
Headers	show Return-Path: <SRS0=jVwY=2D=vger.kernel.org=linux-xfs-owner@kernel.org> From: Brian Foster <bfoster@redhat.com> To: linux-xfs@vger.kernel.org Subject: [PATCH 2/3] xfs: rework insert range into an atomic operation Date: Fri, 13 Dec 2019 12:12:57 -0500 Message-Id: <20191213171258.36934-3-bfoster@redhat.com> In-Reply-To: <20191213171258.36934-1-bfoster@redhat.com> References: <20191213171258.36934-1-bfoster@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	xfs: hold ilock across insert and collapse range \| expand [0/3] xfs: hold ilock across insert and collapse range [1/3] xfs: open code insert range extent split helper [2/3] xfs: rework insert range into an atomic operation [3/3] xfs: rework collapse range into an atomic operation

Brian Foster Dec. 13, 2019, 5:12 p.m. UTC

The insert range operation uses a unique transaction and ilock cycle
for the extent split and each extent shift iteration of the overall
operation. While this works, it is risks racing with other
operations in subtle ways such as COW writeback modifying an extent
tree in the middle of a shift operation.

To avoid this problem, make insert range atomic with respect to
ilock. Hold the ilock across the entire operation, replace the
individual transactions with a single rolling transaction sequence
and relog the inode to keep it moving in the log. This guarantees
that nothing else can change the extent mapping of an inode while
an insert range operation is in progress.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
 1 file changed, 13 insertions(+), 19 deletions(-)

Allison Henderson Dec. 17, 2019, 5:04 p.m. UTC | #1

On 12/13/19 10:12 AM, Brian Foster wrote:
> The insert range operation uses a unique transaction and ilock cycle
> for the extent split and each extent shift iteration of the overall
> operation. While this works, it is risks racing with other
> operations in subtle ways such as COW writeback modifying an extent
> tree in the middle of a shift operation.
> 
> To avoid this problem, make insert range atomic with respect to
> ilock. Hold the ilock across the entire operation, replace the
> individual transactions with a single rolling transaction sequence
> and relog the inode to keep it moving in the log. This guarantees
> that nothing else can change the extent mapping of an inode while
> an insert range operation is in progress.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Ok, this looks good to me.  Thanks!

Reviewed by: Allison Collins <allison.henderson@oracle.com>
> ---
>   fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
>   1 file changed, 13 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 829ab1a804c9..555c8b49a223 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
>   	if (error)
>   		return error;
>   
> -	/*
> -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> -	 * is not the starting block of extent, we need to split the extent at
> -	 * stop_fsb.
> -	 */
>   	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
>   			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
>   	if (error)
>   		return error;
>   
>   	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
>   
> +	/*
> +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> +	 * is not the starting block of extent, we need to split the extent at
> +	 * stop_fsb.
> +	 */
>   	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
>   	if (error)
>   		goto out_trans_cancel;
>   
> -	error = xfs_trans_commit(tp);
> -	if (error)
> -		return error;
> -
> -	while (!error && !done) {
> -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> -					&tp);
> +	do {
> +		error = xfs_trans_roll_inode(&tp, ip);
>   		if (error)
> -			break;
> +			goto out_trans_cancel;
>   
> -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>   		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
>   				&done, stop_fsb);
>   		if (error)
>   			goto out_trans_cancel;
> +	} while (!done);
>   
> -		error = xfs_trans_commit(tp);
> -	}
> -
> +	error = xfs_trans_commit(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>   	return error;
>   
>   out_trans_cancel:
>   	xfs_trans_cancel(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>   	return error;
>   }
>   
>

Darrick J. Wong Dec. 18, 2019, 2:37 a.m. UTC | #2

On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> The insert range operation uses a unique transaction and ilock cycle
> for the extent split and each extent shift iteration of the overall
> operation. While this works, it is risks racing with other
> operations in subtle ways such as COW writeback modifying an extent
> tree in the middle of a shift operation.
> 
> To avoid this problem, make insert range atomic with respect to
> ilock. Hold the ilock across the entire operation, replace the
> individual transactions with a single rolling transaction sequence
> and relog the inode to keep it moving in the log. This guarantees
> that nothing else can change the extent mapping of an inode while
> an insert range operation is in progress.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
>  1 file changed, 13 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 829ab1a804c9..555c8b49a223 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
>  	if (error)
>  		return error;
>  
> -	/*
> -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> -	 * is not the starting block of extent, we need to split the extent at
> -	 * stop_fsb.
> -	 */
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
>  			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
>  	if (error)
>  		return error;
>  
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
>  
> +	/*
> +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> +	 * is not the starting block of extent, we need to split the extent at
> +	 * stop_fsb.
> +	 */
>  	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
>  	if (error)
>  		goto out_trans_cancel;
>  
> -	error = xfs_trans_commit(tp);
> -	if (error)
> -		return error;
> -
> -	while (!error && !done) {
> -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> -					&tp);

I'm a little concerned about the livelock potential here, if there are a lot of
other threads that have eaten all the transaction reservation and are trying to
get our ILOCK, while at the same time this thread has the ILOCK and is trying
to roll the transaction to move another extent, having already rolled the
transaction more than logcount times.

I think the extent shifting loop starts with the highest offset mapping and
shifts it up and continues in order of decreasing offset until it gets to
 @stop_fsb, correct?

Can we use "alloc trans; ilock; move; commit" for every extent higher than the
one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll;
insert_extents; commit" to deal with that one extent that crosses @stop_fsb?
tr_write pre-reserves enough space to that the roll won't need to get more,
which would eliminate that potential problem, I think.

--D

> +	do {
> +		error = xfs_trans_roll_inode(&tp, ip);
>  		if (error)
> -			break;
> +			goto out_trans_cancel;
>  
> -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>  		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
>  				&done, stop_fsb);
>  		if (error)
>  			goto out_trans_cancel;
> +	} while (!done);
>  
> -		error = xfs_trans_commit(tp);
> -	}
> -
> +	error = xfs_trans_commit(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	return error;
>  
>  out_trans_cancel:
>  	xfs_trans_cancel(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	return error;
>  }
>  
> -- 
> 2.20.1
>

Brian Foster Dec. 18, 2019, 12:10 p.m. UTC | #3

On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote:
> On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> > The insert range operation uses a unique transaction and ilock cycle
> > for the extent split and each extent shift iteration of the overall
> > operation. While this works, it is risks racing with other
> > operations in subtle ways such as COW writeback modifying an extent
> > tree in the middle of a shift operation.
> > 
> > To avoid this problem, make insert range atomic with respect to
> > ilock. Hold the ilock across the entire operation, replace the
> > individual transactions with a single rolling transaction sequence
> > and relog the inode to keep it moving in the log. This guarantees
> > that nothing else can change the extent mapping of an inode while
> > an insert range operation is in progress.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
> >  1 file changed, 13 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > index 829ab1a804c9..555c8b49a223 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
> >  	if (error)
> >  		return error;
> >  
> > -	/*
> > -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > -	 * is not the starting block of extent, we need to split the extent at
> > -	 * stop_fsb.
> > -	 */
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> >  			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> >  	if (error)
> >  		return error;
> >  
> >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > +	xfs_trans_ijoin(tp, ip, 0);
> >  
> > +	/*
> > +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > +	 * is not the starting block of extent, we need to split the extent at
> > +	 * stop_fsb.
> > +	 */
> >  	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
> >  	if (error)
> >  		goto out_trans_cancel;
> >  
> > -	error = xfs_trans_commit(tp);
> > -	if (error)
> > -		return error;
> > -
> > -	while (!error && !done) {
> > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> > -					&tp);
> 
> I'm a little concerned about the livelock potential here, if there are a lot of
> other threads that have eaten all the transaction reservation and are trying to
> get our ILOCK, while at the same time this thread has the ILOCK and is trying
> to roll the transaction to move another extent, having already rolled the
> transaction more than logcount times.
> 

My understanding is that the regrant mechanism is intended to deal with
that scenario. Even after the initial (res * logcount) reservation is
consumed, a regrant is different from an initial reservation in that the
reservation head is unconditionally updated with a new reservation unit.
We do wait on the write head in regrant, but IIUC that should be limited
to the pool of already allocated transactions (and is woken before the
reserve head waiters IIRC). I suppose something like this might be
possible in theory if we were blocked on regrant and the entirety of
remaining log space was consumed by transactions waiting on our ilock,
but I think that is highly unlikely since we also hold the iolock here.

Also note that this approach is based on the current truncate algorithm,
which is probably a better barometer of potential for this kind of issue
as it is a less specialized operation. I'd argue that if this is safe
enough for truncate, it should be safe enough for range shifting.

> I think the extent shifting loop starts with the highest offset mapping and
> shifts it up and continues in order of decreasing offset until it gets to
>  @stop_fsb, correct?
> 

Yep, for insert range at least.

> Can we use "alloc trans; ilock; move; commit" for every extent higher than the
> one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll;
> insert_extents; commit" to deal with that one extent that crosses @stop_fsb?
> tr_write pre-reserves enough space to that the roll won't need to get more,
> which would eliminate that potential problem, I think.
> 

We'd have to reorder the extent split for that kind of approach, which I
think you've noted in the sequence above, as the race window is between
the split and subsequent shift. Otherwise I think that would work.

That said, I'd prefer not to introduce the extra complexity and
functional variance unless it were absolutely necessary, and it's not
clear to me that it is. If it is, we'd probably have seen similar issues
in truncate and should target a fix there before worrying about range
shift.

Brian

> --D
> 
> > +	do {
> > +		error = xfs_trans_roll_inode(&tp, ip);
> >  		if (error)
> > -			break;
> > +			goto out_trans_cancel;
> >  
> > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> >  		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
> >  				&done, stop_fsb);
> >  		if (error)
> >  			goto out_trans_cancel;
> > +	} while (!done);
> >  
> > -		error = xfs_trans_commit(tp);
> > -	}
> > -
> > +	error = xfs_trans_commit(tp);
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  	return error;
> >  
> >  out_trans_cancel:
> >  	xfs_trans_cancel(tp);
> > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  	return error;
> >  }
> >  
> > -- 
> > 2.20.1
> > 
>

Darrick J. Wong Dec. 18, 2019, 9:15 p.m. UTC | #4

On Wed, Dec 18, 2019 at 07:10:33AM -0500, Brian Foster wrote:
> On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote:
> > On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> > > The insert range operation uses a unique transaction and ilock cycle
> > > for the extent split and each extent shift iteration of the overall
> > > operation. While this works, it is risks racing with other
> > > operations in subtle ways such as COW writeback modifying an extent
> > > tree in the middle of a shift operation.
> > > 
> > > To avoid this problem, make insert range atomic with respect to
> > > ilock. Hold the ilock across the entire operation, replace the
> > > individual transactions with a single rolling transaction sequence
> > > and relog the inode to keep it moving in the log. This guarantees
> > > that nothing else can change the extent mapping of an inode while
> > > an insert range operation is in progress.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
> > >  1 file changed, 13 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > index 829ab1a804c9..555c8b49a223 100644
> > > --- a/fs/xfs/xfs_bmap_util.c
> > > +++ b/fs/xfs/xfs_bmap_util.c
> > > @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
> > >  	if (error)
> > >  		return error;
> > >  
> > > -	/*
> > > -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > -	 * is not the starting block of extent, we need to split the extent at
> > > -	 * stop_fsb.
> > > -	 */
> > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> > >  			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> > >  	if (error)
> > >  		return error;
> > >  
> > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > +	xfs_trans_ijoin(tp, ip, 0);
> > >  
> > > +	/*
> > > +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > +	 * is not the starting block of extent, we need to split the extent at
> > > +	 * stop_fsb.
> > > +	 */
> > >  	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
> > >  	if (error)
> > >  		goto out_trans_cancel;
> > >  
> > > -	error = xfs_trans_commit(tp);
> > > -	if (error)
> > > -		return error;
> > > -
> > > -	while (!error && !done) {
> > > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> > > -					&tp);
> > 
> > I'm a little concerned about the livelock potential here, if there are a lot of
> > other threads that have eaten all the transaction reservation and are trying to
> > get our ILOCK, while at the same time this thread has the ILOCK and is trying
> > to roll the transaction to move another extent, having already rolled the
> > transaction more than logcount times.
> > 
> 
> My understanding is that the regrant mechanism is intended to deal with
> that scenario. Even after the initial (res * logcount) reservation is
> consumed, a regrant is different from an initial reservation in that the
> reservation head is unconditionally updated with a new reservation unit.
> We do wait on the write head in regrant, but IIUC that should be limited
> to the pool of already allocated transactions (and is woken before the
> reserve head waiters IIRC). I suppose something like this might be
> possible in theory if we were blocked on regrant and the entirety of
> remaining log space was consumed by transactions waiting on our ilock,
> but I think that is highly unlikely since we also hold the iolock here.

True.  The only time I saw this happen was with buffered COW writeback
completions (which hold lock other than ILOCK), which should have been
fixed by the patch I made to put all the writeback items to a single
inode queue and run /one/ worker thread to process them all.  So maybe
my fears are unfounded nowadays. :)

The only other place I can think of that does a lot of transaction
rolling on a single inode is online repair, and it always holds all
three exclusive locks.

> Also note that this approach is based on the current truncate algorithm,
> which is probably a better barometer of potential for this kind of issue
> as it is a less specialized operation. I'd argue that if this is safe
> enough for truncate, it should be safe enough for range shifting.

Hehehe.

> > I think the extent shifting loop starts with the highest offset mapping and
> > shifts it up and continues in order of decreasing offset until it gets to
> >  @stop_fsb, correct?
> > 
> 
> Yep, for insert range at least.
> 
> > Can we use "alloc trans; ilock; move; commit" for every extent higher than the
> > one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll;
> > insert_extents; commit" to deal with that one extent that crosses @stop_fsb?
> > tr_write pre-reserves enough space to that the roll won't need to get more,
> > which would eliminate that potential problem, I think.
> > 
> 
> We'd have to reorder the extent split for that kind of approach, which I
> think you've noted in the sequence above, as the race window is between
> the split and subsequent shift. Otherwise I think that would work.
> 
> That said, I'd prefer not to introduce the extra complexity and
> functional variance unless it were absolutely necessary, and it's not
> clear to me that it is. If it is, we'd probably have seen similar issues
> in truncate and should target a fix there before worrying about range
> shift.

Ok.  Looking back through lore I don't see any complaints about insert
range, so I guess it's fine.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> Brian
> 
> > --D
> > 
> > > +	do {
> > > +		error = xfs_trans_roll_inode(&tp, ip);
> > >  		if (error)
> > > -			break;
> > > +			goto out_trans_cancel;
> > >  
> > > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > >  		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
> > >  				&done, stop_fsb);
> > >  		if (error)
> > >  			goto out_trans_cancel;
> > > +	} while (!done);
> > >  
> > > -		error = xfs_trans_commit(tp);
> > > -	}
> > > -
> > > +	error = xfs_trans_commit(tp);
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > >  	return error;
> > >  
> > >  out_trans_cancel:
> > >  	xfs_trans_cancel(tp);
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > >  	return error;
> > >  }
> > >  
> > > -- 
> > > 2.20.1
> > > 
> > 
>

Brian Foster Dec. 19, 2019, 11:55 a.m. UTC | #5

On Wed, Dec 18, 2019 at 01:15:40PM -0800, Darrick J. Wong wrote:
> On Wed, Dec 18, 2019 at 07:10:33AM -0500, Brian Foster wrote:
> > On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote:
> > > On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> > > > The insert range operation uses a unique transaction and ilock cycle
> > > > for the extent split and each extent shift iteration of the overall
> > > > operation. While this works, it is risks racing with other
> > > > operations in subtle ways such as COW writeback modifying an extent
> > > > tree in the middle of a shift operation.
> > > > 
> > > > To avoid this problem, make insert range atomic with respect to
> > > > ilock. Hold the ilock across the entire operation, replace the
> > > > individual transactions with a single rolling transaction sequence
> > > > and relog the inode to keep it moving in the log. This guarantees
> > > > that nothing else can change the extent mapping of an inode while
> > > > an insert range operation is in progress.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
> > > >  1 file changed, 13 insertions(+), 19 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > > index 829ab1a804c9..555c8b49a223 100644
> > > > --- a/fs/xfs/xfs_bmap_util.c
> > > > +++ b/fs/xfs/xfs_bmap_util.c
> > > > @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
> > > >  	if (error)
> > > >  		return error;
> > > >  
> > > > -	/*
> > > > -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > > -	 * is not the starting block of extent, we need to split the extent at
> > > > -	 * stop_fsb.
> > > > -	 */
> > > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> > > >  			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> > > >  	if (error)
> > > >  		return error;
> > > >  
> > > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > >  
> > > > +	/*
> > > > +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > > +	 * is not the starting block of extent, we need to split the extent at
> > > > +	 * stop_fsb.
> > > > +	 */
> > > >  	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
> > > >  	if (error)
> > > >  		goto out_trans_cancel;
> > > >  
> > > > -	error = xfs_trans_commit(tp);
> > > > -	if (error)
> > > > -		return error;
> > > > -
> > > > -	while (!error && !done) {
> > > > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> > > > -					&tp);
> > > 
> > > I'm a little concerned about the livelock potential here, if there are a lot of
> > > other threads that have eaten all the transaction reservation and are trying to
> > > get our ILOCK, while at the same time this thread has the ILOCK and is trying
> > > to roll the transaction to move another extent, having already rolled the
> > > transaction more than logcount times.
> > > 
> > 
> > My understanding is that the regrant mechanism is intended to deal with
> > that scenario. Even after the initial (res * logcount) reservation is
> > consumed, a regrant is different from an initial reservation in that the
> > reservation head is unconditionally updated with a new reservation unit.
> > We do wait on the write head in regrant, but IIUC that should be limited
> > to the pool of already allocated transactions (and is woken before the
> > reserve head waiters IIRC). I suppose something like this might be
> > possible in theory if we were blocked on regrant and the entirety of
> > remaining log space was consumed by transactions waiting on our ilock,
> > but I think that is highly unlikely since we also hold the iolock here.
> 
> True.  The only time I saw this happen was with buffered COW writeback
> completions (which hold lock other than ILOCK), which should have been
> fixed by the patch I made to put all the writeback items to a single
> inode queue and run /one/ worker thread to process them all.  So maybe
> my fears are unfounded nowadays. :)
> 

Ok. If I'm following correctly, that goes back to this[1] commit. The
commit log describes a situation where we basically have unbound,
concurrent attempts to do to COW remappings on writeback completion.
That leads to heavy ilock contention and the potential for deadlocks
between log reservation and inode locks. Out of curiosity, was that old
scheme bad enough to reproduce that kind of deadlock or was the deadlock
description based on code analysis?

I am a bit curious how "deadlock avoidant" the current transaction
rolling/regrant mechanism is intended to be on its own vs. how much
responsibility is beared by the implementation of certain operations
(like truncate holding IOLOCK and assuming that sufficiently restricts
object access). ISTM that it's technically possible for this lockup
state to occur, but it relies on something unusual like the above where
we allow enough unbound (open reservation -> lock) concurrency on a
single metadata object to exhaust all log space.

[1] cb357bf3d1 ("xfs: implement per-inode writeback completion queues")

> The only other place I can think of that does a lot of transaction
> rolling on a single inode is online repair, and it always holds all
> three exclusive locks.
> 

So I guess that should similarly hold off transaction reservation from
userspace intended to modify the particular inode, provided the XFS
internals play nice (re: your workqueue fix above).

> > Also note that this approach is based on the current truncate algorithm,
> > which is probably a better barometer of potential for this kind of issue
> > as it is a less specialized operation. I'd argue that if this is safe
> > enough for truncate, it should be safe enough for range shifting.
> 
> Hehehe.
> 
> > > I think the extent shifting loop starts with the highest offset mapping and
> > > shifts it up and continues in order of decreasing offset until it gets to
> > >  @stop_fsb, correct?
> > > 
> > 
> > Yep, for insert range at least.
> > 
> > > Can we use "alloc trans; ilock; move; commit" for every extent higher than the
> > > one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll;
> > > insert_extents; commit" to deal with that one extent that crosses @stop_fsb?
> > > tr_write pre-reserves enough space to that the roll won't need to get more,
> > > which would eliminate that potential problem, I think.
> > > 
> > 
> > We'd have to reorder the extent split for that kind of approach, which I
> > think you've noted in the sequence above, as the race window is between
> > the split and subsequent shift. Otherwise I think that would work.
> > 
> > That said, I'd prefer not to introduce the extra complexity and
> > functional variance unless it were absolutely necessary, and it's not
> > clear to me that it is. If it is, we'd probably have seen similar issues
> > in truncate and should target a fix there before worrying about range
> > shift.
> 
> Ok.  Looking back through lore I don't see any complaints about insert
> range, so I guess it's fine.
> 

Ok, thanks.

Brian

> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > +	do {
> > > > +		error = xfs_trans_roll_inode(&tp, ip);
> > > >  		if (error)
> > > > -			break;
> > > > +			goto out_trans_cancel;
> > > >  
> > > > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > >  		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
> > > >  				&done, stop_fsb);
> > > >  		if (error)
> > > >  			goto out_trans_cancel;
> > > > +	} while (!done);
> > > >  
> > > > -		error = xfs_trans_commit(tp);
> > > > -	}
> > > > -
> > > > +	error = xfs_trans_commit(tp);
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > >  	return error;
> > > >  
> > > >  out_trans_cancel:
> > > >  	xfs_trans_cancel(tp);
> > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > >  	return error;
> > > >  }
> > > >  
> > > > -- 
> > > > 2.20.1
> > > > 
> > > 
> > 
>

Darrick J. Wong Dec. 20, 2019, 8:17 p.m. UTC | #6

On Thu, Dec 19, 2019 at 06:55:50AM -0500, Brian Foster wrote:
> On Wed, Dec 18, 2019 at 01:15:40PM -0800, Darrick J. Wong wrote:
> > On Wed, Dec 18, 2019 at 07:10:33AM -0500, Brian Foster wrote:
> > > On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote:
> > > > On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> > > > > The insert range operation uses a unique transaction and ilock cycle
> > > > > for the extent split and each extent shift iteration of the overall
> > > > > operation. While this works, it is risks racing with other
> > > > > operations in subtle ways such as COW writeback modifying an extent
> > > > > tree in the middle of a shift operation.
> > > > > 
> > > > > To avoid this problem, make insert range atomic with respect to
> > > > > ilock. Hold the ilock across the entire operation, replace the
> > > > > individual transactions with a single rolling transaction sequence
> > > > > and relog the inode to keep it moving in the log. This guarantees
> > > > > that nothing else can change the extent mapping of an inode while
> > > > > an insert range operation is in progress.
> > > > > 
> > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > ---
> > > > >  fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
> > > > >  1 file changed, 13 insertions(+), 19 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > > > index 829ab1a804c9..555c8b49a223 100644
> > > > > --- a/fs/xfs/xfs_bmap_util.c
> > > > > +++ b/fs/xfs/xfs_bmap_util.c
> > > > > @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
> > > > >  	if (error)
> > > > >  		return error;
> > > > >  
> > > > > -	/*
> > > > > -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > > > -	 * is not the starting block of extent, we need to split the extent at
> > > > > -	 * stop_fsb.
> > > > > -	 */
> > > > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> > > > >  			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> > > > >  	if (error)
> > > > >  		return error;
> > > > >  
> > > > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > > >  
> > > > > +	/*
> > > > > +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > > > +	 * is not the starting block of extent, we need to split the extent at
> > > > > +	 * stop_fsb.
> > > > > +	 */
> > > > >  	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
> > > > >  	if (error)
> > > > >  		goto out_trans_cancel;
> > > > >  
> > > > > -	error = xfs_trans_commit(tp);
> > > > > -	if (error)
> > > > > -		return error;
> > > > > -
> > > > > -	while (!error && !done) {
> > > > > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> > > > > -					&tp);
> > > > 
> > > > I'm a little concerned about the livelock potential here, if there are a lot of
> > > > other threads that have eaten all the transaction reservation and are trying to
> > > > get our ILOCK, while at the same time this thread has the ILOCK and is trying
> > > > to roll the transaction to move another extent, having already rolled the
> > > > transaction more than logcount times.
> > > > 
> > > 
> > > My understanding is that the regrant mechanism is intended to deal with
> > > that scenario. Even after the initial (res * logcount) reservation is
> > > consumed, a regrant is different from an initial reservation in that the
> > > reservation head is unconditionally updated with a new reservation unit.
> > > We do wait on the write head in regrant, but IIUC that should be limited
> > > to the pool of already allocated transactions (and is woken before the
> > > reserve head waiters IIRC). I suppose something like this might be
> > > possible in theory if we were blocked on regrant and the entirety of
> > > remaining log space was consumed by transactions waiting on our ilock,
> > > but I think that is highly unlikely since we also hold the iolock here.
> > 
> > True.  The only time I saw this happen was with buffered COW writeback
> > completions (which hold lock other than ILOCK), which should have been
> > fixed by the patch I made to put all the writeback items to a single
> > inode queue and run /one/ worker thread to process them all.  So maybe
> > my fears are unfounded nowadays. :)
> > 
> 
> Ok. If I'm following correctly, that goes back to this[1] commit. The
> commit log describes a situation where we basically have unbound,
> concurrent attempts to do to COW remappings on writeback completion.
> That leads to heavy ilock contention and the potential for deadlocks
> between log reservation and inode locks. Out of curiosity, was that old
> scheme bad enough to reproduce that kind of deadlock or was the deadlock
> description based on code analysis?

It was bad enough to cause an internal complaint about log livelock. :(

I think directio completions might suffer from the same class of problem
though, since we allow concurrent dio writes and dio doesn't do any of
the ioend batching that we do with buffered write ioends.

I think fixing directio is going to be a tougher nut to crack because it
means that in the non-overwrite case we can't really do the ioend
completion processing from the dio submitter thread.

It might also be nice to find a way to unify the ioend paths since they
both do "convert unwritten and do cow remapping" on the entire range,
and diverge only once that's done.

> I am a bit curious how "deadlock avoidant" the current transaction
> rolling/regrant mechanism is intended to be on its own vs. how much
> responsibility is beared by the implementation of certain operations
> (like truncate holding IOLOCK and assuming that sufficiently restricts
> object access). ISTM that it's technically possible for this lockup
> state to occur, but it relies on something unusual like the above where
> we allow enough unbound (open reservation -> lock) concurrency on a
> single metadata object to exhaust all log space.
> 
> [1] cb357bf3d1 ("xfs: implement per-inode writeback completion queues")

<nod> So far I haven't seen anyone in practice managing to hit a
thread-happy directio cow remap deadlock, though I think triggering it
should be as simple as turning on alwayscow and coaxing an aio stress
tester into issuing a lot of random writes.

> > The only other place I can think of that does a lot of transaction
> > rolling on a single inode is online repair, and it always holds all
> > three exclusive locks.
> > 
> 
> So I guess that should similarly hold off transaction reservation from
> userspace intended to modify the particular inode, provided the XFS
> internals play nice (re: your workqueue fix above).

Yeah.  ISTR the scrub setup functions flush dirty pages and drain
directios before taking the ILOCK, so it should never be picking any
fights with writeback. :)

> > > Also note that this approach is based on the current truncate algorithm,
> > > which is probably a better barometer of potential for this kind of issue
> > > as it is a less specialized operation. I'd argue that if this is safe
> > > enough for truncate, it should be safe enough for range shifting.
> > 
> > Hehehe.
> > 
> > > > I think the extent shifting loop starts with the highest offset mapping and
> > > > shifts it up and continues in order of decreasing offset until it gets to
> > > >  @stop_fsb, correct?
> > > > 
> > > 
> > > Yep, for insert range at least.
> > > 
> > > > Can we use "alloc trans; ilock; move; commit" for every extent higher than the
> > > > one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll;
> > > > insert_extents; commit" to deal with that one extent that crosses @stop_fsb?
> > > > tr_write pre-reserves enough space to that the roll won't need to get more,
> > > > which would eliminate that potential problem, I think.
> > > > 
> > > 
> > > We'd have to reorder the extent split for that kind of approach, which I
> > > think you've noted in the sequence above, as the race window is between
> > > the split and subsequent shift. Otherwise I think that would work.
> > > 
> > > That said, I'd prefer not to introduce the extra complexity and
> > > functional variance unless it were absolutely necessary, and it's not
> > > clear to me that it is. If it is, we'd probably have seen similar issues
> > > in truncate and should target a fix there before worrying about range
> > > shift.
> > 
> > Ok.  Looking back through lore I don't see any complaints about insert
> > range, so I guess it's fine.
> > 
> 
> Ok, thanks.
> 
> Brian
> 
> > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > --D
> > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > +	do {
> > > > > +		error = xfs_trans_roll_inode(&tp, ip);
> > > > >  		if (error)
> > > > > -			break;
> > > > > +			goto out_trans_cancel;
> > > > >  
> > > > > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > > >  		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
> > > > >  				&done, stop_fsb);
> > > > >  		if (error)
> > > > >  			goto out_trans_cancel;
> > > > > +	} while (!done);
> > > > >  
> > > > > -		error = xfs_trans_commit(tp);
> > > > > -	}
> > > > > -
> > > > > +	error = xfs_trans_commit(tp);
> > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > >  	return error;
> > > > >  
> > > > >  out_trans_cancel:
> > > > >  	xfs_trans_cancel(tp);
> > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > >  	return error;
> > > > >  }
> > > > >  
> > > > > -- 
> > > > > 2.20.1
> > > > > 
> > > > 
> > > 
> > 
>

Brian Foster Dec. 23, 2019, 12:12 p.m. UTC | #7

On Fri, Dec 20, 2019 at 12:17:17PM -0800, Darrick J. Wong wrote:
> On Thu, Dec 19, 2019 at 06:55:50AM -0500, Brian Foster wrote:
> > On Wed, Dec 18, 2019 at 01:15:40PM -0800, Darrick J. Wong wrote:
> > > On Wed, Dec 18, 2019 at 07:10:33AM -0500, Brian Foster wrote:
> > > > On Tue, Dec 17, 2019 at 06:37:26PM -0800, Darrick J. Wong wrote:
> > > > > On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> > > > > > The insert range operation uses a unique transaction and ilock cycle
> > > > > > for the extent split and each extent shift iteration of the overall
> > > > > > operation. While this works, it is risks racing with other
> > > > > > operations in subtle ways such as COW writeback modifying an extent
> > > > > > tree in the middle of a shift operation.
> > > > > > 
> > > > > > To avoid this problem, make insert range atomic with respect to
> > > > > > ilock. Hold the ilock across the entire operation, replace the
> > > > > > individual transactions with a single rolling transaction sequence
> > > > > > and relog the inode to keep it moving in the log. This guarantees
> > > > > > that nothing else can change the extent mapping of an inode while
> > > > > > an insert range operation is in progress.
> > > > > > 
> > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > > ---
> > > > > >  fs/xfs/xfs_bmap_util.c | 32 +++++++++++++-------------------
> > > > > >  1 file changed, 13 insertions(+), 19 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > > > > > index 829ab1a804c9..555c8b49a223 100644
> > > > > > --- a/fs/xfs/xfs_bmap_util.c
> > > > > > +++ b/fs/xfs/xfs_bmap_util.c
> > > > > > @@ -1134,47 +1134,41 @@ xfs_insert_file_space(
> > > > > >  	if (error)
> > > > > >  		return error;
> > > > > >  
> > > > > > -	/*
> > > > > > -	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > > > > -	 * is not the starting block of extent, we need to split the extent at
> > > > > > -	 * stop_fsb.
> > > > > > -	 */
> > > > > >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write,
> > > > > >  			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0, 0, &tp);
> > > > > >  	if (error)
> > > > > >  		return error;
> > > > > >  
> > > > > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > > -	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > > > > +	xfs_trans_ijoin(tp, ip, 0);
> > > > > >  
> > > > > > +	/*
> > > > > > +	 * The extent shifting code works on extent granularity. So, if stop_fsb
> > > > > > +	 * is not the starting block of extent, we need to split the extent at
> > > > > > +	 * stop_fsb.
> > > > > > +	 */
> > > > > >  	error = xfs_bmap_split_extent(tp, ip, stop_fsb);
> > > > > >  	if (error)
> > > > > >  		goto out_trans_cancel;
> > > > > >  
> > > > > > -	error = xfs_trans_commit(tp);
> > > > > > -	if (error)
> > > > > > -		return error;
> > > > > > -
> > > > > > -	while (!error && !done) {
> > > > > > -		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0,
> > > > > > -					&tp);
> > > > > 
> > > > > I'm a little concerned about the livelock potential here, if there are a lot of
> > > > > other threads that have eaten all the transaction reservation and are trying to
> > > > > get our ILOCK, while at the same time this thread has the ILOCK and is trying
> > > > > to roll the transaction to move another extent, having already rolled the
> > > > > transaction more than logcount times.
> > > > > 
> > > > 
> > > > My understanding is that the regrant mechanism is intended to deal with
> > > > that scenario. Even after the initial (res * logcount) reservation is
> > > > consumed, a regrant is different from an initial reservation in that the
> > > > reservation head is unconditionally updated with a new reservation unit.
> > > > We do wait on the write head in regrant, but IIUC that should be limited
> > > > to the pool of already allocated transactions (and is woken before the
> > > > reserve head waiters IIRC). I suppose something like this might be
> > > > possible in theory if we were blocked on regrant and the entirety of
> > > > remaining log space was consumed by transactions waiting on our ilock,
> > > > but I think that is highly unlikely since we also hold the iolock here.
> > > 
> > > True.  The only time I saw this happen was with buffered COW writeback
> > > completions (which hold lock other than ILOCK), which should have been
> > > fixed by the patch I made to put all the writeback items to a single
> > > inode queue and run /one/ worker thread to process them all.  So maybe
> > > my fears are unfounded nowadays. :)
> > > 
> > 
> > Ok. If I'm following correctly, that goes back to this[1] commit. The
> > commit log describes a situation where we basically have unbound,
> > concurrent attempts to do to COW remappings on writeback completion.
> > That leads to heavy ilock contention and the potential for deadlocks
> > between log reservation and inode locks. Out of curiosity, was that old
> > scheme bad enough to reproduce that kind of deadlock or was the deadlock
> > description based on code analysis?
> 
> It was bad enough to cause an internal complaint about log livelock. :(
> 

Interesting..

> I think directio completions might suffer from the same class of problem
> though, since we allow concurrent dio writes and dio doesn't do any of
> the ioend batching that we do with buffered write ioends.
> 

Indeed, that makes sense since it sounds like a similar pattern.

> I think fixing directio is going to be a tougher nut to crack because it
> means that in the non-overwrite case we can't really do the ioend
> completion processing from the dio submitter thread.
> 

Hm.. yeah, but it might be a little strange to have highly concurrent
dio writers to the same inode, right? IOW, it seems like the risk there
more depends on foolishness of userspace rather than of our backend
implementation deciding to allocate more tasks in the event of
contention. That isn't to say we shouldn't think about ways to protect
ourselves from the former case..

> It might also be nice to find a way to unify the ioend paths since they
> both do "convert unwritten and do cow remapping" on the entire range,
> and diverge only once that's done.
> 
> > I am a bit curious how "deadlock avoidant" the current transaction
> > rolling/regrant mechanism is intended to be on its own vs. how much
> > responsibility is beared by the implementation of certain operations
> > (like truncate holding IOLOCK and assuming that sufficiently restricts
> > object access). ISTM that it's technically possible for this lockup
> > state to occur, but it relies on something unusual like the above where
> > we allow enough unbound (open reservation -> lock) concurrency on a
> > single metadata object to exhaust all log space.
> > 
> > [1] cb357bf3d1 ("xfs: implement per-inode writeback completion queues")
> 
> <nod> So far I haven't seen anyone in practice managing to hit a
> thread-happy directio cow remap deadlock, though I think triggering it
> should be as simple as turning on alwayscow and coaxing an aio stress
> tester into issuing a lot of random writes.
> 
> > > The only other place I can think of that does a lot of transaction
> > > rolling on a single inode is online repair, and it always holds all
> > > three exclusive locks.
> > > 
> > 
> > So I guess that should similarly hold off transaction reservation from
> > userspace intended to modify the particular inode, provided the XFS
> > internals play nice (re: your workqueue fix above).
> 
> Yeah.  ISTR the scrub setup functions flush dirty pages and drain
> directios before taking the ILOCK, so it should never be picking any
> fights with writeback. :)
> 

That seems to be the right thing to do, at least for now. It doesn't
look like truncate currently waits on dio. Extent shifting might by
virtue of eofblocks trimming I guess, but that isn't guaranteed.
Regardless, I think either operation could wait on dio without a problem
given that they already isolate the inode for I/O, force writeback, etc.
I suspect it's just not been a likely enough situation to have a heavy
enough I/O workload concurrent with truncating (or shifting) a file to
see this problem in traditional (i.e. unlike the "always COW" mode)
codepaths.

Brian

> > > > Also note that this approach is based on the current truncate algorithm,
> > > > which is probably a better barometer of potential for this kind of issue
> > > > as it is a less specialized operation. I'd argue that if this is safe
> > > > enough for truncate, it should be safe enough for range shifting.
> > > 
> > > Hehehe.
> > > 
> > > > > I think the extent shifting loop starts with the highest offset mapping and
> > > > > shifts it up and continues in order of decreasing offset until it gets to
> > > > >  @stop_fsb, correct?
> > > > > 
> > > > 
> > > > Yep, for insert range at least.
> > > > 
> > > > > Can we use "alloc trans; ilock; move; commit" for every extent higher than the
> > > > > one that crosses @stop_fsb, and use "alloc trans; ilock; split; roll;
> > > > > insert_extents; commit" to deal with that one extent that crosses @stop_fsb?
> > > > > tr_write pre-reserves enough space to that the roll won't need to get more,
> > > > > which would eliminate that potential problem, I think.
> > > > > 
> > > > 
> > > > We'd have to reorder the extent split for that kind of approach, which I
> > > > think you've noted in the sequence above, as the race window is between
> > > > the split and subsequent shift. Otherwise I think that would work.
> > > > 
> > > > That said, I'd prefer not to introduce the extra complexity and
> > > > functional variance unless it were absolutely necessary, and it's not
> > > > clear to me that it is. If it is, we'd probably have seen similar issues
> > > > in truncate and should target a fix there before worrying about range
> > > > shift.
> > > 
> > > Ok.  Looking back through lore I don't see any complaints about insert
> > > range, so I guess it's fine.
> > > 
> > 
> > Ok, thanks.
> > 
> > Brian
> > 
> > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > +	do {
> > > > > > +		error = xfs_trans_roll_inode(&tp, ip);
> > > > > >  		if (error)
> > > > > > -			break;
> > > > > > +			goto out_trans_cancel;
> > > > > >  
> > > > > > -		xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > > -		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> > > > > >  		error = xfs_bmap_insert_extents(tp, ip, &next_fsb, shift_fsb,
> > > > > >  				&done, stop_fsb);
> > > > > >  		if (error)
> > > > > >  			goto out_trans_cancel;
> > > > > > +	} while (!done);
> > > > > >  
> > > > > > -		error = xfs_trans_commit(tp);
> > > > > > -	}
> > > > > > -
> > > > > > +	error = xfs_trans_commit(tp);
> > > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > >  	return error;
> > > > > >  
> > > > > >  out_trans_cancel:
> > > > > >  	xfs_trans_cancel(tp);
> > > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > >  	return error;
> > > > > >  }
> > > > > >  
> > > > > > -- 
> > > > > > 2.20.1
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
>

Christoph Hellwig Dec. 24, 2019, 11:26 a.m. UTC | #8

On Fri, Dec 13, 2019 at 12:12:57PM -0500, Brian Foster wrote:
> The insert range operation uses a unique transaction and ilock cycle
> for the extent split and each extent shift iteration of the overall
> operation. While this works, it is risks racing with other
> operations in subtle ways such as COW writeback modifying an extent
> tree in the middle of a shift operation.
> 
> To avoid this problem, make insert range atomic with respect to
> ilock. Hold the ilock across the entire operation, replace the
> individual transactions with a single rolling transaction sequence
> and relog the inode to keep it moving in the log. This guarantees
> that nothing else can change the extent mapping of an inode while
> an insert range operation is in progress.

This looks good, and similar to our usual truncate sequence:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Christoph Hellwig Dec. 24, 2019, 11:28 a.m. UTC | #9

On Fri, Dec 20, 2019 at 12:17:17PM -0800, Darrick J. Wong wrote:
> I think directio completions might suffer from the same class of problem
> though, since we allow concurrent dio writes and dio doesn't do any of
> the ioend batching that we do with buffered write ioends.

OTOH direct I/O completions are per-I/O, and not per-extent like
buffered I/O completions.  Moreover for the case where we don't update
i_size and don't need a separate log force (overwrites without O_SYNC
or using fua) we could actually avoid the workqueue entirely with just
a little work.

> It might also be nice to find a way to unify the ioend paths since they
> both do "convert unwritten and do cow remapping" on the entire range,
> and diverge only once that's done.

They were common a while ago and it was a complete mess.  That is why
I split them.

Darrick J. Wong Dec. 24, 2019, 4:45 p.m. UTC | #10

On Tue, Dec 24, 2019 at 03:28:45AM -0800, Christoph Hellwig wrote:
> On Fri, Dec 20, 2019 at 12:17:17PM -0800, Darrick J. Wong wrote:
> > I think directio completions might suffer from the same class of problem
> > though, since we allow concurrent dio writes and dio doesn't do any of
> > the ioend batching that we do with buffered write ioends.
> 
> OTOH direct I/O completions are per-I/O, and not per-extent like
> buffered I/O completions.  Moreover for the case where we don't update
> i_size and don't need a separate log force (overwrites without O_SYNC
> or using fua) we could actually avoid the workqueue entirely with just
> a little work.

<nod>

> > It might also be nice to find a way to unify the ioend paths since they
> > both do "convert unwritten and do cow remapping" on the entire range,
> > and diverge only once that's done.
> 
> They were common a while ago and it was a complete mess.  That is why
> I split them.

And I couldn't figure out a sane way to make them work together so I
guess it's just as well. :)

--D

[2/3] xfs: rework insert range into an atomic operation

Commit Message

Comments

Patch