diff mbox series

xfs: stabilize insert range start boundary to avoid COW writeback race

Message ID 20191210132340.11330-1-bfoster@redhat.com (mailing list archive)
State Accepted
Headers show
Series xfs: stabilize insert range start boundary to avoid COW writeback race | expand

Commit Message

Brian Foster Dec. 10, 2019, 1:23 p.m. UTC
generic/522 (fsx) occasionally fails with a file corruption due to
an insert range operation. The primary characteristic of the
corruption is a misplaced insert range operation that differs from
the requested target offset. The reason for this behavior is a race
between the extent shift sequence of an insert range and a COW
writeback completion that causes a front merge with the first extent
in the shift.

The shift preparation function flushes and unmaps from the target
offset of the operation to the end of the file to ensure no
modifications can be made and page cache is invalidated before file
data is shifted. An insert range operation then splits the extent at
the target offset, if necessary, and begins to shift the start
offset of each extent starting from the end of the file to the start
offset. The shift sequence operates at extent level and so depends
on the preparation sequence to guarantee no changes can be made to
the target range during the shift. If the block immediately prior to
the target offset was dirty and shared, however, it can undergo
writeback and move from the COW fork to the data fork at any point
during the shift. If the block is contiguous with the block at the
start offset of the insert range, it can front merge and alter the
start offset of the extent. Once the shift sequence reaches the
target offset, it shifts based on the latest start offset and
silently changes the target offset of the operation and corrupts the
file.

To address this problem, update the shift preparation code to
stabilize the start boundary along with the full range of the
insert. Also update the existing corruption check to fail if any
extent is shifted with a start offset behind the target offset of
the insert range. This prevents insert from racing with COW
writeback completion and fails loudly in the event of an unexpected
extent shift.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

This has survived a couple fstests runs (upstream) so far as well as an
overnight loop test of generic/522 (on RHEL). The RHEL based kernel just
happened to be where this was originally diagnosed and provides a fairly
reliable failure rate of within 30 iterations or so. The current test is
at almost 70 iterations and still running.

Brian

 fs/xfs/libxfs/xfs_bmap.c |  3 +--
 fs/xfs/xfs_bmap_util.c   | 12 ++++++++++++
 2 files changed, 13 insertions(+), 2 deletions(-)

Comments

Carlos Maiolino Dec. 10, 2019, 2:02 p.m. UTC | #1
On Tue, Dec 10, 2019 at 08:23:40AM -0500, Brian Foster wrote:
> generic/522 (fsx) occasionally fails with a file corruption due to
> an insert range operation. The primary characteristic of the
> corruption is a misplaced insert range operation that differs from
> the requested target offset. The reason for this behavior is a race
> between the extent shift sequence of an insert range and a COW
> writeback completion that causes a front merge with the first extent
> in the shift.
> 
> The shift preparation function flushes and unmaps from the target
> offset of the operation to the end of the file to ensure no
> modifications can be made and page cache is invalidated before file
> data is shifted. An insert range operation then splits the extent at
> the target offset, if necessary, and begins to shift the start
> offset of each extent starting from the end of the file to the start
> offset. The shift sequence operates at extent level and so depends
> on the preparation sequence to guarantee no changes can be made to
> the target range during the shift. If the block immediately prior to
> the target offset was dirty and shared, however, it can undergo
> writeback and move from the COW fork to the data fork at any point
> during the shift. If the block is contiguous with the block at the
> start offset of the insert range, it can front merge and alter the
> start offset of the extent. Once the shift sequence reaches the
> target offset, it shifts based on the latest start offset and
> silently changes the target offset of the operation and corrupts the
> file.
> 
> To address this problem, update the shift preparation code to
> stabilize the start boundary along with the full range of the
> insert. Also update the existing corruption check to fail if any
> extent is shifted with a start offset behind the target offset of
> the insert range. This prevents insert from racing with COW
> writeback completion and fails loudly in the event of an unexpected
> extent shift.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

This looks ok.

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>


> ---
> 
> This has survived a couple fstests runs (upstream) so far as well as an
> overnight loop test of generic/522 (on RHEL). The RHEL based kernel just
> happened to be where this was originally diagnosed and provides a fairly
> reliable failure rate of within 30 iterations or so. The current test is
> at almost 70 iterations and still running.
> 
> Brian
> 
>  fs/xfs/libxfs/xfs_bmap.c |  3 +--
>  fs/xfs/xfs_bmap_util.c   | 12 ++++++++++++
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index a9ad1f991ba3..4a802b3abe77 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -5972,8 +5972,7 @@ xfs_bmap_insert_extents(
>  		goto del_cursor;
>  	}
>  
> -	if (XFS_IS_CORRUPT(mp,
> -			   stop_fsb >= got.br_startoff + got.br_blockcount)) {
> +	if (XFS_IS_CORRUPT(mp, stop_fsb > got.br_startoff)) {
>  		error = -EFSCORRUPTED;
>  		goto del_cursor;
>  	}
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 2efd78a9719e..e62fb5216341 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -992,6 +992,7 @@ xfs_prepare_shift(
>  	struct xfs_inode	*ip,
>  	loff_t			offset)
>  {
> +	struct xfs_mount	*mp = ip->i_mount;
>  	int			error;
>  
>  	/*
> @@ -1004,6 +1005,17 @@ xfs_prepare_shift(
>  			return error;
>  	}
>  
> +	/*
> +	 * Shift operations must stabilize the start block offset boundary along
> +	 * with the full range of the operation. If we don't, a COW writeback
> +	 * completion could race with an insert, front merge with the start
> +	 * extent (after split) during the shift and corrupt the file. Start
> +	 * with the block just prior to the start to stabilize the boundary.
> +	 */
> +	offset = round_down(offset, 1 << mp->m_sb.sb_blocklog);
> +	if (offset)
> +		offset -= (1 << mp->m_sb.sb_blocklog);
> +
>  	/*
>  	 * Writeback and invalidate cache for the remainder of the file as we're
>  	 * about to shift down every extent from offset to EOF.
> -- 
> 2.20.1
>
Dave Chinner Dec. 10, 2019, 9:41 p.m. UTC | #2
On Tue, Dec 10, 2019 at 08:23:40AM -0500, Brian Foster wrote:
> generic/522 (fsx) occasionally fails with a file corruption due to
> an insert range operation. The primary characteristic of the
> corruption is a misplaced insert range operation that differs from
> the requested target offset. The reason for this behavior is a race
> between the extent shift sequence of an insert range and a COW
> writeback completion that causes a front merge with the first extent
> in the shift.

How is the COW writeback completion modifying the extent list while
an extent shift is modifying the extent list?  Both should be
running under XFS_ILOCK_EXCL contexts so there shouldn't be a race
condition here unless we've screwed up the extent list modification
atomicity...

> 
> The shift preparation function flushes and unmaps from the target
> offset of the operation to the end of the file to ensure no
> modifications can be made and page cache is invalidated before file
> data is shifted. An insert range operation then splits the extent at
> the target offset, if necessary, and begins to shift the start
> offset of each extent starting from the end of the file to the start
> offset. The shift sequence operates at extent level and so depends
> on the preparation sequence to guarantee no changes can be made to
> the target range during the shift.

Oh... shifting extents is not an atomic operation w.r.t. other
inode modifications - both insert and collapse run individual
modification transactions and lock/unlock the inode around each
transaction. So, essentially, they aren't atomic when faced with
other *metadata* modifications to the inode.

> If the block immediately prior to
> the target offset was dirty and shared, however, it can undergo
> writeback and move from the COW fork to the data fork at any point
> during the shift. If the block is contiguous with the block at the
> start offset of the insert range, it can front merge and alter the
> start offset of the extent. Once the shift sequence reaches the
> target offset, it shifts based on the latest start offset and
> silently changes the target offset of the operation and corrupts the
> file.

Yup, that's exactly the landmine that non-atomic, multi-transaction
extent range operations have. It might be a COW operation, it might
be something else that ends up manipulating the extent list. But
because the ILOCK is not held across the entire extent shift,
insert/collapse are susceptible to corruption when any other XFs
code concurrently modifies the extent list.

I think insert/collapse need to be converted to work like a
truncate operation instead of a series on individual write
operations. That is, they are a permanent transaction that locks the
inode once and is rolled repeatedly until the entire extent listi
modification is done and then the inode is unlocked.

> To address this problem, update the shift preparation code to
> stabilize the start boundary along with the full range of the
> insert. Also update the existing corruption check to fail if any
> extent is shifted with a start offset behind the target offset of
> the insert range. This prevents insert from racing with COW
> writeback completion and fails loudly in the event of an unexpected
> extent shift.

It looks ok to avoid this particular symptom (backportable point
fix), but I really think we should convert insert/collapse to be
atomic w.r.t other extent list modifications....

Cheers,

Dave.
Brian Foster Dec. 11, 2019, 12:47 p.m. UTC | #3
On Wed, Dec 11, 2019 at 08:41:00AM +1100, Dave Chinner wrote:
> On Tue, Dec 10, 2019 at 08:23:40AM -0500, Brian Foster wrote:
> > generic/522 (fsx) occasionally fails with a file corruption due to
> > an insert range operation. The primary characteristic of the
> > corruption is a misplaced insert range operation that differs from
> > the requested target offset. The reason for this behavior is a race
> > between the extent shift sequence of an insert range and a COW
> > writeback completion that causes a front merge with the first extent
> > in the shift.
> 
> How is the COW writeback completion modifying the extent list while
> an extent shift is modifying the extent list?  Both should be
> running under XFS_ILOCK_EXCL contexts so there shouldn't be a race
> condition here unless we've screwed up the extent list modification
> atomicity...
> 
> > 
> > The shift preparation function flushes and unmaps from the target
> > offset of the operation to the end of the file to ensure no
> > modifications can be made and page cache is invalidated before file
> > data is shifted. An insert range operation then splits the extent at
> > the target offset, if necessary, and begins to shift the start
> > offset of each extent starting from the end of the file to the start
> > offset. The shift sequence operates at extent level and so depends
> > on the preparation sequence to guarantee no changes can be made to
> > the target range during the shift.
> 
> Oh... shifting extents is not an atomic operation w.r.t. other
> inode modifications - both insert and collapse run individual
> modification transactions and lock/unlock the inode around each
> transaction. So, essentially, they aren't atomic when faced with
> other *metadata* modifications to the inode.
> 

Right..

> > If the block immediately prior to
> > the target offset was dirty and shared, however, it can undergo
> > writeback and move from the COW fork to the data fork at any point
> > during the shift. If the block is contiguous with the block at the
> > start offset of the insert range, it can front merge and alter the
> > start offset of the extent. Once the shift sequence reaches the
> > target offset, it shifts based on the latest start offset and
> > silently changes the target offset of the operation and corrupts the
> > file.
> 
> Yup, that's exactly the landmine that non-atomic, multi-transaction
> extent range operations have. It might be a COW operation, it might
> be something else that ends up manipulating the extent list. But
> because the ILOCK is not held across the entire extent shift,
> insert/collapse are susceptible to corruption when any other XFs
> code concurrently modifies the extent list.
> 
> I think insert/collapse need to be converted to work like a
> truncate operation instead of a series on individual write
> operations. That is, they are a permanent transaction that locks the
> inode once and is rolled repeatedly until the entire extent listi
> modification is done and then the inode is unlocked.
> 

Note that I don't think it's sufficient to hold the inode locked only
across the shift. For the insert case, I think we'd need to grab it
before the extent split at the target offset and roll from there.
Otherwise the same problem could be reintroduced if we eventually
replaced the xfs_prepare_shift() tweak made by this patch. Of course,
that doesn't look like a big problem. The locking is already elevated
and split and shift even use the same transaction type, so it's mostly a
refactor from a complexity standpoint. 

For the collapse case, we do have a per-shift quota reservation for some
reason. If that is required, we'd have to somehow replace it with a
worst case calculation. That said, it's not clear to me why that
reservation even exists. The pre-shift hole punch is already a separate
transaction with its own such reservation. The shift can merge extents
after that point (though most likely only on the first shift), but that
would only ever remove extent records. Any thoughts or objections if I
just killed that off?

> > To address this problem, update the shift preparation code to
> > stabilize the start boundary along with the full range of the
> > insert. Also update the existing corruption check to fail if any
> > extent is shifted with a start offset behind the target offset of
> > the insert range. This prevents insert from racing with COW
> > writeback completion and fails loudly in the event of an unexpected
> > extent shift.
> 
> It looks ok to avoid this particular symptom (backportable point
> fix), but I really think we should convert insert/collapse to be
> atomic w.r.t other extent list modifications....
> 

Ok, I think that approach is reasonable so long as we do it in two
phases as such to minimize backport churn and separate bug fix from
behavior change.

Unless there is other feedback on this patch, is there any objection to
getting this one reviewed/merged independently? I can start looking into
the shift rework today, but that is ultimately going to require more
involved testing than I'd prefer to block the bug fix on (whereas this
patch has now seen multiple days of fsx testing..).

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Dec. 11, 2019, 8:52 p.m. UTC | #4
On Wed, Dec 11, 2019 at 07:47:12AM -0500, Brian Foster wrote:
> On Wed, Dec 11, 2019 at 08:41:00AM +1100, Dave Chinner wrote:
> > On Tue, Dec 10, 2019 at 08:23:40AM -0500, Brian Foster wrote:
> > I think insert/collapse need to be converted to work like a
> > truncate operation instead of a series on individual write
> > operations. That is, they are a permanent transaction that locks the
> > inode once and is rolled repeatedly until the entire extent listi
> > modification is done and then the inode is unlocked.
> > 
> 
> Note that I don't think it's sufficient to hold the inode locked only
> across the shift. For the insert case, I think we'd need to grab it
> before the extent split at the target offset and roll from there.
> Otherwise the same problem could be reintroduced if we eventually
> replaced the xfs_prepare_shift() tweak made by this patch. Of course,
> that doesn't look like a big problem. The locking is already elevated
> and split and shift even use the same transaction type, so it's mostly a
> refactor from a complexity standpoint. 

*nod*

> For the collapse case, we do have a per-shift quota reservation for some
> reason. If that is required, we'd have to somehow replace it with a
> worst case calculation. That said, it's not clear to me why that
> reservation even exists.

I'm not 100% sure, either, but....

> The pre-shift hole punch is already a separate
> transaction with its own such reservation. The shift can merge extents
> after that point (though most likely only on the first shift), but that
> would only ever remove extent records. Any thoughts or objections if I
> just killed that off?

Yeah, I suspect that it is the xfs_bmse_merge() case freeing blocks
the reservation is for, and I agree that it should only happen on
the first shift because all the others that are moved are identical
in size and shape and would have otherwise been merged at creation.

Hence I think we can probably kill the xfs_bmse_merge() case,
though it might be wrth checking first how often it gets called...

> > > To address this problem, update the shift preparation code to
> > > stabilize the start boundary along with the full range of the
> > > insert. Also update the existing corruption check to fail if any
> > > extent is shifted with a start offset behind the target offset of
> > > the insert range. This prevents insert from racing with COW
> > > writeback completion and fails loudly in the event of an unexpected
> > > extent shift.
> > 
> > It looks ok to avoid this particular symptom (backportable point
> > fix), but I really think we should convert insert/collapse to be
> > atomic w.r.t other extent list modifications....
> > 
> 
> Ok, I think that approach is reasonable so long as we do it in two
> phases as such to minimize backport churn and separate bug fix from
> behavior change.
> 
> Unless there is other feedback on this patch, is there any objection to
> getting this one reviewed/merged independently?

Not here. Seems like the right approach to me. SO for the original
patch:

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Brian Foster Dec. 12, 2019, 2:16 p.m. UTC | #5
On Thu, Dec 12, 2019 at 07:52:30AM +1100, Dave Chinner wrote:
> On Wed, Dec 11, 2019 at 07:47:12AM -0500, Brian Foster wrote:
> > On Wed, Dec 11, 2019 at 08:41:00AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 10, 2019 at 08:23:40AM -0500, Brian Foster wrote:
> > > I think insert/collapse need to be converted to work like a
> > > truncate operation instead of a series on individual write
> > > operations. That is, they are a permanent transaction that locks the
> > > inode once and is rolled repeatedly until the entire extent listi
> > > modification is done and then the inode is unlocked.
> > > 
> > 
> > Note that I don't think it's sufficient to hold the inode locked only
> > across the shift. For the insert case, I think we'd need to grab it
> > before the extent split at the target offset and roll from there.
> > Otherwise the same problem could be reintroduced if we eventually
> > replaced the xfs_prepare_shift() tweak made by this patch. Of course,
> > that doesn't look like a big problem. The locking is already elevated
> > and split and shift even use the same transaction type, so it's mostly a
> > refactor from a complexity standpoint. 
> 
> *nod*
> 
> > For the collapse case, we do have a per-shift quota reservation for some
> > reason. If that is required, we'd have to somehow replace it with a
> > worst case calculation. That said, it's not clear to me why that
> > reservation even exists.
> 
> I'm not 100% sure, either, but....
> 
> > The pre-shift hole punch is already a separate
> > transaction with its own such reservation. The shift can merge extents
> > after that point (though most likely only on the first shift), but that
> > would only ever remove extent records. Any thoughts or objections if I
> > just killed that off?
> 
> Yeah, I suspect that it is the xfs_bmse_merge() case freeing blocks
> the reservation is for, and I agree that it should only happen on
> the first shift because all the others that are moved are identical
> in size and shape and would have otherwise been merged at creation.
> 
> Hence I think we can probably kill the xfs_bmse_merge() case,
> though it might be wrth checking first how often it gets called...
> 

Ok, but do we need an up-front quota reservation for freeing blocks out
of the bmapbt? ISTM the reservation could be removed regardless of the
merging behavior. This is what my current patch does, at least, so we'll
see if anything explodes. :P

I agree on the xfs_bmse_merge() bit. I was planning to leave that as is
however because IIRC, even though it is quite rare, I thought we had a
few corner cases where it was possible for physically and logically
contiguous extents to track separately in a mapping tree. Max sized
extents that are subsequently punched out or truncated might be one
example. I thought we had others, but I can't quite put my finger on it
atm..

> > > > To address this problem, update the shift preparation code to
> > > > stabilize the start boundary along with the full range of the
> > > > insert. Also update the existing corruption check to fail if any
> > > > extent is shifted with a start offset behind the target offset of
> > > > the insert range. This prevents insert from racing with COW
> > > > writeback completion and fails loudly in the event of an unexpected
> > > > extent shift.
> > > 
> > > It looks ok to avoid this particular symptom (backportable point
> > > fix), but I really think we should convert insert/collapse to be
> > > atomic w.r.t other extent list modifications....
> > > 
> > 
> > Ok, I think that approach is reasonable so long as we do it in two
> > phases as such to minimize backport churn and separate bug fix from
> > behavior change.
> > 
> > Unless there is other feedback on this patch, is there any objection to
> > getting this one reviewed/merged independently?
> 
> Not here. Seems like the right approach to me. SO for the original
> patch:
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 

Thanks. FWIW, I whipped up a few patches to hold ilock across insert and
collapse yesterday that survive cursory testing. I'll get them on the
list after more thorough testing..

Brian

> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Dec. 12, 2019, 8:48 p.m. UTC | #6
On Thu, Dec 12, 2019 at 09:16:34AM -0500, Brian Foster wrote:
> On Thu, Dec 12, 2019 at 07:52:30AM +1100, Dave Chinner wrote:
> > On Wed, Dec 11, 2019 at 07:47:12AM -0500, Brian Foster wrote:
> > > On Wed, Dec 11, 2019 at 08:41:00AM +1100, Dave Chinner wrote:
> > > > On Tue, Dec 10, 2019 at 08:23:40AM -0500, Brian Foster wrote:
> > > > I think insert/collapse need to be converted to work like a
> > > > truncate operation instead of a series on individual write
> > > > operations. That is, they are a permanent transaction that locks the
> > > > inode once and is rolled repeatedly until the entire extent listi
> > > > modification is done and then the inode is unlocked.
> > > > 
> > > 
> > > Note that I don't think it's sufficient to hold the inode locked only
> > > across the shift. For the insert case, I think we'd need to grab it
> > > before the extent split at the target offset and roll from there.
> > > Otherwise the same problem could be reintroduced if we eventually
> > > replaced the xfs_prepare_shift() tweak made by this patch. Of course,
> > > that doesn't look like a big problem. The locking is already elevated
> > > and split and shift even use the same transaction type, so it's mostly a
> > > refactor from a complexity standpoint. 
> > 
> > *nod*
> > 
> > > For the collapse case, we do have a per-shift quota reservation for some
> > > reason. If that is required, we'd have to somehow replace it with a
> > > worst case calculation. That said, it's not clear to me why that
> > > reservation even exists.
> > 
> > I'm not 100% sure, either, but....
> > 
> > > The pre-shift hole punch is already a separate
> > > transaction with its own such reservation. The shift can merge extents
> > > after that point (though most likely only on the first shift), but that
> > > would only ever remove extent records. Any thoughts or objections if I
> > > just killed that off?
> > 
> > Yeah, I suspect that it is the xfs_bmse_merge() case freeing blocks
> > the reservation is for, and I agree that it should only happen on
> > the first shift because all the others that are moved are identical
> > in size and shape and would have otherwise been merged at creation.
> > 
> > Hence I think we can probably kill the xfs_bmse_merge() case,
> > though it might be wrth checking first how often it gets called...
> > 
> 
> Ok, but do we need an up-front quota reservation for freeing blocks out
> of the bmapbt? ISTM the reservation could be removed regardless of the
> merging behavior. This is what my current patch does, at least, so we'll
> see if anything explodes. :P

xfs_itruncate_extents() doesn't require an up front block
reservation for quotas or transaction allocation, so I don't really
see how the collapse would require it, even in the merge case...

> I agree on the xfs_bmse_merge() bit. I was planning to leave that as is
> however because IIRC, even though it is quite rare, I thought we had a
> few corner cases where it was possible for physically and logically
> contiguous extents to track separately in a mapping tree. Max sized
> extents that are subsequently punched out or truncated might be one
> example. I thought we had others, but I can't quite put my finger on it
> atm..

True, but is it common enough that we really need to care about it?
If it's bad, just run xfs_fsr on the file/filesystem....

Cheers,

Dave.
Brian Foster Dec. 13, 2019, 11:37 a.m. UTC | #7
On Fri, Dec 13, 2019 at 07:48:51AM +1100, Dave Chinner wrote:
> On Thu, Dec 12, 2019 at 09:16:34AM -0500, Brian Foster wrote:
> > On Thu, Dec 12, 2019 at 07:52:30AM +1100, Dave Chinner wrote:
> > > On Wed, Dec 11, 2019 at 07:47:12AM -0500, Brian Foster wrote:
> > > > On Wed, Dec 11, 2019 at 08:41:00AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 10, 2019 at 08:23:40AM -0500, Brian Foster wrote:
> > > > > I think insert/collapse need to be converted to work like a
> > > > > truncate operation instead of a series on individual write
> > > > > operations. That is, they are a permanent transaction that locks the
> > > > > inode once and is rolled repeatedly until the entire extent listi
> > > > > modification is done and then the inode is unlocked.
> > > > > 
> > > > 
> > > > Note that I don't think it's sufficient to hold the inode locked only
> > > > across the shift. For the insert case, I think we'd need to grab it
> > > > before the extent split at the target offset and roll from there.
> > > > Otherwise the same problem could be reintroduced if we eventually
> > > > replaced the xfs_prepare_shift() tweak made by this patch. Of course,
> > > > that doesn't look like a big problem. The locking is already elevated
> > > > and split and shift even use the same transaction type, so it's mostly a
> > > > refactor from a complexity standpoint. 
> > > 
> > > *nod*
> > > 
> > > > For the collapse case, we do have a per-shift quota reservation for some
> > > > reason. If that is required, we'd have to somehow replace it with a
> > > > worst case calculation. That said, it's not clear to me why that
> > > > reservation even exists.
> > > 
> > > I'm not 100% sure, either, but....
> > > 
> > > > The pre-shift hole punch is already a separate
> > > > transaction with its own such reservation. The shift can merge extents
> > > > after that point (though most likely only on the first shift), but that
> > > > would only ever remove extent records. Any thoughts or objections if I
> > > > just killed that off?
> > > 
> > > Yeah, I suspect that it is the xfs_bmse_merge() case freeing blocks
> > > the reservation is for, and I agree that it should only happen on
> > > the first shift because all the others that are moved are identical
> > > in size and shape and would have otherwise been merged at creation.
> > > 
> > > Hence I think we can probably kill the xfs_bmse_merge() case,
> > > though it might be wrth checking first how often it gets called...
> > > 
> > 
> > Ok, but do we need an up-front quota reservation for freeing blocks out
> > of the bmapbt? ISTM the reservation could be removed regardless of the
> > merging behavior. This is what my current patch does, at least, so we'll
> > see if anything explodes. :P
> 
> xfs_itruncate_extents() doesn't require an up front block
> reservation for quotas or transaction allocation, so I don't really
> see how the collapse would require it, even in the merge case...
> 

Ok, so I'm not completely crazy. :) I've seen no issues so far with the
reservation removed, anyways.

> > I agree on the xfs_bmse_merge() bit. I was planning to leave that as is
> > however because IIRC, even though it is quite rare, I thought we had a
> > few corner cases where it was possible for physically and logically
> > contiguous extents to track separately in a mapping tree. Max sized
> > extents that are subsequently punched out or truncated might be one
> > example. I thought we had others, but I can't quite put my finger on it
> > atm..
> 
> True, but is it common enough that we really need to care about it?
> If it's bad, just run xfs_fsr on the file/filesystem....
> 

Yeah probably not. Extent shift shouldn't really be a path out of that
problem if it exists. As you suggest, fsr or online repair should be
expected to fix up things like that. My thinking was more that it wasn't
necessary to change this code to achieve atomicity. IOW, I don't object
to changing it, but I'd probably do it in a separate patch.

In looking at this code again, it also looks like filtering out the
merge check for everything but the first left shift would result in more
code than what we have now. To me, the current code seems
straightforward and harmless in that it's implemented as sort of an
optimization of how to perform the shift. I.e., do we shift the existing
extent or can we fold it into an existing one? So I think I've kind of
lost track of what the suggestion was wrt to the merge code. What did
you mean exactly by "kill off the merge case" and what was the goal?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Darrick J. Wong Dec. 17, 2019, 5:59 a.m. UTC | #8
On Tue, Dec 10, 2019 at 08:23:40AM -0500, Brian Foster wrote:
> generic/522 (fsx) occasionally fails with a file corruption due to
> an insert range operation. The primary characteristic of the
> corruption is a misplaced insert range operation that differs from
> the requested target offset. The reason for this behavior is a race
> between the extent shift sequence of an insert range and a COW
> writeback completion that causes a front merge with the first extent
> in the shift.
> 
> The shift preparation function flushes and unmaps from the target
> offset of the operation to the end of the file to ensure no
> modifications can be made and page cache is invalidated before file
> data is shifted. An insert range operation then splits the extent at
> the target offset, if necessary, and begins to shift the start
> offset of each extent starting from the end of the file to the start
> offset. The shift sequence operates at extent level and so depends
> on the preparation sequence to guarantee no changes can be made to
> the target range during the shift. If the block immediately prior to
> the target offset was dirty and shared, however, it can undergo
> writeback and move from the COW fork to the data fork at any point
> during the shift. If the block is contiguous with the block at the
> start offset of the insert range, it can front merge and alter the
> start offset of the extent. Once the shift sequence reaches the
> target offset, it shifts based on the latest start offset and
> silently changes the target offset of the operation and corrupts the
> file.
> 
> To address this problem, update the shift preparation code to
> stabilize the start boundary along with the full range of the
> insert. Also update the existing corruption check to fail if any
> extent is shifted with a start offset behind the target offset of
> the insert range. This prevents insert from racing with COW
> writeback completion and fails loudly in the event of an unexpected
> extent shift.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> This has survived a couple fstests runs (upstream) so far as well as an
> overnight loop test of generic/522 (on RHEL). The RHEL based kernel just
> happened to be where this was originally diagnosed and provides a fairly
> reliable failure rate of within 30 iterations or so. The current test is
> at almost 70 iterations and still running.
> 
> Brian

After a week of testing I declare that 2.5 billion fsxops should be
enough for anyone. :P

This /also/ seems to have fixed the separate corruption problems I was
seeing, so...

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> 
>  fs/xfs/libxfs/xfs_bmap.c |  3 +--
>  fs/xfs/xfs_bmap_util.c   | 12 ++++++++++++
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index a9ad1f991ba3..4a802b3abe77 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -5972,8 +5972,7 @@ xfs_bmap_insert_extents(
>  		goto del_cursor;
>  	}
>  
> -	if (XFS_IS_CORRUPT(mp,
> -			   stop_fsb >= got.br_startoff + got.br_blockcount)) {
> +	if (XFS_IS_CORRUPT(mp, stop_fsb > got.br_startoff)) {
>  		error = -EFSCORRUPTED;
>  		goto del_cursor;
>  	}
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 2efd78a9719e..e62fb5216341 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -992,6 +992,7 @@ xfs_prepare_shift(
>  	struct xfs_inode	*ip,
>  	loff_t			offset)
>  {
> +	struct xfs_mount	*mp = ip->i_mount;
>  	int			error;
>  
>  	/*
> @@ -1004,6 +1005,17 @@ xfs_prepare_shift(
>  			return error;
>  	}
>  
> +	/*
> +	 * Shift operations must stabilize the start block offset boundary along
> +	 * with the full range of the operation. If we don't, a COW writeback
> +	 * completion could race with an insert, front merge with the start
> +	 * extent (after split) during the shift and corrupt the file. Start
> +	 * with the block just prior to the start to stabilize the boundary.
> +	 */
> +	offset = round_down(offset, 1 << mp->m_sb.sb_blocklog);
> +	if (offset)
> +		offset -= (1 << mp->m_sb.sb_blocklog);
> +
>  	/*
>  	 * Writeback and invalidate cache for the remainder of the file as we're
>  	 * about to shift down every extent from offset to EOF.
> -- 
> 2.20.1
>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index a9ad1f991ba3..4a802b3abe77 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5972,8 +5972,7 @@  xfs_bmap_insert_extents(
 		goto del_cursor;
 	}
 
-	if (XFS_IS_CORRUPT(mp,
-			   stop_fsb >= got.br_startoff + got.br_blockcount)) {
+	if (XFS_IS_CORRUPT(mp, stop_fsb > got.br_startoff)) {
 		error = -EFSCORRUPTED;
 		goto del_cursor;
 	}
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 2efd78a9719e..e62fb5216341 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -992,6 +992,7 @@  xfs_prepare_shift(
 	struct xfs_inode	*ip,
 	loff_t			offset)
 {
+	struct xfs_mount	*mp = ip->i_mount;
 	int			error;
 
 	/*
@@ -1004,6 +1005,17 @@  xfs_prepare_shift(
 			return error;
 	}
 
+	/*
+	 * Shift operations must stabilize the start block offset boundary along
+	 * with the full range of the operation. If we don't, a COW writeback
+	 * completion could race with an insert, front merge with the start
+	 * extent (after split) during the shift and corrupt the file. Start
+	 * with the block just prior to the start to stabilize the boundary.
+	 */
+	offset = round_down(offset, 1 << mp->m_sb.sb_blocklog);
+	if (offset)
+		offset -= (1 << mp->m_sb.sb_blocklog);
+
 	/*
 	 * Writeback and invalidate cache for the remainder of the file as we're
 	 * about to shift down every extent from offset to EOF.