[v2] xfs: fix transaction block reservation in xfs_reflink_end_cow
diff mbox series

Message ID 20181127161652.GE6792@magnolia
State New
Headers show
Series
  • [v2] xfs: fix transaction block reservation in xfs_reflink_end_cow
Related show

Commit Message

Darrick J. Wong Nov. 27, 2018, 4:16 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

In xfs_reflink_end_cow, we have to swap written extents from the CoW
fork into the data fork, which can require extensive block map updates.
The block calculation has an off-by-one underflow, which can lead to
following shutdown:

XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116
<machine registers snipped>
Call Trace:
 xfs_trans_dup+0x211/0x250 [xfs]
 xfs_trans_roll+0x6d/0x180 [xfs]
 xfs_defer_trans_roll+0x10c/0x3b0 [xfs]
 xfs_defer_finish_noroll+0xdf/0x740 [xfs]
 xfs_defer_finish+0x13/0x70 [xfs]
 xfs_reflink_end_cow+0x2c6/0x680 [xfs]
 xfs_dio_write_end_io+0x115/0x220 [xfs]
 iomap_dio_complete+0x3f/0x130
 iomap_dio_rw+0x3c3/0x420
 xfs_file_dio_aio_write+0x132/0x3c0 [xfs]
 xfs_file_write_iter+0x8b/0xc0 [xfs]
 __vfs_write+0x193/0x1f0
 vfs_write+0xba/0x1c0
 ksys_write+0x52/0xc0
 do_syscall_64+0x50/0x160
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Darrick J. Wong Nov. 28, 2018, 8:37 p.m. UTC | #1
On Tue, Nov 27, 2018 at 08:16:52AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> In xfs_reflink_end_cow, we have to swap written extents from the CoW
> fork into the data fork, which can require extensive block map updates.
> The block calculation has an off-by-one underflow, which can lead to
> following shutdown:
> 
> XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116
> <machine registers snipped>
> Call Trace:
>  xfs_trans_dup+0x211/0x250 [xfs]
>  xfs_trans_roll+0x6d/0x180 [xfs]
>  xfs_defer_trans_roll+0x10c/0x3b0 [xfs]
>  xfs_defer_finish_noroll+0xdf/0x740 [xfs]
>  xfs_defer_finish+0x13/0x70 [xfs]
>  xfs_reflink_end_cow+0x2c6/0x680 [xfs]
>  xfs_dio_write_end_io+0x115/0x220 [xfs]
>  iomap_dio_complete+0x3f/0x130
>  iomap_dio_rw+0x3c3/0x420
>  xfs_file_dio_aio_write+0x132/0x3c0 [xfs]
>  xfs_file_write_iter+0x8b/0xc0 [xfs]
>  __vfs_write+0x193/0x1f0
>  vfs_write+0xba/0x1c0
>  ksys_write+0x52/0xc0
>  do_syscall_64+0x50/0x160
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_reflink.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 322a852ce284..d7a451e8b0b9 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -657,14 +657,14 @@ xfs_reflink_end_cow(
>  	 * Stick a warning in just in case, and avoid 64-bit division.
>  	 */
>  	BUILD_BUG_ON(MAX_RW_COUNT > UINT_MAX);
> -	if (end_fsb - offset_fsb > UINT_MAX) {
> +	if (end_fsb - offset_fsb >= UINT_MAX) {
>  		error = -EFSCORRUPTED;
>  		xfs_force_shutdown(ip->i_mount, SHUTDOWN_CORRUPT_INCORE);
>  		ASSERT(0);
>  		goto out;
>  	}
>  	resblks = XFS_NEXTENTADD_SPACE_RES(ip->i_mount,
> -			(unsigned int)(end_fsb - offset_fsb),
> +			(unsigned int)(end_fsb - offset_fsb + 1),

This isn't it either.  I managed to reproduce the ASSERT with some
debugging enabled, and noticed that just prior to the directio write the
data fork looked like this:

D: ABCDEFGH

where A-H are each single-block mappings.  The COW fork for whatever
reason was pretty fragmented too:

C: IJKLMNOP

where I-P are also single block mappings.  The log showed that there was
a chain of transactions with EFIs and block allocations, and I observed
that the number of extents was just enough that the mappings wouldn't
fit in an extents format data fork.  I surmised that the end_cow loop
would punch out the last block of the range:

D: ABCDEFG-
C: IJKLMNOP

which causes the bmap code to collapse the bmbt block into extents
format, freeing the bmbt block.  Then, we remap out of the COW fork:

D: ABCDEFGP
C: IJKLMNO-

which causes the bmap code to convert the data fork from extents format
back into bmbt format, which allocates a block.  We then repeat this
process to replace block G with block O, which causes yet another
collapse and convert cycle.  The NEXTENTADD block reservation macro only
reserves enough blocks to add I-P (8 blocks) to a data fork where A-H have
*already* been cleared out, which means that we assume 1 bmbt split.

Therefore, we only reserve 5 blocks for that split (max bmbt height for
this fs), and we use up all 5 of them mapping blocks P-L into the data
fork.  The extents -> btree conversion for remapping block K overflows
the transaction block reservation and down goes the filesystem.  Note
that in the vast majority of cases the extents are bigger or we don't
ping-pong the reservation, so we've never hit this until now.

I /think/ the solution is to push the transaction allocation into the
loop so that each transaction roll-chain only moves one extent and
therefore we only have to reserve enough blocks for a single btree
split, which should be enough for us.  The downside is that we drop the
ilock during end_cow, which I think(?) is fine since all CoW write paths
go through _reflink_end_cow, and it isn't picky about holes.  As a
bonus, this will also remove the restriction on the number of bytes you
can _reflink_end_cow in a single call.  Not that anyone's complained
about not being able to CoW 16T in a single operation...

--D

>  			XFS_DATA_FORK);
>  	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
>  			resblks, 0, XFS_TRANS_RESERVE | XFS_TRANS_NOFS, &tp);

Patch
diff mbox series

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 322a852ce284..d7a451e8b0b9 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -657,14 +657,14 @@  xfs_reflink_end_cow(
 	 * Stick a warning in just in case, and avoid 64-bit division.
 	 */
 	BUILD_BUG_ON(MAX_RW_COUNT > UINT_MAX);
-	if (end_fsb - offset_fsb > UINT_MAX) {
+	if (end_fsb - offset_fsb >= UINT_MAX) {
 		error = -EFSCORRUPTED;
 		xfs_force_shutdown(ip->i_mount, SHUTDOWN_CORRUPT_INCORE);
 		ASSERT(0);
 		goto out;
 	}
 	resblks = XFS_NEXTENTADD_SPACE_RES(ip->i_mount,
-			(unsigned int)(end_fsb - offset_fsb),
+			(unsigned int)(end_fsb - offset_fsb + 1),
 			XFS_DATA_FORK);
 	error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
 			resblks, 0, XFS_TRANS_RESERVE | XFS_TRANS_NOFS, &tp);