xfs: handle large CoW remapping requests
diff mbox

Message ID 20170427212754.GB19158@birch.djwong.org
State New
Headers show

Commit Message

Darrick J. Wong April 27, 2017, 9:27 p.m. UTC
XFS transactions are constrained both by space and block reservation
limits and the fact that we have to avoid doing 64-bit divisions.  This
means that we can't remap more than 2^32 blocks at a time.  However,
file logical blocks are 64-bit in size, so if we encounter a huge remap
request we have to break it up into smaller pieces.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_reflink.c |   83 ++++++++++++++++++++++++++++++--------------------
 1 file changed, 49 insertions(+), 34 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Christoph Hellwig May 2, 2017, 7:50 a.m. UTC | #1
On Thu, Apr 27, 2017 at 02:27:54PM -0700, Darrick J. Wong wrote:
> XFS transactions are constrained both by space and block reservation
> limits and the fact that we have to avoid doing 64-bit divisions.  This
> means that we can't remap more than 2^32 blocks at a time.  However,
> file logical blocks are 64-bit in size, so if we encounter a huge remap
> request we have to break it up into smaller pieces.

But where would we get that huge remap request from?  We already did the
BUILD_BUG_ON for the max read/write size at least.  Also the remaps
would now not be atomic, which would be a problem for my O_ATOMIC
implementation at least.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong May 2, 2017, 6:02 p.m. UTC | #2
On Tue, May 02, 2017 at 12:50:21AM -0700, Christoph Hellwig wrote:
> On Thu, Apr 27, 2017 at 02:27:54PM -0700, Darrick J. Wong wrote:
> > XFS transactions are constrained both by space and block reservation
> > limits and the fact that we have to avoid doing 64-bit divisions.  This
> > means that we can't remap more than 2^32 blocks at a time.  However,
> > file logical blocks are 64-bit in size, so if we encounter a huge remap
> > request we have to break it up into smaller pieces.
> 
> But where would we get that huge remap request from?

Nowhere, at the moment.  I had O_ATOMIC in mind for this though, since
it'll call end_cow on the entire file at fsync time.  What if you've
written 8GB to a file that you've opened with ATOMIC and then fsync it?
That would trigger a remap longer than MAX_RW_COUNT which will blow the
assert, right?

> We already did the BUILD_BUG_ON for the max read/write size at least.
> Also the remaps would now not be atomic, which would be a problem for
> my O_ATOMIC implementation at least.

Hm... you're right, if we crash midway through the remap then ideally
we'd recover by finishing whatever remapping steps we didn't get to.

The current remapping mechanism only guarantees that whatever little
part of the data fork we've bunmapi'd for each cow fork extent will also
get remapped.  There isn't anything in there that guarantees a remap of
the parts we haven't touched yet.  If one CoW fork extent maps to 2000
data fork extents, we'll atomically remap each of the 2000 extents.  If
we fail at extent 900, the remaining 1100 extents are fed to the CoW
cleanup at the next mount time.  This patch doesn't try to change that
behavior.

For O_ATOMIC I think we'll have to put in some extra log intent items
to help us track all the extents we intend to remap so that we can
pick up where we left off during recovery.  Hm.  It would be difficult
to avoid running into log space problems if there are a lot of extents.

Second half-baked idea: play games with a shadow inode -- allocate an
unlinked inode, persist all the written CoW fork extents to the shadow
inode, and reflink the extents from the shadow to the original inode.
If we crash then we can just re-reflink everything in the shadow inode.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig May 4, 2017, 11:54 a.m. UTC | #3
On Tue, May 02, 2017 at 11:02:20AM -0700, Darrick J. Wong wrote:
> Nowhere, at the moment.  I had O_ATOMIC in mind for this though, since
> it'll call end_cow on the entire file at fsync time.  What if you've
> written 8GB to a file that you've opened with ATOMIC and then fsync it?
> That would trigger a remap longer than MAX_RW_COUNT which will blow the
> assert, right?

Yeah, good point.

> The current remapping mechanism only guarantees that whatever little
> part of the data fork we've bunmapi'd for each cow fork extent will also
> get remapped.  There isn't anything in there that guarantees a remap of
> the parts we haven't touched yet.  If one CoW fork extent maps to 2000
> data fork extents, we'll atomically remap each of the 2000 extents.  If
> we fail at extent 900, the remaining 1100 extents are fed to the CoW
> cleanup at the next mount time.  This patch doesn't try to change that
> behavior.

And that's perfectly fine, as long as we ensure after the mount time
cleanup everything is either remapped or not.

> Second half-baked idea: play games with a shadow inode -- allocate an
> unlinked inode, persist all the written CoW fork extents to the shadow
> inode, and reflink the extents from the shadow to the original inode.
> If we crash then we can just re-reflink everything in the shadow inode.

That was actually my very first implementation before reflink, and
using a swapext variant instead of reflinks.  Maybe it's actually better
in the end.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index ffe6fe7..bb61bb2 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -675,49 +675,29 @@  xfs_reflink_cancel_cow_range(
 	return error;
 }
 
-/*
- * Remap parts of a file's data fork after a successful CoW.
- */
-int
-xfs_reflink_end_cow(
+/* Remap a particular subrange of a file. */
+STATIC int
+xfs_reflink_end_cow_range(
 	struct xfs_inode		*ip,
-	xfs_off_t			offset,
-	xfs_off_t			count)
+	struct xfs_ifork		*ifp,
+	xfs_fileoff_t			offset_fsb,
+	xfs_fileoff_t			end_fsb)
 {
-	struct xfs_ifork		*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
-	struct xfs_bmbt_irec		got, del;
+	struct xfs_bmbt_irec		got;
+	struct xfs_bmbt_irec		del;
+	struct xfs_defer_ops		dfops;
 	struct xfs_trans		*tp;
-	xfs_fileoff_t			offset_fsb;
-	xfs_fileoff_t			end_fsb;
 	xfs_fsblock_t			firstfsb;
-	struct xfs_defer_ops		dfops;
-	int				error;
-	unsigned int			resblks;
 	xfs_filblks_t			rlen;
+	unsigned int			resblks;
 	xfs_extnum_t			idx;
-
-	trace_xfs_reflink_end_cow(ip, offset, count);
-
-	/* No COW extents?  That's easy! */
-	if (ifp->if_bytes == 0)
-		return 0;
-
-	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
-	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
+	int				error;
 
 	/*
-	 * Start a rolling transaction to switch the mappings.  We're
-	 * unlikely ever to have to remap 16T worth of single-block
-	 * extents, so just cap the worst case extent count to 2^32-1.
-	 * Stick a warning in just in case, and avoid 64-bit division.
+	 * Start a rolling transaction to switch the mappings.
+	 * Avoid 64-bit division.
 	 */
-	BUILD_BUG_ON(MAX_RW_COUNT > UINT_MAX);
-	if (end_fsb - offset_fsb > UINT_MAX) {
-		error = -EFSCORRUPTED;
-		xfs_force_shutdown(ip->i_mount, SHUTDOWN_CORRUPT_INCORE);
-		ASSERT(0);
-		goto out;
-	}
+	ASSERT(end_fsb - offset_fsb <= UINT_MAX);
 	resblks = XFS_NEXTENTADD_SPACE_RES(ip->i_mount,
 			(unsigned int)(end_fsb - offset_fsb),
 			XFS_DATA_FORK);
@@ -812,6 +792,41 @@  xfs_reflink_end_cow(
 }
 
 /*
+ * Remap parts of a file's data fork after a successful CoW.
+ */
+int
+xfs_reflink_end_cow(
+	struct xfs_inode		*ip,
+	xfs_off_t			offset,
+	xfs_off_t			count)
+{
+	struct xfs_ifork		*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	xfs_fileoff_t			offset_fsb;
+	xfs_fileoff_t			end_fsb;
+	xfs_fileoff_t			opend;
+	int				error = 0;
+
+	trace_xfs_reflink_end_cow(ip, offset, count);
+
+	/* No COW extents?  That's easy! */
+	if (ifp->if_bytes == 0)
+		return 0;
+
+	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
+	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
+
+	while (offset_fsb < end_fsb) {
+		opend = min_t(xfs_fileoff_t, offset_fsb + UINT_MAX, end_fsb);
+		error = xfs_reflink_end_cow_range(ip, ifp, offset_fsb, opend);
+		if (error)
+			break;
+		offset_fsb = opend;
+	}
+
+	return error;
+}
+
+/*
  * Free leftover CoW reservations that didn't get cleaned out.
  */
 int