diff mbox series

[v3,6/6] xfs: use the latest extent at writeback delalloc conversion time

Message ID 20190123184131.46188-7-bfoster@redhat.com (mailing list archive)
State Superseded
Headers show
Series xfs: properly invalidate cached writeback mapping | expand

Commit Message

Brian Foster Jan. 23, 2019, 6:41 p.m. UTC
The writeback delalloc conversion code is racy with respect to
changes in the currently cached file mapping outside of the current
page. This is because the ilock is cycled between the time the
caller originally looked up the mapping and across each real
allocation of the provided file range. This code has collected
various hacks over the years to help combat the symptoms of these
races (i.e., truncate race detection, allocation into hole
detection, etc.), but none address the fundamental problem that the
imap may not be valid at allocation time.

Rather than continue to use race detection hacks, update writeback
delalloc conversion to a model that explicitly converts the delalloc
extent backing the current file offset being processed. The current
file offset is the only block we can trust to remain once the ilock
is dropped because any operation that can remove the block
(truncate, hole punch, etc.) must flush and discard pagecache pages
first.

Modify xfs_iomap_write_allocate() to use the xfs_bmapi_delalloc()
mechanism to request allocation of the entire delalloc extent
backing the current offset instead of assuming the extent passed by
the caller is unchanged. Record the range specified by the caller
and apply it to the resulting allocated extent so previous checks by
the caller for COW fork overlap are not lost. Finally, overload the
bmapi delalloc flag with the range reval flag behavior since this is
the only use case for both.

This ensures that writeback always picks up the correct
and current extent associated with the page, regardless of races
with other extent modifying operations. If operating on a data fork
and the COW overlap state has changed since the ilock was cycled,
the caller revalidates against the COW fork sequence number before
using the imap for the next block.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c |  15 ++--
 fs/xfs/libxfs/xfs_bmap.h |   2 -
 fs/xfs/xfs_iomap.c       | 174 ++++++++++++++-------------------------
 3 files changed, 71 insertions(+), 120 deletions(-)

Comments

Christoph Hellwig Jan. 24, 2019, 8:52 a.m. UTC | #1
And this I really don't like.  I much prefer the full version
I've done here, which has evolved a bit from the work I've been
posting to the list for months now:

http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation.2

Which also fixes up some layering issues we have in that code path.
Brian Foster Jan. 24, 2019, 2:01 p.m. UTC | #2
On Thu, Jan 24, 2019 at 12:52:30AM -0800, Christoph Hellwig wrote:
> And this I really don't like.  I much prefer the full version
> I've done here, which has evolved a bit from the work I've been
> posting to the list for months now:
> 

This patch is just an incremental step in that direction to fix the
underlying problem. The functionality with regard to imap race
mitigation in the delalloc conversion path is essentially equivalent.
The difference is you've cleaned up the API further by lifting the
lookup into the caller and implementing a more proper conversion
implemention.

> http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation.2
> 
> Which also fixes up some layering issues we have in that code path.

Ok... I've not reviewed the details but as noted above, the approach to
dealing with imap validity looks sane to me. Essentially all I'm asking
for here is isolated patches for fixes vs. refactoring vs. whatever
other cleanups/changes were related to the always cow stuff and offering
this (or the v2 patch) as an already tested way to do that.

If you'd rather just put this all in your series, then please split this
all up into a patch per logical change (i.e., separate patches to
move/rename functions first, introduce the new bmapi delalloc conversion
helper, modify writeback to use said helper, the layering fixups you're
referring to, etc.). At a glance, the commit above looks like it could
easily be split into at least 3 or 4 patches.

Brian
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 856de22439a3..3229a82de1fb 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4296,15 +4296,14 @@  xfs_bmapi_write(
 	bma.datatype = 0;
 
 	/*
-	 * The reval flag means the caller wants to allocate the entire delalloc
-	 * extent backing bno where bno may not necessarily match the startoff.
-	 * Now that we've looked up the extent, reset the range to map based on
-	 * the extent in the file. If we're in a hole, this may be an error so
-	 * don't adjust anything.
+	 * The delalloc flag means the caller wants to allocate the entire
+	 * delalloc extent backing bno where bno may not necessarily match the
+	 * startoff. Now that we've looked up the extent, reset the range to
+	 * map based on the extent in the file. If we're in a hole, this may be
+	 * an error so don't adjust anything.
 	 */
-	if ((flags & XFS_BMAPI_REVALRANGE) &&
+	if ((flags & XFS_BMAPI_DELALLOC) &&
 	    !eof && bno >= bma.got.br_startoff) {
-		ASSERT(flags & XFS_BMAPI_DELALLOC);
 		bno = bma.got.br_startoff;
 		len = bma.got.br_blockcount;
 #ifdef DEBUG
@@ -4491,7 +4490,7 @@  xfs_bmapi_delalloc(
 	 * The reval flag means to allocate the entire extent; pass a dummy
 	 * length of 1.
 	 */
-	flags |= XFS_BMAPI_REVALRANGE;
+	flags |= XFS_BMAPI_DELALLOC;
 	return xfs_bmapi_write(tp, ip, bno, 1, flags, total, imap, nimaps);
 }
 
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index a53eb3d527e2..4e8bd2837cb0 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -107,8 +107,6 @@  struct xfs_extent_free_item
 /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
 #define XFS_BMAPI_NORMAP	0x2000
 
-#define XFS_BMAPI_REVALRANGE	0x4000
-
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ab69caa685b4..066c2120f0ba 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -677,25 +677,26 @@  xfs_file_iomap_begin_delay(
  */
 int
 xfs_iomap_write_allocate(
-	xfs_inode_t	*ip,
-	int		whichfork,
-	xfs_off_t	offset,
-	xfs_bmbt_irec_t *imap,
-	unsigned int	*seq)
+	struct xfs_inode	*ip,
+	int			whichfork,
+	xfs_off_t		offset,
+	struct xfs_bmbt_irec	*imap,
+	unsigned int		*seq)
 {
-	xfs_mount_t	*mp = ip->i_mount;
-	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
-	xfs_fileoff_t	offset_fsb, last_block;
-	xfs_fileoff_t	end_fsb, map_start_fsb;
-	xfs_filblks_t	count_fsb;
-	xfs_trans_t	*tp;
-	int		nimaps;
-	int		error = 0;
-	int		flags = XFS_BMAPI_DELALLOC;
-	int		nres;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
+	xfs_fileoff_t		offset_fsb;
+	xfs_fileoff_t		map_start_fsb;
+	xfs_extlen_t		map_count_fsb;
+	struct xfs_trans	*tp;
+	int			nimaps;
+	int			error = 0;
+	int			flags = XFS_BMAPI_DELALLOC;
+	int			nres;
 
 	if (whichfork == XFS_COW_FORK)
 		flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
+	nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
 
 	/*
 	 * Make sure that the dquots are there.
@@ -704,106 +705,63 @@  xfs_iomap_write_allocate(
 	if (error)
 		return error;
 
+	/*
+	 * Store the file range the caller is interested in because it encodes
+	 * state such as potential overlap with COW fork blocks. We must trim
+	 * the allocated extent down to this range to maintain consistency with
+	 * what the caller expects. Revalidation of the range itself is the
+	 * responsibility of the caller.
+	 */
 	offset_fsb = XFS_B_TO_FSBT(mp, offset);
-	count_fsb = imap->br_blockcount;
 	map_start_fsb = imap->br_startoff;
+	map_count_fsb = imap->br_blockcount;
 
-	XFS_STATS_ADD(mp, xs_xstrat_bytes, XFS_FSB_TO_B(mp, count_fsb));
+	XFS_STATS_ADD(mp, xs_xstrat_bytes,
+		      XFS_FSB_TO_B(mp, imap->br_blockcount));
 
-	while (count_fsb != 0) {
+	while (true) {
 		/*
-		 * Set up a transaction with which to allocate the
-		 * backing store for the file.  Do allocations in a
-		 * loop until we get some space in the range we are
-		 * interested in.  The other space that might be allocated
-		 * is in the delayed allocation extent on which we sit
-		 * but before our buffer starts.
+		 * Allocate in a loop because it may take several attempts to
+		 * allocate real blocks for a contiguous delalloc extent if free
+		 * space is sufficiently fragmented. Note that space for the
+		 * extent and indirect blocks was reserved when the delalloc
+		 * extent was created so there's no need to do so here.
 		 */
-		nimaps = 0;
-		while (nimaps == 0) {
-			nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
-			/*
-			 * We have already reserved space for the extent and any
-			 * indirect blocks when creating the delalloc extent,
-			 * there is no need to reserve space in this transaction
-			 * again.
-			 */
-			error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0,
-					0, XFS_TRANS_RESERVE, &tp);
-			if (error)
-				return error;
-
-			xfs_ilock(ip, XFS_ILOCK_EXCL);
-			xfs_trans_ijoin(tp, ip, 0);
-
-			/*
-			 * it is possible that the extents have changed since
-			 * we did the read call as we dropped the ilock for a
-			 * while. We have to be careful about truncates or hole
-			 * punchs here - we are not allowed to allocate
-			 * non-delalloc blocks here.
-			 *
-			 * The only protection against truncation is the pages
-			 * for the range we are being asked to convert are
-			 * locked and hence a truncate will block on them
-			 * first.
-			 *
-			 * As a result, if we go beyond the range we really
-			 * need and hit an delalloc extent boundary followed by
-			 * a hole while we have excess blocks in the map, we
-			 * will fill the hole incorrectly and overrun the
-			 * transaction reservation.
-			 *
-			 * Using a single map prevents this as we are forced to
-			 * check each map we look for overlap with the desired
-			 * range and abort as soon as we find it. Also, given
-			 * that we only return a single map, having one beyond
-			 * what we can return is probably a bit silly.
-			 *
-			 * We also need to check that we don't go beyond EOF;
-			 * this is a truncate optimisation as a truncate sets
-			 * the new file size before block on the pages we
-			 * currently have locked under writeback. Because they
-			 * are about to be tossed, we don't need to write them
-			 * back....
-			 */
-			nimaps = 1;
-			end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
-			error = xfs_bmap_last_offset(ip, &last_block,
-							XFS_DATA_FORK);
-			if (error)
-				goto trans_cancel;
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
+					XFS_TRANS_RESERVE, &tp);
+		if (error)
+			return error;
 
-			last_block = XFS_FILEOFF_MAX(last_block, end_fsb);
-			if ((map_start_fsb + count_fsb) > last_block) {
-				count_fsb = last_block - map_start_fsb;
-				if (count_fsb == 0) {
-					error = -EAGAIN;
-					goto trans_cancel;
-				}
-			}
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, ip, 0);
 
-			/*
-			 * From this point onwards we overwrite the imap
-			 * pointer that the caller gave to us.
-			 */
-			error = xfs_bmapi_write(tp, ip, map_start_fsb,
-						count_fsb, flags, nres, imap,
-						&nimaps);
-			if (error)
-				goto trans_cancel;
+		/*
+		 * ilock was dropped since imap was populated which means it
+		 * might no longer be valid. The current page is held locked so
+		 * nothing could have removed the block backing offset_fsb.
+		 * Attempt to allocate whatever delalloc extent currently backs
+		 * offset_fsb and put the result in the imap pointer from the
+		 * caller. We'll trim it down to the caller's most recently
+		 * validated range before we return.
+		 */
+		nimaps = 1;
+		error = xfs_bmapi_delalloc(tp, ip, offset_fsb, flags, nres,
+					   imap, &nimaps);
+		if (nimaps == 0)
+			error = -EFSCORRUPTED;
+		if (error)
+			goto trans_cancel;
 
-			error = xfs_trans_commit(tp);
-			if (error)
-				goto error0;
+		error = xfs_trans_commit(tp);
+		if (error)
+			goto error0;
 
-			*seq = READ_ONCE(ifp->if_seq);
-			xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		}
+		*seq = READ_ONCE(ifp->if_seq);
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
 		/*
-		 * See if we were able to allocate an extent that
-		 * covers at least part of the callers request
+		 * See if we were able to allocate an extent that covers at
+		 * least part of the callers request.
 		 */
 		if (!(imap->br_startblock || XFS_IS_REALTIME_INODE(ip)))
 			return xfs_alert_fsblock_zero(ip, imap);
@@ -812,15 +770,11 @@  xfs_iomap_write_allocate(
 		    (offset_fsb < (imap->br_startoff +
 				   imap->br_blockcount))) {
 			XFS_STATS_INC(mp, xs_xstrat_quick);
+			xfs_trim_extent(imap, map_start_fsb, map_count_fsb);
+			ASSERT(offset_fsb >= imap->br_startoff &&
+			       offset_fsb < imap->br_startoff + imap->br_blockcount);
 			return 0;
 		}
-
-		/*
-		 * So far we have not mapped the requested part of the
-		 * file, just surrounding data, try again.
-		 */
-		count_fsb -= imap->br_blockcount;
-		map_start_fsb = imap->br_startoff + imap->br_blockcount;
 	}
 
 trans_cancel: