diff mbox

[RFC,1/4] xfs: clean up cow fork reservation and tag inodes correctly

Message ID 1478636856-7590-2-git-send-email-bfoster@redhat.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Brian Foster Nov. 8, 2016, 8:27 p.m. UTC
COW fork reservation is implemented via delayed allocation. The code is
modeled after the traditional delalloc allocation code, but is slightly
different in terms of how preallocation occurs. Rather than post-eof
speculative preallocation, COW fork preallocation is implemented via a
COW extent size hint that is designed to minimize fragmentation as a
reflinked file is split over time.

xfs_reflink_reserve_cow() still uses logic that is oriented towards
dealing with post-eof speculative preallocation, however, and is stale
or not necessarily correct. First, the EOF alignment to the COW extent
size hint is implemented in xfs_bmapi_reserve_delalloc() (which does so
correctly by aligning the start and end offsets) and so is not necessary
in xfs_reflink_reserve_cow(). The backoff and retry logic on ENOSPC is
also ineffective for the same reason, as xfs_bmapi_reserve_delalloc()
will simply perform the same allocation request on the retry. Finally,
since the COW extent size hint aligns the start and end offset of the
range to allocate, the end_fsb != orig_end_fsb logic is not sufficient.
Indeed, if a write request happens to end on an aligned offset, it is
possible that we do not tag the inode for COW preallocation even though
xfs_bmapi_reserve_delalloc() may have preallocated at the start offset.

Kill the unnecessary, duplicate code in xfs_reflink_reserve_cow() and
update the preallocation tag logic to check whether the bmap allocation
actually matches the range that was requested.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_reflink.c | 28 ++++------------------------
 1 file changed, 4 insertions(+), 24 deletions(-)

Comments

Christoph Hellwig Nov. 15, 2016, 2:16 p.m. UTC | #1
> +	if (imap->br_startoff != got.br_startoff ||
> +	    imap->br_blockcount != got.br_blockcount)
>  		xfs_inode_set_cowblocks_tag(ip);

Can't got.br_blockcount be smaller than imap->br_blockcount if we have
an existing COW fork reservation lying around behind the whole we're
filling?  Also they way xfs_bmapi_reserve_delalloc works the startoff
will be the same.  E.g. this check should probably be:

	if (got.br_blockcount > imap->br_blockcount)

Except for that the patch looks good and is a nice cleanup.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Nov. 15, 2016, 6:11 p.m. UTC | #2
On Tue, Nov 15, 2016 at 06:16:21AM -0800, Christoph Hellwig wrote:
> > +	if (imap->br_startoff != got.br_startoff ||
> > +	    imap->br_blockcount != got.br_blockcount)
> >  		xfs_inode_set_cowblocks_tag(ip);
> 
> Can't got.br_blockcount be smaller than imap->br_blockcount if we have
> an existing COW fork reservation lying around behind the whole we're
> filling?  Also they way xfs_bmapi_reserve_delalloc works the startoff
> will be the same.  E.g. this check should probably be:
> 

Good point, though I think it can be smaller or larger without
necessarily having preallocation due to being merged with surrounding
extents. I'm not quite sure what the right answer for that is with
regard to tagging, but I do think it's better to have false positive
tagging than false negatives.

startoff can actually change due to merges as well, but merging aside,
I'm not following how you expect startoff to always be the same in the
event of the extent size hint. E.g., I see the following on a quick test
(file is a 1m reflinked file):

# xfs_io -c fiemap -c "fiemap -c" /mnt/file 
/mnt/file:
        0: [0..2047]: 160..2207
/mnt/file:
# xfs_io -c "pwrite 32k 4k" /mnt/file 
wrote 4096/4096 bytes at offset 32768
4 KiB, 1 ops; 0.0000 sec (16.910 MiB/sec and 4329.0043 ops/sec)
# xfs_io -c fiemap -c "fiemap -c" /mnt/file 
/mnt/file:
        0: [0..63]: 160..223
        1: [64..71]: 2272..2279
        2: [72..2047]: 232..2207
/mnt/file:
        0: [0..63]: 2208..2271
        1: [64..71]: hole
        2: [72..255]: 2280..2463
        3: [256..2047]: hole

So the cow extent size hint rounds out the allocation in the cow fork to
the aligned start/end offsets. In fact, I think it would do so
regardless of whether those covered blocks are even shared, but that's a
separate issue.

Given all of that, I could tweak the check to:

	if (got.br_startoff < imap->br_startoff ||
	    got.br_blockcount > imap->br_blockcount)
		...

Thoughts?

Brian

> 	if (got.br_blockcount > imap->br_blockcount)
> 
> Except for that the patch looks good and is a nice cleanup.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 18, 2016, 8:11 a.m. UTC | #3
On Tue, Nov 15, 2016 at 01:11:01PM -0500, Brian Foster wrote:
> On Tue, Nov 15, 2016 at 06:16:21AM -0800, Christoph Hellwig wrote:
> > > +	if (imap->br_startoff != got.br_startoff ||
> > > +	    imap->br_blockcount != got.br_blockcount)
> > >  		xfs_inode_set_cowblocks_tag(ip);
> > 
> > Can't got.br_blockcount be smaller than imap->br_blockcount if we have
> > an existing COW fork reservation lying around behind the whole we're
> > filling?  Also they way xfs_bmapi_reserve_delalloc works the startoff
> > will be the same.  E.g. this check should probably be:
> > 
> 
> Good point, though I think it can be smaller or larger without
> necessarily having preallocation due to being merged with surrounding
> extents. I'm not quite sure what the right answer for that is with
> regard to tagging, but I do think it's better to have false positive
> tagging than false negatives.

Good point, merging can change both the start and length.  Based on
that I think tagging in the caller of xfs_bmapi_reserve_delalloc is
wrong, and we need to do it inside xfs_bmapi_reserve_delalloc where
we know if we did speculative preallocation.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Nov. 18, 2016, 3:10 p.m. UTC | #4
On Fri, Nov 18, 2016 at 12:11:46AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 15, 2016 at 01:11:01PM -0500, Brian Foster wrote:
> > On Tue, Nov 15, 2016 at 06:16:21AM -0800, Christoph Hellwig wrote:
> > > > +	if (imap->br_startoff != got.br_startoff ||
> > > > +	    imap->br_blockcount != got.br_blockcount)
> > > >  		xfs_inode_set_cowblocks_tag(ip);
> > > 
> > > Can't got.br_blockcount be smaller than imap->br_blockcount if we have
> > > an existing COW fork reservation lying around behind the whole we're
> > > filling?  Also they way xfs_bmapi_reserve_delalloc works the startoff
> > > will be the same.  E.g. this check should probably be:
> > > 
> > 
> > Good point, though I think it can be smaller or larger without
> > necessarily having preallocation due to being merged with surrounding
> > extents. I'm not quite sure what the right answer for that is with
> > regard to tagging, but I do think it's better to have false positive
> > tagging than false negatives.
> 
> Good point, merging can change both the start and length.  Based on
> that I think tagging in the caller of xfs_bmapi_reserve_delalloc is
> wrong, and we need to do it inside xfs_bmapi_reserve_delalloc where
> we know if we did speculative preallocation.

xfs_bmapi_reserve_delalloc() doesn't really know if it's doing
preallocation outside of the extent size hint alignment. I don't think
we want to try and bury all of the prealloc stuff down in the bmapi
code, though we may be able to add a prealloc parameter to
bmapi_reserve_delalloc() such that it can add the prealloc itself and
also tell the difference between prealloc and merge cases. I'll play
around with it, thanks.

Brian
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index a279b4e..b19a7e0 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -244,11 +244,9 @@  xfs_reflink_reserve_cow(
 	bool			*shared)
 {
 	struct xfs_bmbt_irec	got, prev;
-	xfs_fileoff_t		end_fsb, orig_end_fsb;
 	int			eof = 0, error = 0;
 	bool			trimmed;
 	xfs_extnum_t		idx;
-	xfs_extlen_t		align;
 
 	/*
 	 * Search the COW fork extent list first.  This serves two purposes:
@@ -285,32 +283,14 @@  xfs_reflink_reserve_cow(
 	if (error)
 		return error;
 
-	end_fsb = orig_end_fsb = imap->br_startoff + imap->br_blockcount;
 
-	align = xfs_eof_alignment(ip, xfs_get_cowextsz_hint(ip));
-	if (align)
-		end_fsb = roundup_64(end_fsb, align);
-
-retry:
 	error = xfs_bmapi_reserve_delalloc(ip, XFS_COW_FORK, imap->br_startoff,
-			end_fsb - imap->br_startoff, &got, &prev, &idx, eof);
-	switch (error) {
-	case 0:
-		break;
-	case -ENOSPC:
-	case -EDQUOT:
-		/* retry without any preallocation */
-		trace_xfs_reflink_cow_enospc(ip, imap);
-		if (end_fsb != orig_end_fsb) {
-			end_fsb = orig_end_fsb;
-			goto retry;
-		}
-		/*FALLTHRU*/
-	default:
+			imap->br_blockcount, &got, &prev, &idx, eof);
+	if (error)
 		return error;
-	}
 
-	if (end_fsb != orig_end_fsb)
+	if (imap->br_startoff != got.br_startoff ||
+	    imap->br_blockcount != got.br_blockcount)
 		xfs_inode_set_cowblocks_tag(ip);
 
 	trace_xfs_reflink_cow_alloc(ip, &got);