diff mbox series

[v2,5/5] xfs: revalidate imap properly before writeback delalloc conversion

Message ID 20190117192004.49346-6-bfoster@redhat.com (mailing list archive)
State New, archived
Headers show
Series xfs: properly invalidate cached writeback mapping | expand

Commit Message

Brian Foster Jan. 17, 2019, 7:20 p.m. UTC
The writeback delalloc conversion code has a couple historical hacks
to help deal with racing with other operations on the file under
writeback. For example, xfs_iomap_write_allocate() checks i_size
once under ilock to determine whether a truncate may have
invalidated some portion of the range we're about to map. This helps
prevent the bmapi call from doing physical allocation where only
delalloc conversion was expected (which is a transaction reservation
overrun vector). XFS_BMAPI_DELALLOC helps similarly protect us from
this problem as it ensures that the bmapi call will only convert
existing blocks.

Neither of these actually ensure we actually pass a valid range to
convert to xfs_bmapi_write(). The bmapi flag turns the
aforementioned physical allocation over a hole problem into an
explicit error, which means that any particular instances of that
problem change from being a transaction overrun error to a writeback
error. The latter means page discard, which is an ominous error even
if due to the underlying block(s) being punched out.

Now that we have a mechanism in place to track inode fork changes,
use it in xfs_iomap_write_allocate() to properly handle potential
changes to the cached writeback mapping and establish a valid range
to map. Check the already provided fork seqno once under ilock. If
it has changed, look up the extent again in the associated fork and
trim the caller's mapping to the latest version. We can expect to
still find the blocks backing the current file offset because the
page is held locked during writeback and page cache truncation
acquires the page lock and waits on writeback to complete before it
can toss the page. This ensures that writeback never attempts to map
a range of a file that has been punched out and never submits I/O to
blocks that are no longer mapped to the file.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_iomap.c | 102 +++++++++++++++++++--------------------------
 1 file changed, 44 insertions(+), 58 deletions(-)

Comments

Allison Henderson Jan. 18, 2019, 6:58 a.m. UTC | #1
On 1/17/19 12:20 PM, Brian Foster wrote:
> The writeback delalloc conversion code has a couple historical hacks
> to help deal with racing with other operations on the file under
> writeback. For example, xfs_iomap_write_allocate() checks i_size
> once under ilock to determine whether a truncate may have
> invalidated some portion of the range we're about to map. This helps
> prevent the bmapi call from doing physical allocation where only
> delalloc conversion was expected (which is a transaction reservation
> overrun vector). XFS_BMAPI_DELALLOC helps similarly protect us from
> this problem as it ensures that the bmapi call will only convert
> existing blocks.
> 
> Neither of these actually ensure we actually pass a valid range to
> convert to xfs_bmapi_write(). The bmapi flag turns the
> aforementioned physical allocation over a hole problem into an
> explicit error, which means that any particular instances of that
> problem change from being a transaction overrun error to a writeback
> error. The latter means page discard, which is an ominous error even
> if due to the underlying block(s) being punched out.
> 
> Now that we have a mechanism in place to track inode fork changes,
> use it in xfs_iomap_write_allocate() to properly handle potential
> changes to the cached writeback mapping and establish a valid range
> to map. Check the already provided fork seqno once under ilock. If
> it has changed, look up the extent again in the associated fork and
> trim the caller's mapping to the latest version. We can expect to
> still find the blocks backing the current file offset because the
> page is held locked during writeback and page cache truncation
> acquires the page lock and waits on writeback to complete before it
> can toss the page. This ensures that writeback never attempts to map
> a range of a file that has been punched out and never submits I/O to
> blocks that are no longer mapped to the file.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/xfs_iomap.c | 102 +++++++++++++++++++--------------------------
>   1 file changed, 44 insertions(+), 58 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index ab69caa685b4..80da465ebf7a 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -677,22 +677,24 @@ xfs_file_iomap_begin_delay(
>    */
>   int
>   xfs_iomap_write_allocate(
> -	xfs_inode_t	*ip,
> -	int		whichfork,
> -	xfs_off_t	offset,
> -	xfs_bmbt_irec_t *imap,
> -	unsigned int	*seq)
> +	struct xfs_inode	*ip,
> +	int			whichfork,
> +	xfs_off_t		offset,
> +	struct xfs_bmbt_irec	*imap,
> +	unsigned int		*seq)
>   {
> -	xfs_mount_t	*mp = ip->i_mount;
> -	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
> -	xfs_fileoff_t	offset_fsb, last_block;
> -	xfs_fileoff_t	end_fsb, map_start_fsb;
> -	xfs_filblks_t	count_fsb;
> -	xfs_trans_t	*tp;
> -	int		nimaps;
> -	int		error = 0;
> -	int		flags = XFS_BMAPI_DELALLOC;
> -	int		nres;
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> +	struct xfs_trans	*tp;
> +	struct xfs_iext_cursor	icur;
> +	struct xfs_bmbt_irec	timap;
> +	xfs_fileoff_t		offset_fsb;
> +	xfs_fileoff_t		map_start_fsb;
> +	xfs_filblks_t		count_fsb;
> +	int			nimaps;
> +	int			error = 0;
> +	int			flags = XFS_BMAPI_DELALLOC;
> +	int			nres;
>   
>   	if (whichfork == XFS_COW_FORK)
>   		flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
> @@ -737,56 +739,40 @@ xfs_iomap_write_allocate(
>   			xfs_trans_ijoin(tp, ip, 0);
>   
>   			/*
> -			 * it is possible that the extents have changed since
> -			 * we did the read call as we dropped the ilock for a
> -			 * while. We have to be careful about truncates or hole
> -			 * punchs here - we are not allowed to allocate
> -			 * non-delalloc blocks here.
> -			 *
> -			 * The only protection against truncation is the pages
> -			 * for the range we are being asked to convert are
> -			 * locked and hence a truncate will block on them
> -			 * first.
> -			 *
> -			 * As a result, if we go beyond the range we really
> -			 * need and hit an delalloc extent boundary followed by
> -			 * a hole while we have excess blocks in the map, we
> -			 * will fill the hole incorrectly and overrun the
> -			 * transaction reservation.
> +			 * Now that we have ILOCK we must account for the fact
> +			 * that the fork (and thus our mapping) could have
> +			 * changed while the inode was unlocked. If the fork
> +			 * has changed, trim the caller's mapping to the
> +			 * current extent in the fork.
>   			 *
> -			 * Using a single map prevents this as we are forced to
> -			 * check each map we look for overlap with the desired
> -			 * range and abort as soon as we find it. Also, given
> -			 * that we only return a single map, having one beyond
> -			 * what we can return is probably a bit silly.
> +			 * If the external change did not modify the current
> +			 * mapping (or just grew it) this will have no effect.
> +			 * If the current mapping shrunk, we expect to at
> +			 * minimum still have blocks backing the current page as
> +			 * the page has remained locked since writeback first
> +			 * located delalloc block(s) at the page offset. A
> +			 * racing truncate, hole punch or even reflink must wait
> +			 * on page writeback before it can modify our page and
> +			 * underlying block(s).
>   			 *
> -			 * We also need to check that we don't go beyond EOF;
> -			 * this is a truncate optimisation as a truncate sets
> -			 * the new file size before block on the pages we
> -			 * currently have locked under writeback. Because they
> -			 * are about to be tossed, we don't need to write them
> -			 * back....
> +			 * We'll update *seq before we drop ilock for the next
> +			 * iteration.
>   			 */
> -			nimaps = 1;
> -			end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> -			error = xfs_bmap_last_offset(ip, &last_block,
> -							XFS_DATA_FORK);
> -			if (error)
> -				goto trans_cancel;
> -
> -			last_block = XFS_FILEOFF_MAX(last_block, end_fsb);
> -			if ((map_start_fsb + count_fsb) > last_block) {
> -				count_fsb = last_block - map_start_fsb;
> -				if (count_fsb == 0) {
> -					error = -EAGAIN;
> +			if (*seq != READ_ONCE(ifp->if_seq)) {
> +				if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb,
> +							    &icur, &timap) ||
> +				    timap.br_startoff > offset_fsb) {
> +					ASSERT(0);
> +					error = -EFSCORRUPTED;
>   					goto trans_cancel;
>   				}
> +				xfs_trim_extent(imap, timap.br_startoff,
> +						timap.br_blockcount);
> +				count_fsb = imap->br_blockcount;
> +				map_start_fsb = imap->br_startoff;
>   			}
>   
> -			/*
> -			 * From this point onwards we overwrite the imap
> -			 * pointer that the caller gave to us.
> -			 */
> +			nimaps = 1;
>   			error = xfs_bmapi_write(tp, ip, map_start_fsb,
>   						count_fsb, flags, nres, imap,
>   						&nimaps);
> 
Alrighty, it seems reasonable to me.  You can add my review. Thanks!

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Christoph Hellwig Jan. 18, 2019, 11:59 a.m. UTC | #2
> +			if (*seq != READ_ONCE(ifp->if_seq)) {
> +				if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb,
> +							    &icur, &timap) ||
> +				    timap.br_startoff > offset_fsb) {
> +					ASSERT(0);
> +					error = -EFSCORRUPTED;
>  					goto trans_cancel;
>  				}
> +				xfs_trim_extent(imap, timap.br_startoff,
> +						timap.br_blockcount);
> +				count_fsb = imap->br_blockcount;
> +				map_start_fsb = imap->br_startoff;

I very much disagree with this logic.

xfs_bmapi_write will always do a xfs_iext_lookup_extent for the
passed in blocks anyway.  So this trimming logic should move into
xfs_bmapi_write to ensure it does the right thing, instead of papering
over the logic in xfs_bmapi_write in the caller with another lookup.

I think the improved XFS_BMAPI_DELALLOC handling in this patch:

   http://git.infradead.org/users/hch/xfs.git/commitdiff/553fed9ce7695df081779c6a9a205af4142b2a06

is the right answer, as writeback really should only ever convert
existing delalloc extents, and I'd much rather rely on that rather than
piling hacks of hacks to feed the right range into a function that must
do a lookup in the extent tree anyway.
Christoph Hellwig Jan. 18, 2019, 4:31 p.m. UTC | #3
On Fri, Jan 18, 2019 at 03:59:37AM -0800, Christoph Hellwig wrote:
> xfs_bmapi_write will always do a xfs_iext_lookup_extent for the
> passed in blocks anyway.  So this trimming logic should move into
> xfs_bmapi_write to ensure it does the right thing, instead of papering
> over the logic in xfs_bmapi_write in the caller with another lookup.
> 
> I think the improved XFS_BMAPI_DELALLOC handling in this patch:
> 
>    http://git.infradead.org/users/hch/xfs.git/commitdiff/553fed9ce7695df081779c6a9a205af4142b2a06
> 
> is the right answer, as writeback really should only ever convert
> existing delalloc extents, and I'd much rather rely on that rather than
> piling hacks of hacks to feed the right range into a function that must
> do a lookup in the extent tree anyway.

FYI, here is a tree that rebases my patches on top of your (minus this
one) and also drops the internal i_size checks, although it still keeps
the initial one for the truncate fast path.  Testing is still ongoing,
and for the writeback issue you probably only need the first three
of my patches:

	http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation
Brian Foster Jan. 18, 2019, 6:39 p.m. UTC | #4
On Fri, Jan 18, 2019 at 08:31:32AM -0800, Christoph Hellwig wrote:
> On Fri, Jan 18, 2019 at 03:59:37AM -0800, Christoph Hellwig wrote:
> > xfs_bmapi_write will always do a xfs_iext_lookup_extent for the
> > passed in blocks anyway.  So this trimming logic should move into
> > xfs_bmapi_write to ensure it does the right thing, instead of papering
> > over the logic in xfs_bmapi_write in the caller with another lookup.
> > 
> > I think the improved XFS_BMAPI_DELALLOC handling in this patch:
> > 
> >    http://git.infradead.org/users/hch/xfs.git/commitdiff/553fed9ce7695df081779c6a9a205af4142b2a06
> > 
> > is the right answer, as writeback really should only ever convert
> > existing delalloc extents, and I'd much rather rely on that rather than
> > piling hacks of hacks to feed the right range into a function that must
> > do a lookup in the extent tree anyway.
> 
> FYI, here is a tree that rebases my patches on top of your (minus this
> one) and also drops the internal i_size checks, although it still keeps
> the initial one for the truncate fast path.  Testing is still ongoing,
> and for the writeback issue you probably only need the first three
> of my patches:
> 
> 	http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation

This doesn't really seem all that different to me. Rather than firm up
the range in the caller, we turn XFS_BMAPI_DELALLOC into something that
seemingly behaves a bit more like CONVERT_ONLY.

A few notes/thoughts:

1. What's the purpose of the nimaps == 0 check in
xfs_iomap_write_allocate()? If we got to this point with the page lock
held, shouldn't we always expect to find something backing the current
page?

2. If so, then it also seems that the whole "eof:" thing in
xfs_map_blocks() should never happen for data forks. If that's the case,
the use of the eof label there seems gratuitous.

3. If the starting bno param to xfs_bmapi_write() lands in a hole (due
to racing with a hole punch for example) and it finds some subsequent
delalloc extent to convert in the requested range, the arithmetic in
xfs_bmapi_trim_map() actually fabricates physical blocks for the hole
portion of the range relative to the startblock of the converted
delalloc extent. I don't think that causes a problem with the writeback
code because the fabricated blocks are before the page offset, but those
semantics are insane. I think you need to do something like fix up
xfs_bmapi_write() to return the actual converted mapping and make sure
xfs_iomap_write_allocate() handles that it might start beyond
map_start_fsb.

Brian
Brian Foster Jan. 20, 2019, 12:45 p.m. UTC | #5
On Fri, Jan 18, 2019 at 01:39:15PM -0500, Brian Foster wrote:
> On Fri, Jan 18, 2019 at 08:31:32AM -0800, Christoph Hellwig wrote:
> > On Fri, Jan 18, 2019 at 03:59:37AM -0800, Christoph Hellwig wrote:
> > > xfs_bmapi_write will always do a xfs_iext_lookup_extent for the
> > > passed in blocks anyway.  So this trimming logic should move into
> > > xfs_bmapi_write to ensure it does the right thing, instead of papering
> > > over the logic in xfs_bmapi_write in the caller with another lookup.
> > > 
> > > I think the improved XFS_BMAPI_DELALLOC handling in this patch:
> > > 
> > >    http://git.infradead.org/users/hch/xfs.git/commitdiff/553fed9ce7695df081779c6a9a205af4142b2a06
> > > 
> > > is the right answer, as writeback really should only ever convert
> > > existing delalloc extents, and I'd much rather rely on that rather than
> > > piling hacks of hacks to feed the right range into a function that must
> > > do a lookup in the extent tree anyway.
> > 
> > FYI, here is a tree that rebases my patches on top of your (minus this
> > one) and also drops the internal i_size checks, although it still keeps
> > the initial one for the truncate fast path.  Testing is still ongoing,
> > and for the writeback issue you probably only need the first three
> > of my patches:
> > 
> > 	http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation
> 
> This doesn't really seem all that different to me. Rather than firm up
> the range in the caller, we turn XFS_BMAPI_DELALLOC into something that
> seemingly behaves a bit more like CONVERT_ONLY.
> 
> A few notes/thoughts:
> 
> 1. What's the purpose of the nimaps == 0 check in
> xfs_iomap_write_allocate()? If we got to this point with the page lock
> held, shouldn't we always expect to find something backing the current
> page?
> 
> 2. If so, then it also seems that the whole "eof:" thing in
> xfs_map_blocks() should never happen for data forks. If that's the case,
> the use of the eof label there seems gratuitous.
> 
> 3. If the starting bno param to xfs_bmapi_write() lands in a hole (due
> to racing with a hole punch for example) and it finds some subsequent
> delalloc extent to convert in the requested range, the arithmetic in
> xfs_bmapi_trim_map() actually fabricates physical blocks for the hole
> portion of the range relative to the startblock of the converted
> delalloc extent. I don't think that causes a problem with the writeback
> code because the fabricated blocks are before the page offset, but those
> semantics are insane. I think you need to do something like fix up
> xfs_bmapi_write() to return the actual converted mapping and make sure
> xfs_iomap_write_allocate() handles that it might start beyond
> map_start_fsb.
> 

A couple more thoughts now that I've had more time to think about this:

4. Kind of a nit, but the comment update in xfs_bmapi_write() that
describes the ilock and associated race window and whatnot should really
be split between there and xfs_iomap_write_allocate(). The former should
just explain what exactly XFS_BMAPI_DELALLOC does (i.e., skip holes,
real extents, over a range..). The latter should explain how
the use of XFS_BMAPI_DELALLOC helps us deal with the race window in the
writeback code.

5. A more serious point.. with this approach, what prevents writeback
from thrashing with buffered writes in the event of the hole punch race
scenario I mentioned previously? One of the things I explicitly wanted
to avoid in this patch is the potential to bounce back and forth between
writes and writeback over extent updates. There is the performance
aspect of that, which is what I was initially concerned about, but your
patch introduces a functional aspect of that we need to consider as
well.

For (the extreme/worst case) example, suppose again we're at writeback
in the middle of a fairly large delalloc extent. Somehow or another we
race and lose a good portion at the start of the currently cached extent
and then see a buffered write at the start offset of the (now invalid)
cached extent. The now invalid range is passed to xfs_bmapi_write(),
which finds and allocates that first block in the range, commits the
transaction and cycles the ilock. If we now contend with another
buffered write at the next offset in the original extent range,
writeback finds and allocates at that offset next and the cycle repeats
until we finally allocate an extent that backs our current page.

Unless I'm missing something here, there's not only opportunity to spend
an awkward amount of time processing unrelated delalloc extents for a
single page, there's potential to allocate bits of what should be a
large contiguous delalloc extent a block at a time, which can defeat the
purpose of delayed allocation (assuming the block allocator doesn't save
our ass).

Hmm, I'm not totally sure what the right answer to that is. On one hand,
I think the approach of using XFS_BMAPI_DELALLOC is fairly elegant. On
the other, I don't think a solution open to this behavior is robust
enough.

One option that comes to mind is to perhaps split off XFS_BMAPI_DELALLOC
from xfs_bmapi_write() into a new xfs_bmapi_delalloc() function that
facilitates new semantics (I'm not terribly comfortable with overloading
the semantics of xfs_bmapi_write()). Instead of passing a range to
xfs_bmapi_delalloc(), just pass the offset we care about and expect that
this function will attempt to allocate the entire extent currently
backing offset. (Alternatively, we could perhaps pass a range by value
and allow xfs_bmapi_delalloc() to update the range in the event of
delalloc discontiguity.) Either way, the extent returned may not cover
the offset (due to fragmentation, as today) and thus the caller needs to
iterate until that happens. The larger point is that we'd lookup the
current extent _at offset_ on each iteration and thus shouldn't ever
contend with new delalloc reservations. Thoughts?

Brian

> Brian
Christoph Hellwig Jan. 21, 2019, 3:48 p.m. UTC | #6
On Fri, Jan 18, 2019 at 01:39:15PM -0500, Brian Foster wrote:
> This doesn't really seem all that different to me. Rather than firm up
> the range in the caller, we turn XFS_BMAPI_DELALLOC into something that
> seemingly behaves a bit more like CONVERT_ONLY.

The differences are:

 (a) we save one lookup in the extent tree
 (b) we have a less fragile API

> A few notes/thoughts:
> 
> 1. What's the purpose of the nimaps == 0 check in
> xfs_iomap_write_allocate()? If we got to this point with the page lock
> held, shouldn't we always expect to find something backing the current
> page?

It protects against the case where we migrate a COW fork mapping to the
data fork, which is not protected by the page lock.  But I guess the
check warrants a comment and an assert.

> 2. If so, then it also seems that the whole "eof:" thing in
> xfs_map_blocks() should never happen for data forks. If that's the case,
> the use of the eof label there seems gratuitous.

Let me try with asserts enabled.

> 
> 3. If the starting bno param to xfs_bmapi_write() lands in a hole (due
> to racing with a hole punch for example) and it finds some subsequent
> delalloc extent to convert in the requested range, the arithmetic in
> xfs_bmapi_trim_map() actually fabricates physical blocks for the hole
> portion of the range relative to the startblock of the converted
> delalloc extent. I don't think that causes a problem with the writeback
> code because the fabricated blocks are before the page offset, but those
> semantics are insane. I think you need to do something like fix up
> xfs_bmapi_write() to return the actual converted mapping and make sure
> xfs_iomap_write_allocate() handles that it might start beyond
> map_start_fsb.

Sounds good.  To be honest I think the whole idea of converting the
mapping before the requested offset is a rather bad idea to start
with, just increasin our window that exposes stale data.  But to fix
this properly we need to create everything as unwritten first, and
I didn't have time to get back to that series.
Christoph Hellwig Jan. 21, 2019, 3:51 p.m. UTC | #7
On Sun, Jan 20, 2019 at 07:45:02AM -0500, Brian Foster wrote:
> 4. Kind of a nit, but the comment update in xfs_bmapi_write() that
> describes the ilock and associated race window and whatnot should really
> be split between there and xfs_iomap_write_allocate(). The former should
> just explain what exactly XFS_BMAPI_DELALLOC does (i.e., skip holes,
> real extents, over a range..). The latter should explain how
> the use of XFS_BMAPI_DELALLOC helps us deal with the race window in the
> writeback code.

Ok.

> One option that comes to mind is to perhaps split off XFS_BMAPI_DELALLOC
> from xfs_bmapi_write() into a new xfs_bmapi_delalloc() function that
> facilitates new semantics (I'm not terribly comfortable with overloading
> the semantics of xfs_bmapi_write()). Instead of passing a range to
> xfs_bmapi_delalloc(), just pass the offset we care about and expect that
> this function will attempt to allocate the entire extent currently
> backing offset. (Alternatively, we could perhaps pass a range by value
> and allow xfs_bmapi_delalloc() to update the range in the event of
> delalloc discontiguity.) Either way, the extent returned may not cover
> the offset (due to fragmentation, as today) and thus the caller needs to
> iterate until that happens. The larger point is that we'd lookup the
> current extent _at offset_ on each iteration and thus shouldn't ever
> contend with new delalloc reservations. Thoughts?

I considered splitting it off and even had an early prototype.  I
got something wrong and it didn't work, and it created a little too
much duplication for my taste so I gave up on it for now.  But
fundamentally having the delalloc conversion separate from
xfs_bmapi_write is the right thing.  I'll just have to find some
time for it or pass this work off to you..
Brian Foster Jan. 21, 2019, 5:42 p.m. UTC | #8
On Mon, Jan 21, 2019 at 07:48:33AM -0800, Christoph Hellwig wrote:
> On Fri, Jan 18, 2019 at 01:39:15PM -0500, Brian Foster wrote:
> > This doesn't really seem all that different to me. Rather than firm up
> > the range in the caller, we turn XFS_BMAPI_DELALLOC into something that
> > seemingly behaves a bit more like CONVERT_ONLY.
> 
> The differences are:
> 
>  (a) we save one lookup in the extent tree
>  (b) we have a less fragile API
> 
> > A few notes/thoughts:
> > 
> > 1. What's the purpose of the nimaps == 0 check in
> > xfs_iomap_write_allocate()? If we got to this point with the page lock
> > held, shouldn't we always expect to find something backing the current
> > page?
> 
> It protects against the case where we migrate a COW fork mapping to the
> data fork, which is not protected by the page lock.  But I guess the
> check warrants a comment and an assert.
> 

Yeah, probably. It's not really clear to me what that means.

> > 2. If so, then it also seems that the whole "eof:" thing in
> > xfs_map_blocks() should never happen for data forks. If that's the case,
> > the use of the eof label there seems gratuitous.
> 
> Let me try with asserts enabled.
> 

Ok, thanks.

> > 
> > 3. If the starting bno param to xfs_bmapi_write() lands in a hole (due
> > to racing with a hole punch for example) and it finds some subsequent
> > delalloc extent to convert in the requested range, the arithmetic in
> > xfs_bmapi_trim_map() actually fabricates physical blocks for the hole
> > portion of the range relative to the startblock of the converted
> > delalloc extent. I don't think that causes a problem with the writeback
> > code because the fabricated blocks are before the page offset, but those
> > semantics are insane. I think you need to do something like fix up
> > xfs_bmapi_write() to return the actual converted mapping and make sure
> > xfs_iomap_write_allocate() handles that it might start beyond
> > map_start_fsb.
> 
> Sounds good.  To be honest I think the whole idea of converting the
> mapping before the requested offset is a rather bad idea to start
> with, just increasin our window that exposes stale data.  But to fix
> this properly we need to create everything as unwritten first, and
> I didn't have time to get back to that series.

Eh, I think that's a separate problem that re: your other series, we
already know how to fix properly. I don't think we should mess around
with something as fairly fundamental as delayed block allocation
behavior for an approach that doesn't address the underlying problem.

Brian
Brian Foster Jan. 21, 2019, 5:43 p.m. UTC | #9
On Mon, Jan 21, 2019 at 07:51:10AM -0800, Christoph Hellwig wrote:
> On Sun, Jan 20, 2019 at 07:45:02AM -0500, Brian Foster wrote:
> > 4. Kind of a nit, but the comment update in xfs_bmapi_write() that
> > describes the ilock and associated race window and whatnot should really
> > be split between there and xfs_iomap_write_allocate(). The former should
> > just explain what exactly XFS_BMAPI_DELALLOC does (i.e., skip holes,
> > real extents, over a range..). The latter should explain how
> > the use of XFS_BMAPI_DELALLOC helps us deal with the race window in the
> > writeback code.
> 
> Ok.
> 
> > One option that comes to mind is to perhaps split off XFS_BMAPI_DELALLOC
> > from xfs_bmapi_write() into a new xfs_bmapi_delalloc() function that
> > facilitates new semantics (I'm not terribly comfortable with overloading
> > the semantics of xfs_bmapi_write()). Instead of passing a range to
> > xfs_bmapi_delalloc(), just pass the offset we care about and expect that
> > this function will attempt to allocate the entire extent currently
> > backing offset. (Alternatively, we could perhaps pass a range by value
> > and allow xfs_bmapi_delalloc() to update the range in the event of
> > delalloc discontiguity.) Either way, the extent returned may not cover
> > the offset (due to fragmentation, as today) and thus the caller needs to
> > iterate until that happens. The larger point is that we'd lookup the
> > current extent _at offset_ on each iteration and thus shouldn't ever
> > contend with new delalloc reservations. Thoughts?
> 
> I considered splitting it off and even had an early prototype.  I
> got something wrong and it didn't work, and it created a little too
> much duplication for my taste so I gave up on it for now.  But
> fundamentally having the delalloc conversion separate from
> xfs_bmapi_write is the right thing.  I'll just have to find some
> time for it or pass this work off to you..

Ok, well I don't mind looking into how to refactor that code, but my
priority for this series is to fix the underlying problem. As a
temporary compromise, I think there might be a couple simple options to
create an xfs_bmapi_delalloc() wrapper over xfs_bmapi_write(). I'm
curious whether we could just pass bno and len == 1 from the delalloc
wrapper and get the behavior we want. Alternatively, perhaps we could
just factor out the bma.got lookup from xfs_bmapi_write() and use that
to handle the range properly in the delalloc case (i.e., pass *got and
*eof into a __xfs_bmapi_write() internal function that does most of
everything else).

If either of those don't work, a temporary fallback may be to just bury
the seqno lookup logic from this patch into xfs_bmapi_delalloc() and
document that further refactoring is required. That would retain the
extra lookup (for now), but TBH the testing I've already down wrt to
excessive higher level revalidations kind of shows that this has no real
impact on writeback performance, it's just a bit ugly.

Brian
Christoph Hellwig Jan. 22, 2019, 5:14 p.m. UTC | #10
On Mon, Jan 21, 2019 at 12:42:56PM -0500, Brian Foster wrote:
> > It protects against the case where we migrate a COW fork mapping to the
> > data fork, which is not protected by the page lock.  But I guess the
> > check warrants a comment and an assert.
> > 
> 
> Yeah, probably. It's not really clear to me what that means.
> 
> > > 2. If so, then it also seems that the whole "eof:" thing in
> > > xfs_map_blocks() should never happen for data forks. If that's the case,
> > > the use of the eof label there seems gratuitous.
> > 
> > Let me try with asserts enabled.
> > 
> 
> Ok, thanks.

So, the nimaps == 0 case hits even for the data fork when running
xfs/442 on a 1k block size file system.  That test has generally been
a fun source of bugs in the always_cow series.
Brian Foster Jan. 22, 2019, 6:13 p.m. UTC | #11
On Tue, Jan 22, 2019 at 09:14:45AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 21, 2019 at 12:42:56PM -0500, Brian Foster wrote:
> > > It protects against the case where we migrate a COW fork mapping to the
> > > data fork, which is not protected by the page lock.  But I guess the
> > > check warrants a comment and an assert.
> > > 
> > 
> > Yeah, probably. It's not really clear to me what that means.
> > 
> > > > 2. If so, then it also seems that the whole "eof:" thing in
> > > > xfs_map_blocks() should never happen for data forks. If that's the case,
> > > > the use of the eof label there seems gratuitous.
> > > 
> > > Let me try with asserts enabled.
> > > 
> > 
> > Ok, thanks.
> 
> So, the nimaps == 0 case hits even for the data fork when running
> xfs/442 on a 1k block size file system.  That test has generally been
> a fun source of bugs in the always_cow series.

Hmm, Ok.. any idea how/why that is? I'm curious if this is an artifact
of this code being racy (i.e., nimaps == 0 occurs somehow on a range
external to the page) or there is some legitimate means to lose a
delalloc block under the locked page. If the latter, we should at least
document it in the code..

FWIW, I have some refactoring in place that basically turns nimaps == 0
into an error and I don't see an xfs/442 failure with 1k or 4k blocks,
though that's only from a couple quick runs and I'm still working
through some changes. The difference of course is that this code uses
the range of the delalloc extent in the data fork rather than the
(potentially already stale) range from wpc->imap. I'll let it spin for a
bit I suppose and see if it explodes..

Brian
Christoph Hellwig Jan. 23, 2019, 6:14 p.m. UTC | #12
On Tue, Jan 22, 2019 at 01:13:28PM -0500, Brian Foster wrote:
> > So, the nimaps == 0 case hits even for the data fork when running
> > xfs/442 on a 1k block size file system.  That test has generally been
> > a fun source of bugs in the always_cow series.
> 
> Hmm, Ok.. any idea how/why that is? I'm curious if this is an artifact
> of this code being racy (i.e., nimaps == 0 occurs somehow on a range
> external to the page) or there is some legitimate means to lose a
> delalloc block under the locked page. If the latter, we should at least
> document it in the code..

I haven't managed to pinpint it mostly because xfs_bmapi_write is
such an unreadable mess.  Based on that I decided to implement the idea
of a separate xfs_bmapi_delalloc_convert helper again, mostly to try
to understand what actuall is going on.  This seems to work pretty
reasonable after initial testing, except that it seems to unhide an
issue in xfs/109 which doesn't look directly related.  The current
status is here:

http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation.2
Brian Foster Jan. 23, 2019, 6:40 p.m. UTC | #13
On Wed, Jan 23, 2019 at 10:14:12AM -0800, Christoph Hellwig wrote:
> On Tue, Jan 22, 2019 at 01:13:28PM -0500, Brian Foster wrote:
> > > So, the nimaps == 0 case hits even for the data fork when running
> > > xfs/442 on a 1k block size file system.  That test has generally been
> > > a fun source of bugs in the always_cow series.
> > 
> > Hmm, Ok.. any idea how/why that is? I'm curious if this is an artifact
> > of this code being racy (i.e., nimaps == 0 occurs somehow on a range
> > external to the page) or there is some legitimate means to lose a
> > delalloc block under the locked page. If the latter, we should at least
> > document it in the code..
> 
> I haven't managed to pinpint it mostly because xfs_bmapi_write is
> such an unreadable mess.  Based on that I decided to implement the idea
> of a separate xfs_bmapi_delalloc_convert helper again, mostly to try
> to understand what actuall is going on.  This seems to work pretty
> reasonable after initial testing, except that it seems to unhide an
> issue in xfs/109 which doesn't look directly related.  The current
> status is here:
> 
> http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation.2
> 

I have a v3 of this series that does something similar, but doesn't
quite take the refactoring as far. I just created an
xfs_bmapi_delalloc() wrapper over xfs_bmapi_write(). The changes to the
latter are minimal based on a tweak of XFS_BMAPI_DELALLOC behavior. The
change to xfs_iomap_write_allocate() is basically just to use the new
helper instead of xfs_bmapi_write().

The refactoring you have looks like the right direction to me, but I'm
wondering why we need to change so much all at once, particularly since
it's not clear that some of this -EAGAIN stuff is really necessary.
Pulling bits out of xfs_bmapi_write() is also inherently more difficult
to review than reusing it, IMO.

Hmm, I'm wondering if we could rebase your refactoring onto the simple
wrapper in my patch. You'd basically just reimplement the helper, kill
off the XFS_BMAPI_DELALLOC stuff as you already have, and the changes to
xfs_iomap_write_allocate() (notwithstanding moving/renaming the
function, which should really be split up into separate patches) would
probably be minimal (i.e., do the extent lookup).

I was still running some tests but I've got through a decent amount
already and have everything formatted to send out so I'll just post
it...

Brian
diff mbox series

Patch

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ab69caa685b4..80da465ebf7a 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -677,22 +677,24 @@  xfs_file_iomap_begin_delay(
  */
 int
 xfs_iomap_write_allocate(
-	xfs_inode_t	*ip,
-	int		whichfork,
-	xfs_off_t	offset,
-	xfs_bmbt_irec_t *imap,
-	unsigned int	*seq)
+	struct xfs_inode	*ip,
+	int			whichfork,
+	xfs_off_t		offset,
+	struct xfs_bmbt_irec	*imap,
+	unsigned int		*seq)
 {
-	xfs_mount_t	*mp = ip->i_mount;
-	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
-	xfs_fileoff_t	offset_fsb, last_block;
-	xfs_fileoff_t	end_fsb, map_start_fsb;
-	xfs_filblks_t	count_fsb;
-	xfs_trans_t	*tp;
-	int		nimaps;
-	int		error = 0;
-	int		flags = XFS_BMAPI_DELALLOC;
-	int		nres;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
+	struct xfs_trans	*tp;
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	timap;
+	xfs_fileoff_t		offset_fsb;
+	xfs_fileoff_t		map_start_fsb;
+	xfs_filblks_t		count_fsb;
+	int			nimaps;
+	int			error = 0;
+	int			flags = XFS_BMAPI_DELALLOC;
+	int			nres;
 
 	if (whichfork == XFS_COW_FORK)
 		flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC;
@@ -737,56 +739,40 @@  xfs_iomap_write_allocate(
 			xfs_trans_ijoin(tp, ip, 0);
 
 			/*
-			 * it is possible that the extents have changed since
-			 * we did the read call as we dropped the ilock for a
-			 * while. We have to be careful about truncates or hole
-			 * punchs here - we are not allowed to allocate
-			 * non-delalloc blocks here.
-			 *
-			 * The only protection against truncation is the pages
-			 * for the range we are being asked to convert are
-			 * locked and hence a truncate will block on them
-			 * first.
-			 *
-			 * As a result, if we go beyond the range we really
-			 * need and hit an delalloc extent boundary followed by
-			 * a hole while we have excess blocks in the map, we
-			 * will fill the hole incorrectly and overrun the
-			 * transaction reservation.
+			 * Now that we have ILOCK we must account for the fact
+			 * that the fork (and thus our mapping) could have
+			 * changed while the inode was unlocked. If the fork
+			 * has changed, trim the caller's mapping to the
+			 * current extent in the fork.
 			 *
-			 * Using a single map prevents this as we are forced to
-			 * check each map we look for overlap with the desired
-			 * range and abort as soon as we find it. Also, given
-			 * that we only return a single map, having one beyond
-			 * what we can return is probably a bit silly.
+			 * If the external change did not modify the current
+			 * mapping (or just grew it) this will have no effect.
+			 * If the current mapping shrunk, we expect to at
+			 * minimum still have blocks backing the current page as
+			 * the page has remained locked since writeback first
+			 * located delalloc block(s) at the page offset. A
+			 * racing truncate, hole punch or even reflink must wait
+			 * on page writeback before it can modify our page and
+			 * underlying block(s).
 			 *
-			 * We also need to check that we don't go beyond EOF;
-			 * this is a truncate optimisation as a truncate sets
-			 * the new file size before block on the pages we
-			 * currently have locked under writeback. Because they
-			 * are about to be tossed, we don't need to write them
-			 * back....
+			 * We'll update *seq before we drop ilock for the next
+			 * iteration.
 			 */
-			nimaps = 1;
-			end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
-			error = xfs_bmap_last_offset(ip, &last_block,
-							XFS_DATA_FORK);
-			if (error)
-				goto trans_cancel;
-
-			last_block = XFS_FILEOFF_MAX(last_block, end_fsb);
-			if ((map_start_fsb + count_fsb) > last_block) {
-				count_fsb = last_block - map_start_fsb;
-				if (count_fsb == 0) {
-					error = -EAGAIN;
+			if (*seq != READ_ONCE(ifp->if_seq)) {
+				if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb,
+							    &icur, &timap) ||
+				    timap.br_startoff > offset_fsb) {
+					ASSERT(0);
+					error = -EFSCORRUPTED;
 					goto trans_cancel;
 				}
+				xfs_trim_extent(imap, timap.br_startoff,
+						timap.br_blockcount);
+				count_fsb = imap->br_blockcount;
+				map_start_fsb = imap->br_startoff;
 			}
 
-			/*
-			 * From this point onwards we overwrite the imap
-			 * pointer that the caller gave to us.
-			 */
+			nimaps = 1;
 			error = xfs_bmapi_write(tp, ip, map_start_fsb,
 						count_fsb, flags, nres, imap,
 						&nimaps);