Message ID | 20190117192004.49346-6-bfoster@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | xfs: properly invalidate cached writeback mapping | expand |
On 1/17/19 12:20 PM, Brian Foster wrote: > The writeback delalloc conversion code has a couple historical hacks > to help deal with racing with other operations on the file under > writeback. For example, xfs_iomap_write_allocate() checks i_size > once under ilock to determine whether a truncate may have > invalidated some portion of the range we're about to map. This helps > prevent the bmapi call from doing physical allocation where only > delalloc conversion was expected (which is a transaction reservation > overrun vector). XFS_BMAPI_DELALLOC helps similarly protect us from > this problem as it ensures that the bmapi call will only convert > existing blocks. > > Neither of these actually ensure we actually pass a valid range to > convert to xfs_bmapi_write(). The bmapi flag turns the > aforementioned physical allocation over a hole problem into an > explicit error, which means that any particular instances of that > problem change from being a transaction overrun error to a writeback > error. The latter means page discard, which is an ominous error even > if due to the underlying block(s) being punched out. > > Now that we have a mechanism in place to track inode fork changes, > use it in xfs_iomap_write_allocate() to properly handle potential > changes to the cached writeback mapping and establish a valid range > to map. Check the already provided fork seqno once under ilock. If > it has changed, look up the extent again in the associated fork and > trim the caller's mapping to the latest version. We can expect to > still find the blocks backing the current file offset because the > page is held locked during writeback and page cache truncation > acquires the page lock and waits on writeback to complete before it > can toss the page. This ensures that writeback never attempts to map > a range of a file that has been punched out and never submits I/O to > blocks that are no longer mapped to the file. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/xfs_iomap.c | 102 +++++++++++++++++++-------------------------- > 1 file changed, 44 insertions(+), 58 deletions(-) > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index ab69caa685b4..80da465ebf7a 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -677,22 +677,24 @@ xfs_file_iomap_begin_delay( > */ > int > xfs_iomap_write_allocate( > - xfs_inode_t *ip, > - int whichfork, > - xfs_off_t offset, > - xfs_bmbt_irec_t *imap, > - unsigned int *seq) > + struct xfs_inode *ip, > + int whichfork, > + xfs_off_t offset, > + struct xfs_bmbt_irec *imap, > + unsigned int *seq) > { > - xfs_mount_t *mp = ip->i_mount; > - struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork); > - xfs_fileoff_t offset_fsb, last_block; > - xfs_fileoff_t end_fsb, map_start_fsb; > - xfs_filblks_t count_fsb; > - xfs_trans_t *tp; > - int nimaps; > - int error = 0; > - int flags = XFS_BMAPI_DELALLOC; > - int nres; > + struct xfs_mount *mp = ip->i_mount; > + struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork); > + struct xfs_trans *tp; > + struct xfs_iext_cursor icur; > + struct xfs_bmbt_irec timap; > + xfs_fileoff_t offset_fsb; > + xfs_fileoff_t map_start_fsb; > + xfs_filblks_t count_fsb; > + int nimaps; > + int error = 0; > + int flags = XFS_BMAPI_DELALLOC; > + int nres; > > if (whichfork == XFS_COW_FORK) > flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC; > @@ -737,56 +739,40 @@ xfs_iomap_write_allocate( > xfs_trans_ijoin(tp, ip, 0); > > /* > - * it is possible that the extents have changed since > - * we did the read call as we dropped the ilock for a > - * while. We have to be careful about truncates or hole > - * punchs here - we are not allowed to allocate > - * non-delalloc blocks here. > - * > - * The only protection against truncation is the pages > - * for the range we are being asked to convert are > - * locked and hence a truncate will block on them > - * first. > - * > - * As a result, if we go beyond the range we really > - * need and hit an delalloc extent boundary followed by > - * a hole while we have excess blocks in the map, we > - * will fill the hole incorrectly and overrun the > - * transaction reservation. > + * Now that we have ILOCK we must account for the fact > + * that the fork (and thus our mapping) could have > + * changed while the inode was unlocked. If the fork > + * has changed, trim the caller's mapping to the > + * current extent in the fork. > * > - * Using a single map prevents this as we are forced to > - * check each map we look for overlap with the desired > - * range and abort as soon as we find it. Also, given > - * that we only return a single map, having one beyond > - * what we can return is probably a bit silly. > + * If the external change did not modify the current > + * mapping (or just grew it) this will have no effect. > + * If the current mapping shrunk, we expect to at > + * minimum still have blocks backing the current page as > + * the page has remained locked since writeback first > + * located delalloc block(s) at the page offset. A > + * racing truncate, hole punch or even reflink must wait > + * on page writeback before it can modify our page and > + * underlying block(s). > * > - * We also need to check that we don't go beyond EOF; > - * this is a truncate optimisation as a truncate sets > - * the new file size before block on the pages we > - * currently have locked under writeback. Because they > - * are about to be tossed, we don't need to write them > - * back.... > + * We'll update *seq before we drop ilock for the next > + * iteration. > */ > - nimaps = 1; > - end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip)); > - error = xfs_bmap_last_offset(ip, &last_block, > - XFS_DATA_FORK); > - if (error) > - goto trans_cancel; > - > - last_block = XFS_FILEOFF_MAX(last_block, end_fsb); > - if ((map_start_fsb + count_fsb) > last_block) { > - count_fsb = last_block - map_start_fsb; > - if (count_fsb == 0) { > - error = -EAGAIN; > + if (*seq != READ_ONCE(ifp->if_seq)) { > + if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, > + &icur, &timap) || > + timap.br_startoff > offset_fsb) { > + ASSERT(0); > + error = -EFSCORRUPTED; > goto trans_cancel; > } > + xfs_trim_extent(imap, timap.br_startoff, > + timap.br_blockcount); > + count_fsb = imap->br_blockcount; > + map_start_fsb = imap->br_startoff; > } > > - /* > - * From this point onwards we overwrite the imap > - * pointer that the caller gave to us. > - */ > + nimaps = 1; > error = xfs_bmapi_write(tp, ip, map_start_fsb, > count_fsb, flags, nres, imap, > &nimaps); > Alrighty, it seems reasonable to me. You can add my review. Thanks! Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
> + if (*seq != READ_ONCE(ifp->if_seq)) { > + if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, > + &icur, &timap) || > + timap.br_startoff > offset_fsb) { > + ASSERT(0); > + error = -EFSCORRUPTED; > goto trans_cancel; > } > + xfs_trim_extent(imap, timap.br_startoff, > + timap.br_blockcount); > + count_fsb = imap->br_blockcount; > + map_start_fsb = imap->br_startoff; I very much disagree with this logic. xfs_bmapi_write will always do a xfs_iext_lookup_extent for the passed in blocks anyway. So this trimming logic should move into xfs_bmapi_write to ensure it does the right thing, instead of papering over the logic in xfs_bmapi_write in the caller with another lookup. I think the improved XFS_BMAPI_DELALLOC handling in this patch: http://git.infradead.org/users/hch/xfs.git/commitdiff/553fed9ce7695df081779c6a9a205af4142b2a06 is the right answer, as writeback really should only ever convert existing delalloc extents, and I'd much rather rely on that rather than piling hacks of hacks to feed the right range into a function that must do a lookup in the extent tree anyway.
On Fri, Jan 18, 2019 at 03:59:37AM -0800, Christoph Hellwig wrote: > xfs_bmapi_write will always do a xfs_iext_lookup_extent for the > passed in blocks anyway. So this trimming logic should move into > xfs_bmapi_write to ensure it does the right thing, instead of papering > over the logic in xfs_bmapi_write in the caller with another lookup. > > I think the improved XFS_BMAPI_DELALLOC handling in this patch: > > http://git.infradead.org/users/hch/xfs.git/commitdiff/553fed9ce7695df081779c6a9a205af4142b2a06 > > is the right answer, as writeback really should only ever convert > existing delalloc extents, and I'd much rather rely on that rather than > piling hacks of hacks to feed the right range into a function that must > do a lookup in the extent tree anyway. FYI, here is a tree that rebases my patches on top of your (minus this one) and also drops the internal i_size checks, although it still keeps the initial one for the truncate fast path. Testing is still ongoing, and for the writeback issue you probably only need the first three of my patches: http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation
On Fri, Jan 18, 2019 at 08:31:32AM -0800, Christoph Hellwig wrote: > On Fri, Jan 18, 2019 at 03:59:37AM -0800, Christoph Hellwig wrote: > > xfs_bmapi_write will always do a xfs_iext_lookup_extent for the > > passed in blocks anyway. So this trimming logic should move into > > xfs_bmapi_write to ensure it does the right thing, instead of papering > > over the logic in xfs_bmapi_write in the caller with another lookup. > > > > I think the improved XFS_BMAPI_DELALLOC handling in this patch: > > > > http://git.infradead.org/users/hch/xfs.git/commitdiff/553fed9ce7695df081779c6a9a205af4142b2a06 > > > > is the right answer, as writeback really should only ever convert > > existing delalloc extents, and I'd much rather rely on that rather than > > piling hacks of hacks to feed the right range into a function that must > > do a lookup in the extent tree anyway. > > FYI, here is a tree that rebases my patches on top of your (minus this > one) and also drops the internal i_size checks, although it still keeps > the initial one for the truncate fast path. Testing is still ongoing, > and for the writeback issue you probably only need the first three > of my patches: > > http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation This doesn't really seem all that different to me. Rather than firm up the range in the caller, we turn XFS_BMAPI_DELALLOC into something that seemingly behaves a bit more like CONVERT_ONLY. A few notes/thoughts: 1. What's the purpose of the nimaps == 0 check in xfs_iomap_write_allocate()? If we got to this point with the page lock held, shouldn't we always expect to find something backing the current page? 2. If so, then it also seems that the whole "eof:" thing in xfs_map_blocks() should never happen for data forks. If that's the case, the use of the eof label there seems gratuitous. 3. If the starting bno param to xfs_bmapi_write() lands in a hole (due to racing with a hole punch for example) and it finds some subsequent delalloc extent to convert in the requested range, the arithmetic in xfs_bmapi_trim_map() actually fabricates physical blocks for the hole portion of the range relative to the startblock of the converted delalloc extent. I don't think that causes a problem with the writeback code because the fabricated blocks are before the page offset, but those semantics are insane. I think you need to do something like fix up xfs_bmapi_write() to return the actual converted mapping and make sure xfs_iomap_write_allocate() handles that it might start beyond map_start_fsb. Brian
On Fri, Jan 18, 2019 at 01:39:15PM -0500, Brian Foster wrote: > On Fri, Jan 18, 2019 at 08:31:32AM -0800, Christoph Hellwig wrote: > > On Fri, Jan 18, 2019 at 03:59:37AM -0800, Christoph Hellwig wrote: > > > xfs_bmapi_write will always do a xfs_iext_lookup_extent for the > > > passed in blocks anyway. So this trimming logic should move into > > > xfs_bmapi_write to ensure it does the right thing, instead of papering > > > over the logic in xfs_bmapi_write in the caller with another lookup. > > > > > > I think the improved XFS_BMAPI_DELALLOC handling in this patch: > > > > > > http://git.infradead.org/users/hch/xfs.git/commitdiff/553fed9ce7695df081779c6a9a205af4142b2a06 > > > > > > is the right answer, as writeback really should only ever convert > > > existing delalloc extents, and I'd much rather rely on that rather than > > > piling hacks of hacks to feed the right range into a function that must > > > do a lookup in the extent tree anyway. > > > > FYI, here is a tree that rebases my patches on top of your (minus this > > one) and also drops the internal i_size checks, although it still keeps > > the initial one for the truncate fast path. Testing is still ongoing, > > and for the writeback issue you probably only need the first three > > of my patches: > > > > http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation > > This doesn't really seem all that different to me. Rather than firm up > the range in the caller, we turn XFS_BMAPI_DELALLOC into something that > seemingly behaves a bit more like CONVERT_ONLY. > > A few notes/thoughts: > > 1. What's the purpose of the nimaps == 0 check in > xfs_iomap_write_allocate()? If we got to this point with the page lock > held, shouldn't we always expect to find something backing the current > page? > > 2. If so, then it also seems that the whole "eof:" thing in > xfs_map_blocks() should never happen for data forks. If that's the case, > the use of the eof label there seems gratuitous. > > 3. If the starting bno param to xfs_bmapi_write() lands in a hole (due > to racing with a hole punch for example) and it finds some subsequent > delalloc extent to convert in the requested range, the arithmetic in > xfs_bmapi_trim_map() actually fabricates physical blocks for the hole > portion of the range relative to the startblock of the converted > delalloc extent. I don't think that causes a problem with the writeback > code because the fabricated blocks are before the page offset, but those > semantics are insane. I think you need to do something like fix up > xfs_bmapi_write() to return the actual converted mapping and make sure > xfs_iomap_write_allocate() handles that it might start beyond > map_start_fsb. > A couple more thoughts now that I've had more time to think about this: 4. Kind of a nit, but the comment update in xfs_bmapi_write() that describes the ilock and associated race window and whatnot should really be split between there and xfs_iomap_write_allocate(). The former should just explain what exactly XFS_BMAPI_DELALLOC does (i.e., skip holes, real extents, over a range..). The latter should explain how the use of XFS_BMAPI_DELALLOC helps us deal with the race window in the writeback code. 5. A more serious point.. with this approach, what prevents writeback from thrashing with buffered writes in the event of the hole punch race scenario I mentioned previously? One of the things I explicitly wanted to avoid in this patch is the potential to bounce back and forth between writes and writeback over extent updates. There is the performance aspect of that, which is what I was initially concerned about, but your patch introduces a functional aspect of that we need to consider as well. For (the extreme/worst case) example, suppose again we're at writeback in the middle of a fairly large delalloc extent. Somehow or another we race and lose a good portion at the start of the currently cached extent and then see a buffered write at the start offset of the (now invalid) cached extent. The now invalid range is passed to xfs_bmapi_write(), which finds and allocates that first block in the range, commits the transaction and cycles the ilock. If we now contend with another buffered write at the next offset in the original extent range, writeback finds and allocates at that offset next and the cycle repeats until we finally allocate an extent that backs our current page. Unless I'm missing something here, there's not only opportunity to spend an awkward amount of time processing unrelated delalloc extents for a single page, there's potential to allocate bits of what should be a large contiguous delalloc extent a block at a time, which can defeat the purpose of delayed allocation (assuming the block allocator doesn't save our ass). Hmm, I'm not totally sure what the right answer to that is. On one hand, I think the approach of using XFS_BMAPI_DELALLOC is fairly elegant. On the other, I don't think a solution open to this behavior is robust enough. One option that comes to mind is to perhaps split off XFS_BMAPI_DELALLOC from xfs_bmapi_write() into a new xfs_bmapi_delalloc() function that facilitates new semantics (I'm not terribly comfortable with overloading the semantics of xfs_bmapi_write()). Instead of passing a range to xfs_bmapi_delalloc(), just pass the offset we care about and expect that this function will attempt to allocate the entire extent currently backing offset. (Alternatively, we could perhaps pass a range by value and allow xfs_bmapi_delalloc() to update the range in the event of delalloc discontiguity.) Either way, the extent returned may not cover the offset (due to fragmentation, as today) and thus the caller needs to iterate until that happens. The larger point is that we'd lookup the current extent _at offset_ on each iteration and thus shouldn't ever contend with new delalloc reservations. Thoughts? Brian > Brian
On Fri, Jan 18, 2019 at 01:39:15PM -0500, Brian Foster wrote: > This doesn't really seem all that different to me. Rather than firm up > the range in the caller, we turn XFS_BMAPI_DELALLOC into something that > seemingly behaves a bit more like CONVERT_ONLY. The differences are: (a) we save one lookup in the extent tree (b) we have a less fragile API > A few notes/thoughts: > > 1. What's the purpose of the nimaps == 0 check in > xfs_iomap_write_allocate()? If we got to this point with the page lock > held, shouldn't we always expect to find something backing the current > page? It protects against the case where we migrate a COW fork mapping to the data fork, which is not protected by the page lock. But I guess the check warrants a comment and an assert. > 2. If so, then it also seems that the whole "eof:" thing in > xfs_map_blocks() should never happen for data forks. If that's the case, > the use of the eof label there seems gratuitous. Let me try with asserts enabled. > > 3. If the starting bno param to xfs_bmapi_write() lands in a hole (due > to racing with a hole punch for example) and it finds some subsequent > delalloc extent to convert in the requested range, the arithmetic in > xfs_bmapi_trim_map() actually fabricates physical blocks for the hole > portion of the range relative to the startblock of the converted > delalloc extent. I don't think that causes a problem with the writeback > code because the fabricated blocks are before the page offset, but those > semantics are insane. I think you need to do something like fix up > xfs_bmapi_write() to return the actual converted mapping and make sure > xfs_iomap_write_allocate() handles that it might start beyond > map_start_fsb. Sounds good. To be honest I think the whole idea of converting the mapping before the requested offset is a rather bad idea to start with, just increasin our window that exposes stale data. But to fix this properly we need to create everything as unwritten first, and I didn't have time to get back to that series.
On Sun, Jan 20, 2019 at 07:45:02AM -0500, Brian Foster wrote: > 4. Kind of a nit, but the comment update in xfs_bmapi_write() that > describes the ilock and associated race window and whatnot should really > be split between there and xfs_iomap_write_allocate(). The former should > just explain what exactly XFS_BMAPI_DELALLOC does (i.e., skip holes, > real extents, over a range..). The latter should explain how > the use of XFS_BMAPI_DELALLOC helps us deal with the race window in the > writeback code. Ok. > One option that comes to mind is to perhaps split off XFS_BMAPI_DELALLOC > from xfs_bmapi_write() into a new xfs_bmapi_delalloc() function that > facilitates new semantics (I'm not terribly comfortable with overloading > the semantics of xfs_bmapi_write()). Instead of passing a range to > xfs_bmapi_delalloc(), just pass the offset we care about and expect that > this function will attempt to allocate the entire extent currently > backing offset. (Alternatively, we could perhaps pass a range by value > and allow xfs_bmapi_delalloc() to update the range in the event of > delalloc discontiguity.) Either way, the extent returned may not cover > the offset (due to fragmentation, as today) and thus the caller needs to > iterate until that happens. The larger point is that we'd lookup the > current extent _at offset_ on each iteration and thus shouldn't ever > contend with new delalloc reservations. Thoughts? I considered splitting it off and even had an early prototype. I got something wrong and it didn't work, and it created a little too much duplication for my taste so I gave up on it for now. But fundamentally having the delalloc conversion separate from xfs_bmapi_write is the right thing. I'll just have to find some time for it or pass this work off to you..
On Mon, Jan 21, 2019 at 07:48:33AM -0800, Christoph Hellwig wrote: > On Fri, Jan 18, 2019 at 01:39:15PM -0500, Brian Foster wrote: > > This doesn't really seem all that different to me. Rather than firm up > > the range in the caller, we turn XFS_BMAPI_DELALLOC into something that > > seemingly behaves a bit more like CONVERT_ONLY. > > The differences are: > > (a) we save one lookup in the extent tree > (b) we have a less fragile API > > > A few notes/thoughts: > > > > 1. What's the purpose of the nimaps == 0 check in > > xfs_iomap_write_allocate()? If we got to this point with the page lock > > held, shouldn't we always expect to find something backing the current > > page? > > It protects against the case where we migrate a COW fork mapping to the > data fork, which is not protected by the page lock. But I guess the > check warrants a comment and an assert. > Yeah, probably. It's not really clear to me what that means. > > 2. If so, then it also seems that the whole "eof:" thing in > > xfs_map_blocks() should never happen for data forks. If that's the case, > > the use of the eof label there seems gratuitous. > > Let me try with asserts enabled. > Ok, thanks. > > > > 3. If the starting bno param to xfs_bmapi_write() lands in a hole (due > > to racing with a hole punch for example) and it finds some subsequent > > delalloc extent to convert in the requested range, the arithmetic in > > xfs_bmapi_trim_map() actually fabricates physical blocks for the hole > > portion of the range relative to the startblock of the converted > > delalloc extent. I don't think that causes a problem with the writeback > > code because the fabricated blocks are before the page offset, but those > > semantics are insane. I think you need to do something like fix up > > xfs_bmapi_write() to return the actual converted mapping and make sure > > xfs_iomap_write_allocate() handles that it might start beyond > > map_start_fsb. > > Sounds good. To be honest I think the whole idea of converting the > mapping before the requested offset is a rather bad idea to start > with, just increasin our window that exposes stale data. But to fix > this properly we need to create everything as unwritten first, and > I didn't have time to get back to that series. Eh, I think that's a separate problem that re: your other series, we already know how to fix properly. I don't think we should mess around with something as fairly fundamental as delayed block allocation behavior for an approach that doesn't address the underlying problem. Brian
On Mon, Jan 21, 2019 at 07:51:10AM -0800, Christoph Hellwig wrote: > On Sun, Jan 20, 2019 at 07:45:02AM -0500, Brian Foster wrote: > > 4. Kind of a nit, but the comment update in xfs_bmapi_write() that > > describes the ilock and associated race window and whatnot should really > > be split between there and xfs_iomap_write_allocate(). The former should > > just explain what exactly XFS_BMAPI_DELALLOC does (i.e., skip holes, > > real extents, over a range..). The latter should explain how > > the use of XFS_BMAPI_DELALLOC helps us deal with the race window in the > > writeback code. > > Ok. > > > One option that comes to mind is to perhaps split off XFS_BMAPI_DELALLOC > > from xfs_bmapi_write() into a new xfs_bmapi_delalloc() function that > > facilitates new semantics (I'm not terribly comfortable with overloading > > the semantics of xfs_bmapi_write()). Instead of passing a range to > > xfs_bmapi_delalloc(), just pass the offset we care about and expect that > > this function will attempt to allocate the entire extent currently > > backing offset. (Alternatively, we could perhaps pass a range by value > > and allow xfs_bmapi_delalloc() to update the range in the event of > > delalloc discontiguity.) Either way, the extent returned may not cover > > the offset (due to fragmentation, as today) and thus the caller needs to > > iterate until that happens. The larger point is that we'd lookup the > > current extent _at offset_ on each iteration and thus shouldn't ever > > contend with new delalloc reservations. Thoughts? > > I considered splitting it off and even had an early prototype. I > got something wrong and it didn't work, and it created a little too > much duplication for my taste so I gave up on it for now. But > fundamentally having the delalloc conversion separate from > xfs_bmapi_write is the right thing. I'll just have to find some > time for it or pass this work off to you.. Ok, well I don't mind looking into how to refactor that code, but my priority for this series is to fix the underlying problem. As a temporary compromise, I think there might be a couple simple options to create an xfs_bmapi_delalloc() wrapper over xfs_bmapi_write(). I'm curious whether we could just pass bno and len == 1 from the delalloc wrapper and get the behavior we want. Alternatively, perhaps we could just factor out the bma.got lookup from xfs_bmapi_write() and use that to handle the range properly in the delalloc case (i.e., pass *got and *eof into a __xfs_bmapi_write() internal function that does most of everything else). If either of those don't work, a temporary fallback may be to just bury the seqno lookup logic from this patch into xfs_bmapi_delalloc() and document that further refactoring is required. That would retain the extra lookup (for now), but TBH the testing I've already down wrt to excessive higher level revalidations kind of shows that this has no real impact on writeback performance, it's just a bit ugly. Brian
On Mon, Jan 21, 2019 at 12:42:56PM -0500, Brian Foster wrote: > > It protects against the case where we migrate a COW fork mapping to the > > data fork, which is not protected by the page lock. But I guess the > > check warrants a comment and an assert. > > > > Yeah, probably. It's not really clear to me what that means. > > > > 2. If so, then it also seems that the whole "eof:" thing in > > > xfs_map_blocks() should never happen for data forks. If that's the case, > > > the use of the eof label there seems gratuitous. > > > > Let me try with asserts enabled. > > > > Ok, thanks. So, the nimaps == 0 case hits even for the data fork when running xfs/442 on a 1k block size file system. That test has generally been a fun source of bugs in the always_cow series.
On Tue, Jan 22, 2019 at 09:14:45AM -0800, Christoph Hellwig wrote: > On Mon, Jan 21, 2019 at 12:42:56PM -0500, Brian Foster wrote: > > > It protects against the case where we migrate a COW fork mapping to the > > > data fork, which is not protected by the page lock. But I guess the > > > check warrants a comment and an assert. > > > > > > > Yeah, probably. It's not really clear to me what that means. > > > > > > 2. If so, then it also seems that the whole "eof:" thing in > > > > xfs_map_blocks() should never happen for data forks. If that's the case, > > > > the use of the eof label there seems gratuitous. > > > > > > Let me try with asserts enabled. > > > > > > > Ok, thanks. > > So, the nimaps == 0 case hits even for the data fork when running > xfs/442 on a 1k block size file system. That test has generally been > a fun source of bugs in the always_cow series. Hmm, Ok.. any idea how/why that is? I'm curious if this is an artifact of this code being racy (i.e., nimaps == 0 occurs somehow on a range external to the page) or there is some legitimate means to lose a delalloc block under the locked page. If the latter, we should at least document it in the code.. FWIW, I have some refactoring in place that basically turns nimaps == 0 into an error and I don't see an xfs/442 failure with 1k or 4k blocks, though that's only from a couple quick runs and I'm still working through some changes. The difference of course is that this code uses the range of the delalloc extent in the data fork rather than the (potentially already stale) range from wpc->imap. I'll let it spin for a bit I suppose and see if it explodes.. Brian
On Tue, Jan 22, 2019 at 01:13:28PM -0500, Brian Foster wrote: > > So, the nimaps == 0 case hits even for the data fork when running > > xfs/442 on a 1k block size file system. That test has generally been > > a fun source of bugs in the always_cow series. > > Hmm, Ok.. any idea how/why that is? I'm curious if this is an artifact > of this code being racy (i.e., nimaps == 0 occurs somehow on a range > external to the page) or there is some legitimate means to lose a > delalloc block under the locked page. If the latter, we should at least > document it in the code.. I haven't managed to pinpint it mostly because xfs_bmapi_write is such an unreadable mess. Based on that I decided to implement the idea of a separate xfs_bmapi_delalloc_convert helper again, mostly to try to understand what actuall is going on. This seems to work pretty reasonable after initial testing, except that it seems to unhide an issue in xfs/109 which doesn't look directly related. The current status is here: http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation.2
On Wed, Jan 23, 2019 at 10:14:12AM -0800, Christoph Hellwig wrote: > On Tue, Jan 22, 2019 at 01:13:28PM -0500, Brian Foster wrote: > > > So, the nimaps == 0 case hits even for the data fork when running > > > xfs/442 on a 1k block size file system. That test has generally been > > > a fun source of bugs in the always_cow series. > > > > Hmm, Ok.. any idea how/why that is? I'm curious if this is an artifact > > of this code being racy (i.e., nimaps == 0 occurs somehow on a range > > external to the page) or there is some legitimate means to lose a > > delalloc block under the locked page. If the latter, we should at least > > document it in the code.. > > I haven't managed to pinpint it mostly because xfs_bmapi_write is > such an unreadable mess. Based on that I decided to implement the idea > of a separate xfs_bmapi_delalloc_convert helper again, mostly to try > to understand what actuall is going on. This seems to work pretty > reasonable after initial testing, except that it seems to unhide an > issue in xfs/109 which doesn't look directly related. The current > status is here: > > http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-mapping-validation.2 > I have a v3 of this series that does something similar, but doesn't quite take the refactoring as far. I just created an xfs_bmapi_delalloc() wrapper over xfs_bmapi_write(). The changes to the latter are minimal based on a tweak of XFS_BMAPI_DELALLOC behavior. The change to xfs_iomap_write_allocate() is basically just to use the new helper instead of xfs_bmapi_write(). The refactoring you have looks like the right direction to me, but I'm wondering why we need to change so much all at once, particularly since it's not clear that some of this -EAGAIN stuff is really necessary. Pulling bits out of xfs_bmapi_write() is also inherently more difficult to review than reusing it, IMO. Hmm, I'm wondering if we could rebase your refactoring onto the simple wrapper in my patch. You'd basically just reimplement the helper, kill off the XFS_BMAPI_DELALLOC stuff as you already have, and the changes to xfs_iomap_write_allocate() (notwithstanding moving/renaming the function, which should really be split up into separate patches) would probably be minimal (i.e., do the extent lookup). I was still running some tests but I've got through a decent amount already and have everything formatted to send out so I'll just post it... Brian
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index ab69caa685b4..80da465ebf7a 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -677,22 +677,24 @@ xfs_file_iomap_begin_delay( */ int xfs_iomap_write_allocate( - xfs_inode_t *ip, - int whichfork, - xfs_off_t offset, - xfs_bmbt_irec_t *imap, - unsigned int *seq) + struct xfs_inode *ip, + int whichfork, + xfs_off_t offset, + struct xfs_bmbt_irec *imap, + unsigned int *seq) { - xfs_mount_t *mp = ip->i_mount; - struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork); - xfs_fileoff_t offset_fsb, last_block; - xfs_fileoff_t end_fsb, map_start_fsb; - xfs_filblks_t count_fsb; - xfs_trans_t *tp; - int nimaps; - int error = 0; - int flags = XFS_BMAPI_DELALLOC; - int nres; + struct xfs_mount *mp = ip->i_mount; + struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork); + struct xfs_trans *tp; + struct xfs_iext_cursor icur; + struct xfs_bmbt_irec timap; + xfs_fileoff_t offset_fsb; + xfs_fileoff_t map_start_fsb; + xfs_filblks_t count_fsb; + int nimaps; + int error = 0; + int flags = XFS_BMAPI_DELALLOC; + int nres; if (whichfork == XFS_COW_FORK) flags |= XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC; @@ -737,56 +739,40 @@ xfs_iomap_write_allocate( xfs_trans_ijoin(tp, ip, 0); /* - * it is possible that the extents have changed since - * we did the read call as we dropped the ilock for a - * while. We have to be careful about truncates or hole - * punchs here - we are not allowed to allocate - * non-delalloc blocks here. - * - * The only protection against truncation is the pages - * for the range we are being asked to convert are - * locked and hence a truncate will block on them - * first. - * - * As a result, if we go beyond the range we really - * need and hit an delalloc extent boundary followed by - * a hole while we have excess blocks in the map, we - * will fill the hole incorrectly and overrun the - * transaction reservation. + * Now that we have ILOCK we must account for the fact + * that the fork (and thus our mapping) could have + * changed while the inode was unlocked. If the fork + * has changed, trim the caller's mapping to the + * current extent in the fork. * - * Using a single map prevents this as we are forced to - * check each map we look for overlap with the desired - * range and abort as soon as we find it. Also, given - * that we only return a single map, having one beyond - * what we can return is probably a bit silly. + * If the external change did not modify the current + * mapping (or just grew it) this will have no effect. + * If the current mapping shrunk, we expect to at + * minimum still have blocks backing the current page as + * the page has remained locked since writeback first + * located delalloc block(s) at the page offset. A + * racing truncate, hole punch or even reflink must wait + * on page writeback before it can modify our page and + * underlying block(s). * - * We also need to check that we don't go beyond EOF; - * this is a truncate optimisation as a truncate sets - * the new file size before block on the pages we - * currently have locked under writeback. Because they - * are about to be tossed, we don't need to write them - * back.... + * We'll update *seq before we drop ilock for the next + * iteration. */ - nimaps = 1; - end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip)); - error = xfs_bmap_last_offset(ip, &last_block, - XFS_DATA_FORK); - if (error) - goto trans_cancel; - - last_block = XFS_FILEOFF_MAX(last_block, end_fsb); - if ((map_start_fsb + count_fsb) > last_block) { - count_fsb = last_block - map_start_fsb; - if (count_fsb == 0) { - error = -EAGAIN; + if (*seq != READ_ONCE(ifp->if_seq)) { + if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, + &icur, &timap) || + timap.br_startoff > offset_fsb) { + ASSERT(0); + error = -EFSCORRUPTED; goto trans_cancel; } + xfs_trim_extent(imap, timap.br_startoff, + timap.br_blockcount); + count_fsb = imap->br_blockcount; + map_start_fsb = imap->br_startoff; } - /* - * From this point onwards we overwrite the imap - * pointer that the caller gave to us. - */ + nimaps = 1; error = xfs_bmapi_write(tp, ip, map_start_fsb, count_fsb, flags, nres, imap, &nimaps);
The writeback delalloc conversion code has a couple historical hacks to help deal with racing with other operations on the file under writeback. For example, xfs_iomap_write_allocate() checks i_size once under ilock to determine whether a truncate may have invalidated some portion of the range we're about to map. This helps prevent the bmapi call from doing physical allocation where only delalloc conversion was expected (which is a transaction reservation overrun vector). XFS_BMAPI_DELALLOC helps similarly protect us from this problem as it ensures that the bmapi call will only convert existing blocks. Neither of these actually ensure we actually pass a valid range to convert to xfs_bmapi_write(). The bmapi flag turns the aforementioned physical allocation over a hole problem into an explicit error, which means that any particular instances of that problem change from being a transaction overrun error to a writeback error. The latter means page discard, which is an ominous error even if due to the underlying block(s) being punched out. Now that we have a mechanism in place to track inode fork changes, use it in xfs_iomap_write_allocate() to properly handle potential changes to the cached writeback mapping and establish a valid range to map. Check the already provided fork seqno once under ilock. If it has changed, look up the extent again in the associated fork and trim the caller's mapping to the latest version. We can expect to still find the blocks backing the current file offset because the page is held locked during writeback and page cache truncation acquires the page lock and waits on writeback to complete before it can toss the page. This ensures that writeback never attempts to map a range of a file that has been punched out and never submits I/O to blocks that are no longer mapped to the file. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/xfs_iomap.c | 102 +++++++++++++++++++-------------------------- 1 file changed, 44 insertions(+), 58 deletions(-)