diff mbox

xfs: fix eofblocks race with file extending async dio writes

Message ID 1484156571-65403-1-git-send-email-bfoster@redhat.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Brian Foster Jan. 11, 2017, 5:42 p.m. UTC
It's possible for post-eof blocks to end up being used for direct I/O
writes. dio write performs an upfront unwritten extent allocation, sends
the dio and then updates the inode size (if necessary) on write
completion. If a file release occurs while a file extending dio write is
in flight, it is possible to mistake the post-eof blocks for speculative
preallocation and incorrectly truncate them from the inode. This means
that the resulting dio write completion can discover a hole and allocate
new blocks rather than perform unwritten extent conversion.

This requires a strange mix of I/O and is thus not likely to reproduce
in real world workloads. It is intermittently reproduced by generic/299.
The error manifests as an assert failure due to transaction overrun
because the aforementioned write completion transaction has only
reserved enough blocks for btree operations:

  XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, \
   file: fs/xfs//xfs_trans.c, line: 309

The root cause is that xfs_free_eofblocks() uses i_size to truncate
post-eof blocks from the inode, but async, file extending direct writes
do not update i_size until write completion, long after inode locks are
dropped. Therefore, xfs_free_eofblocks() effectively truncates the inode
to the incorrect size.

Update xfs_free_eoflbocks() to serialize against dio similar to how
extending writes are serialized against i_size updates before post-eof
block zeroing. Specifically, wait on dio once the iolock is acquired.
This ensures that dio write completions have updated i_size before
post-eof blocks are processed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Comments

Christoph Hellwig Jan. 12, 2017, 1:56 p.m. UTC | #1
>  		if (need_iolock) {
>  			if (!xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL))
>  				return -EAGAIN;
>  		}
> +		inode_dio_wait(VFS_I(ip));

inode_dio_wait generally is only safe to call with i_rwsem held
exclsuively, so if we'd need the call for the !need_iolock this would
be broken.  Fortunately we don't even need the call in that case, so
this should be safe.  I'd still prefer to move the inode_dio_wait call
into the need_iolock block to make that clear, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Jan. 12, 2017, 2:05 p.m. UTC | #2
On Thu, Jan 12, 2017 at 05:56:03AM -0800, Christoph Hellwig wrote:
> >  		if (need_iolock) {
> >  			if (!xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL))
> >  				return -EAGAIN;
> >  		}
> > +		inode_dio_wait(VFS_I(ip));
> 
> inode_dio_wait generally is only safe to call with i_rwsem held
> exclsuively, so if we'd need the call for the !need_iolock this would
> be broken.  Fortunately we don't even need the call in that case, so
> this should be safe.  I'd still prefer to move the inode_dio_wait call
> into the need_iolock block to make that clear, though.

!need_iolock means the iolock is already held. I guess the name is kind
of confusing. !need_iolock doesn't mean that the lock is unnecessary, it
just means that we're calling from a context where it's already held.
See the xfs_icache_free_eofblocks() call from
xfs_file_buffered_aio_write() for reference.

I suppose I could add an ASSERT(xfs_isilocked()) after that block to
better document that..

Brian
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Jan. 13, 2017, 7:33 a.m. UTC | #3
On Thu, Jan 12, 2017 at 09:05:59AM -0500, Brian Foster wrote:
> !need_iolock means the iolock is already held. I guess the name is kind
> of confusing. !need_iolock doesn't mean that the lock is unnecessary, it
> just means that we're calling from a context where it's already held.
> See the xfs_icache_free_eofblocks() call from
> xfs_file_buffered_aio_write() for reference.
> 
> I suppose I could add an ASSERT(xfs_isilocked()) after that block to
> better document that..

Yeah.  In fact I'd prefer to kill that parameter at all, it's horrible.
Instead we should always expect the lock and assert that it's held,
and have the two current need_iolock = false callers take it manually.

This will increase lock hold times a bit for the current need_iolock
callers, but the most important one, xfs_release already does a
previous unlocked xfs_can_free_eofblocks check, and
xfs_inode_free_eofblocks is only called it the inode is tagged, so this
should not be an issue (and if it is a xfs_can_free_eofblocks call
comes to rescue).

Btw, there is a comment incorrectly referring to xfs_free_eofblocks
in xfs_inode_free_cowblocks while we're at it.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Jan. 13, 2017, 2:25 p.m. UTC | #4
On Thu, Jan 12, 2017 at 11:33:15PM -0800, Christoph Hellwig wrote:
> On Thu, Jan 12, 2017 at 09:05:59AM -0500, Brian Foster wrote:
> > !need_iolock means the iolock is already held. I guess the name is kind
> > of confusing. !need_iolock doesn't mean that the lock is unnecessary, it
> > just means that we're calling from a context where it's already held.
> > See the xfs_icache_free_eofblocks() call from
> > xfs_file_buffered_aio_write() for reference.
> > 
> > I suppose I could add an ASSERT(xfs_isilocked()) after that block to
> > better document that..
> 
> Yeah.  In fact I'd prefer to kill that parameter at all, it's horrible.
> Instead we should always expect the lock and assert that it's held,
> and have the two current need_iolock = false callers take it manually.
> 

I may have lied about iolock holders.. I don't think we have the iolock
in the xfs_inactive() case. That said, I don't think there's any harm in
doing so there. It may have just been that way since we're breaking down
the inode in that context.

I agree that this is ugly. I'll try to kill off that param as suggested.

> This will increase lock hold times a bit for the current need_iolock
> callers, but the most important one, xfs_release already does a
> previous unlocked xfs_can_free_eofblocks check, and
> xfs_inode_free_eofblocks is only called it the inode is tagged, so this
> should not be an issue (and if it is a xfs_can_free_eofblocks call
> comes to rescue).
> 

The need_iolock case is also currently a trylock, so I think we can
retain that logic in the release case without being disruptive.

> Btw, there is a comment incorrectly referring to xfs_free_eofblocks
> in xfs_inode_free_cowblocks while we're at it.

I would send that as a standalone patch because it's an entirely
separate feature (and causes unnecessary backport churn, even though
it's just a comment), so I'm not sure it's really worth fixing with this
series.

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index b9abce5..5844549 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -936,6 +936,10 @@  xfs_free_eofblocks(
 	error = xfs_bmapi_read(ip, end_fsb, map_len, &imap, &nimaps, 0);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
+	/*
+	 * If there are blocks after the end of file, truncate the file to its
+	 * current size to free them up.
+	 */
 	if (!error && (nimaps != 0) &&
 	    (imap.br_startblock != HOLESTARTBLOCK ||
 	     ip->i_delayed_blks)) {
@@ -947,14 +951,14 @@  xfs_free_eofblocks(
 			return error;
 
 		/*
-		 * There are blocks after the end of file.
-		 * Free them up now by truncating the file to
-		 * its current size.
+		 * Grab the iolock if the caller hasn't done so and wait on
+		 * direct I/O to ensure i_size has settled.
 		 */
 		if (need_iolock) {
 			if (!xfs_ilock_nowait(ip, XFS_IOLOCK_EXCL))
 				return -EAGAIN;
 		}
+		inode_dio_wait(VFS_I(ip));
 
 		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0,
 				&tp);