[01/14] iomap: iomap that extends beyond EOF should be marked dirty
diff mbox series

Message ID 20191017175624.30305-2-hch@lst.de
State New
Headers show
Series
  • [01/14] iomap: iomap that extends beyond EOF should be marked dirty
Related show

Commit Message

Christoph Hellwig Oct. 17, 2019, 5:56 p.m. UTC
From: Dave Chinner <dchinner@redhat.com>

When doing a direct IO that spans the current EOF, and there are
written blocks beyond EOF that extend beyond the current write, the
only metadata update that needs to be done is a file size extension.

However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
there is IO completion metadata updates required, and hence we may
fail to correctly sync file size extensions made in IO completion
when O_DSYNC writes are being used and the hardware supports FUA.

Hence when setting IOMAP_F_DIRTY, we need to also take into account
whether the iomap spans the current EOF. If it does, then we need to
mark it dirty so that IO completion will call generic_write_sync()
to flush the inode size update to stable storage correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/ext4/inode.c       | 9 ++++++++-
 fs/xfs/xfs_iomap.c    | 7 +++++++
 include/linux/iomap.h | 2 ++
 3 files changed, 17 insertions(+), 1 deletion(-)

Comments

Darrick J. Wong Oct. 17, 2019, 6:39 p.m. UTC | #1
On Thu, Oct 17, 2019 at 07:56:11PM +0200, Christoph Hellwig wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When doing a direct IO that spans the current EOF, and there are
> written blocks beyond EOF that extend beyond the current write, the
> only metadata update that needs to be done is a file size extension.
> 
> However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
> there is IO completion metadata updates required, and hence we may
> fail to correctly sync file size extensions made in IO completion
> when O_DSYNC writes are being used and the hardware supports FUA.
> 
> Hence when setting IOMAP_F_DIRTY, we need to also take into account
> whether the iomap spans the current EOF. If it does, then we need to
> mark it dirty so that IO completion will call generic_write_sync()
> to flush the inode size update to stable storage correctly.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok, but need fixes tag.  Also, might it be wise to split off the
ext4 section into a separate patch so that it can be backported
separately?

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/ext4/inode.c       | 9 ++++++++-
>  fs/xfs/xfs_iomap.c    | 7 +++++++
>  include/linux/iomap.h | 2 ++
>  3 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 516faa280ced..e9dc52537e5b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3523,9 +3523,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>  			return ret;
>  	}
>  
> +	/*
> +	 * Writes that span EOF might trigger an IO size update on completion,
> +	 * so consider them to be dirty for the purposes of O_DSYNC even if
> +	 * there is no other metadata changes being made or are pending here.
> +	 */
>  	iomap->flags = 0;
> -	if (ext4_inode_datasync_dirty(inode))
> +	if (ext4_inode_datasync_dirty(inode) ||
> +	    offset + length > i_size_read(inode))
>  		iomap->flags |= IOMAP_F_DIRTY;
> +
>  	iomap->bdev = inode->i_sb->s_bdev;
>  	iomap->dax_dev = sbi->s_daxdev;
>  	iomap->offset = (u64)first_block << blkbits;
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index f780e223b118..32993c2acbd9 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1049,6 +1049,13 @@ xfs_file_iomap_begin(
>  	trace_xfs_iomap_alloc(ip, offset, length, XFS_DATA_FORK, &imap);
>  
>  out_finish:
> +	/*
> +	 * Writes that span EOF might trigger an IO size update on completion,
> +	 * so consider them to be dirty for the purposes of O_DSYNC even if
> +	 * there is no other metadata changes pending or have been made here.
> +	 */
> +	if ((flags & IOMAP_WRITE) && offset + length > i_size_read(inode))
> +		iomap->flags |= IOMAP_F_DIRTY;
>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, shared);
>  
>  out_found:
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 7aa5d6117936..24bd227d59f9 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -32,6 +32,8 @@ struct vm_fault;
>   *
>   * IOMAP_F_DIRTY indicates the inode has uncommitted metadata needed to access
>   * written data and requires fdatasync to commit them to persistent storage.
> + * This needs to take into account metadata changes that *may* be made at IO
> + * completion, such as file size updates from direct IO.
>   */
>  #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
>  #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
> -- 
> 2.20.1
>
Dave Chinner Oct. 17, 2019, 9:56 p.m. UTC | #2
On Thu, Oct 17, 2019 at 11:39:17AM -0700, Darrick J. Wong wrote:
> On Thu, Oct 17, 2019 at 07:56:11PM +0200, Christoph Hellwig wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When doing a direct IO that spans the current EOF, and there are
> > written blocks beyond EOF that extend beyond the current write, the
> > only metadata update that needs to be done is a file size extension.
> > 
> > However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
> > there is IO completion metadata updates required, and hence we may
> > fail to correctly sync file size extensions made in IO completion
> > when O_DSYNC writes are being used and the hardware supports FUA.
> > 
> > Hence when setting IOMAP_F_DIRTY, we need to also take into account
> > whether the iomap spans the current EOF. If it does, then we need to
> > mark it dirty so that IO completion will call generic_write_sync()
> > to flush the inode size update to stable storage correctly.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Looks ok, but need fixes tag.  Also, might it be wise to split off the
> ext4 section into a separate patch so that it can be backported
> separately?

I 've done a bit more digging on this, and the ext4 part is not
needed for DAX as IOMAP_F_DIRTY is only used in the page fault path
and hence can't change the file size. As such, this only affects
direct IO. Hence the ext4 hunk can be added to the ext4 iomap-dio
patchset that is being developed rather than being in this patch.

Fixes: 3460cac1ca76 ("iomap: Use FUA for pure data O_DSYNC DIO writes")

Cheers,

Dave.
Matthew Bobrowski Oct. 17, 2019, 11:08 p.m. UTC | #3
On Fri, Oct 18, 2019 at 08:56:13AM +1100, Dave Chinner wrote:
> On Thu, Oct 17, 2019 at 11:39:17AM -0700, Darrick J. Wong wrote:
> > On Thu, Oct 17, 2019 at 07:56:11PM +0200, Christoph Hellwig wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > When doing a direct IO that spans the current EOF, and there are
> > > written blocks beyond EOF that extend beyond the current write, the
> > > only metadata update that needs to be done is a file size extension.
> > > 
> > > However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
> > > there is IO completion metadata updates required, and hence we may
> > > fail to correctly sync file size extensions made in IO completion
> > > when O_DSYNC writes are being used and the hardware supports FUA.
> > > 
> > > Hence when setting IOMAP_F_DIRTY, we need to also take into account
> > > whether the iomap spans the current EOF. If it does, then we need to
> > > mark it dirty so that IO completion will call generic_write_sync()
> > > to flush the inode size update to stable storage correctly.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > 
> > Looks ok, but need fixes tag.  Also, might it be wise to split off the
> > ext4 section into a separate patch so that it can be backported
> > separately?
> 
> I 've done a bit more digging on this, and the ext4 part is not
> needed for DAX as IOMAP_F_DIRTY is only used in the page fault path
> and hence can't change the file size. As such, this only affects
> direct IO. Hence the ext4 hunk can be added to the ext4 iomap-dio
> patchset that is being developed rather than being in this patch.

Noted, thanks Dave. I've incorporated the ext4 related change into my patch
series.

--<M>--
Darrick J. Wong Oct. 18, 2019, 1:02 a.m. UTC | #4
On Fri, Oct 18, 2019 at 10:08:14AM +1100, Matthew Bobrowski wrote:
> On Fri, Oct 18, 2019 at 08:56:13AM +1100, Dave Chinner wrote:
> > On Thu, Oct 17, 2019 at 11:39:17AM -0700, Darrick J. Wong wrote:
> > > On Thu, Oct 17, 2019 at 07:56:11PM +0200, Christoph Hellwig wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > When doing a direct IO that spans the current EOF, and there are
> > > > written blocks beyond EOF that extend beyond the current write, the
> > > > only metadata update that needs to be done is a file size extension.
> > > > 
> > > > However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
> > > > there is IO completion metadata updates required, and hence we may
> > > > fail to correctly sync file size extensions made in IO completion
> > > > when O_DSYNC writes are being used and the hardware supports FUA.
> > > > 
> > > > Hence when setting IOMAP_F_DIRTY, we need to also take into account
> > > > whether the iomap spans the current EOF. If it does, then we need to
> > > > mark it dirty so that IO completion will call generic_write_sync()
> > > > to flush the inode size update to stable storage correctly.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > 
> > > Looks ok, but need fixes tag.  Also, might it be wise to split off the
> > > ext4 section into a separate patch so that it can be backported
> > > separately?
> > 
> > I 've done a bit more digging on this, and the ext4 part is not
> > needed for DAX as IOMAP_F_DIRTY is only used in the page fault path
> > and hence can't change the file size. As such, this only affects
> > direct IO. Hence the ext4 hunk can be added to the ext4 iomap-dio
> > patchset that is being developed rather than being in this patch.
> 
> Noted, thanks Dave. I've incorporated the ext4 related change into my patch
> series.

Ok, I've dropped the ext4 hunk from my branch.

--D

> --<M>--
Christoph Hellwig Oct. 18, 2019, 7:20 a.m. UTC | #5
On Thu, Oct 17, 2019 at 11:39:17AM -0700, Darrick J. Wong wrote:
> Looks ok, but need fixes tag.  Also, might it be wise to split off the
> ext4 section into a separate patch so that it can be backported
> separately?

I'll let Dave handle all that.  I've just pulled it in here as multiple
patches conflict with this one.

Patch
diff mbox series

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 516faa280ced..e9dc52537e5b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3523,9 +3523,16 @@  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 			return ret;
 	}
 
+	/*
+	 * Writes that span EOF might trigger an IO size update on completion,
+	 * so consider them to be dirty for the purposes of O_DSYNC even if
+	 * there is no other metadata changes being made or are pending here.
+	 */
 	iomap->flags = 0;
-	if (ext4_inode_datasync_dirty(inode))
+	if (ext4_inode_datasync_dirty(inode) ||
+	    offset + length > i_size_read(inode))
 		iomap->flags |= IOMAP_F_DIRTY;
+
 	iomap->bdev = inode->i_sb->s_bdev;
 	iomap->dax_dev = sbi->s_daxdev;
 	iomap->offset = (u64)first_block << blkbits;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f780e223b118..32993c2acbd9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1049,6 +1049,13 @@  xfs_file_iomap_begin(
 	trace_xfs_iomap_alloc(ip, offset, length, XFS_DATA_FORK, &imap);
 
 out_finish:
+	/*
+	 * Writes that span EOF might trigger an IO size update on completion,
+	 * so consider them to be dirty for the purposes of O_DSYNC even if
+	 * there is no other metadata changes pending or have been made here.
+	 */
+	if ((flags & IOMAP_WRITE) && offset + length > i_size_read(inode))
+		iomap->flags |= IOMAP_F_DIRTY;
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, shared);
 
 out_found:
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7aa5d6117936..24bd227d59f9 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -32,6 +32,8 @@  struct vm_fault;
  *
  * IOMAP_F_DIRTY indicates the inode has uncommitted metadata needed to access
  * written data and requires fdatasync to commit them to persistent storage.
+ * This needs to take into account metadata changes that *may* be made at IO
+ * completion, such as file size updates from direct IO.
  */
 #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
 #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */