diff mbox series

[1/5] iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents

Message ID 20181119211742.8824-2-david@fromorbit.com (mailing list archive)
State New, archived
Headers show
Series iomap: data corruption fixes and more | expand

Commit Message

Dave Chinner Nov. 19, 2018, 9:17 p.m. UTC
From: Dave Chinner <dchinner@redhat.com>

When we write into an unwritten extent via direct IO, we dirty
metadata on IO completion to convert the unwritten extent to
written. However, when we do the FUA optimisation checks, the inode
may be clean and so we issue a FUA write into the unwritten extent.
This means we then bypass the generic_write_sync() call after
unwritten extent conversion has ben done and we don't force the
modified metadata to stable storage.

This violates O_DSYNC semantics. The window of exposure is a single
IO, as the next DIO write will see the inode has dirty metadata and
hence will not use the FUA optimisation. Calling
generic_write_sync() after completion of the second IO will also
sync the first write and it's metadata.

Fix this by avoiding the FUA optimisation when writing to unwritten
extents.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

Comments

Christoph Hellwig Nov. 20, 2018, 7:48 a.m. UTC | #1
On Tue, Nov 20, 2018 at 08:17:38AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we write into an unwritten extent via direct IO, we dirty
> metadata on IO completion to convert the unwritten extent to
> written. However, when we do the FUA optimisation checks, the inode
> may be clean and so we issue a FUA write into the unwritten extent.
> This means we then bypass the generic_write_sync() call after
> unwritten extent conversion has ben done and we don't force the
> modified metadata to stable storage.
> 
> This violates O_DSYNC semantics. The window of exposure is a single
> IO, as the next DIO write will see the inode has dirty metadata and
> hence will not use the FUA optimisation. Calling
> generic_write_sync() after completion of the second IO will also
> sync the first write and it's metadata.
> 
> Fix this by avoiding the FUA optimisation when writing to unwritten
> extents.

Ouch, yes.  We can't skip the log force when converting unwritten
extent.  If we really cared we could try to use FUA and only do
the log force vs needing a full device flush, but that would require
a fair amount of of work.

So this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong Nov. 20, 2018, 11 p.m. UTC | #2
On Tue, Nov 20, 2018 at 08:17:38AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we write into an unwritten extent via direct IO, we dirty
> metadata on IO completion to convert the unwritten extent to
> written. However, when we do the FUA optimisation checks, the inode
> may be clean and so we issue a FUA write into the unwritten extent.
> This means we then bypass the generic_write_sync() call after
> unwritten extent conversion has ben done and we don't force the
> modified metadata to stable storage.
> 
> This violates O_DSYNC semantics. The window of exposure is a single
> IO, as the next DIO write will see the inode has dirty metadata and
> hence will not use the FUA optimisation. Calling
> generic_write_sync() after completion of the second IO will also
> sync the first write and it's metadata.
> 
> Fix this by avoiding the FUA optimisation when writing to unwritten
> extents.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/iomap.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 64ce240217a1..72f3864a2e6b 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1596,12 +1596,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>  
>  	if (iomap->flags & IOMAP_F_NEW) {
>  		need_zeroout = true;
> -	} else {
> +	} else if (iomap->type == IOMAP_MAPPED) {
>  		/*
> -		 * Use a FUA write if we need datasync semantics, this
> -		 * is a pure data IO that doesn't require any metadata
> -		 * updates and the underlying device supports FUA. This
> -		 * allows us to avoid cache flushes on IO completion.
> +		 * Use a FUA write if we need datasync semantics, this is a pure
> +		 * data IO that doesn't require any metadata updates (including
> +		 * after IO completion such as unwritten extent conversion) and
> +		 * the underlying device supports FUA. This allows us to avoid
> +		 * cache flushes on IO completion.
>  		 */
>  		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
>  		    (dio->flags & IOMAP_DIO_WRITE_FUA) &&
> -- 
> 2.19.1
>
diff mbox series

Patch

diff --git a/fs/iomap.c b/fs/iomap.c
index 64ce240217a1..72f3864a2e6b 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1596,12 +1596,13 @@  iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 	if (iomap->flags & IOMAP_F_NEW) {
 		need_zeroout = true;
-	} else {
+	} else if (iomap->type == IOMAP_MAPPED) {
 		/*
-		 * Use a FUA write if we need datasync semantics, this
-		 * is a pure data IO that doesn't require any metadata
-		 * updates and the underlying device supports FUA. This
-		 * allows us to avoid cache flushes on IO completion.
+		 * Use a FUA write if we need datasync semantics, this is a pure
+		 * data IO that doesn't require any metadata updates (including
+		 * after IO completion such as unwritten extent conversion) and
+		 * the underlying device supports FUA. This allows us to avoid
+		 * cache flushes on IO completion.
 		 */
 		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
 		    (dio->flags & IOMAP_DIO_WRITE_FUA) &&