diff mbox series

[v3,1/2] btrfs: ensure fast fsync waits for ordered extents after a write failure

Message ID bc50545a6356d32644de712bbd0e6128276596a2.1715964570.git.fdmanana@suse.com (mailing list archive)
State New
Headers show
Series btrfs: fix logging unwritten extents after failure in write paths | expand

Commit Message

Filipe Manana May 17, 2024, 4:52 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

If a write path in COW mode fails, either before submitting a bio for the
new extents or an actual IO error happens, we can end up allowing a fast
fsync to log file extent items that point to unwritten extents.

This is because dropping the extent maps happens when completing ordered
extents, at btrfs_finish_one_ordered(), and the completion of an ordered
extent is executed in a work queue.

This can result in a fast fsync to start logging file extent items based
on existing extent maps before the ordered extents complete, therefore
resulting in a log that has file extent items that point to unwritten
extents, resulting in a corrupt file if a crash happens after and the log
tree is replayed the next time the fs is mounted.

This can happen for both direct IO writes and buffered writes.

For example consider a direct IO write, in COW mode, that fails at
btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
error:

1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
   set to false, meaning an error happened;

2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
   flag;

3) btrfs_finish_ordered_extent() queues the completion of the ordered
   extent - so that btrfs_finish_one_ordered() will be executed later in
   a work queue. That function will drop extent maps in the range when
   it's executed, since the extent maps point to unwritten locations
   (signaled by the BTRFS_ORDERED_IOERR flag);

4) After calling btrfs_finish_ordered_extent() we keep going down the
   write path and unlock the inode;

5) After that a fast fsync starts and locks the inode;

6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
   task sees the extent maps that point to the unwritten locations and
   logs file extent items based on them - it does not know they are
   unwritten, and the fast fsync path does not wait for ordered extents
   to complete, which is an intentional behaviour in order to reduce
   latency.

For the buffered write case, here's one example:

1) A fast fsync begins, and it starts by flushing delalloc and waiting for
   the writeback to complete by calling filemap_fdatawait_range();

2) Flushing the dellaloc created a new extent map X;

3) During the writeback some IO error happened, and at the end io callback
   (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which
   sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its
   completion;

4) After queuing the ordered extent completion, the end io callback clears
   the writeback flag from all pages (or folios), and from that moment the
   fast fsync can proceed;

5) The fast fsync proceeds sees extent map X and logs a file extent item
   based on extent map X, resulting in a log that points to an unwritten
   data extent - because the ordered extent completion hasn't run yet, it
   happens only after the logging.

To fix this make btrfs_finish_ordered_extent() set the inode flag
BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write,
so that a fast fsync will wait for ordered extent completion.

Note that this issues of using extent maps that point to unwritten
locations can not happen for reads, because in read paths we start by
locking the extent range and wait for any ordered extents in the range
to complete before looking for extent maps.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h  | 19 +++++++++++++------
 fs/btrfs/file.c         | 15 +++++++++++++++
 fs/btrfs/ordered-data.c | 31 +++++++++++++++++++++++++++++++
 3 files changed, 59 insertions(+), 6 deletions(-)

Comments

Qu Wenruo May 18, 2024, 5:28 a.m. UTC | #1
在 2024/5/18 02:22, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> If a write path in COW mode fails, either before submitting a bio for the
> new extents or an actual IO error happens, we can end up allowing a fast
> fsync to log file extent items that point to unwritten extents.
>
> This is because dropping the extent maps happens when completing ordered
> extents, at btrfs_finish_one_ordered(), and the completion of an ordered
> extent is executed in a work queue.
>
> This can result in a fast fsync to start logging file extent items based
> on existing extent maps before the ordered extents complete, therefore
> resulting in a log that has file extent items that point to unwritten
> extents, resulting in a corrupt file if a crash happens after and the log
> tree is replayed the next time the fs is mounted.
>
> This can happen for both direct IO writes and buffered writes.
>
> For example consider a direct IO write, in COW mode, that fails at
> btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
> error:
>
> 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
>     set to false, meaning an error happened;
>
> 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
>     flag;
>
> 3) btrfs_finish_ordered_extent() queues the completion of the ordered
>     extent - so that btrfs_finish_one_ordered() will be executed later in
>     a work queue. That function will drop extent maps in the range when
>     it's executed, since the extent maps point to unwritten locations
>     (signaled by the BTRFS_ORDERED_IOERR flag);
>
> 4) After calling btrfs_finish_ordered_extent() we keep going down the
>     write path and unlock the inode;
>
> 5) After that a fast fsync starts and locks the inode;
>
> 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
>     task sees the extent maps that point to the unwritten locations and
>     logs file extent items based on them - it does not know they are
>     unwritten, and the fast fsync path does not wait for ordered extents
>     to complete, which is an intentional behaviour in order to reduce
>     latency.
>
> For the buffered write case, here's one example:
>
> 1) A fast fsync begins, and it starts by flushing delalloc and waiting for
>     the writeback to complete by calling filemap_fdatawait_range();
>
> 2) Flushing the dellaloc created a new extent map X;
>
> 3) During the writeback some IO error happened, and at the end io callback
>     (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which
>     sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its
>     completion;
>
> 4) After queuing the ordered extent completion, the end io callback clears
>     the writeback flag from all pages (or folios), and from that moment the
>     fast fsync can proceed;
>
> 5) The fast fsync proceeds sees extent map X and logs a file extent item
>     based on extent map X, resulting in a log that points to an unwritten
>     data extent - because the ordered extent completion hasn't run yet, it
>     happens only after the logging.
>
> To fix this make btrfs_finish_ordered_extent() set the inode flag
> BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write,
> so that a fast fsync will wait for ordered extent completion.
>
> Note that this issues of using extent maps that point to unwritten
> locations can not happen for reads, because in read paths we start by
> locking the extent range and wait for any ordered extents in the range
> to complete before looking for extent maps.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Thanks for the updated commit messages, it's much clear for the race window.

And since we no longer try to run finish_ordered_io() inside the endio
function, we should no longer hit the memory allocation warning inside
irq context.

But the inode->lock usage seems unsafe to me, comment inlined below:
[...]
> @@ -478,10 +485,10 @@ static inline void btrfs_set_inode_full_sync(struct btrfs_inode *inode)
>   	 * while ->last_trans was not yet updated in the current transaction,
>   	 * and therefore has a lower value.
>   	 */
> -	spin_lock(&inode->lock);
> +	spin_lock_irqsave(&inode->lock, flags);
>   	if (inode->last_reflink_trans < inode->last_trans)
>   		inode->last_reflink_trans = inode->last_trans;
> -	spin_unlock(&inode->lock);
> +	spin_unlock_irqrestore(&inode->lock, flags);

IIRC this is not how we change the lock usage to be irq safe.
We need all lock users to use irq variants.

Or we can hit situation like:

	Thread A

	spin_lock(&inode->lock);
--- IRQ happens for the endio ---
	spin_lock_irqsave();

Then we dead lock.

Thus all inode->lock users needs to use the irq variant, which would be
a huge change.

I guess if we unconditionally wait for ordered extents inside
btrfs_sync_file() would be too slow?

Thanks,
Qu

>   }
>
>   static inline bool btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0c7c1b42028e..d635bc0c01df 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1894,6 +1894,21 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>   		btrfs_get_ordered_extents_for_logging(BTRFS_I(inode),
>   						      &ctx.ordered_extents);
>   		ret = filemap_fdatawait_range(inode->i_mapping, start, end);
> +		if (ret)
> +			goto out_release_extents;
> +
> +		/*
> +		 * Check again the full sync flag, because it may have been set
> +		 * during the end IO callback (end_bbio_data_write() ->
> +		 * btrfs_finish_ordered_extent()) in case an error happened and
> +		 * we need to wait for ordered extents to complete so that any
> +		 * extent maps that point to unwritten locations are dropped and
> +		 * we don't log them.
> +		 */
> +		full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
> +				     &BTRFS_I(inode)->runtime_flags);
> +		if (full_sync)
> +			ret = btrfs_wait_ordered_range(inode, start, len);
>   	}
>
>   	if (ret)
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 44157e43fd2a..55a9aeed7344 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -388,6 +388,37 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
>   	ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
>   	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
>
> +	/*
> +	 * If this is a COW write it means we created new extent maps for the
> +	 * range and they point to unwritten locations if we got an error either
> +	 * before submitting a bio or during IO.
> +	 *
> +	 * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we
> +	 * are queuing its completion below. During completion, at
> +	 * btrfs_finish_one_ordered(), we will drop the extent maps for the
> +	 * unwritten extents.
> +	 *
> +	 * However because completion runs in a work queue we can end up having
> +	 * a fast fsync running before that. In the case of direct IO, once we
> +	 * unlock the inode the fsync might start, and we queue the completion
> +	 * before unlocking the inode. In the case of buffered IO when writeback
> +	 * finishes (end_bbio_data_write()) we queue the completion, so if the
> +	 * writeback was triggered by a fast fsync, the fsync might start
> +	 * logging before ordered extent completion runs in the work queue.
> +	 *
> +	 * The fast fsync will log file extent items based on the extent maps it
> +	 * finds, so if by the time it collects extent maps the ordered extent
> +	 * completion didn't happen yet, it will log file extent items that
> +	 * point to unwritten extents, resulting in a corruption if a crash
> +	 * happens and the log tree is replayed. Note that a fast fsync does not
> +	 * wait for completion of ordered extents in order to reduce latency.
> +	 *
> +	 * Set a flag in the inode so that the next fast fsync will wait for
> +	 * ordered extents to complete before starting to log.
> +	 */
> +	if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
> +		btrfs_set_inode_full_sync(inode);
> +
>   	if (ret)
>   		btrfs_queue_ordered_fn(ordered);
>   	return ret;
Filipe Manana May 20, 2024, 9:46 a.m. UTC | #2
On Sat, May 18, 2024 at 6:28 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/5/18 02:22, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > If a write path in COW mode fails, either before submitting a bio for the
> > new extents or an actual IO error happens, we can end up allowing a fast
> > fsync to log file extent items that point to unwritten extents.
> >
> > This is because dropping the extent maps happens when completing ordered
> > extents, at btrfs_finish_one_ordered(), and the completion of an ordered
> > extent is executed in a work queue.
> >
> > This can result in a fast fsync to start logging file extent items based
> > on existing extent maps before the ordered extents complete, therefore
> > resulting in a log that has file extent items that point to unwritten
> > extents, resulting in a corrupt file if a crash happens after and the log
> > tree is replayed the next time the fs is mounted.
> >
> > This can happen for both direct IO writes and buffered writes.
> >
> > For example consider a direct IO write, in COW mode, that fails at
> > btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
> > error:
> >
> > 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
> >     set to false, meaning an error happened;
> >
> > 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
> >     flag;
> >
> > 3) btrfs_finish_ordered_extent() queues the completion of the ordered
> >     extent - so that btrfs_finish_one_ordered() will be executed later in
> >     a work queue. That function will drop extent maps in the range when
> >     it's executed, since the extent maps point to unwritten locations
> >     (signaled by the BTRFS_ORDERED_IOERR flag);
> >
> > 4) After calling btrfs_finish_ordered_extent() we keep going down the
> >     write path and unlock the inode;
> >
> > 5) After that a fast fsync starts and locks the inode;
> >
> > 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
> >     task sees the extent maps that point to the unwritten locations and
> >     logs file extent items based on them - it does not know they are
> >     unwritten, and the fast fsync path does not wait for ordered extents
> >     to complete, which is an intentional behaviour in order to reduce
> >     latency.
> >
> > For the buffered write case, here's one example:
> >
> > 1) A fast fsync begins, and it starts by flushing delalloc and waiting for
> >     the writeback to complete by calling filemap_fdatawait_range();
> >
> > 2) Flushing the dellaloc created a new extent map X;
> >
> > 3) During the writeback some IO error happened, and at the end io callback
> >     (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which
> >     sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its
> >     completion;
> >
> > 4) After queuing the ordered extent completion, the end io callback clears
> >     the writeback flag from all pages (or folios), and from that moment the
> >     fast fsync can proceed;
> >
> > 5) The fast fsync proceeds sees extent map X and logs a file extent item
> >     based on extent map X, resulting in a log that points to an unwritten
> >     data extent - because the ordered extent completion hasn't run yet, it
> >     happens only after the logging.
> >
> > To fix this make btrfs_finish_ordered_extent() set the inode flag
> > BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write,
> > so that a fast fsync will wait for ordered extent completion.
> >
> > Note that this issues of using extent maps that point to unwritten
> > locations can not happen for reads, because in read paths we start by
> > locking the extent range and wait for any ordered extents in the range
> > to complete before looking for extent maps.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> Thanks for the updated commit messages, it's much clear for the race window.
>
> And since we no longer try to run finish_ordered_io() inside the endio
> function, we should no longer hit the memory allocation warning inside
> irq context.
>
> But the inode->lock usage seems unsafe to me, comment inlined below:
> [...]
> > @@ -478,10 +485,10 @@ static inline void btrfs_set_inode_full_sync(struct btrfs_inode *inode)
> >        * while ->last_trans was not yet updated in the current transaction,
> >        * and therefore has a lower value.
> >        */
> > -     spin_lock(&inode->lock);
> > +     spin_lock_irqsave(&inode->lock, flags);
> >       if (inode->last_reflink_trans < inode->last_trans)
> >               inode->last_reflink_trans = inode->last_trans;
> > -     spin_unlock(&inode->lock);
> > +     spin_unlock_irqrestore(&inode->lock, flags);
>
> IIRC this is not how we change the lock usage to be irq safe.
> We need all lock users to use irq variants.
>
> Or we can hit situation like:
>
>         Thread A
>
>         spin_lock(&inode->lock);
> --- IRQ happens for the endio ---
>         spin_lock_irqsave();
>
> Then we dead lock.
>
> Thus all inode->lock users needs to use the irq variant, which would be
> a huge change.

Indeed, I missed that, thanks.

>
> I guess if we unconditionally wait for ordered extents inside
> btrfs_sync_file() would be too slow?

No way. Not waiting for ordered extent completion is one of the main
things that makes the fast fsync faster then the full fsync.
It's ok to wait only in case of errors, since they are unexpected and
unlikely, and in error cases the ordered extent completion doesn't do
much (no trees to update).

Fixed in v4, thanks.

>
> Thanks,
> Qu
>
> >   }
> >
> >   static inline bool btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 0c7c1b42028e..d635bc0c01df 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1894,6 +1894,21 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> >               btrfs_get_ordered_extents_for_logging(BTRFS_I(inode),
> >                                                     &ctx.ordered_extents);
> >               ret = filemap_fdatawait_range(inode->i_mapping, start, end);
> > +             if (ret)
> > +                     goto out_release_extents;
> > +
> > +             /*
> > +              * Check again the full sync flag, because it may have been set
> > +              * during the end IO callback (end_bbio_data_write() ->
> > +              * btrfs_finish_ordered_extent()) in case an error happened and
> > +              * we need to wait for ordered extents to complete so that any
> > +              * extent maps that point to unwritten locations are dropped and
> > +              * we don't log them.
> > +              */
> > +             full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
> > +                                  &BTRFS_I(inode)->runtime_flags);
> > +             if (full_sync)
> > +                     ret = btrfs_wait_ordered_range(inode, start, len);
> >       }
> >
> >       if (ret)
> > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > index 44157e43fd2a..55a9aeed7344 100644
> > --- a/fs/btrfs/ordered-data.c
> > +++ b/fs/btrfs/ordered-data.c
> > @@ -388,6 +388,37 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
> >       ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
> >       spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
> >
> > +     /*
> > +      * If this is a COW write it means we created new extent maps for the
> > +      * range and they point to unwritten locations if we got an error either
> > +      * before submitting a bio or during IO.
> > +      *
> > +      * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we
> > +      * are queuing its completion below. During completion, at
> > +      * btrfs_finish_one_ordered(), we will drop the extent maps for the
> > +      * unwritten extents.
> > +      *
> > +      * However because completion runs in a work queue we can end up having
> > +      * a fast fsync running before that. In the case of direct IO, once we
> > +      * unlock the inode the fsync might start, and we queue the completion
> > +      * before unlocking the inode. In the case of buffered IO when writeback
> > +      * finishes (end_bbio_data_write()) we queue the completion, so if the
> > +      * writeback was triggered by a fast fsync, the fsync might start
> > +      * logging before ordered extent completion runs in the work queue.
> > +      *
> > +      * The fast fsync will log file extent items based on the extent maps it
> > +      * finds, so if by the time it collects extent maps the ordered extent
> > +      * completion didn't happen yet, it will log file extent items that
> > +      * point to unwritten extents, resulting in a corruption if a crash
> > +      * happens and the log tree is replayed. Note that a fast fsync does not
> > +      * wait for completion of ordered extents in order to reduce latency.
> > +      *
> > +      * Set a flag in the inode so that the next fast fsync will wait for
> > +      * ordered extents to complete before starting to log.
> > +      */
> > +     if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
> > +             btrfs_set_inode_full_sync(inode);
> > +
> >       if (ret)
> >               btrfs_queue_ordered_fn(ordered);
> >       return ret;
diff mbox series

Patch

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3c8bc7a8ebdd..26ee7251ad6c 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -56,7 +56,9 @@  enum {
 	 /*
 	  * Always set under the VFS' inode lock, otherwise it can cause races
 	  * during fsync (we start as a fast fsync and then end up in a full
-	  * fsync racing with ordered extent completion).
+	  * fsync racing with ordered extent completion). It's also safe to be
+	  * set during writeback without holding the VFS' inode lock because
+	  * a fsync flushes and waits for delalloc before starting to log.
 	  */
 	BTRFS_INODE_NEEDS_FULL_SYNC,
 	BTRFS_INODE_COPY_EVERYTHING,
@@ -453,14 +455,19 @@  static inline void btrfs_set_inode_last_sub_trans(struct btrfs_inode *inode)
 }
 
 /*
- * Should be called while holding the inode's VFS lock in exclusive mode, or
- * while holding the inode's mmap lock (struct btrfs_inode::i_mmap_lock) in
+ * Should be called while holding the inode's VFS lock in exclusive/shared mode,
+ * or while holding the inode's mmap lock (struct btrfs_inode::i_mmap_lock) in
  * either shared or exclusive mode, or in a context where no one else can access
  * the inode concurrently (during inode creation or when loading an inode from
- * disk).
+ * disk). Can also be called in irq context from btrfs_finish_ordered_extent()
+ * because at that point we are either in a direct IO write, in which case the
+ * inode's VFS lock is held, or in end IO callback of a buffered write, in which
+ * case fsync allways waits for writeback completion.
  */
 static inline void btrfs_set_inode_full_sync(struct btrfs_inode *inode)
 {
+	unsigned long flags;
+
 	set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags);
 	/*
 	 * The inode may have been part of a reflink operation in the last
@@ -478,10 +485,10 @@  static inline void btrfs_set_inode_full_sync(struct btrfs_inode *inode)
 	 * while ->last_trans was not yet updated in the current transaction,
 	 * and therefore has a lower value.
 	 */
-	spin_lock(&inode->lock);
+	spin_lock_irqsave(&inode->lock, flags);
 	if (inode->last_reflink_trans < inode->last_trans)
 		inode->last_reflink_trans = inode->last_trans;
-	spin_unlock(&inode->lock);
+	spin_unlock_irqrestore(&inode->lock, flags);
 }
 
 static inline bool btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0c7c1b42028e..d635bc0c01df 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1894,6 +1894,21 @@  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		btrfs_get_ordered_extents_for_logging(BTRFS_I(inode),
 						      &ctx.ordered_extents);
 		ret = filemap_fdatawait_range(inode->i_mapping, start, end);
+		if (ret)
+			goto out_release_extents;
+
+		/*
+		 * Check again the full sync flag, because it may have been set
+		 * during the end IO callback (end_bbio_data_write() ->
+		 * btrfs_finish_ordered_extent()) in case an error happened and
+		 * we need to wait for ordered extents to complete so that any
+		 * extent maps that point to unwritten locations are dropped and
+		 * we don't log them.
+		 */
+		full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
+				     &BTRFS_I(inode)->runtime_flags);
+		if (full_sync)
+			ret = btrfs_wait_ordered_range(inode, start, len);
 	}
 
 	if (ret)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 44157e43fd2a..55a9aeed7344 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -388,6 +388,37 @@  bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
 	ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
 	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
 
+	/*
+	 * If this is a COW write it means we created new extent maps for the
+	 * range and they point to unwritten locations if we got an error either
+	 * before submitting a bio or during IO.
+	 *
+	 * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we
+	 * are queuing its completion below. During completion, at
+	 * btrfs_finish_one_ordered(), we will drop the extent maps for the
+	 * unwritten extents.
+	 *
+	 * However because completion runs in a work queue we can end up having
+	 * a fast fsync running before that. In the case of direct IO, once we
+	 * unlock the inode the fsync might start, and we queue the completion
+	 * before unlocking the inode. In the case of buffered IO when writeback
+	 * finishes (end_bbio_data_write()) we queue the completion, so if the
+	 * writeback was triggered by a fast fsync, the fsync might start
+	 * logging before ordered extent completion runs in the work queue.
+	 *
+	 * The fast fsync will log file extent items based on the extent maps it
+	 * finds, so if by the time it collects extent maps the ordered extent
+	 * completion didn't happen yet, it will log file extent items that
+	 * point to unwritten extents, resulting in a corruption if a crash
+	 * happens and the log tree is replayed. Note that a fast fsync does not
+	 * wait for completion of ordered extents in order to reduce latency.
+	 *
+	 * Set a flag in the inode so that the next fast fsync will wait for
+	 * ordered extents to complete before starting to log.
+	 */
+	if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
+		btrfs_set_inode_full_sync(inode);
+
 	if (ret)
 		btrfs_queue_ordered_fn(ordered);
 	return ret;