diff mbox series

[v2,1/2] btrfs: immediately drop extent maps after failed COW write

Message ID c9d7a03ee9730e1d864cb6fbe2d511dd8899a953.1715798440.git.fdmanana@suse.com (mailing list archive)
State New
Headers show
Series btrfs: fix a bug in the direct IO write path for COW writes | expand

Commit Message

Filipe Manana May 15, 2024, 6:51 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

If a write path in COW mode fails, either before submitting a bio for the
new extents or an actual IO error happens, we can end up allowing a fast
fsync to log file extent items that point to unwritten extents.

This is because the ordered extent completion for a failed write is
executed in a work queue. This means that once the write path unlocks the
inode, a fast fsync can come and log the extent maps created by the write
attempt before the work queue completes the ordered extent.

For example consider a direct IO write, in COW mode, that fails at
btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
error:

1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
   set to false, meaning an error happened;

2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
   flag;

3) btrfs_finish_ordered_extent() queues the completion of the ordered
   extent - so that btrfs_finish_one_ordered() will be executed later in
   a work queue. That function will drop extents maps in the range when
   it's executed, since the extent maps point to unwritten locations
   (signaled by the BTRFS_ORDERED_IOERR flag);

4) After calling btrfs_finish_ordered_extent() we keep going down the
   write path and unlock the inode;

5) After that a fast fsync starts and locks the inode;

6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
   task sees the extent maps that point to the unwritten locations and
   logs file extent items based on them - it does not know they are
   unwritten, and the fast fsync path does not wait for ordered extents
   to complete in order to reduce latency.

So to fix this make btrfs_finish_ordered_extent() drop the extent maps
in the range if an error happened for a COW write.

Note that this issues of using extent maps that point to unwritten
locations can not happen for reads, because in read paths we start by
locking the extent range and wait for any ordered extents in the range
to complete before looking for extent maps.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ordered-data.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Comments

Qu Wenruo May 15, 2024, 10:02 p.m. UTC | #1
在 2024/5/16 04:21, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> If a write path in COW mode fails, either before submitting a bio for the
> new extents or an actual IO error happens, we can end up allowing a fast
> fsync to log file extent items that point to unwritten extents.
>
> This is because the ordered extent completion for a failed write is
> executed in a work queue. This means that once the write path unlocks the
> inode, a fast fsync can come and log the extent maps created by the write
> attempt before the work queue completes the ordered extent.
>
> For example consider a direct IO write, in COW mode, that fails at
> btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
> error:
>
> 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
>     set to false, meaning an error happened;
>
> 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
>     flag;
>
> 3) btrfs_finish_ordered_extent() queues the completion of the ordered
>     extent - so that btrfs_finish_one_ordered() will be executed later in
>     a work queue. That function will drop extents maps in the range when
>     it's executed, since the extent maps point to unwritten locations
>     (signaled by the BTRFS_ORDERED_IOERR flag);
>
> 4) After calling btrfs_finish_ordered_extent() we keep going down the
>     write path and unlock the inode;
>
> 5) After that a fast fsync starts and locks the inode;
>
> 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
>     task sees the extent maps that point to the unwritten locations and
>     logs file extent items based on them - it does not know they are
>     unwritten, and the fast fsync path does not wait for ordered extents
>     to complete in order to reduce latency.
>
> So to fix this make btrfs_finish_ordered_extent() drop the extent maps
> in the range if an error happened for a COW write.
>
> Note that this issues of using extent maps that point to unwritten
> locations can not happen for reads, because in read paths we start by
> locking the extent range and wait for any ordered extents in the range
> to complete before looking for extent maps.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks for the detailed explanation on the issue.

Thanks,
Qu
> ---
>   fs/btrfs/ordered-data.c | 27 +++++++++++++++++++++++++++
>   1 file changed, 27 insertions(+)
>
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 304d94f6d29b..3a3f21da6eb7 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -388,6 +388,33 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
>   	ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
>   	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
>
> +	/*
> +	 * If this is a COW write it means we created new extent maps for the
> +	 * range and they point to an unwritten location if we got an error
> +	 * either before submitting a bio or during IO.
> +	 *
> +	 * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we
> +	 * are queuing its completion below. During completion, at
> +	 * btrfs_finish_one_ordered(), we will drop the extent maps for the
> +	 * unwritten extents.
> +	 *
> +	 * However because completion runs in a work queue we can end up
> +	 * unlocking the inode before the ordered extent is completed.
> +	 *
> +	 * That means that a fast fsync can happen before the work queue
> +	 * executes the completion of the ordered extent, and in that case
> +	 * the fsync will use the extent maps that point to unwritten extents,
> +	 * resulting in logging file extent items that point to unwritten
> +	 * locations. Unlike read paths, a fast fsync doesn't wait for ordered
> +	 * extent completion before proceeding (intentional to reduce latency).
> +	 *
> +	 * To be safe drop the new extent maps in the range (if are doing COW)
> +	 * right here before we unlock the inode and allow a fsync to run.
> +	 */
> +	if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
> +		btrfs_drop_extent_map_range(inode, file_offset,
> +					    file_offset + len - 1, false);
> +
>   	if (ret)
>   		btrfs_queue_ordered_fn(ordered);
>   	return ret;
diff mbox series

Patch

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 304d94f6d29b..3a3f21da6eb7 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -388,6 +388,33 @@  bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
 	ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
 	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
 
+	/*
+	 * If this is a COW write it means we created new extent maps for the
+	 * range and they point to an unwritten location if we got an error
+	 * either before submitting a bio or during IO.
+	 *
+	 * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we
+	 * are queuing its completion below. During completion, at
+	 * btrfs_finish_one_ordered(), we will drop the extent maps for the
+	 * unwritten extents.
+	 *
+	 * However because completion runs in a work queue we can end up
+	 * unlocking the inode before the ordered extent is completed.
+	 *
+	 * That means that a fast fsync can happen before the work queue
+	 * executes the completion of the ordered extent, and in that case
+	 * the fsync will use the extent maps that point to unwritten extents,
+	 * resulting in logging file extent items that point to unwritten
+	 * locations. Unlike read paths, a fast fsync doesn't wait for ordered
+	 * extent completion before proceeding (intentional to reduce latency).
+	 *
+	 * To be safe drop the new extent maps in the range (if are doing COW)
+	 * right here before we unlock the inode and allow a fsync to run.
+	 */
+	if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
+		btrfs_drop_extent_map_range(inode, file_offset,
+					    file_offset + len - 1, false);
+
 	if (ret)
 		btrfs_queue_ordered_fn(ordered);
 	return ret;