diff mbox series

[1/2] btrfs: drop extent maps after failed COW dio write

Message ID affef2b4ccd4d0f9a0272cd5883a5922d36bac31.1715688057.git.fdmanana@suse.com (mailing list archive)
State New
Headers show
Series btrfs: fix a bug in the direct IO write path for COW writes | expand

Commit Message

Filipe Manana May 14, 2024, 2:23 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

During a COW direct IO write if the call to btrfs_extract_ordered_extent()
fails, we don't submit any bios, cleanup the ordered extent but we leave
the extent maps we created for the write around.

These extent maps point to locations on disk that were not written to,
since we haven't submitted a bio for them, and they are new an in the list
of modified extents.

This means that if we fsync the file after, we log file extent items based
on those extent maps, which are invalid since they point to unwritten
locations. If a power failure happens after the fsync, on log replay we
get a corrupted file range.

Fix this by dropping the extent maps if btrfs_extract_ordered_extent()
failed.

Fixes: b73a6fd1b1ef ("btrfs: split partial dio bios before submit")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

Comments

Qu Wenruo May 14, 2024, 10:15 p.m. UTC | #1
在 2024/5/14 23:53, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> During a COW direct IO write if the call to btrfs_extract_ordered_extent()
> fails, we don't submit any bios, cleanup the ordered extent but we leave
> the extent maps we created for the write around.
>
> These extent maps point to locations on disk that were not written to,
> since we haven't submitted a bio for them, and they are new an in the list
> of modified extents.
>
> This means that if we fsync the file after, we log file extent items based
> on those extent maps, which are invalid since they point to unwritten
> locations. If a power failure happens after the fsync, on log replay we
> get a corrupted file range.
>
> Fix this by dropping the extent maps if btrfs_extract_ordered_extent()
> failed.
>
> Fixes: b73a6fd1b1ef ("btrfs: split partial dio bios before submit")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Just a small question inlined below.
> ---
>   fs/btrfs/inode.c | 21 +++++++++++++++++++--
>   1 file changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5a1014122088..f04852e44123 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7888,11 +7888,28 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
>   	 * remaining pages is blocked on the outstanding ordered extent.
>   	 */
>   	if (iter->flags & IOMAP_WRITE) {
> +		struct btrfs_ordered_extent *oe = dio_data->ordered;
>   		int ret;
>
> -		ret = btrfs_extract_ordered_extent(bbio, dio_data->ordered);
> +		ret = btrfs_extract_ordered_extent(bbio, oe);
>   		if (ret) {
> -			btrfs_finish_ordered_extent(dio_data->ordered, NULL,
> +			/*
> +			 * If this is a COW write it means we created new extent
> +			 * maps for the range and they point to an unwritten
> +			 * location since we got an error and we don't submit
> +			 * a bio. We must drop any extent maps within the range,
> +			 * otherwise a fast fsync would log them and after a
> +			 * crash and log replay we would have file extent items
> +			 * that point to unwritten locations (garbage).
> +			 */

Is regular read also being affected?

Since the em is already inserted, and we're doing DIO, there should be
no page cache for the whole dio range (as long as we didn't fallback to
buffered IO).

In that case, the problem is not only limited to fsync, but also any read.

Thanks,
Qu
> +			if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) {
> +				const u64 start = oe->file_offset;
> +				const u64 end = start + oe->num_bytes - 1;
> +
> +				btrfs_drop_extent_map_range(bbio->inode, start, end, false);
> +			}
> +
> +			btrfs_finish_ordered_extent(oe, NULL,
>   						    file_offset, dip->bytes,
>   						    !ret);
>   			bio->bi_status = errno_to_blk_status(ret);
Filipe Manana May 15, 2024, 9:47 a.m. UTC | #2
On Tue, May 14, 2024 at 11:15 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/5/14 23:53, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > During a COW direct IO write if the call to btrfs_extract_ordered_extent()
> > fails, we don't submit any bios, cleanup the ordered extent but we leave
> > the extent maps we created for the write around.
> >
> > These extent maps point to locations on disk that were not written to,
> > since we haven't submitted a bio for them, and they are new an in the list
> > of modified extents.
> >
> > This means that if we fsync the file after, we log file extent items based
> > on those extent maps, which are invalid since they point to unwritten
> > locations. If a power failure happens after the fsync, on log replay we
> > get a corrupted file range.
> >
> > Fix this by dropping the extent maps if btrfs_extract_ordered_extent()
> > failed.
> >
> > Fixes: b73a6fd1b1ef ("btrfs: split partial dio bios before submit")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> Just a small question inlined below.
> > ---
> >   fs/btrfs/inode.c | 21 +++++++++++++++++++--
> >   1 file changed, 19 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 5a1014122088..f04852e44123 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -7888,11 +7888,28 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
> >        * remaining pages is blocked on the outstanding ordered extent.
> >        */
> >       if (iter->flags & IOMAP_WRITE) {
> > +             struct btrfs_ordered_extent *oe = dio_data->ordered;
> >               int ret;
> >
> > -             ret = btrfs_extract_ordered_extent(bbio, dio_data->ordered);
> > +             ret = btrfs_extract_ordered_extent(bbio, oe);
> >               if (ret) {
> > -                     btrfs_finish_ordered_extent(dio_data->ordered, NULL,
> > +                     /*
> > +                      * If this is a COW write it means we created new extent
> > +                      * maps for the range and they point to an unwritten
> > +                      * location since we got an error and we don't submit
> > +                      * a bio. We must drop any extent maps within the range,
> > +                      * otherwise a fast fsync would log them and after a
> > +                      * crash and log replay we would have file extent items
> > +                      * that point to unwritten locations (garbage).
> > +                      */
>
> Is regular read also being affected?

No, it isn't, I'll explain below.

>
> Since the em is already inserted, and we're doing DIO, there should be
> no page cache for the whole dio range (as long as we didn't fallback to
> buffered IO).
>
> In that case, the problem is not only limited to fsync, but also any read.

Nop, it has nothing to do with not being buffered IO.

So what happens in this particular error path is that on error we call
btrfs_finish_ordered_extent() with "uptodate" false.

This sets the IO_ERR flag on the ordered extent and then queues the
ordered extent completion,
which results in executing btrfs_finish_one_ordered() in a work queue.

There we check the error flag on the ordered extent, and in fact we
drop extent maps in range - exactly what this patch is doing.

The problem is that once we unlock the inode in the dio write path,
another task can call fsync.

If the fsync is a "fast fsync", it does not wait for ordered extents
to complete before going over new extent maps.
So if the work queue job only finishes after the fsync logs extent
maps, we get the corruption - a log tree with file extent items
pointing to unwritten extents.

For the read path this doesn't happen.

Because for reads, when they start we lock the extent range and wait
for any ordered extents in the range to complete (we call
btrfs_lock_and_flush_ordered_range()),
which will result in dropping the extent maps in the range when the
ordered extents complete.

These are probably a lot of non-obvious details.
So I'll send a v2 with these details in the changelog and comment.

I also noticed there's at least one more error path with the same
problem, so I'll move the call to drop extent maps to
btrfs_finish_ordered_extent().

I'm giving a full fstests run with that updated version and sending it
out later.

Thanks.


>
> Thanks,
> Qu
> > +                     if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) {
> > +                             const u64 start = oe->file_offset;
> > +                             const u64 end = start + oe->num_bytes - 1;
> > +
> > +                             btrfs_drop_extent_map_range(bbio->inode, start, end, false);
> > +                     }
> > +
> > +                     btrfs_finish_ordered_extent(oe, NULL,
> >                                                   file_offset, dip->bytes,
> >                                                   !ret);
> >                       bio->bi_status = errno_to_blk_status(ret);
diff mbox series

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5a1014122088..f04852e44123 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7888,11 +7888,28 @@  static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 	 * remaining pages is blocked on the outstanding ordered extent.
 	 */
 	if (iter->flags & IOMAP_WRITE) {
+		struct btrfs_ordered_extent *oe = dio_data->ordered;
 		int ret;
 
-		ret = btrfs_extract_ordered_extent(bbio, dio_data->ordered);
+		ret = btrfs_extract_ordered_extent(bbio, oe);
 		if (ret) {
-			btrfs_finish_ordered_extent(dio_data->ordered, NULL,
+			/*
+			 * If this is a COW write it means we created new extent
+			 * maps for the range and they point to an unwritten
+			 * location since we got an error and we don't submit
+			 * a bio. We must drop any extent maps within the range,
+			 * otherwise a fast fsync would log them and after a
+			 * crash and log replay we would have file extent items
+			 * that point to unwritten locations (garbage).
+			 */
+			if (!test_bit(BTRFS_ORDERED_NOCOW, &oe->flags)) {
+				const u64 start = oe->file_offset;
+				const u64 end = start + oe->num_bytes - 1;
+
+				btrfs_drop_extent_map_range(bbio->inode, start, end, false);
+			}
+
+			btrfs_finish_ordered_extent(oe, NULL,
 						    file_offset, dip->bytes,
 						    !ret);
 			bio->bi_status = errno_to_blk_status(ret);