diff mbox series

btrfs: fix a race which can lead to unreported write failure

Message ID 8bc4c2aaa8d32ad92838b3778a85660ba7c6bfa8.1695617943.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: fix a race which can lead to unreported write failure | expand

Commit Message

Qu Wenruo Sept. 25, 2023, 4:59 a.m. UTC
[RACE]
For write back to chunks with multiple mirrors, there is a race that due
to when the bioc->error is checked, we may falsely consider a write is
successful:

               Thread A                |            Thread B
---------------------------------------+----------------------------------
 btrfs_orig_write_end_io()             | btrfs_clone_write_end_io()
 |  this bio failed                    | |  this bio failed
 |                                     | |
 |- atommic_inc(&bioc->error);         | |
 |- atomic_read(&bioc->error)          | |
 |  So far we only hit one error,      | |
 |  thus can still consider the write  | |
 |  succeeded                          | |
 `- bio->bi_status = BLK_STS_OK;       | |
                                       | `- atomic_inc(&bioc->error);

This can lead to data loss, especially for metadata which by default
goes with duplication.

[FIX]
Instead of only relying on btrfs_orig_write_end_io() to determine if the
bio is successful, also check the error inside the
btrfs_clone_write_end_io().

If any call site found we have exceed the tolerance, mark the original
bio as failed.

Yes, we still have races between atomic_inc() and atomic_read() in all
the endio threads.

But we have ensured the last thread calling atomic_read() would have a
correct view to do the final call, thus fixing the problem.

Fixes: c3a62baf21ad ("btrfs: use chained bios when cloning")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/bio.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig Sept. 25, 2023, 8:57 a.m. UTC | #1
On Mon, Sep 25, 2023 at 02:29:28PM +0930, Qu Wenruo wrote:
>                Thread A                |            Thread B
> ---------------------------------------+----------------------------------
>  btrfs_orig_write_end_io()             | btrfs_clone_write_end_io()
>  |  this bio failed                    | |  this bio failed
>  |                                     | |
>  |- atommic_inc(&bioc->error);         | |
>  |- atomic_read(&bioc->error)          | |
>  |  So far we only hit one error,      | |
>  |  thus can still consider the write  | |
>  |  succeeded                          | |
>  `- bio->bi_status = BLK_STS_OK;       | |
>                                        | `- atomic_inc(&bioc->error);

I don't think this scenario can happen.

btrfs_clone_write_end_io increments bioc->error before calling
bio_endio on the orig_bio, which is the earliest time
btrfs_orig_write_end_io will be called for this original bio due to
the earlier bio_inc_remaining.
Qu Wenruo Sept. 25, 2023, 9:17 a.m. UTC | #2
On 2023/9/25 18:27, Christoph Hellwig wrote:
> On Mon, Sep 25, 2023 at 02:29:28PM +0930, Qu Wenruo wrote:
>>                 Thread A                |            Thread B
>> ---------------------------------------+----------------------------------
>>   btrfs_orig_write_end_io()             | btrfs_clone_write_end_io()
>>   |  this bio failed                    | |  this bio failed
>>   |                                     | |
>>   |- atommic_inc(&bioc->error);         | |
>>   |- atomic_read(&bioc->error)          | |
>>   |  So far we only hit one error,      | |
>>   |  thus can still consider the write  | |
>>   |  succeeded                          | |
>>   `- bio->bi_status = BLK_STS_OK;       | |
>>                                         | `- atomic_inc(&bioc->error);
>
> I don't think this scenario can happen.
>
> btrfs_clone_write_end_io increments bioc->error before calling
> bio_endio on the orig_bio, which is the earliest time
> btrfs_orig_write_end_io will be called for this original bio due to
> the earlier bio_inc_remaining.
>
Sorry, I couldn't find out a hard guarantee to make
btrfs_orig_write_end_io() to always be executed after
btrfs_clone_write_end_io().

We can have a case that there is a RAID1, the first disk is an SSD while
the 2nd disk is a slow HDD.

And the last mirror of a write (aka, the btrfs_orig_write_end_io()) is
submitted to the faster disk, thus it can set orig_bio to BLK_STS_OK
even if the faster disk failed the write.

Yes, the chained bio ensured the original endio function can only be
called after all chained (cloned) ones finished, but there is no ensured
on the orig_bio->bi_status checks.
Thus we need to do the check on all chained and embedded bios.

Thanks,
Qu
Christoph Hellwig Sept. 25, 2023, 9:20 a.m. UTC | #3
On Mon, Sep 25, 2023 at 06:47:50PM +0930, Qu Wenruo wrote:
> Sorry, I couldn't find out a hard guarantee to make
> btrfs_orig_write_end_io() to always be executed after
> btrfs_clone_write_end_io().

That is how bio_inc_remaining works.  For each call to it you need
an additional bio_endio call to counter it before ->end_io is called.

So no matter how fast orig_bio is executed by the underlying
driver/hardware, btrfs_orig_write_end_io will only be called once
both orig_bio itself and any clone has completed.
Qu Wenruo Sept. 25, 2023, 9:39 a.m. UTC | #4
On 2023/9/25 18:50, Christoph Hellwig wrote:
> On Mon, Sep 25, 2023 at 06:47:50PM +0930, Qu Wenruo wrote:
>> Sorry, I couldn't find out a hard guarantee to make
>> btrfs_orig_write_end_io() to always be executed after
>> btrfs_clone_write_end_io().
>
> That is how bio_inc_remaining works.  For each call to it you need
> an additional bio_endio call to counter it before ->end_io is called.
>
> So no matter how fast orig_bio is executed by the underlying
> driver/hardware, btrfs_orig_write_end_io will only be called once
> both orig_bio itself and any clone has completed.

Oh, that's what I missed.

After checking bio_endio(), it's indeed call bio_remaining_done() to
check the chained ones, and only after that, the real bio would be called.

Thus even if the last mirror finishes first, it would only reduce
__bi_remaining, and exit, not really calling the btrfs_orig_write_end_io().

If the last bio finished is a cloned one, it would call bio_endio() on
the orig_bio, thus trigger the real endio for it (aka,
btrfs_orig_write_end_io()).

Thanks for pointing this out,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 42f1f87f1872..4eef135b57af 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -417,8 +417,6 @@  static void btrfs_orig_write_end_io(struct bio *bio)
 	 */
 	if (atomic_read(&bioc->error) > bioc->max_errors)
 		bio->bi_status = BLK_STS_IOERR;
-	else
-		bio->bi_status = BLK_STS_OK;
 
 	btrfs_orig_bbio_end_io(bbio);
 	btrfs_put_bioc(bioc);
@@ -435,6 +433,9 @@  static void btrfs_clone_write_end_io(struct bio *bio)
 		stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
 	}
 
+	if (atomic_read(&stripe->bioc->error) >= stripe->bioc->max_errors)
+		stripe->bioc->orig_bio->bi_status = BLK_STS_IOERR;
+
 	/* Pass on control to the original bio this one was cloned from */
 	bio_endio(stripe->bioc->orig_bio);
 	bio_put(bio);