Message ID | 8bc4c2aaa8d32ad92838b3778a85660ba7c6bfa8.1695617943.git.wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: fix a race which can lead to unreported write failure | expand |
On Mon, Sep 25, 2023 at 02:29:28PM +0930, Qu Wenruo wrote: > Thread A | Thread B > ---------------------------------------+---------------------------------- > btrfs_orig_write_end_io() | btrfs_clone_write_end_io() > | this bio failed | | this bio failed > | | | > |- atommic_inc(&bioc->error); | | > |- atomic_read(&bioc->error) | | > | So far we only hit one error, | | > | thus can still consider the write | | > | succeeded | | > `- bio->bi_status = BLK_STS_OK; | | > | `- atomic_inc(&bioc->error); I don't think this scenario can happen. btrfs_clone_write_end_io increments bioc->error before calling bio_endio on the orig_bio, which is the earliest time btrfs_orig_write_end_io will be called for this original bio due to the earlier bio_inc_remaining.
On 2023/9/25 18:27, Christoph Hellwig wrote: > On Mon, Sep 25, 2023 at 02:29:28PM +0930, Qu Wenruo wrote: >> Thread A | Thread B >> ---------------------------------------+---------------------------------- >> btrfs_orig_write_end_io() | btrfs_clone_write_end_io() >> | this bio failed | | this bio failed >> | | | >> |- atommic_inc(&bioc->error); | | >> |- atomic_read(&bioc->error) | | >> | So far we only hit one error, | | >> | thus can still consider the write | | >> | succeeded | | >> `- bio->bi_status = BLK_STS_OK; | | >> | `- atomic_inc(&bioc->error); > > I don't think this scenario can happen. > > btrfs_clone_write_end_io increments bioc->error before calling > bio_endio on the orig_bio, which is the earliest time > btrfs_orig_write_end_io will be called for this original bio due to > the earlier bio_inc_remaining. > Sorry, I couldn't find out a hard guarantee to make btrfs_orig_write_end_io() to always be executed after btrfs_clone_write_end_io(). We can have a case that there is a RAID1, the first disk is an SSD while the 2nd disk is a slow HDD. And the last mirror of a write (aka, the btrfs_orig_write_end_io()) is submitted to the faster disk, thus it can set orig_bio to BLK_STS_OK even if the faster disk failed the write. Yes, the chained bio ensured the original endio function can only be called after all chained (cloned) ones finished, but there is no ensured on the orig_bio->bi_status checks. Thus we need to do the check on all chained and embedded bios. Thanks, Qu
On Mon, Sep 25, 2023 at 06:47:50PM +0930, Qu Wenruo wrote: > Sorry, I couldn't find out a hard guarantee to make > btrfs_orig_write_end_io() to always be executed after > btrfs_clone_write_end_io(). That is how bio_inc_remaining works. For each call to it you need an additional bio_endio call to counter it before ->end_io is called. So no matter how fast orig_bio is executed by the underlying driver/hardware, btrfs_orig_write_end_io will only be called once both orig_bio itself and any clone has completed.
On 2023/9/25 18:50, Christoph Hellwig wrote: > On Mon, Sep 25, 2023 at 06:47:50PM +0930, Qu Wenruo wrote: >> Sorry, I couldn't find out a hard guarantee to make >> btrfs_orig_write_end_io() to always be executed after >> btrfs_clone_write_end_io(). > > That is how bio_inc_remaining works. For each call to it you need > an additional bio_endio call to counter it before ->end_io is called. > > So no matter how fast orig_bio is executed by the underlying > driver/hardware, btrfs_orig_write_end_io will only be called once > both orig_bio itself and any clone has completed. Oh, that's what I missed. After checking bio_endio(), it's indeed call bio_remaining_done() to check the chained ones, and only after that, the real bio would be called. Thus even if the last mirror finishes first, it would only reduce __bi_remaining, and exit, not really calling the btrfs_orig_write_end_io(). If the last bio finished is a cloned one, it would call bio_endio() on the orig_bio, thus trigger the real endio for it (aka, btrfs_orig_write_end_io()). Thanks for pointing this out, Qu
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c index 42f1f87f1872..4eef135b57af 100644 --- a/fs/btrfs/bio.c +++ b/fs/btrfs/bio.c @@ -417,8 +417,6 @@ static void btrfs_orig_write_end_io(struct bio *bio) */ if (atomic_read(&bioc->error) > bioc->max_errors) bio->bi_status = BLK_STS_IOERR; - else - bio->bi_status = BLK_STS_OK; btrfs_orig_bbio_end_io(bbio); btrfs_put_bioc(bioc); @@ -435,6 +433,9 @@ static void btrfs_clone_write_end_io(struct bio *bio) stripe->physical = bio->bi_iter.bi_sector << SECTOR_SHIFT; } + if (atomic_read(&stripe->bioc->error) >= stripe->bioc->max_errors) + stripe->bioc->orig_bio->bi_status = BLK_STS_IOERR; + /* Pass on control to the original bio this one was cloned from */ bio_endio(stripe->bioc->orig_bio); bio_put(bio);
[RACE] For write back to chunks with multiple mirrors, there is a race that due to when the bioc->error is checked, we may falsely consider a write is successful: Thread A | Thread B ---------------------------------------+---------------------------------- btrfs_orig_write_end_io() | btrfs_clone_write_end_io() | this bio failed | | this bio failed | | | |- atommic_inc(&bioc->error); | | |- atomic_read(&bioc->error) | | | So far we only hit one error, | | | thus can still consider the write | | | succeeded | | `- bio->bi_status = BLK_STS_OK; | | | `- atomic_inc(&bioc->error); This can lead to data loss, especially for metadata which by default goes with duplication. [FIX] Instead of only relying on btrfs_orig_write_end_io() to determine if the bio is successful, also check the error inside the btrfs_clone_write_end_io(). If any call site found we have exceed the tolerance, mark the original bio as failed. Yes, we still have races between atomic_inc() and atomic_read() in all the endio threads. But we have ensured the last thread calling atomic_read() would have a correct view to do the final call, thus fixing the problem. Fixes: c3a62baf21ad ("btrfs: use chained bios when cloning") Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/bio.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)