Message ID | 4895772fd73872ee2cc23d152e50db28a5ca5bbc.1696867165.git.johannes.thumshirn@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: perform stripe tree lookup for scrub | expand |
On 2023/10/10 02:30, Johannes Thumshirn wrote: > In case we're scrubbing a filesystem which uses the RAID stripe tree for > multi-device logical to physical address translation, perform an extra > block mapping step to get the real on0disk stripe length from the stripe > tree when scrubbing the sectors. > > This prevents a double completion of the btrfs_bio caused by splitting the > underlying bio and ultimately a use-after-free. My concern is still the same, why we hit double endio calls in the first place? In the bio layer, if the bbio->inode is NULL, the real endio would only be called when all the split bios finished, thus it doesn't seem to cause double endio calls. Mind to explain it more? > > Cc: Qu Wenru <wqu@suse.com> > Fixes: 0708614f9c28 ("btrfs: scrub: implement raid stripe tree support") > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > --- > fs/btrfs/scrub.c | 26 ++++++++++++++++++++++---- > 1 file changed, 22 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > index c477a14f1281..2ac574bde350 100644 > --- a/fs/btrfs/scrub.c > +++ b/fs/btrfs/scrub.c > @@ -1640,6 +1640,7 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx, > { > struct btrfs_fs_info *fs_info = stripe->bg->fs_info; > struct btrfs_bio *bbio = NULL; > + u64 stripe_len = BTRFS_STRIPE_LEN; > int mirror = stripe->mirror_num; > int i; > > @@ -1651,8 +1652,9 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx, > > /* The current sector cannot be merged, submit the bio. */ > if (bbio && > - ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) || > - bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) { > + ((i > 0 && > + !test_bit(i - 1, &stripe->extent_sector_bitmap)) || > + bbio->bio.bi_iter.bi_size >= stripe_len)) { > ASSERT(bbio->bio.bi_iter.bi_size); > atomic_inc(&stripe->pending_io); > btrfs_submit_bio(bbio, mirror); > @@ -1660,10 +1662,26 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx, > } > > if (!bbio) { > + struct btrfs_io_stripe io_stripe = {}; > + struct btrfs_io_context *bioc = NULL; > + const u64 logical = stripe->logical + > + (i << fs_info->sectorsize_bits); > + int err; > + > bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ, > fs_info, scrub_read_endio, stripe); > - bbio->bio.bi_iter.bi_sector = (stripe->logical + > - (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT; > + bbio->bio.bi_iter.bi_sector = logical >> SECTOR_SHIFT; > + > + io_stripe.is_scrub = true; > + err = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical, > + &stripe_len, &bioc, &io_stripe, > + &mirror); Another thing is, I noticed that for zoned devices, we still go through the repair path, just skip the writeback and rely on relocation to repair. In that case, would scrub_stripe_submit_repair_read() need the same treatment or can be completely skipped instead? Thanks, Qu > + btrfs_put_bioc(bioc); > + if (err) { > + btrfs_bio_end_io(bbio, > + errno_to_blk_status(err)); > + return; > + } > } > > __bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
On 09.10.23 22:48, Qu Wenruo wrote: > > > On 2023/10/10 02:30, Johannes Thumshirn wrote: >> In case we're scrubbing a filesystem which uses the RAID stripe tree for >> multi-device logical to physical address translation, perform an extra >> block mapping step to get the real on0disk stripe length from the stripe >> tree when scrubbing the sectors. >> >> This prevents a double completion of the btrfs_bio caused by splitting the >> underlying bio and ultimately a use-after-free. > > My concern is still the same, why we hit double endio calls in the first > place? > > In the bio layer, if the bbio->inode is NULL, the real endio would only > be called when all the split bios finished, thus it doesn't seem to > cause double endio calls. > > Mind to explain it more? Hmm indeed you're right. The patch probably only masks the UAF. On the other hand, there's no point in submitting a bio for a range that needs to be split, if we can avoid it. Regarding the UAF, the KASAN report points to an object allocated by btrfs_bio_alloc() in scrub_submit_extent_sector_read(), so it's the bbio. Let me check if changing bbio->pending_ios from atomic_t to refcount_t does give some hints here. Still I think the patch is still useful regardless of the UAF. >> @@ -1660,10 +1662,26 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx, >> } >> >> if (!bbio) { >> + struct btrfs_io_stripe io_stripe = {}; >> + struct btrfs_io_context *bioc = NULL; >> + const u64 logical = stripe->logical + >> + (i << fs_info->sectorsize_bits); >> + int err; >> + >> bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ, >> fs_info, scrub_read_endio, stripe); >> - bbio->bio.bi_iter.bi_sector = (stripe->logical + >> - (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT; >> + bbio->bio.bi_iter.bi_sector = logical >> SECTOR_SHIFT; >> + >> + io_stripe.is_scrub = true; >> + err = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical, >> + &stripe_len, &bioc, &io_stripe, >> + &mirror); > > Another thing is, I noticed that for zoned devices, we still go through > the repair path, just skip the writeback and rely on relocation to repair. > > In that case, would scrub_stripe_submit_repair_read() need the same > treatment or can be completely skipped instead? I think for zoned devices we should go via relocation repair. But currently there is nothing that prevents RST form being used with regular non-zoned devices (and for the RAID56 write hole fixes with RST we'd have regular devies as well) so we need the fix there as well.
On 2023/10/10 19:07, Johannes Thumshirn wrote: > On 09.10.23 22:48, Qu Wenruo wrote: >> >> >> On 2023/10/10 02:30, Johannes Thumshirn wrote: >>> In case we're scrubbing a filesystem which uses the RAID stripe tree for >>> multi-device logical to physical address translation, perform an extra >>> block mapping step to get the real on0disk stripe length from the stripe >>> tree when scrubbing the sectors. >>> >>> This prevents a double completion of the btrfs_bio caused by splitting the >>> underlying bio and ultimately a use-after-free. >> >> My concern is still the same, why we hit double endio calls in the first >> place? >> >> In the bio layer, if the bbio->inode is NULL, the real endio would only >> be called when all the split bios finished, thus it doesn't seem to >> cause double endio calls. >> >> Mind to explain it more? > > > Hmm indeed you're right. The patch probably only masks the UAF. On the > other hand, there's no point in submitting a bio for a range that needs > to be split, if we can avoid it. This is the opposite IIRC, the idea of introducing bio layer is to move the RAID stripe split into that new bio layer, so upper layer doesn't need to bother the split. The same applies to the RST stripe AFAIK. > > Regarding the UAF, the KASAN report points to an object allocated by > btrfs_bio_alloc() in scrub_submit_extent_sector_read(), so it's the bbio. > > Let me check if changing bbio->pending_ios from atomic_t to refcount_t > does give some hints here. > > Still I think the patch is still useful regardless of the UAF. > >>> @@ -1660,10 +1662,26 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx, >>> } >>> >>> if (!bbio) { >>> + struct btrfs_io_stripe io_stripe = {}; >>> + struct btrfs_io_context *bioc = NULL; >>> + const u64 logical = stripe->logical + >>> + (i << fs_info->sectorsize_bits); >>> + int err; >>> + >>> bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ, >>> fs_info, scrub_read_endio, stripe); >>> - bbio->bio.bi_iter.bi_sector = (stripe->logical + >>> - (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT; >>> + bbio->bio.bi_iter.bi_sector = logical >> SECTOR_SHIFT; >>> + >>> + io_stripe.is_scrub = true; >>> + err = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical, >>> + &stripe_len, &bioc, &io_stripe, >>> + &mirror); >> >> Another thing is, I noticed that for zoned devices, we still go through >> the repair path, just skip the writeback and rely on relocation to repair. >> >> In that case, would scrub_stripe_submit_repair_read() need the same >> treatment or can be completely skipped instead? > > I think for zoned devices we should go via relocation repair. But > currently there is nothing that prevents RST form being used with > regular non-zoned devices (and for the RAID56 write hole fixes with RST > we'd have regular devies as well) so we need the fix there as well. > Can we have a more unified behavior? Like if we go emulated zoned mode (running zoned mode on non-zoned devices), we just go all the zoned way by relocation other than in-place write. Or is the balance too heavy for just repairing one sector? If so, you may want to considering splitting the per-fs zoned status between per-device zoned status. As in that case, we need to determine how the repair should be done. (in-place write for emulated zoned RST mode, or relocation if that bad mirror is on a real zoned device). I have no special preference here, either would work for me. Thanks, Qu
On Tue, Oct 10, 2023 at 08:37:10AM +0000, Johannes Thumshirn wrote: > On 09.10.23 22:48, Qu Wenruo wrote: > > > > > > On 2023/10/10 02:30, Johannes Thumshirn wrote: > >> In case we're scrubbing a filesystem which uses the RAID stripe tree for > >> multi-device logical to physical address translation, perform an extra > >> block mapping step to get the real on0disk stripe length from the stripe > >> tree when scrubbing the sectors. > >> > >> This prevents a double completion of the btrfs_bio caused by splitting the > >> underlying bio and ultimately a use-after-free. > > > > My concern is still the same, why we hit double endio calls in the first > > place? > > > > In the bio layer, if the bbio->inode is NULL, the real endio would only > > be called when all the split bios finished, thus it doesn't seem to > > cause double endio calls. > > > > Mind to explain it more? > > > Hmm indeed you're right. The patch probably only masks the UAF. On the > other hand, there's no point in submitting a bio for a range that needs > to be split, if we can avoid it. > > Regarding the UAF, the KASAN report points to an object allocated by > btrfs_bio_alloc() in scrub_submit_extent_sector_read(), so it's the bbio. > > Let me check if changing bbio->pending_ios from atomic_t to refcount_t > does give some hints here. > > Still I think the patch is still useful regardless of the UAF. I had merged the fixup yesterday before Qu replied, as I read it's not entirely wrong as the RST is improved incrementally but please let me know if you'd like to revert this change.
On 10.10.23 15:49, David Sterba wrote: > I had merged the fixup yesterday before Qu replied, as I read it's not > entirely wrong as the RST is improved incrementally but please let me > know if you'd like to revert this change. I personally prefer having it in TBH. I'm still giving it some more days to investigate the UAF in case I've missed something. As for the layering violation part, we're already looking at extents at the logical address space level, so for RST we need to also check the physical level, as the mapping isn't necessarily 1:1.
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index c477a14f1281..2ac574bde350 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -1640,6 +1640,7 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx, { struct btrfs_fs_info *fs_info = stripe->bg->fs_info; struct btrfs_bio *bbio = NULL; + u64 stripe_len = BTRFS_STRIPE_LEN; int mirror = stripe->mirror_num; int i; @@ -1651,8 +1652,9 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx, /* The current sector cannot be merged, submit the bio. */ if (bbio && - ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) || - bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) { + ((i > 0 && + !test_bit(i - 1, &stripe->extent_sector_bitmap)) || + bbio->bio.bi_iter.bi_size >= stripe_len)) { ASSERT(bbio->bio.bi_iter.bi_size); atomic_inc(&stripe->pending_io); btrfs_submit_bio(bbio, mirror); @@ -1660,10 +1662,26 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx, } if (!bbio) { + struct btrfs_io_stripe io_stripe = {}; + struct btrfs_io_context *bioc = NULL; + const u64 logical = stripe->logical + + (i << fs_info->sectorsize_bits); + int err; + bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ, fs_info, scrub_read_endio, stripe); - bbio->bio.bi_iter.bi_sector = (stripe->logical + - (i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT; + bbio->bio.bi_iter.bi_sector = logical >> SECTOR_SHIFT; + + io_stripe.is_scrub = true; + err = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical, + &stripe_len, &bioc, &io_stripe, + &mirror); + btrfs_put_bioc(bioc); + if (err) { + btrfs_bio_end_io(bbio, + errno_to_blk_status(err)); + return; + } } __bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);
In case we're scrubbing a filesystem which uses the RAID stripe tree for multi-device logical to physical address translation, perform an extra block mapping step to get the real on0disk stripe length from the stripe tree when scrubbing the sectors. This prevents a double completion of the btrfs_bio caused by splitting the underlying bio and ultimately a use-after-free. Cc: Qu Wenru <wqu@suse.com> Fixes: 0708614f9c28 ("btrfs: scrub: implement raid stripe tree support") Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> --- fs/btrfs/scrub.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-)