btrfs: perform stripe tree lookup for scrub

Message ID	4895772fd73872ee2cc23d152e50db28a5ca5bbc.1696867165.git.johannes.thumshirn@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@vger.kernel.org> IronPort-SDR: gXl8BcBtLmTbnWaKl6/HFskSMYfbh69gtoMSLjtsLiactmQq76vjA0CKmuQS/yCEswTpoVEI/u bvOOVlS+YI0YfAM3pE22oYftAGGoLRI+a/7CVq8yWnl2REIj171sQ4B4eKwVaJpvUrG66Tub9I Qb9S6zZy1cCoG5Fkag/B0+exiakbCDxqy8GbtJ1pEa/XY9Kn0DvfeBxivxhxUinY9x2ui057rY bjbnrcusH5fXFNOLuNc9wzNYWVYuP9Aps7ZCFWok/OLdDTsHv2ZP/T0CKAuKFCu19CV0BtyW2l yMA= IronPort-SDR: u5eyCUk4aDzFOSpPGewps3iRAVP9S2GGkAn0lNJbbpa8gdmmCXqVT9/M/rHp075uNw1KQ3ynDz MZxqnba2YVHrWwnf2UhQ/XMltD0TEEPCKuH0NKdK98RIxBpZbYXf9PrDcdNl2HguijmomasmPb 1V+3srHwb6Oejtf9INrSIRh0lMBCTRAMPG75GhUzp1dHKt3na6ECTIhz6YcRe4YRNlCZZ6TQ5O jfS+9z0gVw0F33xF/fpkx8h3mdouvhZer/u733I1q/fJ4QzrHTuV+7BuXj257VmZgDmG1ancd/ NM0= WDCIronportException: Internal From: Johannes Thumshirn <johannes.thumshirn@wdc.com> To: David Sterba <dsterba@suse.cz> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>, linux-btrfs@vger.kernel.org, Qu Wenru <wqu@suse.com> Subject: [PATCH] btrfs: perform stripe tree lookup for scrub Date: Mon, 9 Oct 2023 09:00:13 -0700 Message-ID: <4895772fd73872ee2cc23d152e50db28a5ca5bbc.1696867165.git.johannes.thumshirn@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: perform stripe tree lookup for scrub \| expand btrfs: perform stripe tree lookup for scrub

Johannes Thumshirn Oct. 9, 2023, 4 p.m. UTC

In case we're scrubbing a filesystem which uses the RAID stripe tree for
multi-device logical to physical address translation, perform an extra
block mapping step to get the real on0disk stripe length from the stripe
tree when scrubbing the sectors.

This prevents a double completion of the btrfs_bio caused by splitting the
underlying bio and ultimately a use-after-free.

Cc: Qu Wenru <wqu@suse.com>
Fixes: 0708614f9c28 ("btrfs: scrub: implement raid stripe tree support")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/scrub.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

Qu Wenruo Oct. 9, 2023, 8:48 p.m. UTC | #1

On 2023/10/10 02:30, Johannes Thumshirn wrote:
> In case we're scrubbing a filesystem which uses the RAID stripe tree for
> multi-device logical to physical address translation, perform an extra
> block mapping step to get the real on0disk stripe length from the stripe
> tree when scrubbing the sectors.
> 
> This prevents a double completion of the btrfs_bio caused by splitting the
> underlying bio and ultimately a use-after-free.

My concern is still the same, why we hit double endio calls in the first 
place?

In the bio layer, if the bbio->inode is NULL, the real endio would only 
be called when all the split bios finished, thus it doesn't seem to 
cause double endio calls.

Mind to explain it more?
> 
> Cc: Qu Wenru <wqu@suse.com>
> Fixes: 0708614f9c28 ("btrfs: scrub: implement raid stripe tree support")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   fs/btrfs/scrub.c | 26 ++++++++++++++++++++++----
>   1 file changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index c477a14f1281..2ac574bde350 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1640,6 +1640,7 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
>   {
>   	struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
>   	struct btrfs_bio *bbio = NULL;
> +	u64 stripe_len = BTRFS_STRIPE_LEN;
>   	int mirror = stripe->mirror_num;
>   	int i;
>   
> @@ -1651,8 +1652,9 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
>   
>   		/* The current sector cannot be merged, submit the bio. */
>   		if (bbio &&
> -		    ((i > 0 && !test_bit(i - 1, &stripe->extent_sector_bitmap)) ||
> -		     bbio->bio.bi_iter.bi_size >= BTRFS_STRIPE_LEN)) {
> +		    ((i > 0 &&
> +		      !test_bit(i - 1, &stripe->extent_sector_bitmap)) ||
> +		     bbio->bio.bi_iter.bi_size >= stripe_len)) {
>   			ASSERT(bbio->bio.bi_iter.bi_size);
>   			atomic_inc(&stripe->pending_io);
>   			btrfs_submit_bio(bbio, mirror);
> @@ -1660,10 +1662,26 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
>   		}
>   
>   		if (!bbio) {
> +			struct btrfs_io_stripe io_stripe = {};
> +			struct btrfs_io_context *bioc = NULL;
> +			const u64 logical = stripe->logical +
> +					    (i << fs_info->sectorsize_bits);
> +			int err;
> +
>   			bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
>   					       fs_info, scrub_read_endio, stripe);
> -			bbio->bio.bi_iter.bi_sector = (stripe->logical +
> -				(i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
> +			bbio->bio.bi_iter.bi_sector = logical >> SECTOR_SHIFT;
> +
> +			io_stripe.is_scrub = true;
> +			err = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical,
> +					      &stripe_len, &bioc, &io_stripe,
> +					      &mirror);

Another thing is, I noticed that for zoned devices, we still go through 
the repair path, just skip the writeback and rely on relocation to repair.

In that case, would scrub_stripe_submit_repair_read() need the same 
treatment or can be completely skipped instead?

Thanks,
Qu
> +			btrfs_put_bioc(bioc);
> +			if (err) {
> +				btrfs_bio_end_io(bbio,
> +						 errno_to_blk_status(err));
> +				return;
> +			}
>   		}
>   
>   		__bio_add_page(&bbio->bio, page, fs_info->sectorsize, pgoff);

Johannes Thumshirn Oct. 10, 2023, 8:37 a.m. UTC | #2

On 09.10.23 22:48, Qu Wenruo wrote:
> 
> 
> On 2023/10/10 02:30, Johannes Thumshirn wrote:
>> In case we're scrubbing a filesystem which uses the RAID stripe tree for
>> multi-device logical to physical address translation, perform an extra
>> block mapping step to get the real on0disk stripe length from the stripe
>> tree when scrubbing the sectors.
>>
>> This prevents a double completion of the btrfs_bio caused by splitting the
>> underlying bio and ultimately a use-after-free.
> 
> My concern is still the same, why we hit double endio calls in the first
> place?
> 
> In the bio layer, if the bbio->inode is NULL, the real endio would only
> be called when all the split bios finished, thus it doesn't seem to
> cause double endio calls.
> 
> Mind to explain it more?


Hmm indeed you're right. The patch probably only masks the UAF. On the 
other hand, there's no point in submitting a bio for a range that needs 
to be split, if we can avoid it.

Regarding the UAF, the KASAN report points to an object allocated by 
btrfs_bio_alloc() in scrub_submit_extent_sector_read(), so it's the bbio.

Let me check if changing bbio->pending_ios from atomic_t to refcount_t 
does give some hints here.

Still I think the patch is still useful regardless of the UAF.

>> @@ -1660,10 +1662,26 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
>>    		}
>>    
>>    		if (!bbio) {
>> +			struct btrfs_io_stripe io_stripe = {};
>> +			struct btrfs_io_context *bioc = NULL;
>> +			const u64 logical = stripe->logical +
>> +					    (i << fs_info->sectorsize_bits);
>> +			int err;
>> +
>>    			bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
>>    					       fs_info, scrub_read_endio, stripe);
>> -			bbio->bio.bi_iter.bi_sector = (stripe->logical +
>> -				(i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
>> +			bbio->bio.bi_iter.bi_sector = logical >> SECTOR_SHIFT;
>> +
>> +			io_stripe.is_scrub = true;
>> +			err = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical,
>> +					      &stripe_len, &bioc, &io_stripe,
>> +					      &mirror);
> 
> Another thing is, I noticed that for zoned devices, we still go through
> the repair path, just skip the writeback and rely on relocation to repair.
> 
> In that case, would scrub_stripe_submit_repair_read() need the same
> treatment or can be completely skipped instead?

I think for zoned devices we should go via relocation repair. But 
currently there is nothing that prevents RST form being used with 
regular non-zoned devices (and for the RAID56 write hole fixes with RST 
we'd have regular devies as well) so we need the fix there as well.

Qu Wenruo Oct. 10, 2023, 8:51 a.m. UTC | #3

On 2023/10/10 19:07, Johannes Thumshirn wrote:
> On 09.10.23 22:48, Qu Wenruo wrote:
>>
>>
>> On 2023/10/10 02:30, Johannes Thumshirn wrote:
>>> In case we're scrubbing a filesystem which uses the RAID stripe tree for
>>> multi-device logical to physical address translation, perform an extra
>>> block mapping step to get the real on0disk stripe length from the stripe
>>> tree when scrubbing the sectors.
>>>
>>> This prevents a double completion of the btrfs_bio caused by splitting the
>>> underlying bio and ultimately a use-after-free.
>>
>> My concern is still the same, why we hit double endio calls in the first
>> place?
>>
>> In the bio layer, if the bbio->inode is NULL, the real endio would only
>> be called when all the split bios finished, thus it doesn't seem to
>> cause double endio calls.
>>
>> Mind to explain it more?
>
>
> Hmm indeed you're right. The patch probably only masks the UAF. On the
> other hand, there's no point in submitting a bio for a range that needs
> to be split, if we can avoid it.

This is the opposite IIRC, the idea of introducing bio layer is to move
the RAID stripe split into that new bio layer, so upper layer doesn't
need to bother the split.

The same applies to the RST stripe AFAIK.

>
> Regarding the UAF, the KASAN report points to an object allocated by
> btrfs_bio_alloc() in scrub_submit_extent_sector_read(), so it's the bbio.
>
> Let me check if changing bbio->pending_ios from atomic_t to refcount_t
> does give some hints here.
>
> Still I think the patch is still useful regardless of the UAF.
>
>>> @@ -1660,10 +1662,26 @@ static void scrub_submit_extent_sector_read(struct scrub_ctx *sctx,
>>>     		}
>>>
>>>     		if (!bbio) {
>>> +			struct btrfs_io_stripe io_stripe = {};
>>> +			struct btrfs_io_context *bioc = NULL;
>>> +			const u64 logical = stripe->logical +
>>> +					    (i << fs_info->sectorsize_bits);
>>> +			int err;
>>> +
>>>     			bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
>>>     					       fs_info, scrub_read_endio, stripe);
>>> -			bbio->bio.bi_iter.bi_sector = (stripe->logical +
>>> -				(i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
>>> +			bbio->bio.bi_iter.bi_sector = logical >> SECTOR_SHIFT;
>>> +
>>> +			io_stripe.is_scrub = true;
>>> +			err = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical,
>>> +					      &stripe_len, &bioc, &io_stripe,
>>> +					      &mirror);
>>
>> Another thing is, I noticed that for zoned devices, we still go through
>> the repair path, just skip the writeback and rely on relocation to repair.
>>
>> In that case, would scrub_stripe_submit_repair_read() need the same
>> treatment or can be completely skipped instead?
>
> I think for zoned devices we should go via relocation repair. But
> currently there is nothing that prevents RST form being used with
> regular non-zoned devices (and for the RAID56 write hole fixes with RST
> we'd have regular devies as well) so we need the fix there as well.
>

Can we have a more unified behavior? Like if we go emulated zoned mode
(running zoned mode on non-zoned devices), we just go all the zoned way
by relocation other than in-place write.

Or is the balance too heavy for just repairing one sector?
If so, you may want to considering splitting the per-fs zoned status
between per-device zoned status.
As in that case, we need to determine how the repair should be done.
(in-place write for emulated zoned RST mode, or relocation if that bad
mirror is on a real zoned device).

I have no special preference here, either would work for me.

Thanks,
Qu

David Sterba Oct. 10, 2023, 1:42 p.m. UTC | #4

On Tue, Oct 10, 2023 at 08:37:10AM +0000, Johannes Thumshirn wrote:
> On 09.10.23 22:48, Qu Wenruo wrote:
> > 
> > 
> > On 2023/10/10 02:30, Johannes Thumshirn wrote:
> >> In case we're scrubbing a filesystem which uses the RAID stripe tree for
> >> multi-device logical to physical address translation, perform an extra
> >> block mapping step to get the real on0disk stripe length from the stripe
> >> tree when scrubbing the sectors.
> >>
> >> This prevents a double completion of the btrfs_bio caused by splitting the
> >> underlying bio and ultimately a use-after-free.
> > 
> > My concern is still the same, why we hit double endio calls in the first
> > place?
> > 
> > In the bio layer, if the bbio->inode is NULL, the real endio would only
> > be called when all the split bios finished, thus it doesn't seem to
> > cause double endio calls.
> > 
> > Mind to explain it more?
> 
> 
> Hmm indeed you're right. The patch probably only masks the UAF. On the 
> other hand, there's no point in submitting a bio for a range that needs 
> to be split, if we can avoid it.
> 
> Regarding the UAF, the KASAN report points to an object allocated by 
> btrfs_bio_alloc() in scrub_submit_extent_sector_read(), so it's the bbio.
> 
> Let me check if changing bbio->pending_ios from atomic_t to refcount_t 
> does give some hints here.
> 
> Still I think the patch is still useful regardless of the UAF.

I had merged the fixup yesterday before Qu replied, as I read it's not
entirely wrong as the RST is improved incrementally but please let me
know if you'd like to revert this change.

Johannes Thumshirn Oct. 10, 2023, 4:46 p.m. UTC | #5

On 10.10.23 15:49, David Sterba wrote:
> I had merged the fixup yesterday before Qu replied, as I read it's not 
> entirely wrong as the RST is improved incrementally but please let me 
> know if you'd like to revert this change.

I personally prefer having it in TBH. I'm still giving it some more days 
to investigate the UAF in case I've missed something.

As for the layering violation part, we're already looking at extents at 
the logical address space level, so for RST we need to also check the 
physical level, as the mapping isn't necessarily 1:1.

btrfs: perform stripe tree lookup for scrub

Commit Message

Comments

Patch