diff mbox series

[v2,3/3] btrfs: update stripe_extent delete loop assumptions

Message ID 20240711-b4-rst-updates-v2-3-d7b8113d88b7@kernel.org (mailing list archive)
State New, archived
Headers show
Series btrfs: more RAID stripe tree updates | expand

Commit Message

Johannes Thumshirn July 11, 2024, 6:21 a.m. UTC
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs_delete_raid_extent() was written under the assumption, that it's
call-chain always passes a start, length tuple that matches a single
extent. But btrfs_delete_raid_extent() is called by
do_free_extent_acounting() which in term is called by
__btrfs_free_extent().

But this call-chain passes in a start address and a length that can
possibly match multiple on-disk extents.

To make this possible, we have to adjust the start and length of each
btree node lookup, to not delete beyond the requested range.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/raid-stripe-tree.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Qu Wenruo July 11, 2024, 6:55 a.m. UTC | #1
在 2024/7/11 15:51, Johannes Thumshirn 写道:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> btrfs_delete_raid_extent() was written under the assumption, that it's
> call-chain always passes a start, length tuple that matches a single
> extent. But btrfs_delete_raid_extent() is called by
> do_free_extent_acounting() which in term is called by > __btrfs_free_extent().

But from the call site __btrfs_free_extent(), it is still called for a 
single extent.

Or we will get an error and abort the current transaction.

> 
> But this call-chain passes in a start address and a length that can
> possibly match multiple on-disk extents.

Mind to give a more detailed example on this?

Thanks,
Qu

> 
> To make this possible, we have to adjust the start and length of each
> btree node lookup, to not delete beyond the requested range.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   fs/btrfs/raid-stripe-tree.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index fd56535b2289..6f65be334637 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -66,6 +66,11 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
>   		if (ret)
>   			break;
>   
> +		start += key.offset;
> +		length -= key.offset;
> +		if (length == 0)
> +			break;
> +
>   		btrfs_release_path(path);
>   	}
>   
>
Qu Wenruo July 11, 2024, 7:44 a.m. UTC | #2
在 2024/7/11 16:25, Qu Wenruo 写道:
> 
> 
> 在 2024/7/11 15:51, Johannes Thumshirn 写道:
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> btrfs_delete_raid_extent() was written under the assumption, that it's
>> call-chain always passes a start, length tuple that matches a single
>> extent. But btrfs_delete_raid_extent() is called by
>> do_free_extent_acounting() which in term is called by > 
>> __btrfs_free_extent().
> 
> But from the call site __btrfs_free_extent(), it is still called for a 
> single extent.
> 
> Or we will get an error and abort the current transaction.

Or does it mean, one data extent can have multiple RST entries?

Is that a non-zoned RST specific behavior?
As I still remember that we split ordered extents for zoned devices, so 
that it should always have one extent for each split OE.

Thanks,
Qu
> 
>>
>> But this call-chain passes in a start address and a length that can
>> possibly match multiple on-disk extents.
> 
> Mind to give a more detailed example on this?
> 
> Thanks,
> Qu
> 
>>
>> To make this possible, we have to adjust the start and length of each
>> btree node lookup, to not delete beyond the requested range.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>>   fs/btrfs/raid-stripe-tree.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
>> index fd56535b2289..6f65be334637 100644
>> --- a/fs/btrfs/raid-stripe-tree.c
>> +++ b/fs/btrfs/raid-stripe-tree.c
>> @@ -66,6 +66,11 @@ int btrfs_delete_raid_extent(struct 
>> btrfs_trans_handle *trans, u64 start, u64 le
>>           if (ret)
>>               break;
>> +        start += key.offset;
>> +        length -= key.offset;
>> +        if (length == 0)
>> +            break;
>> +
>>           btrfs_release_path(path);
>>       }
>>
>
Qu Wenruo July 11, 2024, 7:55 a.m. UTC | #3
在 2024/7/11 17:14, Qu Wenruo 写道:
> 
> 
> 在 2024/7/11 16:25, Qu Wenruo 写道:
>>
>>
>> 在 2024/7/11 15:51, Johannes Thumshirn 写道:
>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> btrfs_delete_raid_extent() was written under the assumption, that it's
>>> call-chain always passes a start, length tuple that matches a single
>>> extent. But btrfs_delete_raid_extent() is called by
>>> do_free_extent_acounting() which in term is called by > 
>>> __btrfs_free_extent().
>>
>> But from the call site __btrfs_free_extent(), it is still called for a 
>> single extent.
>>
>> Or we will get an error and abort the current transaction.
> 
> Or does it mean, one data extent can have multiple RST entries?
> 
> Is that a non-zoned RST specific behavior?
> As I still remember that we split ordered extents for zoned devices, so 
> that it should always have one extent for each split OE.


OK, it's indeed an RST specific behavior (at least for RST with 
non-zoned devices).

I can have the following layout:

         item 15 key (258 EXTENT_DATA 419430400) itemoff 15306 itemsize 53
                 generation 10 type 1 (regular)
                 extent data disk byte 1808793600 nr 117440512
                 extent data offset 0 nr 117440512 ram 117440512
                 extent compression 0 (none)

Which is a large data extent with 112MiB length.

Meanwhile for the RST entries there are 3 split ones:

         item 13 key (1808793600 RAID_STRIPE 33619968) itemoff 15835 
itemsize 32
                         stripe 0 devid 2 physical 1787822080
                         stripe 1 devid 1 physical 1808793600
         item 14 key (1842413568 RAID_STRIPE 58789888) itemoff 15803 
itemsize 32
                         stripe 0 devid 2 physical 1821442048
                         stripe 1 devid 1 physical 1842413568
         item 15 key (1901203456 RAID_STRIPE 25030656) itemoff 15771 
itemsize 32
                         stripe 0 devid 2 physical 1880231936
                         stripe 1 devid 1 physical 1901203456

So yes, it's possible to have multiple RST entries for a single data 
extent, it's no longer the old zoned behavior.

In that case, the patch looks fine to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu


> 
> Thanks,
> Qu
>>
>>>
>>> But this call-chain passes in a start address and a length that can
>>> possibly match multiple on-disk extents.
>>
>> Mind to give a more detailed example on this?
>>
>> Thanks,
>> Qu
>>
>>>
>>> To make this possible, we have to adjust the start and length of each
>>> btree node lookup, to not delete beyond the requested range.
>>>
>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>> ---
>>>   fs/btrfs/raid-stripe-tree.c | 5 +++++
>>>   1 file changed, 5 insertions(+)
>>>
>>> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
>>> index fd56535b2289..6f65be334637 100644
>>> --- a/fs/btrfs/raid-stripe-tree.c
>>> +++ b/fs/btrfs/raid-stripe-tree.c
>>> @@ -66,6 +66,11 @@ int btrfs_delete_raid_extent(struct 
>>> btrfs_trans_handle *trans, u64 start, u64 le
>>>           if (ret)
>>>               break;
>>> +        start += key.offset;
>>> +        length -= key.offset;
>>> +        if (length == 0)
>>> +            break;
>>> +
>>>           btrfs_release_path(path);
>>>       }
>>>
>>
>
diff mbox series

Patch

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index fd56535b2289..6f65be334637 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -66,6 +66,11 @@  int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
 		if (ret)
 			break;
 
+		start += key.offset;
+		length -= key.offset;
+		if (length == 0)
+			break;
+
 		btrfs_release_path(path);
 	}