diff mbox series

[v2,06/14] btrfs: fix deletion of a range spanning parts two RAID stripe extents

Message ID 20250107-rst-delete-fixes-v2-6-0c7b14c0aac2@kernel.org (mailing list archive)
State New, archived
Headers show
Series btrfs: more RST delete fixes | expand

Commit Message

Johannes Thumshirn Jan. 7, 2025, 12:47 p.m. UTC
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

When a user requests the deletion of a range that spans multiple stripe
extents and btrfs_search_slot() returns us the second RAID stripe extent,
we need to pick the previous item and truncate it, if there's still a
range to delete left, move on to the next item.

The following diagram illustrates the operation:

 |--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
        |--- keep  ---|--- drop ---|

While at it, comment the trivial case of a whole item delete as well.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/raid-stripe-tree.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

Comments

Filipe Manana Jan. 9, 2025, 3:24 p.m. UTC | #1
On Tue, Jan 7, 2025 at 12:50 PM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> When a user requests the deletion of a range that spans multiple stripe
> extents and btrfs_search_slot() returns us the second RAID stripe extent,
> we need to pick the previous item and truncate it, if there's still a
> range to delete left, move on to the next item.
>
> The following diagram illustrates the operation:
>
>  |--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
>         |--- keep  ---|--- drop ---|
>
> While at it, comment the trivial case of a whole item delete as well.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/raid-stripe-tree.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
>
> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> index 79f8f692aaa8f6df2c9482fbd7777c2812528f65..893d963951315abfc734e1ca232b3087b7889431 100644
> --- a/fs/btrfs/raid-stripe-tree.c
> +++ b/fs/btrfs/raid-stripe-tree.c
> @@ -103,6 +103,31 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
>                 found_end = found_start + key.offset;
>                 ret = 0;
>
> +               /*
> +                * The stripe extent starts before the range we want to delete,
> +                * but the range spans more than one stripe extent:
> +                *
> +                * |--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
> +                *        |--- keep  ---|--- drop ---|
> +                *
> +                * This means we have to get the previous item, truncate its
> +                * length and then restart the search.
> +                */
> +               if (found_start > start) {
> +
> +                       ret = btrfs_previous_item(stripe_root, path, start,
> +                                                 BTRFS_RAID_STRIPE_KEY);
> +                       if (ret < 0)
> +                               break;
> +                       ret = 0;
> +
> +                       leaf = path->nodes[0];
> +                       slot = path->slots[0];
> +                       btrfs_item_key_to_cpu(leaf, &key, slot);
> +                       found_start = key.objectid;
> +                       found_end = found_start + key.offset;

Hum, this isn't safe, ignoring the case where btrfs_previous_item()
returns 1, meaning there's no previous item.

In that case previous_item() returns pointing to the same leaf and
slot, and then below we delete the item instead of trimming it
(increasing its range start and decreasing its length).

Thanks.

> +               }
> +
>                 if (key.type != BTRFS_RAID_STRIPE_KEY)
>                         break;
>
> @@ -156,6 +181,9 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
>                         break;
>                 }
>
> +               /*
> +                * Finally we can delete the whole item, no more special cases.
> +                */
>                 ret = btrfs_del_item(trans, stripe_root, path);
>                 if (ret)
>                         break;
>
> --
> 2.43.0
>
>
Johannes Thumshirn Jan. 10, 2025, 11:33 a.m. UTC | #2
On 09.01.25 16:24, Filipe Manana wrote:
> On Tue, Jan 7, 2025 at 12:50 PM Johannes Thumshirn <jth@kernel.org> wrote:
>>
>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> When a user requests the deletion of a range that spans multiple stripe
>> extents and btrfs_search_slot() returns us the second RAID stripe extent,
>> we need to pick the previous item and truncate it, if there's still a
>> range to delete left, move on to the next item.
>>
>> The following diagram illustrates the operation:
>>
>>   |--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
>>          |--- keep  ---|--- drop ---|
>>
>> While at it, comment the trivial case of a whole item delete as well.
>>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> ---
>>   fs/btrfs/raid-stripe-tree.c | 28 ++++++++++++++++++++++++++++
>>   1 file changed, 28 insertions(+)
>>
>> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
>> index 79f8f692aaa8f6df2c9482fbd7777c2812528f65..893d963951315abfc734e1ca232b3087b7889431 100644
>> --- a/fs/btrfs/raid-stripe-tree.c
>> +++ b/fs/btrfs/raid-stripe-tree.c
>> @@ -103,6 +103,31 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
>>                  found_end = found_start + key.offset;
>>                  ret = 0;
>>
>> +               /*
>> +                * The stripe extent starts before the range we want to delete,
>> +                * but the range spans more than one stripe extent:
>> +                *
>> +                * |--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
>> +                *        |--- keep  ---|--- drop ---|
>> +                *
>> +                * This means we have to get the previous item, truncate its
>> +                * length and then restart the search.
>> +                */
>> +               if (found_start > start) {
>> +
>> +                       ret = btrfs_previous_item(stripe_root, path, start,
>> +                                                 BTRFS_RAID_STRIPE_KEY);
>> +                       if (ret < 0)
>> +                               break;
>> +                       ret = 0;
>> +
>> +                       leaf = path->nodes[0];
>> +                       slot = path->slots[0];
>> +                       btrfs_item_key_to_cpu(leaf, &key, slot);
>> +                       found_start = key.objectid;
>> +                       found_end = found_start + key.offset;
> 
> Hum, this isn't safe, ignoring the case where btrfs_previous_item()
> returns 1, meaning there's no previous item.
> 
> In that case previous_item() returns pointing to the same leaf and
> slot, and then below we delete the item instead of trimming it
> (increasing its range start and decreasing its length).

Good catch!

But what should we do when we end up in this situation? Doesn't that 
mean that either do_free_extent_accounting() passed in a bogus range or 
btrfs_previous_item() should've done one more call to btrfs_pref_leaf()?
Filipe Manana Jan. 10, 2025, 4:20 p.m. UTC | #3
On Fri, Jan 10, 2025 at 11:33 AM Johannes Thumshirn
<Johannes.Thumshirn@wdc.com> wrote:
>
> On 09.01.25 16:24, Filipe Manana wrote:
> > On Tue, Jan 7, 2025 at 12:50 PM Johannes Thumshirn <jth@kernel.org> wrote:
> >>
> >> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >>
> >> When a user requests the deletion of a range that spans multiple stripe
> >> extents and btrfs_search_slot() returns us the second RAID stripe extent,
> >> we need to pick the previous item and truncate it, if there's still a
> >> range to delete left, move on to the next item.
> >>
> >> The following diagram illustrates the operation:
> >>
> >>   |--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
> >>          |--- keep  ---|--- drop ---|
> >>
> >> While at it, comment the trivial case of a whole item delete as well.
> >>
> >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >> ---
> >>   fs/btrfs/raid-stripe-tree.c | 28 ++++++++++++++++++++++++++++
> >>   1 file changed, 28 insertions(+)
> >>
> >> diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
> >> index 79f8f692aaa8f6df2c9482fbd7777c2812528f65..893d963951315abfc734e1ca232b3087b7889431 100644
> >> --- a/fs/btrfs/raid-stripe-tree.c
> >> +++ b/fs/btrfs/raid-stripe-tree.c
> >> @@ -103,6 +103,31 @@ int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
> >>                  found_end = found_start + key.offset;
> >>                  ret = 0;
> >>
> >> +               /*
> >> +                * The stripe extent starts before the range we want to delete,
> >> +                * but the range spans more than one stripe extent:
> >> +                *
> >> +                * |--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
> >> +                *        |--- keep  ---|--- drop ---|
> >> +                *
> >> +                * This means we have to get the previous item, truncate its
> >> +                * length and then restart the search.
> >> +                */
> >> +               if (found_start > start) {
> >> +
> >> +                       ret = btrfs_previous_item(stripe_root, path, start,
> >> +                                                 BTRFS_RAID_STRIPE_KEY);
> >> +                       if (ret < 0)
> >> +                               break;
> >> +                       ret = 0;
> >> +
> >> +                       leaf = path->nodes[0];
> >> +                       slot = path->slots[0];
> >> +                       btrfs_item_key_to_cpu(leaf, &key, slot);
> >> +                       found_start = key.objectid;
> >> +                       found_end = found_start + key.offset;
> >
> > Hum, this isn't safe, ignoring the case where btrfs_previous_item()
> > returns 1, meaning there's no previous item.
> >
> > In that case previous_item() returns pointing to the same leaf and
> > slot, and then below we delete the item instead of trimming it
> > (increasing its range start and decreasing its length).
>
> Good catch!
>
> But what should we do when we end up in this situation? Doesn't that
> mean that either do_free_extent_accounting() passed in a bogus range or
> btrfs_previous_item() should've done one more call to btrfs_pref_leaf()?

When it returns 1 it means there's no previous leaf - the item we
processed was the first in the tree, so any future calls to
btrfs_prev_leaf() will keep returning 1 (unless some other task
inserts smaller keys).
The right thing is probably to log an error telling the target range,
dump the leaf, abort the transaction and return an error.



>
diff mbox series

Patch

diff --git a/fs/btrfs/raid-stripe-tree.c b/fs/btrfs/raid-stripe-tree.c
index 79f8f692aaa8f6df2c9482fbd7777c2812528f65..893d963951315abfc734e1ca232b3087b7889431 100644
--- a/fs/btrfs/raid-stripe-tree.c
+++ b/fs/btrfs/raid-stripe-tree.c
@@ -103,6 +103,31 @@  int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
 		found_end = found_start + key.offset;
 		ret = 0;
 
+		/*
+		 * The stripe extent starts before the range we want to delete,
+		 * but the range spans more than one stripe extent:
+		 *
+		 * |--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
+		 *        |--- keep  ---|--- drop ---|
+		 *
+		 * This means we have to get the previous item, truncate its
+		 * length and then restart the search.
+		 */
+		if (found_start > start) {
+
+			ret = btrfs_previous_item(stripe_root, path, start,
+						  BTRFS_RAID_STRIPE_KEY);
+			if (ret < 0)
+				break;
+			ret = 0;
+
+			leaf = path->nodes[0];
+			slot = path->slots[0];
+			btrfs_item_key_to_cpu(leaf, &key, slot);
+			found_start = key.objectid;
+			found_end = found_start + key.offset;
+		}
+
 		if (key.type != BTRFS_RAID_STRIPE_KEY)
 			break;
 
@@ -156,6 +181,9 @@  int btrfs_delete_raid_extent(struct btrfs_trans_handle *trans, u64 start, u64 le
 			break;
 		}
 
+		/*
+		 * Finally we can delete the whole item, no more special cases.
+		 */
 		ret = btrfs_del_item(trans, stripe_root, path);
 		if (ret)
 			break;