diff mbox series

[v3] btrfs: trim: fix underflow in trim length to prevent access beyond device boundary

Message ID 20200731094815.104794-1-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [v3] btrfs: trim: fix underflow in trim length to prevent access beyond device boundary | expand

Commit Message

Qu Wenruo July 31, 2020, 9:48 a.m. UTC
[BUG]
The following script can lead to tons of beyond device boundary access:

  mkfs.btrfs -f $dev -b 10G
  mount $dev $mnt
  trimfs $mnt
  btrfs filesystem resize 1:-1G $mnt
  trimfs $mnt

[CAUSE]
Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to
find_first_clear_extent_bit"), we try to avoid trimming ranges that's
already trimmed.

So we check device->alloc_state by finding the first range which doesn't
have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.

But if we shrunk the device, that bits are not cleared, thus we could
easily got a range starts beyond the shrunk device size.

This results the returned @start and @end are all beyond device size,
then we call "end = min(end, device->total_bytes -1);" making @end
smaller than device size.

Then finally we goes "len = end - start + 1", totally underflow the
result, and lead to the beyond-device-boundary access.

[FIX]
This patch will fix the problem in two ways:
- Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
  This is the root fix

- Add extra safe net when trimming free device extents
  We check and warn if the returned range is already beyond current
  device.

Link: https://github.com/kdave/btrfs-progs/issues/282
Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
---
Changelog:
v2:
- Add proper fixes tag
- Add extra warning for beyond device end case
- Add graceful exit for already trimmed case
v3:
- Don't return EUCLEAN for beyond boundary access
- Rephrase the warning message for beyond boundary access
---
 fs/btrfs/extent-tree.c | 21 +++++++++++++++++++++
 fs/btrfs/volumes.c     | 12 ++++++++++++
 2 files changed, 33 insertions(+)

Comments

Filipe Manana July 31, 2020, 10:05 a.m. UTC | #1
On Fri, Jul 31, 2020 at 10:49 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> The following script can lead to tons of beyond device boundary access:
>
>   mkfs.btrfs -f $dev -b 10G
>   mount $dev $mnt
>   trimfs $mnt
>   btrfs filesystem resize 1:-1G $mnt
>   trimfs $mnt
>
> [CAUSE]
> Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to
> find_first_clear_extent_bit"), we try to avoid trimming ranges that's
> already trimmed.
>
> So we check device->alloc_state by finding the first range which doesn't
> have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
>
> But if we shrunk the device, that bits are not cleared, thus we could
> easily got a range starts beyond the shrunk device size.
>
> This results the returned @start and @end are all beyond device size,
> then we call "end = min(end, device->total_bytes -1);" making @end
> smaller than device size.
>
> Then finally we goes "len = end - start + 1", totally underflow the
> result, and lead to the beyond-device-boundary access.
>
> [FIX]
> This patch will fix the problem in two ways:
> - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
>   This is the root fix
>
> - Add extra safe net when trimming free device extents
>   We check and warn if the returned range is already beyond current
>   device.
>
> Link: https://github.com/kdave/btrfs-progs/issues/282
> Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> ---
> Changelog:
> v2:
> - Add proper fixes tag
> - Add extra warning for beyond device end case
> - Add graceful exit for already trimmed case
> v3:
> - Don't return EUCLEAN for beyond boundary access
> - Rephrase the warning message for beyond boundary access
> ---
>  fs/btrfs/extent-tree.c | 21 +++++++++++++++++++++
>  fs/btrfs/volumes.c     | 12 ++++++++++++
>  2 files changed, 33 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index fa7d83051587..7c5e0961c93b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -33,6 +33,7 @@
>  #include "delalloc-space.h"
>  #include "block-group.h"
>  #include "discard.h"
> +#include "rcu-string.h"
>
>  #undef SCRAMBLE_DELAYED_REFS
>
> @@ -5669,6 +5670,26 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed)
>                                             &start, &end,
>                                             CHUNK_TRIMMED | CHUNK_ALLOCATED);
>
> +               /* CHUNK_* bits not cleared properly */
> +               if (start > device->total_bytes) {
> +                       WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
> +                       btrfs_warn_in_rcu(fs_info,
> +"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu",
> +                                         start, end - start + 1,
> +                                         rcu_str_deref(device->name),
> +                                         device->total_bytes);
> +                       mutex_unlock(&fs_info->chunk_mutex);
> +                       ret = 0;
> +                       break;
> +               }
> +
> +               /* The remaining part has already been trimmed */
> +               if (start == device->total_bytes) {
> +                       mutex_unlock(&fs_info->chunk_mutex);
> +                       ret = 0;
> +                       break;
> +               }

Sorry I missed this earlier, but why is this a special case? Couldn't
this be merged into the previous check?
Why is an offset matching the ending of the device not considered unexpected?

I also don't understand the comment, what is the remaining part?

Thanks.

> +
>                 /* Ensure we skip the reserved area in the first 1M */
>                 start = max_t(u64, start, SZ_1M);
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index d7670e2a9f39..4e51ef68ea72 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4720,6 +4720,18 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
>         }
>
>         mutex_lock(&fs_info->chunk_mutex);
> +       /*
> +        * Also clear any CHUNK_TRIMMED and CHUNK_ALLOCATED bits beyond the
> +        * current device boundary.
> +        * This shouldn't fail, as alloc_state should only utilize those two
> +        * bits, thus we shouldn't alloc new memory for clearing the status.
> +        *
> +        * So here we just do an ASSERT() to catch future behavior change.
> +        */
> +       ret = clear_extent_bits(&device->alloc_state, new_size, (u64)-1,
> +                               CHUNK_TRIMMED | CHUNK_ALLOCATED);
> +       ASSERT(!ret);
> +
>         btrfs_device_set_disk_total_bytes(device, new_size);
>         if (list_empty(&device->post_commit_list))
>                 list_add_tail(&device->post_commit_list,
> --
> 2.28.0
>
Qu Wenruo July 31, 2020, 10:20 a.m. UTC | #2
On 2020/7/31 下午6:05, Filipe Manana wrote:
> On Fri, Jul 31, 2020 at 10:49 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> The following script can lead to tons of beyond device boundary access:
>>
>>   mkfs.btrfs -f $dev -b 10G
>>   mount $dev $mnt
>>   trimfs $mnt
>>   btrfs filesystem resize 1:-1G $mnt
>>   trimfs $mnt
>>
>> [CAUSE]
>> Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to
>> find_first_clear_extent_bit"), we try to avoid trimming ranges that's
>> already trimmed.
>>
>> So we check device->alloc_state by finding the first range which doesn't
>> have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
>>
>> But if we shrunk the device, that bits are not cleared, thus we could
>> easily got a range starts beyond the shrunk device size.
>>
>> This results the returned @start and @end are all beyond device size,
>> then we call "end = min(end, device->total_bytes -1);" making @end
>> smaller than device size.
>>
>> Then finally we goes "len = end - start + 1", totally underflow the
>> result, and lead to the beyond-device-boundary access.
>>
>> [FIX]
>> This patch will fix the problem in two ways:
>> - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
>>   This is the root fix
>>
>> - Add extra safe net when trimming free device extents
>>   We check and warn if the returned range is already beyond current
>>   device.
>>
>> Link: https://github.com/kdave/btrfs-progs/issues/282
>> Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>> ---
>> Changelog:
>> v2:
>> - Add proper fixes tag
>> - Add extra warning for beyond device end case
>> - Add graceful exit for already trimmed case
>> v3:
>> - Don't return EUCLEAN for beyond boundary access
>> - Rephrase the warning message for beyond boundary access
>> ---
>>  fs/btrfs/extent-tree.c | 21 +++++++++++++++++++++
>>  fs/btrfs/volumes.c     | 12 ++++++++++++
>>  2 files changed, 33 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index fa7d83051587..7c5e0961c93b 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -33,6 +33,7 @@
>>  #include "delalloc-space.h"
>>  #include "block-group.h"
>>  #include "discard.h"
>> +#include "rcu-string.h"
>>
>>  #undef SCRAMBLE_DELAYED_REFS
>>
>> @@ -5669,6 +5670,26 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed)
>>                                             &start, &end,
>>                                             CHUNK_TRIMMED | CHUNK_ALLOCATED);
>>
>> +               /* CHUNK_* bits not cleared properly */
>> +               if (start > device->total_bytes) {
>> +                       WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
>> +                       btrfs_warn_in_rcu(fs_info,
>> +"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu",
>> +                                         start, end - start + 1,
>> +                                         rcu_str_deref(device->name),
>> +                                         device->total_bytes);
>> +                       mutex_unlock(&fs_info->chunk_mutex);
>> +                       ret = 0;
>> +                       break;
>> +               }
>> +
>> +               /* The remaining part has already been trimmed */
>> +               if (start == device->total_bytes) {
>> +                       mutex_unlock(&fs_info->chunk_mutex);
>> +                       ret = 0;
>> +                       break;
>> +               }
> 
> Sorry I missed this earlier, but why is this a special case? Couldn't
> this be merged into the previous check?
> Why is an offset matching the ending of the device not considered unexpected?

For such example:
		0		1g		2g
device 1:	|///////////////|               |
|//| = Allocated space
|  | = Free space.

After one fstrim, [1G, 2G) get trimmed.
So in the alloc_state we have
		0		1G		2G
device 1:	|  		|***************|
|**| = CHUNK_TRIMMED bits set

Here we just focus on the unallocated space, ignoring the block group parts.

Then we run fstrim again.
We call find_first_clear_extent_bit(start == 1G), then we got the result
start == 2G, end = U64_MAX.

In that case, we got start == device->total_bytes, and it's completely
valid.

> 
> I also don't understand the comment, what is the remaining part?

The remaining means the unallocated space from the @start of
find_first_clear_extent_bit().

Any better suggestion?

Thanks,
Qu

> 
> Thanks.
> 
>> +
>>                 /* Ensure we skip the reserved area in the first 1M */
>>                 start = max_t(u64, start, SZ_1M);
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index d7670e2a9f39..4e51ef68ea72 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -4720,6 +4720,18 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
>>         }
>>
>>         mutex_lock(&fs_info->chunk_mutex);
>> +       /*
>> +        * Also clear any CHUNK_TRIMMED and CHUNK_ALLOCATED bits beyond the
>> +        * current device boundary.
>> +        * This shouldn't fail, as alloc_state should only utilize those two
>> +        * bits, thus we shouldn't alloc new memory for clearing the status.
>> +        *
>> +        * So here we just do an ASSERT() to catch future behavior change.
>> +        */
>> +       ret = clear_extent_bits(&device->alloc_state, new_size, (u64)-1,
>> +                               CHUNK_TRIMMED | CHUNK_ALLOCATED);
>> +       ASSERT(!ret);
>> +
>>         btrfs_device_set_disk_total_bytes(device, new_size);
>>         if (list_empty(&device->post_commit_list))
>>                 list_add_tail(&device->post_commit_list,
>> --
>> 2.28.0
>>
> 
>
Qu Wenruo July 31, 2020, 10:38 a.m. UTC | #3
On 2020/7/31 下午6:20, Qu Wenruo wrote:
> 
> 
> On 2020/7/31 下午6:05, Filipe Manana wrote:
>> On Fri, Jul 31, 2020 at 10:49 AM Qu Wenruo <wqu@suse.com> wrote:
>>>
>>> [BUG]
>>> The following script can lead to tons of beyond device boundary access:
>>>
>>>   mkfs.btrfs -f $dev -b 10G
>>>   mount $dev $mnt
>>>   trimfs $mnt
>>>   btrfs filesystem resize 1:-1G $mnt
>>>   trimfs $mnt
>>>
>>> [CAUSE]
>>> Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to
>>> find_first_clear_extent_bit"), we try to avoid trimming ranges that's
>>> already trimmed.
>>>
>>> So we check device->alloc_state by finding the first range which doesn't
>>> have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
>>>
>>> But if we shrunk the device, that bits are not cleared, thus we could
>>> easily got a range starts beyond the shrunk device size.
>>>
>>> This results the returned @start and @end are all beyond device size,
>>> then we call "end = min(end, device->total_bytes -1);" making @end
>>> smaller than device size.
>>>
>>> Then finally we goes "len = end - start + 1", totally underflow the
>>> result, and lead to the beyond-device-boundary access.
>>>
>>> [FIX]
>>> This patch will fix the problem in two ways:
>>> - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
>>>   This is the root fix
>>>
>>> - Add extra safe net when trimming free device extents
>>>   We check and warn if the returned range is already beyond current
>>>   device.
>>>
>>> Link: https://github.com/kdave/btrfs-progs/issues/282
>>> Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>> Changelog:
>>> v2:
>>> - Add proper fixes tag
>>> - Add extra warning for beyond device end case
>>> - Add graceful exit for already trimmed case
>>> v3:
>>> - Don't return EUCLEAN for beyond boundary access
>>> - Rephrase the warning message for beyond boundary access
>>> ---
>>>  fs/btrfs/extent-tree.c | 21 +++++++++++++++++++++
>>>  fs/btrfs/volumes.c     | 12 ++++++++++++
>>>  2 files changed, 33 insertions(+)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index fa7d83051587..7c5e0961c93b 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -33,6 +33,7 @@
>>>  #include "delalloc-space.h"
>>>  #include "block-group.h"
>>>  #include "discard.h"
>>> +#include "rcu-string.h"
>>>
>>>  #undef SCRAMBLE_DELAYED_REFS
>>>
>>> @@ -5669,6 +5670,26 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed)
>>>                                             &start, &end,
>>>                                             CHUNK_TRIMMED | CHUNK_ALLOCATED);
>>>
>>> +               /* CHUNK_* bits not cleared properly */
>>> +               if (start > device->total_bytes) {
>>> +                       WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
>>> +                       btrfs_warn_in_rcu(fs_info,
>>> +"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu",
>>> +                                         start, end - start + 1,
>>> +                                         rcu_str_deref(device->name),
>>> +                                         device->total_bytes);
>>> +                       mutex_unlock(&fs_info->chunk_mutex);
>>> +                       ret = 0;
>>> +                       break;
>>> +               }
>>> +
>>> +               /* The remaining part has already been trimmed */
>>> +               if (start == device->total_bytes) {
>>> +                       mutex_unlock(&fs_info->chunk_mutex);
>>> +                       ret = 0;
>>> +                       break;
>>> +               }
>>
>> Sorry I missed this earlier, but why is this a special case? Couldn't
>> this be merged into the previous check?
>> Why is an offset matching the ending of the device not considered unexpected?
> 
> For such example:
> 		0		1g		2g
> device 1:	|///////////////|               |
> |//| = Allocated space
> |  | = Free space.
> 
> After one fstrim, [1G, 2G) get trimmed.
> So in the alloc_state we have
> 		0		1G		2G
> device 1:	|  		|***************|
> |**| = CHUNK_TRIMMED bits set
> 
> Here we just focus on the unallocated space, ignoring the block group parts.
> 
> Then we run fstrim again.
> We call find_first_clear_extent_bit(start == 1G), then we got the result
> start == 2G, end = U64_MAX.
> 
> In that case, we got start == device->total_bytes, and it's completely
> valid.

Sorry, although this is correct, it's duplicated with the later checks:

                end = min(end, device->total_bytes - 1);

                len = end - start + 1;

                /* We didn't find any extents */
                if (!len) {
                        mutex_unlock(&fs_info->chunk_mutex);
                        ret = 0;
                        break;
                }

If we got a returned start == device->total_bytes, then here we would
hit len == 0, and exit.

So my (start == device->total_bytes) is duplicated.

I guess the existing one is easier to understand, thus my extra check
should be removed.

Thanks,
Qu

> 
>>
>> I also don't understand the comment, what is the remaining part?
> 
> The remaining means the unallocated space from the @start of
> find_first_clear_extent_bit().
> 
> Any better suggestion?
> 
> Thanks,
> Qu
> 
>>
>> Thanks.
>>
>>> +
>>>                 /* Ensure we skip the reserved area in the first 1M */
>>>                 start = max_t(u64, start, SZ_1M);
>>>
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index d7670e2a9f39..4e51ef68ea72 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -4720,6 +4720,18 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
>>>         }
>>>
>>>         mutex_lock(&fs_info->chunk_mutex);
>>> +       /*
>>> +        * Also clear any CHUNK_TRIMMED and CHUNK_ALLOCATED bits beyond the
>>> +        * current device boundary.
>>> +        * This shouldn't fail, as alloc_state should only utilize those two
>>> +        * bits, thus we shouldn't alloc new memory for clearing the status.
>>> +        *
>>> +        * So here we just do an ASSERT() to catch future behavior change.
>>> +        */
>>> +       ret = clear_extent_bits(&device->alloc_state, new_size, (u64)-1,
>>> +                               CHUNK_TRIMMED | CHUNK_ALLOCATED);
>>> +       ASSERT(!ret);
>>> +
>>>         btrfs_device_set_disk_total_bytes(device, new_size);
>>>         if (list_empty(&device->post_commit_list))
>>>                 list_add_tail(&device->post_commit_list,
>>> --
>>> 2.28.0
>>>
>>
>>
>
Filipe Manana July 31, 2020, 10:40 a.m. UTC | #4
On Fri, Jul 31, 2020 at 11:21 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> On 2020/7/31 下午6:05, Filipe Manana wrote:
> > On Fri, Jul 31, 2020 at 10:49 AM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> [BUG]
> >> The following script can lead to tons of beyond device boundary access:
> >>
> >>   mkfs.btrfs -f $dev -b 10G
> >>   mount $dev $mnt
> >>   trimfs $mnt
> >>   btrfs filesystem resize 1:-1G $mnt
> >>   trimfs $mnt
> >>
> >> [CAUSE]
> >> Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to
> >> find_first_clear_extent_bit"), we try to avoid trimming ranges that's
> >> already trimmed.
> >>
> >> So we check device->alloc_state by finding the first range which doesn't
> >> have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
> >>
> >> But if we shrunk the device, that bits are not cleared, thus we could
> >> easily got a range starts beyond the shrunk device size.
> >>
> >> This results the returned @start and @end are all beyond device size,
> >> then we call "end = min(end, device->total_bytes -1);" making @end
> >> smaller than device size.
> >>
> >> Then finally we goes "len = end - start + 1", totally underflow the
> >> result, and lead to the beyond-device-boundary access.
> >>
> >> [FIX]
> >> This patch will fix the problem in two ways:
> >> - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
> >>   This is the root fix
> >>
> >> - Add extra safe net when trimming free device extents
> >>   We check and warn if the returned range is already beyond current
> >>   device.
> >>
> >> Link: https://github.com/kdave/btrfs-progs/issues/282
> >> Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> >> ---
> >> Changelog:
> >> v2:
> >> - Add proper fixes tag
> >> - Add extra warning for beyond device end case
> >> - Add graceful exit for already trimmed case
> >> v3:
> >> - Don't return EUCLEAN for beyond boundary access
> >> - Rephrase the warning message for beyond boundary access
> >> ---
> >>  fs/btrfs/extent-tree.c | 21 +++++++++++++++++++++
> >>  fs/btrfs/volumes.c     | 12 ++++++++++++
> >>  2 files changed, 33 insertions(+)
> >>
> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >> index fa7d83051587..7c5e0961c93b 100644
> >> --- a/fs/btrfs/extent-tree.c
> >> +++ b/fs/btrfs/extent-tree.c
> >> @@ -33,6 +33,7 @@
> >>  #include "delalloc-space.h"
> >>  #include "block-group.h"
> >>  #include "discard.h"
> >> +#include "rcu-string.h"
> >>
> >>  #undef SCRAMBLE_DELAYED_REFS
> >>
> >> @@ -5669,6 +5670,26 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed)
> >>                                             &start, &end,
> >>                                             CHUNK_TRIMMED | CHUNK_ALLOCATED);
> >>
> >> +               /* CHUNK_* bits not cleared properly */
> >> +               if (start > device->total_bytes) {
> >> +                       WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
> >> +                       btrfs_warn_in_rcu(fs_info,
> >> +"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu",
> >> +                                         start, end - start + 1,
> >> +                                         rcu_str_deref(device->name),
> >> +                                         device->total_bytes);
> >> +                       mutex_unlock(&fs_info->chunk_mutex);
> >> +                       ret = 0;
> >> +                       break;
> >> +               }
> >> +
> >> +               /* The remaining part has already been trimmed */
> >> +               if (start == device->total_bytes) {
> >> +                       mutex_unlock(&fs_info->chunk_mutex);
> >> +                       ret = 0;
> >> +                       break;
> >> +               }
> >
> > Sorry I missed this earlier, but why is this a special case? Couldn't
> > this be merged into the previous check?
> > Why is an offset matching the ending of the device not considered unexpected?
>
> For such example:
>                 0               1g              2g
> device 1:       |///////////////|               |
> |//| = Allocated space
> |  | = Free space.
>
> After one fstrim, [1G, 2G) get trimmed.
> So in the alloc_state we have
>                 0               1G              2G
> device 1:       |               |***************|
> |**| = CHUNK_TRIMMED bits set
>
> Here we just focus on the unallocated space, ignoring the block group parts.
>
> Then we run fstrim again.
> We call find_first_clear_extent_bit(start == 1G), then we got the result
> start == 2G, end = U64_MAX.
>
> In that case, we got start == device->total_bytes, and it's completely
> valid.

Ok. But this can happen without shrinking the device before, and it
seems we already handle it, or is the existing handling buggy?
If it is, it should be replaced or updated.

Thanks.

>
> >
> > I also don't understand the comment, what is the remaining part?
>
> The remaining means the unallocated space from the @start of
> find_first_clear_extent_bit().
>
> Any better suggestion?
>
> Thanks,
> Qu
>
> >
> > Thanks.
> >
> >> +
> >>                 /* Ensure we skip the reserved area in the first 1M */
> >>                 start = max_t(u64, start, SZ_1M);
> >>
> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> >> index d7670e2a9f39..4e51ef68ea72 100644
> >> --- a/fs/btrfs/volumes.c
> >> +++ b/fs/btrfs/volumes.c
> >> @@ -4720,6 +4720,18 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
> >>         }
> >>
> >>         mutex_lock(&fs_info->chunk_mutex);
> >> +       /*
> >> +        * Also clear any CHUNK_TRIMMED and CHUNK_ALLOCATED bits beyond the
> >> +        * current device boundary.
> >> +        * This shouldn't fail, as alloc_state should only utilize those two
> >> +        * bits, thus we shouldn't alloc new memory for clearing the status.
> >> +        *
> >> +        * So here we just do an ASSERT() to catch future behavior change.
> >> +        */
> >> +       ret = clear_extent_bits(&device->alloc_state, new_size, (u64)-1,
> >> +                               CHUNK_TRIMMED | CHUNK_ALLOCATED);
> >> +       ASSERT(!ret);
> >> +
> >>         btrfs_device_set_disk_total_bytes(device, new_size);
> >>         if (list_empty(&device->post_commit_list))
> >>                 list_add_tail(&device->post_commit_list,
> >> --
> >> 2.28.0
> >>
> >
> >
>
Filipe Manana July 31, 2020, 10:42 a.m. UTC | #5
On Fri, Jul 31, 2020 at 11:38 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> On 2020/7/31 下午6:20, Qu Wenruo wrote:
> >
> >
> > On 2020/7/31 下午6:05, Filipe Manana wrote:
> >> On Fri, Jul 31, 2020 at 10:49 AM Qu Wenruo <wqu@suse.com> wrote:
> >>>
> >>> [BUG]
> >>> The following script can lead to tons of beyond device boundary access:
> >>>
> >>>   mkfs.btrfs -f $dev -b 10G
> >>>   mount $dev $mnt
> >>>   trimfs $mnt
> >>>   btrfs filesystem resize 1:-1G $mnt
> >>>   trimfs $mnt
> >>>
> >>> [CAUSE]
> >>> Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to
> >>> find_first_clear_extent_bit"), we try to avoid trimming ranges that's
> >>> already trimmed.
> >>>
> >>> So we check device->alloc_state by finding the first range which doesn't
> >>> have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
> >>>
> >>> But if we shrunk the device, that bits are not cleared, thus we could
> >>> easily got a range starts beyond the shrunk device size.
> >>>
> >>> This results the returned @start and @end are all beyond device size,
> >>> then we call "end = min(end, device->total_bytes -1);" making @end
> >>> smaller than device size.
> >>>
> >>> Then finally we goes "len = end - start + 1", totally underflow the
> >>> result, and lead to the beyond-device-boundary access.
> >>>
> >>> [FIX]
> >>> This patch will fix the problem in two ways:
> >>> - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
> >>>   This is the root fix
> >>>
> >>> - Add extra safe net when trimming free device extents
> >>>   We check and warn if the returned range is already beyond current
> >>>   device.
> >>>
> >>> Link: https://github.com/kdave/btrfs-progs/issues/282
> >>> Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
> >>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> >>> ---
> >>> Changelog:
> >>> v2:
> >>> - Add proper fixes tag
> >>> - Add extra warning for beyond device end case
> >>> - Add graceful exit for already trimmed case
> >>> v3:
> >>> - Don't return EUCLEAN for beyond boundary access
> >>> - Rephrase the warning message for beyond boundary access
> >>> ---
> >>>  fs/btrfs/extent-tree.c | 21 +++++++++++++++++++++
> >>>  fs/btrfs/volumes.c     | 12 ++++++++++++
> >>>  2 files changed, 33 insertions(+)
> >>>
> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >>> index fa7d83051587..7c5e0961c93b 100644
> >>> --- a/fs/btrfs/extent-tree.c
> >>> +++ b/fs/btrfs/extent-tree.c
> >>> @@ -33,6 +33,7 @@
> >>>  #include "delalloc-space.h"
> >>>  #include "block-group.h"
> >>>  #include "discard.h"
> >>> +#include "rcu-string.h"
> >>>
> >>>  #undef SCRAMBLE_DELAYED_REFS
> >>>
> >>> @@ -5669,6 +5670,26 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed)
> >>>                                             &start, &end,
> >>>                                             CHUNK_TRIMMED | CHUNK_ALLOCATED);
> >>>
> >>> +               /* CHUNK_* bits not cleared properly */
> >>> +               if (start > device->total_bytes) {
> >>> +                       WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
> >>> +                       btrfs_warn_in_rcu(fs_info,
> >>> +"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu",
> >>> +                                         start, end - start + 1,
> >>> +                                         rcu_str_deref(device->name),
> >>> +                                         device->total_bytes);
> >>> +                       mutex_unlock(&fs_info->chunk_mutex);
> >>> +                       ret = 0;
> >>> +                       break;
> >>> +               }
> >>> +
> >>> +               /* The remaining part has already been trimmed */
> >>> +               if (start == device->total_bytes) {
> >>> +                       mutex_unlock(&fs_info->chunk_mutex);
> >>> +                       ret = 0;
> >>> +                       break;
> >>> +               }
> >>
> >> Sorry I missed this earlier, but why is this a special case? Couldn't
> >> this be merged into the previous check?
> >> Why is an offset matching the ending of the device not considered unexpected?
> >
> > For such example:
> >               0               1g              2g
> > device 1:     |///////////////|               |
> > |//| = Allocated space
> > |  | = Free space.
> >
> > After one fstrim, [1G, 2G) get trimmed.
> > So in the alloc_state we have
> >               0               1G              2G
> > device 1:     |               |***************|
> > |**| = CHUNK_TRIMMED bits set
> >
> > Here we just focus on the unallocated space, ignoring the block group parts.
> >
> > Then we run fstrim again.
> > We call find_first_clear_extent_bit(start == 1G), then we got the result
> > start == 2G, end = U64_MAX.
> >
> > In that case, we got start == device->total_bytes, and it's completely
> > valid.
>
> Sorry, although this is correct, it's duplicated with the later checks:
>
>                 end = min(end, device->total_bytes - 1);
>
>                 len = end - start + 1;
>
>                 /* We didn't find any extents */
>                 if (!len) {
>                         mutex_unlock(&fs_info->chunk_mutex);
>                         ret = 0;
>                         break;
>                 }
>
> If we got a returned start == device->total_bytes, then here we would
> hit len == 0, and exit.
>
> So my (start == device->total_bytes) is duplicated.
>
> I guess the existing one is easier to understand, thus my extra check
> should be removed.

Hit reply too soon before seeing this reply. Yes, it seems correct to
me as well.

Thanks.

>
> Thanks,
> Qu
>
> >
> >>
> >> I also don't understand the comment, what is the remaining part?
> >
> > The remaining means the unallocated space from the @start of
> > find_first_clear_extent_bit().
> >
> > Any better suggestion?
> >
> > Thanks,
> > Qu
> >
> >>
> >> Thanks.
> >>
> >>> +
> >>>                 /* Ensure we skip the reserved area in the first 1M */
> >>>                 start = max_t(u64, start, SZ_1M);
> >>>
> >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> >>> index d7670e2a9f39..4e51ef68ea72 100644
> >>> --- a/fs/btrfs/volumes.c
> >>> +++ b/fs/btrfs/volumes.c
> >>> @@ -4720,6 +4720,18 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
> >>>         }
> >>>
> >>>         mutex_lock(&fs_info->chunk_mutex);
> >>> +       /*
> >>> +        * Also clear any CHUNK_TRIMMED and CHUNK_ALLOCATED bits beyond the
> >>> +        * current device boundary.
> >>> +        * This shouldn't fail, as alloc_state should only utilize those two
> >>> +        * bits, thus we shouldn't alloc new memory for clearing the status.
> >>> +        *
> >>> +        * So here we just do an ASSERT() to catch future behavior change.
> >>> +        */
> >>> +       ret = clear_extent_bits(&device->alloc_state, new_size, (u64)-1,
> >>> +                               CHUNK_TRIMMED | CHUNK_ALLOCATED);
> >>> +       ASSERT(!ret);
> >>> +
> >>>         btrfs_device_set_disk_total_bytes(device, new_size);
> >>>         if (list_empty(&device->post_commit_list))
> >>>                 list_add_tail(&device->post_commit_list,
> >>> --
> >>> 2.28.0
> >>>
> >>
> >>
> >
>
kernel test robot July 31, 2020, 8:52 p.m. UTC | #6
Hi Qu,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on v5.8-rc7 next-20200731]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-trim-fix-underflow-in-trim-length-to-prevent-access-beyond-device-boundary/20200731-174928
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-randconfig-s022-20200731 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-14) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.2-115-g5fc204f2-dirty
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

>> fs/btrfs/extent-tree.c:5676:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
>> fs/btrfs/extent-tree.c:5676:25: sparse:    struct rcu_string [noderef] __rcu *
>> fs/btrfs/extent-tree.c:5676:25: sparse:    struct rcu_string *
   fs/btrfs/extent-tree.c:1769:9: sparse: sparse: context imbalance in 'run_and_cleanup_extent_op' - unexpected unlock
   fs/btrfs/extent-tree.c:1842:28: sparse: sparse: context imbalance in 'cleanup_ref_head' - unexpected unlock
   fs/btrfs/extent-tree.c:1918:36: sparse: sparse: context imbalance in 'btrfs_run_delayed_refs_for_head' - unexpected unlock
   fs/btrfs/extent-tree.c:1983:21: sparse: sparse: context imbalance in '__btrfs_run_delayed_refs' - wrong count at exit

vim +5676 fs/btrfs/extent-tree.c

  5619	
  5620	/*
  5621	 * It used to be that old block groups would be left around forever.
  5622	 * Iterating over them would be enough to trim unused space.  Since we
  5623	 * now automatically remove them, we also need to iterate over unallocated
  5624	 * space.
  5625	 *
  5626	 * We don't want a transaction for this since the discard may take a
  5627	 * substantial amount of time.  We don't require that a transaction be
  5628	 * running, but we do need to take a running transaction into account
  5629	 * to ensure that we're not discarding chunks that were released or
  5630	 * allocated in the current transaction.
  5631	 *
  5632	 * Holding the chunks lock will prevent other threads from allocating
  5633	 * or releasing chunks, but it won't prevent a running transaction
  5634	 * from committing and releasing the memory that the pending chunks
  5635	 * list head uses.  For that, we need to take a reference to the
  5636	 * transaction and hold the commit root sem.  We only need to hold
  5637	 * it while performing the free space search since we have already
  5638	 * held back allocations.
  5639	 */
  5640	static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed)
  5641	{
  5642		u64 start = SZ_1M, len = 0, end = 0;
  5643		int ret;
  5644	
  5645		*trimmed = 0;
  5646	
  5647		/* Discard not supported = nothing to do. */
  5648		if (!blk_queue_discard(bdev_get_queue(device->bdev)))
  5649			return 0;
  5650	
  5651		/* Not writable = nothing to do. */
  5652		if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
  5653			return 0;
  5654	
  5655		/* No free space = nothing to do. */
  5656		if (device->total_bytes <= device->bytes_used)
  5657			return 0;
  5658	
  5659		ret = 0;
  5660	
  5661		while (1) {
  5662			struct btrfs_fs_info *fs_info = device->fs_info;
  5663			u64 bytes;
  5664	
  5665			ret = mutex_lock_interruptible(&fs_info->chunk_mutex);
  5666			if (ret)
  5667				break;
  5668	
  5669			find_first_clear_extent_bit(&device->alloc_state, start,
  5670						    &start, &end,
  5671						    CHUNK_TRIMMED | CHUNK_ALLOCATED);
  5672	
  5673			/* CHUNK_* bits not cleared properly */
  5674			if (start > device->total_bytes) {
  5675				WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
> 5676				btrfs_warn_in_rcu(fs_info,
  5677	"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu",
  5678						  start, end - start + 1,
  5679						  rcu_str_deref(device->name),
  5680						  device->total_bytes);
  5681				mutex_unlock(&fs_info->chunk_mutex);
  5682				ret = 0;
  5683				break;
  5684			}
  5685	
  5686			/* The remaining part has already been trimmed */
  5687			if (start == device->total_bytes) {
  5688				mutex_unlock(&fs_info->chunk_mutex);
  5689				ret = 0;
  5690				break;
  5691			}
  5692	
  5693			/* Ensure we skip the reserved area in the first 1M */
  5694			start = max_t(u64, start, SZ_1M);
  5695	
  5696			/*
  5697			 * If find_first_clear_extent_bit find a range that spans the
  5698			 * end of the device it will set end to -1, in this case it's up
  5699			 * to the caller to trim the value to the size of the device.
  5700			 */
  5701			end = min(end, device->total_bytes - 1);
  5702	
  5703			len = end - start + 1;
  5704	
  5705			/* We didn't find any extents */
  5706			if (!len) {
  5707				mutex_unlock(&fs_info->chunk_mutex);
  5708				ret = 0;
  5709				break;
  5710			}
  5711	
  5712			ret = btrfs_issue_discard(device->bdev, start, len,
  5713						  &bytes);
  5714			if (!ret)
  5715				set_extent_bits(&device->alloc_state, start,
  5716						start + bytes - 1,
  5717						CHUNK_TRIMMED);
  5718			mutex_unlock(&fs_info->chunk_mutex);
  5719	
  5720			if (ret)
  5721				break;
  5722	
  5723			start += len;
  5724			*trimmed += bytes;
  5725	
  5726			if (fatal_signal_pending(current)) {
  5727				ret = -ERESTARTSYS;
  5728				break;
  5729			}
  5730	
  5731			cond_resched();
  5732		}
  5733	
  5734		return ret;
  5735	}
  5736	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
diff mbox series

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fa7d83051587..7c5e0961c93b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -33,6 +33,7 @@ 
 #include "delalloc-space.h"
 #include "block-group.h"
 #include "discard.h"
+#include "rcu-string.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -5669,6 +5670,26 @@  static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed)
 					    &start, &end,
 					    CHUNK_TRIMMED | CHUNK_ALLOCATED);
 
+		/* CHUNK_* bits not cleared properly */
+		if (start > device->total_bytes) {
+			WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
+			btrfs_warn_in_rcu(fs_info,
+"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu",
+					  start, end - start + 1,
+					  rcu_str_deref(device->name),
+					  device->total_bytes);
+			mutex_unlock(&fs_info->chunk_mutex);
+			ret = 0;
+			break;
+		}
+
+		/* The remaining part has already been trimmed */
+		if (start == device->total_bytes) {
+			mutex_unlock(&fs_info->chunk_mutex);
+			ret = 0;
+			break;
+		}
+
 		/* Ensure we skip the reserved area in the first 1M */
 		start = max_t(u64, start, SZ_1M);
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d7670e2a9f39..4e51ef68ea72 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4720,6 +4720,18 @@  int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 	}
 
 	mutex_lock(&fs_info->chunk_mutex);
+	/*
+	 * Also clear any CHUNK_TRIMMED and CHUNK_ALLOCATED bits beyond the
+	 * current device boundary.
+	 * This shouldn't fail, as alloc_state should only utilize those two
+	 * bits, thus we shouldn't alloc new memory for clearing the status.
+	 *
+	 * So here we just do an ASSERT() to catch future behavior change.
+	 */
+	ret = clear_extent_bits(&device->alloc_state, new_size, (u64)-1,
+				CHUNK_TRIMMED | CHUNK_ALLOCATED);
+	ASSERT(!ret);
+
 	btrfs_device_set_disk_total_bytes(device, new_size);
 	if (list_empty(&device->post_commit_list))
 		list_add_tail(&device->post_commit_list,