Message ID | 20200731094815.104794-1-wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] btrfs: trim: fix underflow in trim length to prevent access beyond device boundary | expand |
On Fri, Jul 31, 2020 at 10:49 AM Qu Wenruo <wqu@suse.com> wrote: > > [BUG] > The following script can lead to tons of beyond device boundary access: > > mkfs.btrfs -f $dev -b 10G > mount $dev $mnt > trimfs $mnt > btrfs filesystem resize 1:-1G $mnt > trimfs $mnt > > [CAUSE] > Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to > find_first_clear_extent_bit"), we try to avoid trimming ranges that's > already trimmed. > > So we check device->alloc_state by finding the first range which doesn't > have CHUNK_TRIMMED and CHUNK_ALLOCATED not set. > > But if we shrunk the device, that bits are not cleared, thus we could > easily got a range starts beyond the shrunk device size. > > This results the returned @start and @end are all beyond device size, > then we call "end = min(end, device->total_bytes -1);" making @end > smaller than device size. > > Then finally we goes "len = end - start + 1", totally underflow the > result, and lead to the beyond-device-boundary access. > > [FIX] > This patch will fix the problem in two ways: > - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device > This is the root fix > > - Add extra safe net when trimming free device extents > We check and warn if the returned range is already beyond current > device. > > Link: https://github.com/kdave/btrfs-progs/issues/282 > Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit") > Signed-off-by: Qu Wenruo <wqu@suse.com> > Reviewed-by: Filipe Manana <fdmanana@suse.com> > --- > Changelog: > v2: > - Add proper fixes tag > - Add extra warning for beyond device end case > - Add graceful exit for already trimmed case > v3: > - Don't return EUCLEAN for beyond boundary access > - Rephrase the warning message for beyond boundary access > --- > fs/btrfs/extent-tree.c | 21 +++++++++++++++++++++ > fs/btrfs/volumes.c | 12 ++++++++++++ > 2 files changed, 33 insertions(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index fa7d83051587..7c5e0961c93b 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -33,6 +33,7 @@ > #include "delalloc-space.h" > #include "block-group.h" > #include "discard.h" > +#include "rcu-string.h" > > #undef SCRAMBLE_DELAYED_REFS > > @@ -5669,6 +5670,26 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed) > &start, &end, > CHUNK_TRIMMED | CHUNK_ALLOCATED); > > + /* CHUNK_* bits not cleared properly */ > + if (start > device->total_bytes) { > + WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)); > + btrfs_warn_in_rcu(fs_info, > +"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu", > + start, end - start + 1, > + rcu_str_deref(device->name), > + device->total_bytes); > + mutex_unlock(&fs_info->chunk_mutex); > + ret = 0; > + break; > + } > + > + /* The remaining part has already been trimmed */ > + if (start == device->total_bytes) { > + mutex_unlock(&fs_info->chunk_mutex); > + ret = 0; > + break; > + } Sorry I missed this earlier, but why is this a special case? Couldn't this be merged into the previous check? Why is an offset matching the ending of the device not considered unexpected? I also don't understand the comment, what is the remaining part? Thanks. > + > /* Ensure we skip the reserved area in the first 1M */ > start = max_t(u64, start, SZ_1M); > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index d7670e2a9f39..4e51ef68ea72 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -4720,6 +4720,18 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size) > } > > mutex_lock(&fs_info->chunk_mutex); > + /* > + * Also clear any CHUNK_TRIMMED and CHUNK_ALLOCATED bits beyond the > + * current device boundary. > + * This shouldn't fail, as alloc_state should only utilize those two > + * bits, thus we shouldn't alloc new memory for clearing the status. > + * > + * So here we just do an ASSERT() to catch future behavior change. > + */ > + ret = clear_extent_bits(&device->alloc_state, new_size, (u64)-1, > + CHUNK_TRIMMED | CHUNK_ALLOCATED); > + ASSERT(!ret); > + > btrfs_device_set_disk_total_bytes(device, new_size); > if (list_empty(&device->post_commit_list)) > list_add_tail(&device->post_commit_list, > -- > 2.28.0 >
On 2020/7/31 下午6:05, Filipe Manana wrote: > On Fri, Jul 31, 2020 at 10:49 AM Qu Wenruo <wqu@suse.com> wrote: >> >> [BUG] >> The following script can lead to tons of beyond device boundary access: >> >> mkfs.btrfs -f $dev -b 10G >> mount $dev $mnt >> trimfs $mnt >> btrfs filesystem resize 1:-1G $mnt >> trimfs $mnt >> >> [CAUSE] >> Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to >> find_first_clear_extent_bit"), we try to avoid trimming ranges that's >> already trimmed. >> >> So we check device->alloc_state by finding the first range which doesn't >> have CHUNK_TRIMMED and CHUNK_ALLOCATED not set. >> >> But if we shrunk the device, that bits are not cleared, thus we could >> easily got a range starts beyond the shrunk device size. >> >> This results the returned @start and @end are all beyond device size, >> then we call "end = min(end, device->total_bytes -1);" making @end >> smaller than device size. >> >> Then finally we goes "len = end - start + 1", totally underflow the >> result, and lead to the beyond-device-boundary access. >> >> [FIX] >> This patch will fix the problem in two ways: >> - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device >> This is the root fix >> >> - Add extra safe net when trimming free device extents >> We check and warn if the returned range is already beyond current >> device. >> >> Link: https://github.com/kdave/btrfs-progs/issues/282 >> Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit") >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> Reviewed-by: Filipe Manana <fdmanana@suse.com> >> --- >> Changelog: >> v2: >> - Add proper fixes tag >> - Add extra warning for beyond device end case >> - Add graceful exit for already trimmed case >> v3: >> - Don't return EUCLEAN for beyond boundary access >> - Rephrase the warning message for beyond boundary access >> --- >> fs/btrfs/extent-tree.c | 21 +++++++++++++++++++++ >> fs/btrfs/volumes.c | 12 ++++++++++++ >> 2 files changed, 33 insertions(+) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index fa7d83051587..7c5e0961c93b 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -33,6 +33,7 @@ >> #include "delalloc-space.h" >> #include "block-group.h" >> #include "discard.h" >> +#include "rcu-string.h" >> >> #undef SCRAMBLE_DELAYED_REFS >> >> @@ -5669,6 +5670,26 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed) >> &start, &end, >> CHUNK_TRIMMED | CHUNK_ALLOCATED); >> >> + /* CHUNK_* bits not cleared properly */ >> + if (start > device->total_bytes) { >> + WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)); >> + btrfs_warn_in_rcu(fs_info, >> +"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu", >> + start, end - start + 1, >> + rcu_str_deref(device->name), >> + device->total_bytes); >> + mutex_unlock(&fs_info->chunk_mutex); >> + ret = 0; >> + break; >> + } >> + >> + /* The remaining part has already been trimmed */ >> + if (start == device->total_bytes) { >> + mutex_unlock(&fs_info->chunk_mutex); >> + ret = 0; >> + break; >> + } > > Sorry I missed this earlier, but why is this a special case? Couldn't > this be merged into the previous check? > Why is an offset matching the ending of the device not considered unexpected? For such example: 0 1g 2g device 1: |///////////////| | |//| = Allocated space | | = Free space. After one fstrim, [1G, 2G) get trimmed. So in the alloc_state we have 0 1G 2G device 1: | |***************| |**| = CHUNK_TRIMMED bits set Here we just focus on the unallocated space, ignoring the block group parts. Then we run fstrim again. We call find_first_clear_extent_bit(start == 1G), then we got the result start == 2G, end = U64_MAX. In that case, we got start == device->total_bytes, and it's completely valid. > > I also don't understand the comment, what is the remaining part? The remaining means the unallocated space from the @start of find_first_clear_extent_bit(). Any better suggestion? Thanks, Qu > > Thanks. > >> + >> /* Ensure we skip the reserved area in the first 1M */ >> start = max_t(u64, start, SZ_1M); >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index d7670e2a9f39..4e51ef68ea72 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -4720,6 +4720,18 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size) >> } >> >> mutex_lock(&fs_info->chunk_mutex); >> + /* >> + * Also clear any CHUNK_TRIMMED and CHUNK_ALLOCATED bits beyond the >> + * current device boundary. >> + * This shouldn't fail, as alloc_state should only utilize those two >> + * bits, thus we shouldn't alloc new memory for clearing the status. >> + * >> + * So here we just do an ASSERT() to catch future behavior change. >> + */ >> + ret = clear_extent_bits(&device->alloc_state, new_size, (u64)-1, >> + CHUNK_TRIMMED | CHUNK_ALLOCATED); >> + ASSERT(!ret); >> + >> btrfs_device_set_disk_total_bytes(device, new_size); >> if (list_empty(&device->post_commit_list)) >> list_add_tail(&device->post_commit_list, >> -- >> 2.28.0 >> > >
On 2020/7/31 下午6:20, Qu Wenruo wrote: > > > On 2020/7/31 下午6:05, Filipe Manana wrote: >> On Fri, Jul 31, 2020 at 10:49 AM Qu Wenruo <wqu@suse.com> wrote: >>> >>> [BUG] >>> The following script can lead to tons of beyond device boundary access: >>> >>> mkfs.btrfs -f $dev -b 10G >>> mount $dev $mnt >>> trimfs $mnt >>> btrfs filesystem resize 1:-1G $mnt >>> trimfs $mnt >>> >>> [CAUSE] >>> Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to >>> find_first_clear_extent_bit"), we try to avoid trimming ranges that's >>> already trimmed. >>> >>> So we check device->alloc_state by finding the first range which doesn't >>> have CHUNK_TRIMMED and CHUNK_ALLOCATED not set. >>> >>> But if we shrunk the device, that bits are not cleared, thus we could >>> easily got a range starts beyond the shrunk device size. >>> >>> This results the returned @start and @end are all beyond device size, >>> then we call "end = min(end, device->total_bytes -1);" making @end >>> smaller than device size. >>> >>> Then finally we goes "len = end - start + 1", totally underflow the >>> result, and lead to the beyond-device-boundary access. >>> >>> [FIX] >>> This patch will fix the problem in two ways: >>> - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device >>> This is the root fix >>> >>> - Add extra safe net when trimming free device extents >>> We check and warn if the returned range is already beyond current >>> device. >>> >>> Link: https://github.com/kdave/btrfs-progs/issues/282 >>> Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit") >>> Signed-off-by: Qu Wenruo <wqu@suse.com> >>> Reviewed-by: Filipe Manana <fdmanana@suse.com> >>> --- >>> Changelog: >>> v2: >>> - Add proper fixes tag >>> - Add extra warning for beyond device end case >>> - Add graceful exit for already trimmed case >>> v3: >>> - Don't return EUCLEAN for beyond boundary access >>> - Rephrase the warning message for beyond boundary access >>> --- >>> fs/btrfs/extent-tree.c | 21 +++++++++++++++++++++ >>> fs/btrfs/volumes.c | 12 ++++++++++++ >>> 2 files changed, 33 insertions(+) >>> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index fa7d83051587..7c5e0961c93b 100644 >>> --- a/fs/btrfs/extent-tree.c >>> +++ b/fs/btrfs/extent-tree.c >>> @@ -33,6 +33,7 @@ >>> #include "delalloc-space.h" >>> #include "block-group.h" >>> #include "discard.h" >>> +#include "rcu-string.h" >>> >>> #undef SCRAMBLE_DELAYED_REFS >>> >>> @@ -5669,6 +5670,26 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed) >>> &start, &end, >>> CHUNK_TRIMMED | CHUNK_ALLOCATED); >>> >>> + /* CHUNK_* bits not cleared properly */ >>> + if (start > device->total_bytes) { >>> + WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)); >>> + btrfs_warn_in_rcu(fs_info, >>> +"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu", >>> + start, end - start + 1, >>> + rcu_str_deref(device->name), >>> + device->total_bytes); >>> + mutex_unlock(&fs_info->chunk_mutex); >>> + ret = 0; >>> + break; >>> + } >>> + >>> + /* The remaining part has already been trimmed */ >>> + if (start == device->total_bytes) { >>> + mutex_unlock(&fs_info->chunk_mutex); >>> + ret = 0; >>> + break; >>> + } >> >> Sorry I missed this earlier, but why is this a special case? Couldn't >> this be merged into the previous check? >> Why is an offset matching the ending of the device not considered unexpected? > > For such example: > 0 1g 2g > device 1: |///////////////| | > |//| = Allocated space > | | = Free space. > > After one fstrim, [1G, 2G) get trimmed. > So in the alloc_state we have > 0 1G 2G > device 1: | |***************| > |**| = CHUNK_TRIMMED bits set > > Here we just focus on the unallocated space, ignoring the block group parts. > > Then we run fstrim again. > We call find_first_clear_extent_bit(start == 1G), then we got the result > start == 2G, end = U64_MAX. > > In that case, we got start == device->total_bytes, and it's completely > valid. Sorry, although this is correct, it's duplicated with the later checks: end = min(end, device->total_bytes - 1); len = end - start + 1; /* We didn't find any extents */ if (!len) { mutex_unlock(&fs_info->chunk_mutex); ret = 0; break; } If we got a returned start == device->total_bytes, then here we would hit len == 0, and exit. So my (start == device->total_bytes) is duplicated. I guess the existing one is easier to understand, thus my extra check should be removed. Thanks, Qu > >> >> I also don't understand the comment, what is the remaining part? > > The remaining means the unallocated space from the @start of > find_first_clear_extent_bit(). > > Any better suggestion? > > Thanks, > Qu > >> >> Thanks. >> >>> + >>> /* Ensure we skip the reserved area in the first 1M */ >>> start = max_t(u64, start, SZ_1M); >>> >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>> index d7670e2a9f39..4e51ef68ea72 100644 >>> --- a/fs/btrfs/volumes.c >>> +++ b/fs/btrfs/volumes.c >>> @@ -4720,6 +4720,18 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size) >>> } >>> >>> mutex_lock(&fs_info->chunk_mutex); >>> + /* >>> + * Also clear any CHUNK_TRIMMED and CHUNK_ALLOCATED bits beyond the >>> + * current device boundary. >>> + * This shouldn't fail, as alloc_state should only utilize those two >>> + * bits, thus we shouldn't alloc new memory for clearing the status. >>> + * >>> + * So here we just do an ASSERT() to catch future behavior change. >>> + */ >>> + ret = clear_extent_bits(&device->alloc_state, new_size, (u64)-1, >>> + CHUNK_TRIMMED | CHUNK_ALLOCATED); >>> + ASSERT(!ret); >>> + >>> btrfs_device_set_disk_total_bytes(device, new_size); >>> if (list_empty(&device->post_commit_list)) >>> list_add_tail(&device->post_commit_list, >>> -- >>> 2.28.0 >>> >> >> >
On Fri, Jul 31, 2020 at 11:21 AM Qu Wenruo <wqu@suse.com> wrote: > > > > On 2020/7/31 下午6:05, Filipe Manana wrote: > > On Fri, Jul 31, 2020 at 10:49 AM Qu Wenruo <wqu@suse.com> wrote: > >> > >> [BUG] > >> The following script can lead to tons of beyond device boundary access: > >> > >> mkfs.btrfs -f $dev -b 10G > >> mount $dev $mnt > >> trimfs $mnt > >> btrfs filesystem resize 1:-1G $mnt > >> trimfs $mnt > >> > >> [CAUSE] > >> Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to > >> find_first_clear_extent_bit"), we try to avoid trimming ranges that's > >> already trimmed. > >> > >> So we check device->alloc_state by finding the first range which doesn't > >> have CHUNK_TRIMMED and CHUNK_ALLOCATED not set. > >> > >> But if we shrunk the device, that bits are not cleared, thus we could > >> easily got a range starts beyond the shrunk device size. > >> > >> This results the returned @start and @end are all beyond device size, > >> then we call "end = min(end, device->total_bytes -1);" making @end > >> smaller than device size. > >> > >> Then finally we goes "len = end - start + 1", totally underflow the > >> result, and lead to the beyond-device-boundary access. > >> > >> [FIX] > >> This patch will fix the problem in two ways: > >> - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device > >> This is the root fix > >> > >> - Add extra safe net when trimming free device extents > >> We check and warn if the returned range is already beyond current > >> device. > >> > >> Link: https://github.com/kdave/btrfs-progs/issues/282 > >> Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit") > >> Signed-off-by: Qu Wenruo <wqu@suse.com> > >> Reviewed-by: Filipe Manana <fdmanana@suse.com> > >> --- > >> Changelog: > >> v2: > >> - Add proper fixes tag > >> - Add extra warning for beyond device end case > >> - Add graceful exit for already trimmed case > >> v3: > >> - Don't return EUCLEAN for beyond boundary access > >> - Rephrase the warning message for beyond boundary access > >> --- > >> fs/btrfs/extent-tree.c | 21 +++++++++++++++++++++ > >> fs/btrfs/volumes.c | 12 ++++++++++++ > >> 2 files changed, 33 insertions(+) > >> > >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > >> index fa7d83051587..7c5e0961c93b 100644 > >> --- a/fs/btrfs/extent-tree.c > >> +++ b/fs/btrfs/extent-tree.c > >> @@ -33,6 +33,7 @@ > >> #include "delalloc-space.h" > >> #include "block-group.h" > >> #include "discard.h" > >> +#include "rcu-string.h" > >> > >> #undef SCRAMBLE_DELAYED_REFS > >> > >> @@ -5669,6 +5670,26 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed) > >> &start, &end, > >> CHUNK_TRIMMED | CHUNK_ALLOCATED); > >> > >> + /* CHUNK_* bits not cleared properly */ > >> + if (start > device->total_bytes) { > >> + WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)); > >> + btrfs_warn_in_rcu(fs_info, > >> +"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu", > >> + start, end - start + 1, > >> + rcu_str_deref(device->name), > >> + device->total_bytes); > >> + mutex_unlock(&fs_info->chunk_mutex); > >> + ret = 0; > >> + break; > >> + } > >> + > >> + /* The remaining part has already been trimmed */ > >> + if (start == device->total_bytes) { > >> + mutex_unlock(&fs_info->chunk_mutex); > >> + ret = 0; > >> + break; > >> + } > > > > Sorry I missed this earlier, but why is this a special case? Couldn't > > this be merged into the previous check? > > Why is an offset matching the ending of the device not considered unexpected? > > For such example: > 0 1g 2g > device 1: |///////////////| | > |//| = Allocated space > | | = Free space. > > After one fstrim, [1G, 2G) get trimmed. > So in the alloc_state we have > 0 1G 2G > device 1: | |***************| > |**| = CHUNK_TRIMMED bits set > > Here we just focus on the unallocated space, ignoring the block group parts. > > Then we run fstrim again. > We call find_first_clear_extent_bit(start == 1G), then we got the result > start == 2G, end = U64_MAX. > > In that case, we got start == device->total_bytes, and it's completely > valid. Ok. But this can happen without shrinking the device before, and it seems we already handle it, or is the existing handling buggy? If it is, it should be replaced or updated. Thanks. > > > > > I also don't understand the comment, what is the remaining part? > > The remaining means the unallocated space from the @start of > find_first_clear_extent_bit(). > > Any better suggestion? > > Thanks, > Qu > > > > > Thanks. > > > >> + > >> /* Ensure we skip the reserved area in the first 1M */ > >> start = max_t(u64, start, SZ_1M); > >> > >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > >> index d7670e2a9f39..4e51ef68ea72 100644 > >> --- a/fs/btrfs/volumes.c > >> +++ b/fs/btrfs/volumes.c > >> @@ -4720,6 +4720,18 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size) > >> } > >> > >> mutex_lock(&fs_info->chunk_mutex); > >> + /* > >> + * Also clear any CHUNK_TRIMMED and CHUNK_ALLOCATED bits beyond the > >> + * current device boundary. > >> + * This shouldn't fail, as alloc_state should only utilize those two > >> + * bits, thus we shouldn't alloc new memory for clearing the status. > >> + * > >> + * So here we just do an ASSERT() to catch future behavior change. > >> + */ > >> + ret = clear_extent_bits(&device->alloc_state, new_size, (u64)-1, > >> + CHUNK_TRIMMED | CHUNK_ALLOCATED); > >> + ASSERT(!ret); > >> + > >> btrfs_device_set_disk_total_bytes(device, new_size); > >> if (list_empty(&device->post_commit_list)) > >> list_add_tail(&device->post_commit_list, > >> -- > >> 2.28.0 > >> > > > > >
On Fri, Jul 31, 2020 at 11:38 AM Qu Wenruo <wqu@suse.com> wrote: > > > > On 2020/7/31 下午6:20, Qu Wenruo wrote: > > > > > > On 2020/7/31 下午6:05, Filipe Manana wrote: > >> On Fri, Jul 31, 2020 at 10:49 AM Qu Wenruo <wqu@suse.com> wrote: > >>> > >>> [BUG] > >>> The following script can lead to tons of beyond device boundary access: > >>> > >>> mkfs.btrfs -f $dev -b 10G > >>> mount $dev $mnt > >>> trimfs $mnt > >>> btrfs filesystem resize 1:-1G $mnt > >>> trimfs $mnt > >>> > >>> [CAUSE] > >>> Since commit 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to > >>> find_first_clear_extent_bit"), we try to avoid trimming ranges that's > >>> already trimmed. > >>> > >>> So we check device->alloc_state by finding the first range which doesn't > >>> have CHUNK_TRIMMED and CHUNK_ALLOCATED not set. > >>> > >>> But if we shrunk the device, that bits are not cleared, thus we could > >>> easily got a range starts beyond the shrunk device size. > >>> > >>> This results the returned @start and @end are all beyond device size, > >>> then we call "end = min(end, device->total_bytes -1);" making @end > >>> smaller than device size. > >>> > >>> Then finally we goes "len = end - start + 1", totally underflow the > >>> result, and lead to the beyond-device-boundary access. > >>> > >>> [FIX] > >>> This patch will fix the problem in two ways: > >>> - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device > >>> This is the root fix > >>> > >>> - Add extra safe net when trimming free device extents > >>> We check and warn if the returned range is already beyond current > >>> device. > >>> > >>> Link: https://github.com/kdave/btrfs-progs/issues/282 > >>> Fixes: 929be17a9b49 ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit") > >>> Signed-off-by: Qu Wenruo <wqu@suse.com> > >>> Reviewed-by: Filipe Manana <fdmanana@suse.com> > >>> --- > >>> Changelog: > >>> v2: > >>> - Add proper fixes tag > >>> - Add extra warning for beyond device end case > >>> - Add graceful exit for already trimmed case > >>> v3: > >>> - Don't return EUCLEAN for beyond boundary access > >>> - Rephrase the warning message for beyond boundary access > >>> --- > >>> fs/btrfs/extent-tree.c | 21 +++++++++++++++++++++ > >>> fs/btrfs/volumes.c | 12 ++++++++++++ > >>> 2 files changed, 33 insertions(+) > >>> > >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > >>> index fa7d83051587..7c5e0961c93b 100644 > >>> --- a/fs/btrfs/extent-tree.c > >>> +++ b/fs/btrfs/extent-tree.c > >>> @@ -33,6 +33,7 @@ > >>> #include "delalloc-space.h" > >>> #include "block-group.h" > >>> #include "discard.h" > >>> +#include "rcu-string.h" > >>> > >>> #undef SCRAMBLE_DELAYED_REFS > >>> > >>> @@ -5669,6 +5670,26 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed) > >>> &start, &end, > >>> CHUNK_TRIMMED | CHUNK_ALLOCATED); > >>> > >>> + /* CHUNK_* bits not cleared properly */ > >>> + if (start > device->total_bytes) { > >>> + WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)); > >>> + btrfs_warn_in_rcu(fs_info, > >>> +"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu", > >>> + start, end - start + 1, > >>> + rcu_str_deref(device->name), > >>> + device->total_bytes); > >>> + mutex_unlock(&fs_info->chunk_mutex); > >>> + ret = 0; > >>> + break; > >>> + } > >>> + > >>> + /* The remaining part has already been trimmed */ > >>> + if (start == device->total_bytes) { > >>> + mutex_unlock(&fs_info->chunk_mutex); > >>> + ret = 0; > >>> + break; > >>> + } > >> > >> Sorry I missed this earlier, but why is this a special case? Couldn't > >> this be merged into the previous check? > >> Why is an offset matching the ending of the device not considered unexpected? > > > > For such example: > > 0 1g 2g > > device 1: |///////////////| | > > |//| = Allocated space > > | | = Free space. > > > > After one fstrim, [1G, 2G) get trimmed. > > So in the alloc_state we have > > 0 1G 2G > > device 1: | |***************| > > |**| = CHUNK_TRIMMED bits set > > > > Here we just focus on the unallocated space, ignoring the block group parts. > > > > Then we run fstrim again. > > We call find_first_clear_extent_bit(start == 1G), then we got the result > > start == 2G, end = U64_MAX. > > > > In that case, we got start == device->total_bytes, and it's completely > > valid. > > Sorry, although this is correct, it's duplicated with the later checks: > > end = min(end, device->total_bytes - 1); > > len = end - start + 1; > > /* We didn't find any extents */ > if (!len) { > mutex_unlock(&fs_info->chunk_mutex); > ret = 0; > break; > } > > If we got a returned start == device->total_bytes, then here we would > hit len == 0, and exit. > > So my (start == device->total_bytes) is duplicated. > > I guess the existing one is easier to understand, thus my extra check > should be removed. Hit reply too soon before seeing this reply. Yes, it seems correct to me as well. Thanks. > > Thanks, > Qu > > > > >> > >> I also don't understand the comment, what is the remaining part? > > > > The remaining means the unallocated space from the @start of > > find_first_clear_extent_bit(). > > > > Any better suggestion? > > > > Thanks, > > Qu > > > >> > >> Thanks. > >> > >>> + > >>> /* Ensure we skip the reserved area in the first 1M */ > >>> start = max_t(u64, start, SZ_1M); > >>> > >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > >>> index d7670e2a9f39..4e51ef68ea72 100644 > >>> --- a/fs/btrfs/volumes.c > >>> +++ b/fs/btrfs/volumes.c > >>> @@ -4720,6 +4720,18 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size) > >>> } > >>> > >>> mutex_lock(&fs_info->chunk_mutex); > >>> + /* > >>> + * Also clear any CHUNK_TRIMMED and CHUNK_ALLOCATED bits beyond the > >>> + * current device boundary. > >>> + * This shouldn't fail, as alloc_state should only utilize those two > >>> + * bits, thus we shouldn't alloc new memory for clearing the status. > >>> + * > >>> + * So here we just do an ASSERT() to catch future behavior change. > >>> + */ > >>> + ret = clear_extent_bits(&device->alloc_state, new_size, (u64)-1, > >>> + CHUNK_TRIMMED | CHUNK_ALLOCATED); > >>> + ASSERT(!ret); > >>> + > >>> btrfs_device_set_disk_total_bytes(device, new_size); > >>> if (list_empty(&device->post_commit_list)) > >>> list_add_tail(&device->post_commit_list, > >>> -- > >>> 2.28.0 > >>> > >> > >> > > >
Hi Qu, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on kdave/for-next] [also build test WARNING on v5.8-rc7 next-20200731] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-trim-fix-underflow-in-trim-length-to-prevent-access-beyond-device-boundary/20200731-174928 base: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next config: x86_64-randconfig-s022-20200731 (attached as .config) compiler: gcc-9 (Debian 9.3.0-14) 9.3.0 reproduce: # apt-get install sparse # sparse version: v0.6.2-115-g5fc204f2-dirty # save the attached .config to linux build tree make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> sparse warnings: (new ones prefixed by >>) >> fs/btrfs/extent-tree.c:5676:25: sparse: sparse: incompatible types in comparison expression (different address spaces): >> fs/btrfs/extent-tree.c:5676:25: sparse: struct rcu_string [noderef] __rcu * >> fs/btrfs/extent-tree.c:5676:25: sparse: struct rcu_string * fs/btrfs/extent-tree.c:1769:9: sparse: sparse: context imbalance in 'run_and_cleanup_extent_op' - unexpected unlock fs/btrfs/extent-tree.c:1842:28: sparse: sparse: context imbalance in 'cleanup_ref_head' - unexpected unlock fs/btrfs/extent-tree.c:1918:36: sparse: sparse: context imbalance in 'btrfs_run_delayed_refs_for_head' - unexpected unlock fs/btrfs/extent-tree.c:1983:21: sparse: sparse: context imbalance in '__btrfs_run_delayed_refs' - wrong count at exit vim +5676 fs/btrfs/extent-tree.c 5619 5620 /* 5621 * It used to be that old block groups would be left around forever. 5622 * Iterating over them would be enough to trim unused space. Since we 5623 * now automatically remove them, we also need to iterate over unallocated 5624 * space. 5625 * 5626 * We don't want a transaction for this since the discard may take a 5627 * substantial amount of time. We don't require that a transaction be 5628 * running, but we do need to take a running transaction into account 5629 * to ensure that we're not discarding chunks that were released or 5630 * allocated in the current transaction. 5631 * 5632 * Holding the chunks lock will prevent other threads from allocating 5633 * or releasing chunks, but it won't prevent a running transaction 5634 * from committing and releasing the memory that the pending chunks 5635 * list head uses. For that, we need to take a reference to the 5636 * transaction and hold the commit root sem. We only need to hold 5637 * it while performing the free space search since we have already 5638 * held back allocations. 5639 */ 5640 static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed) 5641 { 5642 u64 start = SZ_1M, len = 0, end = 0; 5643 int ret; 5644 5645 *trimmed = 0; 5646 5647 /* Discard not supported = nothing to do. */ 5648 if (!blk_queue_discard(bdev_get_queue(device->bdev))) 5649 return 0; 5650 5651 /* Not writable = nothing to do. */ 5652 if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) 5653 return 0; 5654 5655 /* No free space = nothing to do. */ 5656 if (device->total_bytes <= device->bytes_used) 5657 return 0; 5658 5659 ret = 0; 5660 5661 while (1) { 5662 struct btrfs_fs_info *fs_info = device->fs_info; 5663 u64 bytes; 5664 5665 ret = mutex_lock_interruptible(&fs_info->chunk_mutex); 5666 if (ret) 5667 break; 5668 5669 find_first_clear_extent_bit(&device->alloc_state, start, 5670 &start, &end, 5671 CHUNK_TRIMMED | CHUNK_ALLOCATED); 5672 5673 /* CHUNK_* bits not cleared properly */ 5674 if (start > device->total_bytes) { 5675 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)); > 5676 btrfs_warn_in_rcu(fs_info, 5677 "ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu", 5678 start, end - start + 1, 5679 rcu_str_deref(device->name), 5680 device->total_bytes); 5681 mutex_unlock(&fs_info->chunk_mutex); 5682 ret = 0; 5683 break; 5684 } 5685 5686 /* The remaining part has already been trimmed */ 5687 if (start == device->total_bytes) { 5688 mutex_unlock(&fs_info->chunk_mutex); 5689 ret = 0; 5690 break; 5691 } 5692 5693 /* Ensure we skip the reserved area in the first 1M */ 5694 start = max_t(u64, start, SZ_1M); 5695 5696 /* 5697 * If find_first_clear_extent_bit find a range that spans the 5698 * end of the device it will set end to -1, in this case it's up 5699 * to the caller to trim the value to the size of the device. 5700 */ 5701 end = min(end, device->total_bytes - 1); 5702 5703 len = end - start + 1; 5704 5705 /* We didn't find any extents */ 5706 if (!len) { 5707 mutex_unlock(&fs_info->chunk_mutex); 5708 ret = 0; 5709 break; 5710 } 5711 5712 ret = btrfs_issue_discard(device->bdev, start, len, 5713 &bytes); 5714 if (!ret) 5715 set_extent_bits(&device->alloc_state, start, 5716 start + bytes - 1, 5717 CHUNK_TRIMMED); 5718 mutex_unlock(&fs_info->chunk_mutex); 5719 5720 if (ret) 5721 break; 5722 5723 start += len; 5724 *trimmed += bytes; 5725 5726 if (fatal_signal_pending(current)) { 5727 ret = -ERESTARTSYS; 5728 break; 5729 } 5730 5731 cond_resched(); 5732 } 5733 5734 return ret; 5735 } 5736 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index fa7d83051587..7c5e0961c93b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -33,6 +33,7 @@ #include "delalloc-space.h" #include "block-group.h" #include "discard.h" +#include "rcu-string.h" #undef SCRAMBLE_DELAYED_REFS @@ -5669,6 +5670,26 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, u64 *trimmed) &start, &end, CHUNK_TRIMMED | CHUNK_ALLOCATED); + /* CHUNK_* bits not cleared properly */ + if (start > device->total_bytes) { + WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)); + btrfs_warn_in_rcu(fs_info, +"ignoring attempt to trim beyond device size: offset %llu length %llu device %s device size %llu", + start, end - start + 1, + rcu_str_deref(device->name), + device->total_bytes); + mutex_unlock(&fs_info->chunk_mutex); + ret = 0; + break; + } + + /* The remaining part has already been trimmed */ + if (start == device->total_bytes) { + mutex_unlock(&fs_info->chunk_mutex); + ret = 0; + break; + } + /* Ensure we skip the reserved area in the first 1M */ start = max_t(u64, start, SZ_1M); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d7670e2a9f39..4e51ef68ea72 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4720,6 +4720,18 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size) } mutex_lock(&fs_info->chunk_mutex); + /* + * Also clear any CHUNK_TRIMMED and CHUNK_ALLOCATED bits beyond the + * current device boundary. + * This shouldn't fail, as alloc_state should only utilize those two + * bits, thus we shouldn't alloc new memory for clearing the status. + * + * So here we just do an ASSERT() to catch future behavior change. + */ + ret = clear_extent_bits(&device->alloc_state, new_size, (u64)-1, + CHUNK_TRIMMED | CHUNK_ALLOCATED); + ASSERT(!ret); + btrfs_device_set_disk_total_bytes(device, new_size); if (list_empty(&device->post_commit_list)) list_add_tail(&device->post_commit_list,