mbox series

[v4,00/11] btrfs-progs: Support for SKINNY_BG_TREE feature

Message ID 20200505000230.4454-1-wqu@suse.com (mailing list archive)
Headers show
Series btrfs-progs: Support for SKINNY_BG_TREE feature | expand

Message

Qu Wenruo May 5, 2020, 12:02 a.m. UTC
This patchset can be fetched from github:
https://github.com/adam900710/btrfs-progs/tree/skinny_bg_tree
Which is based on v5.6 tag, with extra cleanups (sent to mail list) applied.

This patchset provides the needed user space infrastructure for SKINNY_BG_TREE
feature.

Since it's an new incompatible feature, unlike SKINNY_METADATA, btrfs-progs
is needed to convert existing fs (unmounted) to new format, and
vice-verse.

Now btrfstune can convert regular extent tree fs to bg tree fs to
improve mount time.

For the performance improvement, please check the kernel patchset cover
letter or the last patch.
(SPOILER ALERT: It's super fast)

Changelog:
v2:
- Rebase to v5.2.2 tag
- Add btrfstune ability to convert existing fs to BG_TREE feature

v3:
- Fix a bug that temp chunks are not cleaned up properly
  This is caused by wrong timing btrfs_convert_to_bg_tree() is called.
  It should be called after temp chunks cleaned up.

- Fix a bug that an extent buffer get leaked
  This is caused by newly created bg tree not added to dirty list.

v4:
- Go with skinny bg tree other than regular block group item
  We're introducing a new incompatible feature anyway, why not go
  extreme?

- Use the same refactor as kernel.
  To make code much cleaner and easier to read.

- Add the ability to rollback to regular extent tree.
  So confident tester can try SKINNY_BG_TREE using their real world
  data, and rollback if they still want to mount it using older kernels.


Qu Wenruo (11):
  btrfs-progs: check/lowmem: Lookup block group item in a seperate
    function
  btrfs-progs: block-group: Refactor how we read one block group item
  btrfs-progs: Rename btrfs_remove_block_group() and
    free_block_group_item()
  btrfs-progs: block-group: Refactor how we insert a block group item
  btrfs-progs: block-group: Rename write_one_cahce_group()
  btrfs-progs: Introduce rw support for skinny_bg_tree
  btrfs-progs: mkfs: Introduce -O skinny-bg-tree
  btrfs-progs: dump-tree/dump-super: Introduce support for skinny bg
    tree
  btrfs-progs: check: Introduce support for bg-tree feature
  btrfs-progs: btrfstune: Allow to enable bg-tree feature offline
  btrfs-progs: btrfstune: Allow user to rollback to regular extent tree

 Documentation/btrfstune.asciidoc |  10 +
 btrfstune.c                      |  36 +-
 check/common.h                   |   4 +-
 check/main.c                     |  60 +++-
 check/mode-lowmem.c              | 140 +++++---
 cmds/inspect-dump-super.c        |   1 +
 cmds/inspect-dump-tree.c         |   6 +
 cmds/rescue-chunk-recover.c      |   5 +-
 common/fsfeatures.c              |   6 +
 ctree.h                          |  23 +-
 disk-io.c                        |  20 ++
 extent-tree.c                    | 546 +++++++++++++++++++++++++------
 mkfs/common.c                    |   3 +-
 mkfs/common.h                    |   3 +
 mkfs/main.c                      |  13 +-
 print-tree.c                     |   4 +
 root-tree.c                      |   6 +-
 transaction.c                    |   2 +
 18 files changed, 738 insertions(+), 150 deletions(-)

Comments

David Sterba May 11, 2020, 6:58 p.m. UTC | #1
On Tue, May 05, 2020 at 08:02:19AM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/btrfs-progs/tree/skinny_bg_tree
> Which is based on v5.6 tag, with extra cleanups (sent to mail list) applied.
> 
> This patchset provides the needed user space infrastructure for SKINNY_BG_TREE
> feature.
> 
> Since it's an new incompatible feature, unlike SKINNY_METADATA, btrfs-progs
> is needed to convert existing fs (unmounted) to new format, and
> vice-verse.
> 
> Now btrfstune can convert regular extent tree fs to bg tree fs to
> improve mount time.
> 
> For the performance improvement, please check the kernel patchset cover
> letter or the last patch.
> (SPOILER ALERT: It's super fast)
> 
> Changelog:
> v2:
> - Rebase to v5.2.2 tag
> - Add btrfstune ability to convert existing fs to BG_TREE feature
> 
> v3:
> - Fix a bug that temp chunks are not cleaned up properly
>   This is caused by wrong timing btrfs_convert_to_bg_tree() is called.
>   It should be called after temp chunks cleaned up.
> 
> - Fix a bug that an extent buffer get leaked
>   This is caused by newly created bg tree not added to dirty list.
> 
> v4:
> - Go with skinny bg tree other than regular block group item
>   We're introducing a new incompatible feature anyway, why not go
>   extreme?
> 
> - Use the same refactor as kernel.
>   To make code much cleaner and easier to read.
> 
> - Add the ability to rollback to regular extent tree.
>   So confident tester can try SKINNY_BG_TREE using their real world
>   data, and rollback if they still want to mount it using older kernels.
>
> Qu Wenruo (11):
>   btrfs-progs: check/lowmem: Lookup block group item in a seperate
>     function
>   btrfs-progs: block-group: Refactor how we read one block group item
>   btrfs-progs: Rename btrfs_remove_block_group() and
>     free_block_group_item()
>   btrfs-progs: block-group: Refactor how we insert a block group item
>   btrfs-progs: block-group: Rename write_one_cahce_group()

I'll add the above patches independently, for the rest I don't know. I
still think the separate tree is somehow wrong so have to convince
myself that it's not.
Qu Wenruo May 12, 2020, 12:26 a.m. UTC | #2
On 2020/5/12 上午2:58, David Sterba wrote:
> On Tue, May 05, 2020 at 08:02:19AM +0800, Qu Wenruo wrote:
>> This patchset can be fetched from github:
>> https://github.com/adam900710/btrfs-progs/tree/skinny_bg_tree
>> Which is based on v5.6 tag, with extra cleanups (sent to mail list) applied.
>>
>> This patchset provides the needed user space infrastructure for SKINNY_BG_TREE
>> feature.
>>
>> Since it's an new incompatible feature, unlike SKINNY_METADATA, btrfs-progs
>> is needed to convert existing fs (unmounted) to new format, and
>> vice-verse.
>>
>> Now btrfstune can convert regular extent tree fs to bg tree fs to
>> improve mount time.
>>
>> For the performance improvement, please check the kernel patchset cover
>> letter or the last patch.
>> (SPOILER ALERT: It's super fast)
>>
>> Changelog:
>> v2:
>> - Rebase to v5.2.2 tag
>> - Add btrfstune ability to convert existing fs to BG_TREE feature
>>
>> v3:
>> - Fix a bug that temp chunks are not cleaned up properly
>>   This is caused by wrong timing btrfs_convert_to_bg_tree() is called.
>>   It should be called after temp chunks cleaned up.
>>
>> - Fix a bug that an extent buffer get leaked
>>   This is caused by newly created bg tree not added to dirty list.
>>
>> v4:
>> - Go with skinny bg tree other than regular block group item
>>   We're introducing a new incompatible feature anyway, why not go
>>   extreme?
>>
>> - Use the same refactor as kernel.
>>   To make code much cleaner and easier to read.
>>
>> - Add the ability to rollback to regular extent tree.
>>   So confident tester can try SKINNY_BG_TREE using their real world
>>   data, and rollback if they still want to mount it using older kernels.
>>
>> Qu Wenruo (11):
>>   btrfs-progs: check/lowmem: Lookup block group item in a seperate
>>     function
>>   btrfs-progs: block-group: Refactor how we read one block group item
>>   btrfs-progs: Rename btrfs_remove_block_group() and
>>     free_block_group_item()
>>   btrfs-progs: block-group: Refactor how we insert a block group item
>>   btrfs-progs: block-group: Rename write_one_cahce_group()
> 
> I'll add the above patches independently, for the rest I don't know. I
> still think the separate tree is somehow wrong so have to convince
> myself that it's not.

No problem.

Since the refactor would be the basis for whatever the final method we
choose, it should be pretty OK.
Even if we go something like (0, NEW_BLOCK_GROUP_ITEM, bytenr) in extent
tree, the refactor still makes a lot of sense.

BTW, if we merge the skip_bg mount option before this patchset, we could
even make the skinny_bg_tree feature RO compactable.

So it would be a good time reviewing that feature.

Thanks,
Qu
Qu Wenruo May 12, 2020, 2:30 a.m. UTC | #3
On 2020/5/12 上午2:58, David Sterba wrote:
> On Tue, May 05, 2020 at 08:02:19AM +0800, Qu Wenruo wrote:
>> This patchset can be fetched from github:
>> https://github.com/adam900710/btrfs-progs/tree/skinny_bg_tree
>> Which is based on v5.6 tag, with extra cleanups (sent to mail list) applied.
>>
>> This patchset provides the needed user space infrastructure for SKINNY_BG_TREE
>> feature.
>>
>> Since it's an new incompatible feature, unlike SKINNY_METADATA, btrfs-progs
>> is needed to convert existing fs (unmounted) to new format, and
>> vice-verse.
>>
>> Now btrfstune can convert regular extent tree fs to bg tree fs to
>> improve mount time.
>>
>> For the performance improvement, please check the kernel patchset cover
>> letter or the last patch.
>> (SPOILER ALERT: It's super fast)
>>
>> Changelog:
>> v2:
>> - Rebase to v5.2.2 tag
>> - Add btrfstune ability to convert existing fs to BG_TREE feature
>>
>> v3:
>> - Fix a bug that temp chunks are not cleaned up properly
>>   This is caused by wrong timing btrfs_convert_to_bg_tree() is called.
>>   It should be called after temp chunks cleaned up.
>>
>> - Fix a bug that an extent buffer get leaked
>>   This is caused by newly created bg tree not added to dirty list.
>>
>> v4:
>> - Go with skinny bg tree other than regular block group item
>>   We're introducing a new incompatible feature anyway, why not go
>>   extreme?
>>
>> - Use the same refactor as kernel.
>>   To make code much cleaner and easier to read.
>>
>> - Add the ability to rollback to regular extent tree.
>>   So confident tester can try SKINNY_BG_TREE using their real world
>>   data, and rollback if they still want to mount it using older kernels.
>>
>> Qu Wenruo (11):
>>   btrfs-progs: check/lowmem: Lookup block group item in a seperate
>>     function
>>   btrfs-progs: block-group: Refactor how we read one block group item
>>   btrfs-progs: Rename btrfs_remove_block_group() and
>>     free_block_group_item()
>>   btrfs-progs: block-group: Refactor how we insert a block group item
>>   btrfs-progs: block-group: Rename write_one_cahce_group()
> 
> I'll add the above patches independently, for the rest I don't know. I
> still think the separate tree is somehow wrong so have to convince
> myself that it's not.
> 
One interesting advantage here is, separate block group tree would
hugely reduce the possibility to fail to mount due to corrupted extent tree.
There are two reports of different corruption on extent tree already in
the mail list in the last 24 hours.

While the skinny bg tree could hugely reduce the amount of block group
items, which means less possibility to corrupt.

And since we have less tree blocks for block group tree, the cow cost
would also be reduced obviously.
As one BGI (just a key) get modified, all modification to other keys in
that leaf won't lead to new COW until next transaction.

So personally I believe it's much better than regular extent tree.

Thanks,
Qu
Nikolay Borisov May 12, 2020, 8:21 a.m. UTC | #4
On 12.05.20 г. 5:30 ч., Qu Wenruo wrote:
> 
> 
> On 2020/5/12 上午2:58, David Sterba wrote:
>> On Tue, May 05, 2020 at 08:02:19AM +0800, Qu Wenruo wrote:
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/btrfs-progs/tree/skinny_bg_tree
>>> Which is based on v5.6 tag, with extra cleanups (sent to mail list) applied.
>>>
>>> This patchset provides the needed user space infrastructure for SKINNY_BG_TREE
>>> feature.
>>>
>>> Since it's an new incompatible feature, unlike SKINNY_METADATA, btrfs-progs
>>> is needed to convert existing fs (unmounted) to new format, and
>>> vice-verse.
>>>
>>> Now btrfstune can convert regular extent tree fs to bg tree fs to
>>> improve mount time.
>>>
>>> For the performance improvement, please check the kernel patchset cover
>>> letter or the last patch.
>>> (SPOILER ALERT: It's super fast)
>>>
>>> Changelog:
>>> v2:
>>> - Rebase to v5.2.2 tag
>>> - Add btrfstune ability to convert existing fs to BG_TREE feature
>>>
>>> v3:
>>> - Fix a bug that temp chunks are not cleaned up properly
>>>   This is caused by wrong timing btrfs_convert_to_bg_tree() is called.
>>>   It should be called after temp chunks cleaned up.
>>>
>>> - Fix a bug that an extent buffer get leaked
>>>   This is caused by newly created bg tree not added to dirty list.
>>>
>>> v4:
>>> - Go with skinny bg tree other than regular block group item
>>>   We're introducing a new incompatible feature anyway, why not go
>>>   extreme?
>>>
>>> - Use the same refactor as kernel.
>>>   To make code much cleaner and easier to read.
>>>
>>> - Add the ability to rollback to regular extent tree.
>>>   So confident tester can try SKINNY_BG_TREE using their real world
>>>   data, and rollback if they still want to mount it using older kernels.
>>>
>>> Qu Wenruo (11):
>>>   btrfs-progs: check/lowmem: Lookup block group item in a seperate
>>>     function
>>>   btrfs-progs: block-group: Refactor how we read one block group item
>>>   btrfs-progs: Rename btrfs_remove_block_group() and
>>>     free_block_group_item()
>>>   btrfs-progs: block-group: Refactor how we insert a block group item
>>>   btrfs-progs: block-group: Rename write_one_cahce_group()
>>
>> I'll add the above patches independently, for the rest I don't know. I
>> still think the separate tree is somehow wrong so have to convince
>> myself that it's not.
>>
> One interesting advantage here is, separate block group tree would
> hugely reduce the possibility to fail to mount due to corrupted extent tree.
> There are two reports of different corruption on extent tree already in
> the mail list in the last 24 hours.
> 
> While the skinny bg tree could hugely reduce the amount of block group
> items, which means less possibility to corrupt.
> 
> And since we have less tree blocks for block group tree, the cow cost
> would also be reduced obviously.
> As one BGI (just a key) get modified, all modification to other keys in
> that leaf won't lead to new COW until next transaction.
> 
> So personally I believe it's much better than regular extent tree.

Perhaps it will be more convincing if you could substantiate those
claims with numbers. I.e run some benchmarks and show numbers under what
cases the added complexity brings positives to the table.

> 
> Thanks,
> Qu
>
Qu Wenruo May 12, 2020, 8:44 a.m. UTC | #5
On 2020/5/12 下午4:21, Nikolay Borisov wrote:
> 
> 
> On 12.05.20 г. 5:30 ч., Qu Wenruo wrote:
>>
>>
>> On 2020/5/12 上午2:58, David Sterba wrote:
>>> On Tue, May 05, 2020 at 08:02:19AM +0800, Qu Wenruo wrote:
>>>> This patchset can be fetched from github:
>>>> https://github.com/adam900710/btrfs-progs/tree/skinny_bg_tree
>>>> Which is based on v5.6 tag, with extra cleanups (sent to mail list) applied.
>>>>
>>>> This patchset provides the needed user space infrastructure for SKINNY_BG_TREE
>>>> feature.
>>>>
>>>> Since it's an new incompatible feature, unlike SKINNY_METADATA, btrfs-progs
>>>> is needed to convert existing fs (unmounted) to new format, and
>>>> vice-verse.
>>>>
>>>> Now btrfstune can convert regular extent tree fs to bg tree fs to
>>>> improve mount time.
>>>>
>>>> For the performance improvement, please check the kernel patchset cover
>>>> letter or the last patch.
>>>> (SPOILER ALERT: It's super fast)
>>>>
>>>> Changelog:
>>>> v2:
>>>> - Rebase to v5.2.2 tag
>>>> - Add btrfstune ability to convert existing fs to BG_TREE feature
>>>>
>>>> v3:
>>>> - Fix a bug that temp chunks are not cleaned up properly
>>>>   This is caused by wrong timing btrfs_convert_to_bg_tree() is called.
>>>>   It should be called after temp chunks cleaned up.
>>>>
>>>> - Fix a bug that an extent buffer get leaked
>>>>   This is caused by newly created bg tree not added to dirty list.
>>>>
>>>> v4:
>>>> - Go with skinny bg tree other than regular block group item
>>>>   We're introducing a new incompatible feature anyway, why not go
>>>>   extreme?
>>>>
>>>> - Use the same refactor as kernel.
>>>>   To make code much cleaner and easier to read.
>>>>
>>>> - Add the ability to rollback to regular extent tree.
>>>>   So confident tester can try SKINNY_BG_TREE using their real world
>>>>   data, and rollback if they still want to mount it using older kernels.
>>>>
>>>> Qu Wenruo (11):
>>>>   btrfs-progs: check/lowmem: Lookup block group item in a seperate
>>>>     function
>>>>   btrfs-progs: block-group: Refactor how we read one block group item
>>>>   btrfs-progs: Rename btrfs_remove_block_group() and
>>>>     free_block_group_item()
>>>>   btrfs-progs: block-group: Refactor how we insert a block group item
>>>>   btrfs-progs: block-group: Rename write_one_cahce_group()
>>>
>>> I'll add the above patches independently, for the rest I don't know. I
>>> still think the separate tree is somehow wrong so have to convince
>>> myself that it's not.
>>>
>> One interesting advantage here is, separate block group tree would
>> hugely reduce the possibility to fail to mount due to corrupted extent tree.
>> There are two reports of different corruption on extent tree already in
>> the mail list in the last 24 hours.
>>
>> While the skinny bg tree could hugely reduce the amount of block group
>> items, which means less possibility to corrupt.
>>
>> And since we have less tree blocks for block group tree, the cow cost
>> would also be reduced obviously.
>> As one BGI (just a key) get modified, all modification to other keys in
>> that leaf won't lead to new COW until next transaction.
>>
>> So personally I believe it's much better than regular extent tree.
> 
> Perhaps it will be more convincing if you could substantiate those
> claims with numbers. I.e run some benchmarks and show numbers under what
> cases the added complexity brings positives to the table.

Exactly in the patch implementing the feature:


          |  Extent tree  |  Skinny bg tree  |
----------------------------------------------
  nodes   |            55 |                1 |
  leaves  |          1025 |                7 |
  total   |          1080 |                8 |

That's 1T used fs with 4K node size, which has at least 1024 data block
groups.

The above number is the needed tree blocks to be iterated to read all
block group items.

Thanks,
Qu
> 
>>
>> Thanks,
>> Qu
>>