[v2,0/7] btrfs-progs: Support for BG_TREE feature

Message ID	20191008044936.157873-1-wqu@suse.com (mailing list archive)
Headers	show Return-Path: <SRS0=GeaY=YB=vger.kernel.org=linux-btrfs-owner@kernel.org> From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature Date: Tue, 8 Oct 2019 12:49:29 +0800 Message-Id: <20191008044936.157873-1-wqu@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	btrfs-progs: Support for BG_TREE feature \| expand [v2,0/7] btrfs-progs: Support for BG_TREE feature [v2,1/7] btrfs-progs: Refactor excluded extent functions to use fs_info [v2,2/7] btrfs-progs: Refactor btrfs_read_block_groups() [v2,3/7] btrfs-progs: Enable read-write ability for 'bg_tree' feature [v2,4/7] btrfs-progs: mkfs: Introduce -O bg-tree [v2,5/7] btrfs-progs: dump-tree/dump-super: Introduce support for bg tree [v2,6/7] btrfs-progs: check: Introduce support for bg-tree feature [v2,7/7] btrfs-progs: btrfstune: Allow to enable bg-tree feature offline

Qu Wenruo Oct. 8, 2019, 4:49 a.m. UTC

This patchset can be fetched from github:
https://github.com/adam900710/btrfs-progs/tree/bg_tree
Which is based on v5.2.2 tag.

This patchset provides the needed user space infrastructure for BG_TREE
feature.

Since it's an new incompatible feature, unlike SKINNY_METADATA, btrfs-progs
is needed to convert existing fs (unmounted) to new format.

Now btrfstune can convert regular extent tree fs to bg tree fs to
improve mount time.

For the performance improvement, please check the kernel patchset cover
letter or the last patch.
(SPOILER ALERT: It's super fast)

Changelog:
v2:
- Rebase to v5.2.2 tag
- Add btrfstune ability to convert existing fs to BG_TREE feature

Qu Wenruo (7):
  btrfs-progs: Refactor excluded extent functions to use fs_info
  btrfs-progs: Refactor btrfs_read_block_groups()
  btrfs-progs: Enable read-write ability for 'bg_tree' feature
  btrfs-progs: mkfs: Introduce -O bg-tree
  btrfs-progs: dump-tree/dump-super: Introduce support for bg tree
  btrfs-progs: check: Introduce support for bg-tree feature
  btrfs-progs: btrfstune: Allow to enable bg-tree feature offline

 Documentation/btrfstune.asciidoc |   6 +
 btrfstune.c                      |  44 +++-
 check/main.c                     |   7 +-
 check/mode-lowmem.c              |   9 +-
 cmds/inspect-dump-super.c        |   3 +-
 cmds/inspect-dump-tree.c         |   5 +
 common/fsfeatures.c              |   6 +
 ctree.h                          |  18 +-
 disk-io.c                        |  29 ++-
 extent-tree.c                    | 364 +++++++++++++++++++++++--------
 mkfs/common.c                    |   6 +-
 mkfs/main.c                      |   9 +
 print-tree.c                     |   3 +
 transaction.c                    |   1 +
 14 files changed, 402 insertions(+), 108 deletions(-)

David Sterba Oct. 14, 2019, 3:17 p.m. UTC | #1

On Tue, Oct 08, 2019 at 12:49:29PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/btrfs-progs/tree/bg_tree
> Which is based on v5.2.2 tag.
> 
> This patchset provides the needed user space infrastructure for BG_TREE
> feature.
> 
> Since it's an new incompatible feature, unlike SKINNY_METADATA, btrfs-progs
> is needed to convert existing fs (unmounted) to new format.
> 
> Now btrfstune can convert regular extent tree fs to bg tree fs to
> improve mount time.

Have we settled the argument whether to use a new tree or key tricks for
the blocgroup data? I think we have not and will read the previous
discussions. For a feature like this I want to be sure we understand all
the pros and cons.

Qu Wenruo Oct. 15, 2019, 12:32 a.m. UTC | #2

On 2019/10/14 下午11:17, David Sterba wrote:
> On Tue, Oct 08, 2019 at 12:49:29PM +0800, Qu Wenruo wrote:
>> This patchset can be fetched from github:
>> https://github.com/adam900710/btrfs-progs/tree/bg_tree
>> Which is based on v5.2.2 tag.
>>
>> This patchset provides the needed user space infrastructure for BG_TREE
>> feature.
>>
>> Since it's an new incompatible feature, unlike SKINNY_METADATA, btrfs-progs
>> is needed to convert existing fs (unmounted) to new format.
>>
>> Now btrfstune can convert regular extent tree fs to bg tree fs to
>> improve mount time.
> 
> Have we settled the argument whether to use a new tree or key tricks for
> the blocgroup data? I think we have not and will read the previous
> discussions. For a feature like this I want to be sure we understand all
> the pros and cons.
> 
Yep, we haven't settled on the whether creating a new tree, or
re-organize the keys.

But as my last discussion said, I see no obvious pro using the existing
extent tree to hold the new block group item keys, even we can pack them
all together.

And for backup roots, indeed I forgot to add this feature.
But to me that's a minor point, not a show stopper.

The most important aspect to me is, to allow real world user of super
large fs to try this feature, to prove the usefulness of this design,
other than my on-paper analyse.

That's why I'm pushing the patchset, even it may not pass any review.
I just want to hold a up-to-date branch so that when some one needs, it
can grab and try them themselves.

Thanks,
Qu

David Sterba Oct. 16, 2019, 11:16 a.m. UTC | #3

On Tue, Oct 15, 2019 at 08:32:30AM +0800, Qu Wenruo wrote:
> > Have we settled the argument whether to use a new tree or key tricks for
> > the blocgroup data? I think we have not and will read the previous
> > discussions. For a feature like this I want to be sure we understand all
> > the pros and cons.
> > 
> Yep, we haven't settled on the whether creating a new tree, or
> re-organize the keys.
> 
> But as my last discussion said, I see no obvious pro using the existing
> extent tree to hold the new block group item keys, even we can pack them
> all together.

For me the obvious pro is minimum change to existing set of trees.

> And for backup roots, indeed I forgot to add this feature.
> But to me that's a minor point, not a show stopper.
>
> The most important aspect to me is, to allow real world user of super
> large fs to try this feature, to prove the usefulness of this design,
> other than my on-paper analyse.
> 
> That's why I'm pushing the patchset, even it may not pass any review.
> I just want to hold a up-to-date branch so that when some one needs, it
> can grab and try them themselves.

Ok that's fine and I can add the branch to for-next for ease of testing.
I'm working on a prototype that does it the bg item key way, it compiles
and creates almost correct filesystem, so I have to fix it before
posting. The patches are on top of your bg-tree feature so we could have
both in the same kernel for testing.

Qu Wenruo Oct. 16, 2019, 11:19 a.m. UTC | #4

On 2019/10/16 下午7:16, David Sterba wrote:
> On Tue, Oct 15, 2019 at 08:32:30AM +0800, Qu Wenruo wrote:
>>> Have we settled the argument whether to use a new tree or key tricks for
>>> the blocgroup data? I think we have not and will read the previous
>>> discussions. For a feature like this I want to be sure we understand all
>>> the pros and cons.
>>>
>> Yep, we haven't settled on the whether creating a new tree, or
>> re-organize the keys.
>>
>> But as my last discussion said, I see no obvious pro using the existing
>> extent tree to hold the new block group item keys, even we can pack them
>> all together.
> 
> For me the obvious pro is minimum change to existing set of trees.

That's interesting.

And indeed, since we're dealing one less tree, there is no chance to
cause the bug mentioned by Josef.
> 
>> And for backup roots, indeed I forgot to add this feature.
>> But to me that's a minor point, not a show stopper.
>>
>> The most important aspect to me is, to allow real world user of super
>> large fs to try this feature, to prove the usefulness of this design,
>> other than my on-paper analyse.
>>
>> That's why I'm pushing the patchset, even it may not pass any review.
>> I just want to hold a up-to-date branch so that when some one needs, it
>> can grab and try them themselves.
> 
> Ok that's fine and I can add the branch to for-next for ease of testing.
> I'm working on a prototype that does it the bg item key way, it compiles
> and creates almost correct filesystem, so I have to fix it before
> posting. The patches are on top of your bg-tree feature so we could have
> both in the same kernel for testing.

That's great!

As long as we're pushing a solution to the mount time problem, I can't
be more happier!

Then I guess no matter which version get merged to upstream, the
patchset is already meaningful.

Thanks,
Qu

David Sterba Oct. 18, 2019, 5:27 p.m. UTC | #5

On Wed, Oct 16, 2019 at 11:19:54AM +0000, Qu WenRuo wrote:
> >> The most important aspect to me is, to allow real world user of super
> >> large fs to try this feature, to prove the usefulness of this design,
> >> other than my on-paper analyse.
> >>
> >> That's why I'm pushing the patchset, even it may not pass any review.
> >> I just want to hold a up-to-date branch so that when some one needs, it
> >> can grab and try them themselves.
> > 
> > Ok that's fine and I can add the branch to for-next for ease of testing.
> > I'm working on a prototype that does it the bg item key way, it compiles
> > and creates almost correct filesystem, so I have to fix it before
> > posting. The patches are on top of your bg-tree feature so we could have
> > both in the same kernel for testing.
> 
> That's great!
> 
> As long as we're pushing a solution to the mount time problem, I can't
> be more happier!
> 
> Then I guess no matter which version get merged to upstream, the
> patchset is already meaningful.

We'll see what works in the end, I'm getting to the point where the
prototype almost works and am debugging weird problems or making sure
it's correct. So I'll dump the ideas here and link to the code so you
can have a look.

We agree on the point that the block group items must be packed. The key
approach should move the new BGI to the beginning, ie. key type is
smaller than anything that appears in the extent tree. I chose 100 for
the prototype, it could change.

To keep changes to minimum, the new BGI uses the same block group item,
so the only difference then becomes how we search for the items.

Packing of the items is done by swapping the key objectid and offset.

Normal BGI has bg.start == key.objectid and bg.length == key.offset. As
the objectid is the thing that scatters the items all over the tree.

So the new BGI has bg.length == key.objectid and bg.start == key.offset.
As most of block groups are of same size, or from a small set, they're
packed.

The nice thing is that a lot of code can be shared between BGI and new
BGI, just needs some care with searches, inserts and search key
advances.

I'm now stuck a bit at mkfs, where the 8M block groups are separated by
some METADATA_ITEMs

 item 0 key (8388608 BLOCK_GROUP_ITEM_NEW 13631488) itemoff 16259 itemsize 24
         block group NEW used 0 chunk_objectid 256 flags DATA
 item 1 key (8388608 BLOCK_GROUP_ITEM_NEW 22020096) itemoff 16235 itemsize 24
         block group NEW used 16384 chunk_objectid 256 flags SYSTEM|DUP
 item 2 key (22036480 METADATA_ITEM 0) itemoff 16202 itemsize 33
         refs 1 gen 5 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root CHUNK_TREE
 item 3 key (30408704 METADATA_ITEM 0) itemoff 16169 itemsize 33
         refs 1 gen 4 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root FS_TREE
 item 4 key (30474240 METADATA_ITEM 0) itemoff 16136 itemsize 33
         refs 1 gen 4 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root CSUM_TREE
 item 5 key (30490624 METADATA_ITEM 0) itemoff 16103 itemsize 33
         refs 1 gen 4 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root DATA_RELOC_TREE
 item 6 key (30507008 METADATA_ITEM 0) itemoff 16070 itemsize 33
         refs 1 gen 4 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root UUID_TREE
 item 7 key (30523392 METADATA_ITEM 0) itemoff 16037 itemsize 33
         refs 1 gen 5 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root EXTENT_TREE
 item 8 key (30539776 METADATA_ITEM 0) itemoff 16004 itemsize 33
         refs 1 gen 5 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root DEV_TREE
 item 9 key (30556160 METADATA_ITEM 0) itemoff 15971 itemsize 33
         refs 1 gen 5 flags TREE_BLOCK
         tree block skinny level 0
         tree block backref root ROOT_TREE
 item 10 key (107347968 BLOCK_GROUP_ITEM_NEW 30408704) itemoff 15947 itemsize 24
         block group NEW used 114688 chunk_objectid 256 flags METADATA|DUP

After item 10, the rest of the block group would appear, and basically
the rest of the extent tree, many other items.

I don't want
to make hardcoded assumptins, that maximum objecit is 1G, but so far was
not able to come up with a generic and reliable formula how to set up
key for next search to reach item (107347968 BLOCK_GROUP_ITEM_NEW
30408704) once (8388608 BLOCK_GROUP_ITEM_NEW 22020096) has been
processed.

The swapped objectid and offset is the hard part for search because we
lose the linearity of block group start.  Advance objectid by one and
search again ie. (8388608+1 BGI_NEW 22020096) will land on the next
metadata item. Iterating objectid by one would eventually reach the 1G
block group item, but what to do after the last 1G item is found and we
want do decide wheather to continue or not?

This would be easy with the bg_tree, because we'd know that all items in
the tree are just the block group items. Some sort of enumeration could
work for bg_key too, but I don't have something solid.

The WIP is in my github repository branch dev/bg-key. It's not on top of
the bg_tree branch for now. The kernel part will be very similar once
the progs side is done.

Feedback welcome.

Qu Wenruo Oct. 19, 2019, 12:04 a.m. UTC | #6

On 2019/10/19 上午1:27, David Sterba wrote:
> On Wed, Oct 16, 2019 at 11:19:54AM +0000, Qu WenRuo wrote:
>>>> The most important aspect to me is, to allow real world user of super
>>>> large fs to try this feature, to prove the usefulness of this design,
>>>> other than my on-paper analyse.
>>>>
>>>> That's why I'm pushing the patchset, even it may not pass any review.
>>>> I just want to hold a up-to-date branch so that when some one needs, it
>>>> can grab and try them themselves.
>>>
>>> Ok that's fine and I can add the branch to for-next for ease of testing.
>>> I'm working on a prototype that does it the bg item key way, it compiles
>>> and creates almost correct filesystem, so I have to fix it before
>>> posting. The patches are on top of your bg-tree feature so we could have
>>> both in the same kernel for testing.
>>
>> That's great!
>>
>> As long as we're pushing a solution to the mount time problem, I can't
>> be more happier!
>>
>> Then I guess no matter which version get merged to upstream, the
>> patchset is already meaningful.
> 
> We'll see what works in the end, I'm getting to the point where the
> prototype almost works and am debugging weird problems or making sure
> it's correct. So I'll dump the ideas here and link to the code so you
> can have a look.

That's wonderful.
Although I guess my patchset should provide the hint of where to modify
the code, since there are only a limited number of places we modify
block group item.

> 
> We agree on the point that the block group items must be packed. The key
> approach should move the new BGI to the beginning, ie. key type is
> smaller than anything that appears in the extent tree. I chose 100 for
> the prototype, it could change.
> 
> To keep changes to minimum, the new BGI uses the same block group item,
> so the only difference then becomes how we search for the items.

If we're introducing new block group item, I hope to do a minor change.

Remove the chunk_objectid member, or even flags. to make it more
compact. So that you can make the BGI subtree even smaller.

But I guess since you don't want to modify the BGI structure, and keep
the code modification minimal, it may not be a good idea right now.

> 
> Packing of the items is done by swapping the key objectid and offset.
> 
> Normal BGI has bg.start == key.objectid and bg.length == key.offset. As
> the objectid is the thing that scatters the items all over the tree.
> 
> So the new BGI has bg.length == key.objectid and bg.start == key.offset.
> As most of block groups are of same size, or from a small set, they're
> packed.

That doesn't look optimized enough.

bg.length can be at 1G, that means if extents starts below 1G can still
be before BGIs.

I believe we should have a fixed objectid for this new BGIs, so that
they are ensured to be at the beginning of extent tree.

> 
> The nice thing is that a lot of code can be shared between BGI and new
> BGI, just needs some care with searches, inserts and search key
> advances.

Exactly, but since we're introducing a new key type, I prefer to perfect
it. Not only change the key, but also the block group item structure to
make it more compact.

Although from the design aspect, it looks like BG tree along with new
BGI would be the best design.

New BG key goes (bg start, NEW BGI TYPE, used) no data. It would provide
the most compact on-disk format.

> 
> I'm now stuck a bit at mkfs, where the 8M block groups are separated by
> some METADATA_ITEMs
> 
>  item 0 key (8388608 BLOCK_GROUP_ITEM_NEW 13631488) itemoff 16259 itemsize 24
>          block group NEW used 0 chunk_objectid 256 flags DATA
>  item 1 key (8388608 BLOCK_GROUP_ITEM_NEW 22020096) itemoff 16235 itemsize 24
>          block group NEW used 16384 chunk_objectid 256 flags SYSTEM|DUP
>  item 2 key (22036480 METADATA_ITEM 0) itemoff 16202 itemsize 33
>          refs 1 gen 5 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root CHUNK_TREE
>  item 3 key (30408704 METADATA_ITEM 0) itemoff 16169 itemsize 33
>          refs 1 gen 4 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root FS_TREE

Exactly the problem I described.


>  item 4 key (30474240 METADATA_ITEM 0) itemoff 16136 itemsize 33
>          refs 1 gen 4 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root CSUM_TREE
>  item 5 key (30490624 METADATA_ITEM 0) itemoff 16103 itemsize 33
>          refs 1 gen 4 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root DATA_RELOC_TREE
>  item 6 key (30507008 METADATA_ITEM 0) itemoff 16070 itemsize 33
>          refs 1 gen 4 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root UUID_TREE
>  item 7 key (30523392 METADATA_ITEM 0) itemoff 16037 itemsize 33
>          refs 1 gen 5 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root EXTENT_TREE
>  item 8 key (30539776 METADATA_ITEM 0) itemoff 16004 itemsize 33
>          refs 1 gen 5 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root DEV_TREE
>  item 9 key (30556160 METADATA_ITEM 0) itemoff 15971 itemsize 33
>          refs 1 gen 5 flags TREE_BLOCK
>          tree block skinny level 0
>          tree block backref root ROOT_TREE
>  item 10 key (107347968 BLOCK_GROUP_ITEM_NEW 30408704) itemoff 15947 itemsize 24
>          block group NEW used 114688 chunk_objectid 256 flags METADATA|DUP
> 
> After item 10, the rest of the block group would appear, and basically
> the rest of the extent tree, many other items.
> 
> I don't want
> to make hardcoded assumptins, that maximum objecit is 1G, but so far was
> not able to come up with a generic and reliable formula how to set up
> key for next search to reach item (107347968 BLOCK_GROUP_ITEM_NEW
> 30408704) once (8388608 BLOCK_GROUP_ITEM_NEW 22020096) has been
> processed.
> 
> The swapped objectid and offset is the hard part for search because we
> lose the linearity of block group start.  Advance objectid by one and
> search again ie. (8388608+1 BGI_NEW 22020096) will land on the next
> metadata item. Iterating objectid by one would eventually reach the 1G
> block group item, but what to do after the last 1G item is found and we
> want do decide wheather to continue or not?
> 
> This would be easy with the bg_tree, because we'd know that all items in
> the tree are just the block group items. Some sort of enumeration could
> work for bg_key too, but I don't have something solid.

Why not fixed objectid for BGI and just ignore the bg.len part?

We have chunk<->BGI verification code, no bg.len is not a problem at
all, we can still make sure chunk<->bg is 1:1 mapped and still verify
the bg.used.

Thanks,
Qu

> 
> The WIP is in my github repository branch dev/bg-key. It's not on top of
> the bg_tree branch for now. The kernel part will be very similar once
> the progs side is done.
> 
> Feedback welcome.
>

David Sterba Oct. 21, 2019, 3:44 p.m. UTC | #7

On Sat, Oct 19, 2019 at 08:04:51AM +0800, Qu Wenruo wrote:
> That's wonderful.
> Although I guess my patchset should provide the hint of where to modify
> the code, since there are only a limited number of places we modify
> block group item.

I indeed started at your patchset and grepped fro BG_TREE, adding
another branch.

> > We agree on the point that the block group items must be packed. The key
> > approach should move the new BGI to the beginning, ie. key type is
> > smaller than anything that appears in the extent tree. I chose 100 for
> > the prototype, it could change.
> > 
> > To keep changes to minimum, the new BGI uses the same block group item,
> > so the only difference then becomes how we search for the items.
> 
> If we're introducing new block group item, I hope to do a minor change.
> 
> Remove the chunk_objectid member, or even flags. to make it more
> compact. So that you can make the BGI subtree even smaller.

Yeah that can be done.

> But I guess since you don't want to modify the BGI structure, and keep
> the code modification minimal, it may not be a good idea right now.

As long as the changes are bearable, eg. a minor refactoring of block
group access (the cache.key serving a as offset and length) is fine. And
if we can make the b-tree item more then let's do it.

> > Packing of the items is done by swapping the key objectid and offset.
> > 
> > Normal BGI has bg.start == key.objectid and bg.length == key.offset. As
> > the objectid is the thing that scatters the items all over the tree.
> > 
> > So the new BGI has bg.length == key.objectid and bg.start == key.offset.
> > As most of block groups are of same size, or from a small set, they're
> > packed.
> 
> That doesn't look optimized enough.
> 
> bg.length can be at 1G, that means if extents starts below 1G can still
> be before BGIs.

This shold not be a big problem, the items are still grouped togethers.
Mkfs does 8M, we can have 256M or 1G. On average there could be several
packed groups, which I think is fine and the estimated overhead would be
a few more seeks.

> I believe we should have a fixed objectid for this new BGIs, so that
> they are ensured to be at the beginning of extent tree.

That was my idea too, but that also requires to add one more member to
the item to track the length. Currently the key is saves the bytes. With
the proposed changes to drop chunk_objectid, the overall size of BGI
would not increase so this still sounds ok. And all the problems with
searching would go away.

> > The nice thing is that a lot of code can be shared between BGI and new
> > BGI, just needs some care with searches, inserts and search key
> > advances.
> 
> Exactly, but since we're introducing a new key type, I prefer to perfect
> it. Not only change the key, but also the block group item structure to
> make it more compact.
> 
> Although from the design aspect, it looks like BG tree along with new
> BGI would be the best design.
> 
> New BG key goes (bg start, NEW BGI TYPE, used) no data. It would provide
> the most compact on-disk format.

That's very compact. Given the 'bg start' we can't use the same for the
extent tree item.

> > This would be easy with the bg_tree, because we'd know that all items in
> > the tree are just the block group items. Some sort of enumeration could
> > work for bg_key too, but I don't have something solid.
> 
> Why not fixed objectid for BGI and just ignore the bg.len part?

I wanted to avoid storing a useless number, it costs 8 bytes per item,
and the simple swap of objectid/offset was first idea how to avoid it.

> We have chunk<->BGI verification code, no bg.len is not a problem at
> all, we can still make sure chunk<->bg is 1:1 mapped and still verify
> the bg.used.

This is all great, sure, and makes the extensions easy. Let's try to
work out best for each approach we have so far. Having a separate tree
for block groups may open some future options.

Qu Wenruo Oct. 22, 2019, 12:49 a.m. UTC | #8

On 2019/10/21 下午11:44, David Sterba wrote:
> On Sat, Oct 19, 2019 at 08:04:51AM +0800, Qu Wenruo wrote:
>> That's wonderful.
>> Although I guess my patchset should provide the hint of where to modify
>> the code, since there are only a limited number of places we modify
>> block group item.
> 
> I indeed started at your patchset and grepped fro BG_TREE, adding
> another branch.
> 
>>> We agree on the point that the block group items must be packed. The key
>>> approach should move the new BGI to the beginning, ie. key type is
>>> smaller than anything that appears in the extent tree. I chose 100 for
>>> the prototype, it could change.
>>>
>>> To keep changes to minimum, the new BGI uses the same block group item,
>>> so the only difference then becomes how we search for the items.
>>
>> If we're introducing new block group item, I hope to do a minor change.
>>
>> Remove the chunk_objectid member, or even flags. to make it more
>> compact. So that you can make the BGI subtree even smaller.
> 
> Yeah that can be done.
> 
>> But I guess since you don't want to modify the BGI structure, and keep
>> the code modification minimal, it may not be a good idea right now.
> 
> As long as the changes are bearable, eg. a minor refactoring of block
> group access (the cache.key serving a as offset and length) is fine. And
> if we can make the b-tree item more then let's do it.
> 
>>> Packing of the items is done by swapping the key objectid and offset.
>>>
>>> Normal BGI has bg.start == key.objectid and bg.length == key.offset. As
>>> the objectid is the thing that scatters the items all over the tree.
>>>
>>> So the new BGI has bg.length == key.objectid and bg.start == key.offset.
>>> As most of block groups are of same size, or from a small set, they're
>>> packed.
>>
>> That doesn't look optimized enough.
>>
>> bg.length can be at 1G, that means if extents starts below 1G can still
>> be before BGIs.
> 
> This shold not be a big problem, the items are still grouped togethers.
> Mkfs does 8M, we can have 256M or 1G. On average there could be several
> packed groups, which I think is fine and the estimated overhead would be
> a few more seeks.
> 
>> I believe we should have a fixed objectid for this new BGIs, so that
>> they are ensured to be at the beginning of extent tree.
> 
> That was my idea too, but that also requires to add one more member to
> the item to track the length. Currently the key is saves the bytes. With
> the proposed changes to drop chunk_objectid, the overall size of BGI
> would not increase so this still sounds ok. And all the problems with
> searching would go away.
> 
>>> The nice thing is that a lot of code can be shared between BGI and new
>>> BGI, just needs some care with searches, inserts and search key
>>> advances.
>>
>> Exactly, but since we're introducing a new key type, I prefer to perfect
>> it. Not only change the key, but also the block group item structure to
>> make it more compact.
>>
>> Although from the design aspect, it looks like BG tree along with new
>> BGI would be the best design.
>>
>> New BG key goes (bg start, NEW BGI TYPE, used) no data. It would provide
>> the most compact on-disk format.
> 
> That's very compact. Given the 'bg start' we can't use the same for the
> extent tree item.
> 
>>> This would be easy with the bg_tree, because we'd know that all items in
>>> the tree are just the block group items. Some sort of enumeration could
>>> work for bg_key too, but I don't have something solid.
>>
>> Why not fixed objectid for BGI and just ignore the bg.len part?
> 
> I wanted to avoid storing a useless number, it costs 8 bytes per item,
> and the simple swap of objectid/offset was first idea how to avoid it.
> 
>> We have chunk<->BGI verification code, no bg.len is not a problem at
>> all, we can still make sure chunk<->bg is 1:1 mapped and still verify
>> the bg.used.
> 
> This is all great, sure, and makes the extensions easy. Let's try to
> work out best for each approach we have so far. Having a separate tree
> for block groups may open some future options.

Great, I'll explore the most compact key only method with BG_TREE.

And maybe also try the fixed key objectid solution, just dropping the
bg.length, while keep the current BGI structure.

Thanks,
Qu

Qu Wenruo Oct. 22, 2019, 6:30 a.m. UTC | #9

BTW, there is one important compatibility problem related to all the BGI
related features.

Although I'm putting the BG_TREE feature as incompatible feature, but in
theory, it should be RO compatible.

As except extent/bg tree, we *should* read the fs without any problem.

But the problem is, current btrfs mount (including btrfs-check) still
need to go through the block group item search, even for permanent RO mount.

This get my rescue mount option patchset to be involved.
If we have such skip_bg feature earlier, we can completely afford to
make all these potential features as RO compatible.


Now my question is,  should we put this feature still as incompatible
feature?

Thanks,
Qu


On 2019/10/22 上午8:49, Qu Wenruo wrote:
> 
> 
> On 2019/10/21 下午11:44, David Sterba wrote:
>> On Sat, Oct 19, 2019 at 08:04:51AM +0800, Qu Wenruo wrote:
>>> That's wonderful.
>>> Although I guess my patchset should provide the hint of where to modify
>>> the code, since there are only a limited number of places we modify
>>> block group item.
>>
>> I indeed started at your patchset and grepped fro BG_TREE, adding
>> another branch.
>>
>>>> We agree on the point that the block group items must be packed. The key
>>>> approach should move the new BGI to the beginning, ie. key type is
>>>> smaller than anything that appears in the extent tree. I chose 100 for
>>>> the prototype, it could change.
>>>>
>>>> To keep changes to minimum, the new BGI uses the same block group item,
>>>> so the only difference then becomes how we search for the items.
>>>
>>> If we're introducing new block group item, I hope to do a minor change.
>>>
>>> Remove the chunk_objectid member, or even flags. to make it more
>>> compact. So that you can make the BGI subtree even smaller.
>>
>> Yeah that can be done.
>>
>>> But I guess since you don't want to modify the BGI structure, and keep
>>> the code modification minimal, it may not be a good idea right now.
>>
>> As long as the changes are bearable, eg. a minor refactoring of block
>> group access (the cache.key serving a as offset and length) is fine. And
>> if we can make the b-tree item more then let's do it.
>>
>>>> Packing of the items is done by swapping the key objectid and offset.
>>>>
>>>> Normal BGI has bg.start == key.objectid and bg.length == key.offset. As
>>>> the objectid is the thing that scatters the items all over the tree.
>>>>
>>>> So the new BGI has bg.length == key.objectid and bg.start == key.offset.
>>>> As most of block groups are of same size, or from a small set, they're
>>>> packed.
>>>
>>> That doesn't look optimized enough.
>>>
>>> bg.length can be at 1G, that means if extents starts below 1G can still
>>> be before BGIs.
>>
>> This shold not be a big problem, the items are still grouped togethers.
>> Mkfs does 8M, we can have 256M or 1G. On average there could be several
>> packed groups, which I think is fine and the estimated overhead would be
>> a few more seeks.
>>
>>> I believe we should have a fixed objectid for this new BGIs, so that
>>> they are ensured to be at the beginning of extent tree.
>>
>> That was my idea too, but that also requires to add one more member to
>> the item to track the length. Currently the key is saves the bytes. With
>> the proposed changes to drop chunk_objectid, the overall size of BGI
>> would not increase so this still sounds ok. And all the problems with
>> searching would go away.
>>
>>>> The nice thing is that a lot of code can be shared between BGI and new
>>>> BGI, just needs some care with searches, inserts and search key
>>>> advances.
>>>
>>> Exactly, but since we're introducing a new key type, I prefer to perfect
>>> it. Not only change the key, but also the block group item structure to
>>> make it more compact.
>>>
>>> Although from the design aspect, it looks like BG tree along with new
>>> BGI would be the best design.
>>>
>>> New BG key goes (bg start, NEW BGI TYPE, used) no data. It would provide
>>> the most compact on-disk format.
>>
>> That's very compact. Given the 'bg start' we can't use the same for the
>> extent tree item.
>>
>>>> This would be easy with the bg_tree, because we'd know that all items in
>>>> the tree are just the block group items. Some sort of enumeration could
>>>> work for bg_key too, but I don't have something solid.
>>>
>>> Why not fixed objectid for BGI and just ignore the bg.len part?
>>
>> I wanted to avoid storing a useless number, it costs 8 bytes per item,
>> and the simple swap of objectid/offset was first idea how to avoid it.
>>
>>> We have chunk<->BGI verification code, no bg.len is not a problem at
>>> all, we can still make sure chunk<->bg is 1:1 mapped and still verify
>>> the bg.used.
>>
>> This is all great, sure, and makes the extensions easy. Let's try to
>> work out best for each approach we have so far. Having a separate tree
>> for block groups may open some future options.
> 
> Great, I'll explore the most compact key only method with BG_TREE.
> 
> And maybe also try the fixed key objectid solution, just dropping the
> bg.length, while keep the current BGI structure.
> 
> Thanks,
> Qu
>

David Sterba Oct. 22, 2019, 12:23 p.m. UTC | #10

On Tue, Oct 22, 2019 at 02:30:22PM +0800, Qu Wenruo wrote:
> BTW, there is one important compatibility problem related to all the BGI
> related features.
> 
> Although I'm putting the BG_TREE feature as incompatible feature, but in
> theory, it should be RO compatible.

It could be RO compatible yes.

> As except extent/bg tree, we *should* read the fs without any problem.
> 
> But the problem is, current btrfs mount (including btrfs-check) still
> need to go through the block group item search, even for permanent RO mount.
> 
> This get my rescue mount option patchset to be involved.
> If we have such skip_bg feature earlier, we can completely afford to
> make all these potential features as RO compatible.
> 
> 
> Now my question is,  should we put this feature still as incompatible
> feature?

In some way it would probably have to be incompat, either full or RO. As
unknown tree items are ignored, if the rest of the filesystem provides
enough information to access the data, then incompat RO sounds like the
best option. And that's probably independent of how exactly the new BGI
is done.

Qu Wenruo Oct. 22, 2019, 12:27 p.m. UTC | #11

On 2019/10/22 下午8:23, David Sterba wrote:
> On Tue, Oct 22, 2019 at 02:30:22PM +0800, Qu Wenruo wrote:
>> BTW, there is one important compatibility problem related to all the BGI
>> related features.
>>
>> Although I'm putting the BG_TREE feature as incompatible feature, but in
>> theory, it should be RO compatible.
> 
> It could be RO compatible yes.
> 
>> As except extent/bg tree, we *should* read the fs without any problem.
>>
>> But the problem is, current btrfs mount (including btrfs-check) still
>> need to go through the block group item search, even for permanent RO mount.
>>
>> This get my rescue mount option patchset to be involved.
>> If we have such skip_bg feature earlier, we can completely afford to
>> make all these potential features as RO compatible.
>>
>>
>> Now my question is,  should we put this feature still as incompatible
>> feature?
> 
> In some way it would probably have to be incompat, either full or RO. As
> unknown tree items are ignored, if the rest of the filesystem provides
> enough information to access the data, then incompat RO sounds like the
> best option. And that's probably independent of how exactly the new BGI
> is done.
> 
But the problem is, older fs can't mount it at all, no matter RO or RW.

It's now completely dependent on how older kernel handles missing block
group items.

If current kernel has something like skip_bg or fall back to RO +
skip_bg by default, then it's really RO compatible.
But we all know how picky current btrfs is about missing block group
items...

So, I guess we will have a RO compatible feature that can't really be
mounted RO by older kernel at last?

Thanks,
Qu

[v2,0/7] btrfs-progs: Support for BG_TREE feature

Message

Comments