mbox series

[0/2] Use new incompat feature BG_TREE to hugely reduce mount time

Message ID 20190102052945.16325-1-wqu@suse.com (mailing list archive)
Headers show
Series Use new incompat feature BG_TREE to hugely reduce mount time | expand

Message

Qu Wenruo Jan. 2, 2019, 5:29 a.m. UTC
This patchset can be fetched from:
https://github.com/adam900710/linux/tree/bg_tree
Which is based on v4.20-rc1 tag.

This patchset will hugely reduce mount time of large fs by putting all
block group items into its own tree.

The old behavior will try to read out all block group items at mount
time, however due to the key of block group items are scattered across
tons of extent items, we must call btrfs_search_slot() for each block
group.

It works fine for small fs, but when number of block groups goes beyond
200, such tree search will become a random read, causing obvious slow
down.

On the other hand, btrfs_read_chunk_tree() is still very fast, since we
put CHUNK_ITEMS into their own tree and package them next to each other.


Following this idea, we could do the same thing for block group items,
so instead of triggering btrfs_search_slot() for each block group, we
just call btrfs_next_item() and under most case we could finish in
memory, and hugely speed up mount (see BENCHMARK below).

The only disadvantage is, this method introduce an incompatible feature,
so existing fs can't use this feature directly.
Either specify it at mkfs time, or use btrfs-progs offline convert tool
(*).

*: Mkfs and convert tool are doing the same work, however I haven't
decide if I should put this feature to btrfstune.

[[Benchmark]]
Physical device:	HDD (7200RPM)
Nodesize:		4K  (to bump up tree height)
Used size:		250G
Total size:		500G
Extent data size:	1M

All file extents on disk is in 1M size, ensured by using fallocate.

Without patchset:
Use ftrace function graph:

  3)               |  open_ctree [btrfs]() {
  3)               |    btrfs_read_chunk_tree [btrfs]() {
  3) * 69033.31 us |    }
  3)               |    btrfs_verify_dev_extents [btrfs]() {
  3) * 90376.15 us |    }
  3)               |    btrfs_read_block_groups [btrfs]() {
  2) $ 2733853 us  |    } /* btrfs_read_block_groups [btrfs] */
  2) $ 3168384 us  |  } /* open_ctree [btrfs] */

 btrfs_read_block_groups() takes 87% of the total mount time,

With patchset, and use -O bg-tree mkfs option:
  7)               |  open_ctree [btrfs]() {
  7)               |    btrfs_read_chunk_tree [btrfs]() {
  7) # 2448.562 us |    }
  7)               |    btrfs_verify_dev_extents [btrfs]() {
  7) * 19802.02 us |    }
  7)               |    btrfs_read_block_groups [btrfs]() {
  7) # 8610.397 us |    }
  7) @ 113498.6 us |  }

  open_ctree() time is only 3% of original mount time.
  And btrfs_read_block_groups() only takes 7.6% of total open_ctree()
  execution time.

Changelog:
RFC->v1:
 - Fix memory leak for fs_info->bg_root at module unload time.
 - Add sysfs features interface.
 - Testing shows no regression, so no RFC tag now.

Qu Wenruo (2):
  btrfs: Refactor btrfs_read_block_groups()
  btrfs: Introduce new incompat feature, BG_TREE

 fs/btrfs/ctree.h                |   5 +-
 fs/btrfs/disk-io.c              |  13 ++
 fs/btrfs/extent-tree.c          | 300 ++++++++++++++++++++------------
 fs/btrfs/sysfs.c                |   2 +
 include/uapi/linux/btrfs.h      |   1 +
 include/uapi/linux/btrfs_tree.h |   3 +
 6 files changed, 208 insertions(+), 116 deletions(-)

Comments

Qu Wenruo Jan. 8, 2019, 7:34 a.m. UTC | #1
Future proof benchmark for bg-tree.

The benchmark in the cover letter is in fact a pretty bad case for
original mode, with a lot of EXTENT_ITEMs bumping up extent tree height,
along with small node size to make searching extent tree super slow.
(Although that doesn't make any difference for bg-tree).


Now here is the best case scenario for original mode.
Just using plain fallocate to fill 12T, so every EXTENT_DATA will be at
its maximum size, causing minimal noise for block group items iteration.

For short conclusion:
Bg-tree still faster than best case original extent tree, by around 30%.
Bg-tree should be fast enough to mount the 12T filled fs less than *ONE*
*SECOND*.

And due to the fact that bg-tree doesn't care how crowded the extent
tree is, its block group iteration time is just O(N).
So for 12T filled fs, bg-tree should always mount around the same speed,
no matter how many extents are.

For full details, please check the sheet:

https://docs.google.com/spreadsheets/d/1YfZXiGoL9EoPcWGh0x1gIldS1ymsPgqZFQdngM-Iep0/edit?usp=sharing


Thanks,
Qu

On 2019/1/2 下午1:29, Qu Wenruo wrote:
> This patchset can be fetched from:
> https://github.com/adam900710/linux/tree/bg_tree
> Which is based on v4.20-rc1 tag.
> 
> This patchset will hugely reduce mount time of large fs by putting all
> block group items into its own tree.
> 
> The old behavior will try to read out all block group items at mount
> time, however due to the key of block group items are scattered across
> tons of extent items, we must call btrfs_search_slot() for each block
> group.
> 
> It works fine for small fs, but when number of block groups goes beyond
> 200, such tree search will become a random read, causing obvious slow
> down.
> 
> On the other hand, btrfs_read_chunk_tree() is still very fast, since we
> put CHUNK_ITEMS into their own tree and package them next to each other.
> 
> 
> Following this idea, we could do the same thing for block group items,
> so instead of triggering btrfs_search_slot() for each block group, we
> just call btrfs_next_item() and under most case we could finish in
> memory, and hugely speed up mount (see BENCHMARK below).
> 
> The only disadvantage is, this method introduce an incompatible feature,
> so existing fs can't use this feature directly.
> Either specify it at mkfs time, or use btrfs-progs offline convert tool
> (*).
> 
> *: Mkfs and convert tool are doing the same work, however I haven't
> decide if I should put this feature to btrfstune.
> 
> [[Benchmark]]
> Physical device:	HDD (7200RPM)
> Nodesize:		4K  (to bump up tree height)
> Used size:		250G
> Total size:		500G
> Extent data size:	1M
> 
> All file extents on disk is in 1M size, ensured by using fallocate.
> 
> Without patchset:
> Use ftrace function graph:
> 
>   3)               |  open_ctree [btrfs]() {
>   3)               |    btrfs_read_chunk_tree [btrfs]() {
>   3) * 69033.31 us |    }
>   3)               |    btrfs_verify_dev_extents [btrfs]() {
>   3) * 90376.15 us |    }
>   3)               |    btrfs_read_block_groups [btrfs]() {
>   2) $ 2733853 us  |    } /* btrfs_read_block_groups [btrfs] */
>   2) $ 3168384 us  |  } /* open_ctree [btrfs] */
> 
>  btrfs_read_block_groups() takes 87% of the total mount time,
> 
> With patchset, and use -O bg-tree mkfs option:
>   7)               |  open_ctree [btrfs]() {
>   7)               |    btrfs_read_chunk_tree [btrfs]() {
>   7) # 2448.562 us |    }
>   7)               |    btrfs_verify_dev_extents [btrfs]() {
>   7) * 19802.02 us |    }
>   7)               |    btrfs_read_block_groups [btrfs]() {
>   7) # 8610.397 us |    }
>   7) @ 113498.6 us |  }
> 
>   open_ctree() time is only 3% of original mount time.
>   And btrfs_read_block_groups() only takes 7.6% of total open_ctree()
>   execution time.
> 
> Changelog:
> RFC->v1:
>  - Fix memory leak for fs_info->bg_root at module unload time.
>  - Add sysfs features interface.
>  - Testing shows no regression, so no RFC tag now.
> 
> Qu Wenruo (2):
>   btrfs: Refactor btrfs_read_block_groups()
>   btrfs: Introduce new incompat feature, BG_TREE
> 
>  fs/btrfs/ctree.h                |   5 +-
>  fs/btrfs/disk-io.c              |  13 ++
>  fs/btrfs/extent-tree.c          | 300 ++++++++++++++++++++------------
>  fs/btrfs/sysfs.c                |   2 +
>  include/uapi/linux/btrfs.h      |   1 +
>  include/uapi/linux/btrfs_tree.h |   3 +
>  6 files changed, 208 insertions(+), 116 deletions(-)
>
Hans van Kranenburg Feb. 25, 2019, 1:02 p.m. UTC | #2
Hi Qu,

On 1/2/19 6:29 AM, Qu Wenruo wrote:
> This patchset can be fetched from:
> https://github.com/adam900710/linux/tree/bg_tree
> Which is based on v4.20-rc1 tag.
> 
> This patchset will hugely reduce mount time of large fs by putting all
> block group items into its own tree.

I'm going to test this.

Is there some specific command you'd like me to use to produce results
for benchmarks?

> The old behavior will try to read out all block group items at mount
> time, however due to the key of block group items are scattered across
> tons of extent items, we must call btrfs_search_slot() for each block
> group.
> 
> It works fine for small fs, but when number of block groups goes beyond
> 200, such tree search will become a random read, causing obvious slow
> down.
> 
> On the other hand, btrfs_read_chunk_tree() is still very fast, since we
> put CHUNK_ITEMS into their own tree and package them next to each other.
> 
> 
> Following this idea, we could do the same thing for block group items,
> so instead of triggering btrfs_search_slot() for each block group, we
> just call btrfs_next_item() and under most case we could finish in
> memory, and hugely speed up mount (see BENCHMARK below).
> 
> The only disadvantage is, this method introduce an incompatible feature,
> so existing fs can't use this feature directly.
> Either specify it at mkfs time, or use btrfs-progs offline convert tool
> (*).
> 
> *: Mkfs and convert tool are doing the same work, however I haven't
> decide if I should put this feature to btrfstune.
> 
> [[Benchmark]]
> Physical device:	HDD (7200RPM)
> Nodesize:		4K  (to bump up tree height)
> Used size:		250G
> Total size:		500G
> Extent data size:	1M
> 
> All file extents on disk is in 1M size, ensured by using fallocate.
> 
> Without patchset:
> Use ftrace function graph:
> 
>   3)               |  open_ctree [btrfs]() {
>   3)               |    btrfs_read_chunk_tree [btrfs]() {
>   3) * 69033.31 us |    }
>   3)               |    btrfs_verify_dev_extents [btrfs]() {
>   3) * 90376.15 us |    }
>   3)               |    btrfs_read_block_groups [btrfs]() {
>   2) $ 2733853 us  |    } /* btrfs_read_block_groups [btrfs] */
>   2) $ 3168384 us  |  } /* open_ctree [btrfs] */
> 
>  btrfs_read_block_groups() takes 87% of the total mount time,
> 
> With patchset, and use -O bg-tree mkfs option:
>   7)               |  open_ctree [btrfs]() {
>   7)               |    btrfs_read_chunk_tree [btrfs]() {
>   7) # 2448.562 us |    }
>   7)               |    btrfs_verify_dev_extents [btrfs]() {
>   7) * 19802.02 us |    }
>   7)               |    btrfs_read_block_groups [btrfs]() {
>   7) # 8610.397 us |    }
>   7) @ 113498.6 us |  }
> 
>   open_ctree() time is only 3% of original mount time.
>   And btrfs_read_block_groups() only takes 7.6% of total open_ctree()
>   execution time.
> 
> Changelog:
> RFC->v1:
>  - Fix memory leak for fs_info->bg_root at module unload time.
>  - Add sysfs features interface.
>  - Testing shows no regression, so no RFC tag now.
> 
> Qu Wenruo (2):
>   btrfs: Refactor btrfs_read_block_groups()
>   btrfs: Introduce new incompat feature, BG_TREE
> 
>  fs/btrfs/ctree.h                |   5 +-
>  fs/btrfs/disk-io.c              |  13 ++
>  fs/btrfs/extent-tree.c          | 300 ++++++++++++++++++++------------
>  fs/btrfs/sysfs.c                |   2 +
>  include/uapi/linux/btrfs.h      |   1 +
>  include/uapi/linux/btrfs_tree.h |   3 +
>  6 files changed, 208 insertions(+), 116 deletions(-)
>
Qu Wenruo Feb. 25, 2019, 1:10 p.m. UTC | #3
On 2019/2/25 下午9:02, Hans van Kranenburg wrote:
> Hi Qu,
> 
> On 1/2/19 6:29 AM, Qu Wenruo wrote:
>> This patchset can be fetched from:
>> https://github.com/adam900710/linux/tree/bg_tree
>> Which is based on v4.20-rc1 tag.
>>
>> This patchset will hugely reduce mount time of large fs by putting all
>> block group items into its own tree.
> 
> I'm going to test this.
> 
> Is there some specific command you'd like me to use to produce results
> for benchmarks?

Nope. If I provides some commands, those must be super beneficial for
BG_TREE.

So just use whatever you like, near real world use case would be more
appreciated.

Thanks,
Qu
Hans van Kranenburg Feb. 25, 2019, 1:16 p.m. UTC | #4
On 2/25/19 2:10 PM, Qu Wenruo wrote:
> 
> 
> On 2019/2/25 下午9:02, Hans van Kranenburg wrote:
>> Hi Qu,
>>
>> On 1/2/19 6:29 AM, Qu Wenruo wrote:
>>> This patchset can be fetched from:
>>> https://github.com/adam900710/linux/tree/bg_tree
>>> Which is based on v4.20-rc1 tag.
>>>
>>> This patchset will hugely reduce mount time of large fs by putting all
>>> block group items into its own tree.
>>
>> I'm going to test this.
>>
>> Is there some specific command you'd like me to use to produce results
>> for benchmarks?
> 
> Nope. If I provides some commands, those must be super beneficial for
> BG_TREE.

Ah, I meant, to produce output for the mount timings, like your 'ftrace
function graph'.

> So just use whatever you like, near real world use case would be more
> appreciated.

Hans
Qu Wenruo Feb. 25, 2019, 2:46 p.m. UTC | #5
On 2019/2/25 下午9:16, Hans van Kranenburg wrote:
> On 2/25/19 2:10 PM, Qu Wenruo wrote:
>>
>>
>> On 2019/2/25 下午9:02, Hans van Kranenburg wrote:
>>> Hi Qu,
>>>
>>> On 1/2/19 6:29 AM, Qu Wenruo wrote:
>>>> This patchset can be fetched from:
>>>> https://github.com/adam900710/linux/tree/bg_tree
>>>> Which is based on v4.20-rc1 tag.
>>>>
>>>> This patchset will hugely reduce mount time of large fs by putting all
>>>> block group items into its own tree.
>>>
>>> I'm going to test this.
>>>
>>> Is there some specific command you'd like me to use to produce results
>>> for benchmarks?
>>
>> Nope. If I provides some commands, those must be super beneficial for
>> BG_TREE.
> 
> Ah, I meant, to produce output for the mount timings, like your 'ftrace
> function graph'.

I don't have anything better than ftrace function graph so far.

I assume you may need extra scripts to parse such output to create some
beautiful graph.

Thanks,
Qu

> 
>> So just use whatever you like, near real world use case would be more
>> appreciated.
> 
> Hans
>