Message ID | 20190102052945.16325-1-wqu@suse.com (mailing list archive) |
---|---|
Headers | show |
Series | Use new incompat feature BG_TREE to hugely reduce mount time | expand |
Future proof benchmark for bg-tree. The benchmark in the cover letter is in fact a pretty bad case for original mode, with a lot of EXTENT_ITEMs bumping up extent tree height, along with small node size to make searching extent tree super slow. (Although that doesn't make any difference for bg-tree). Now here is the best case scenario for original mode. Just using plain fallocate to fill 12T, so every EXTENT_DATA will be at its maximum size, causing minimal noise for block group items iteration. For short conclusion: Bg-tree still faster than best case original extent tree, by around 30%. Bg-tree should be fast enough to mount the 12T filled fs less than *ONE* *SECOND*. And due to the fact that bg-tree doesn't care how crowded the extent tree is, its block group iteration time is just O(N). So for 12T filled fs, bg-tree should always mount around the same speed, no matter how many extents are. For full details, please check the sheet: https://docs.google.com/spreadsheets/d/1YfZXiGoL9EoPcWGh0x1gIldS1ymsPgqZFQdngM-Iep0/edit?usp=sharing Thanks, Qu On 2019/1/2 下午1:29, Qu Wenruo wrote: > This patchset can be fetched from: > https://github.com/adam900710/linux/tree/bg_tree > Which is based on v4.20-rc1 tag. > > This patchset will hugely reduce mount time of large fs by putting all > block group items into its own tree. > > The old behavior will try to read out all block group items at mount > time, however due to the key of block group items are scattered across > tons of extent items, we must call btrfs_search_slot() for each block > group. > > It works fine for small fs, but when number of block groups goes beyond > 200, such tree search will become a random read, causing obvious slow > down. > > On the other hand, btrfs_read_chunk_tree() is still very fast, since we > put CHUNK_ITEMS into their own tree and package them next to each other. > > > Following this idea, we could do the same thing for block group items, > so instead of triggering btrfs_search_slot() for each block group, we > just call btrfs_next_item() and under most case we could finish in > memory, and hugely speed up mount (see BENCHMARK below). > > The only disadvantage is, this method introduce an incompatible feature, > so existing fs can't use this feature directly. > Either specify it at mkfs time, or use btrfs-progs offline convert tool > (*). > > *: Mkfs and convert tool are doing the same work, however I haven't > decide if I should put this feature to btrfstune. > > [[Benchmark]] > Physical device: HDD (7200RPM) > Nodesize: 4K (to bump up tree height) > Used size: 250G > Total size: 500G > Extent data size: 1M > > All file extents on disk is in 1M size, ensured by using fallocate. > > Without patchset: > Use ftrace function graph: > > 3) | open_ctree [btrfs]() { > 3) | btrfs_read_chunk_tree [btrfs]() { > 3) * 69033.31 us | } > 3) | btrfs_verify_dev_extents [btrfs]() { > 3) * 90376.15 us | } > 3) | btrfs_read_block_groups [btrfs]() { > 2) $ 2733853 us | } /* btrfs_read_block_groups [btrfs] */ > 2) $ 3168384 us | } /* open_ctree [btrfs] */ > > btrfs_read_block_groups() takes 87% of the total mount time, > > With patchset, and use -O bg-tree mkfs option: > 7) | open_ctree [btrfs]() { > 7) | btrfs_read_chunk_tree [btrfs]() { > 7) # 2448.562 us | } > 7) | btrfs_verify_dev_extents [btrfs]() { > 7) * 19802.02 us | } > 7) | btrfs_read_block_groups [btrfs]() { > 7) # 8610.397 us | } > 7) @ 113498.6 us | } > > open_ctree() time is only 3% of original mount time. > And btrfs_read_block_groups() only takes 7.6% of total open_ctree() > execution time. > > Changelog: > RFC->v1: > - Fix memory leak for fs_info->bg_root at module unload time. > - Add sysfs features interface. > - Testing shows no regression, so no RFC tag now. > > Qu Wenruo (2): > btrfs: Refactor btrfs_read_block_groups() > btrfs: Introduce new incompat feature, BG_TREE > > fs/btrfs/ctree.h | 5 +- > fs/btrfs/disk-io.c | 13 ++ > fs/btrfs/extent-tree.c | 300 ++++++++++++++++++++------------ > fs/btrfs/sysfs.c | 2 + > include/uapi/linux/btrfs.h | 1 + > include/uapi/linux/btrfs_tree.h | 3 + > 6 files changed, 208 insertions(+), 116 deletions(-) >
Hi Qu, On 1/2/19 6:29 AM, Qu Wenruo wrote: > This patchset can be fetched from: > https://github.com/adam900710/linux/tree/bg_tree > Which is based on v4.20-rc1 tag. > > This patchset will hugely reduce mount time of large fs by putting all > block group items into its own tree. I'm going to test this. Is there some specific command you'd like me to use to produce results for benchmarks? > The old behavior will try to read out all block group items at mount > time, however due to the key of block group items are scattered across > tons of extent items, we must call btrfs_search_slot() for each block > group. > > It works fine for small fs, but when number of block groups goes beyond > 200, such tree search will become a random read, causing obvious slow > down. > > On the other hand, btrfs_read_chunk_tree() is still very fast, since we > put CHUNK_ITEMS into their own tree and package them next to each other. > > > Following this idea, we could do the same thing for block group items, > so instead of triggering btrfs_search_slot() for each block group, we > just call btrfs_next_item() and under most case we could finish in > memory, and hugely speed up mount (see BENCHMARK below). > > The only disadvantage is, this method introduce an incompatible feature, > so existing fs can't use this feature directly. > Either specify it at mkfs time, or use btrfs-progs offline convert tool > (*). > > *: Mkfs and convert tool are doing the same work, however I haven't > decide if I should put this feature to btrfstune. > > [[Benchmark]] > Physical device: HDD (7200RPM) > Nodesize: 4K (to bump up tree height) > Used size: 250G > Total size: 500G > Extent data size: 1M > > All file extents on disk is in 1M size, ensured by using fallocate. > > Without patchset: > Use ftrace function graph: > > 3) | open_ctree [btrfs]() { > 3) | btrfs_read_chunk_tree [btrfs]() { > 3) * 69033.31 us | } > 3) | btrfs_verify_dev_extents [btrfs]() { > 3) * 90376.15 us | } > 3) | btrfs_read_block_groups [btrfs]() { > 2) $ 2733853 us | } /* btrfs_read_block_groups [btrfs] */ > 2) $ 3168384 us | } /* open_ctree [btrfs] */ > > btrfs_read_block_groups() takes 87% of the total mount time, > > With patchset, and use -O bg-tree mkfs option: > 7) | open_ctree [btrfs]() { > 7) | btrfs_read_chunk_tree [btrfs]() { > 7) # 2448.562 us | } > 7) | btrfs_verify_dev_extents [btrfs]() { > 7) * 19802.02 us | } > 7) | btrfs_read_block_groups [btrfs]() { > 7) # 8610.397 us | } > 7) @ 113498.6 us | } > > open_ctree() time is only 3% of original mount time. > And btrfs_read_block_groups() only takes 7.6% of total open_ctree() > execution time. > > Changelog: > RFC->v1: > - Fix memory leak for fs_info->bg_root at module unload time. > - Add sysfs features interface. > - Testing shows no regression, so no RFC tag now. > > Qu Wenruo (2): > btrfs: Refactor btrfs_read_block_groups() > btrfs: Introduce new incompat feature, BG_TREE > > fs/btrfs/ctree.h | 5 +- > fs/btrfs/disk-io.c | 13 ++ > fs/btrfs/extent-tree.c | 300 ++++++++++++++++++++------------ > fs/btrfs/sysfs.c | 2 + > include/uapi/linux/btrfs.h | 1 + > include/uapi/linux/btrfs_tree.h | 3 + > 6 files changed, 208 insertions(+), 116 deletions(-) >
On 2019/2/25 下午9:02, Hans van Kranenburg wrote: > Hi Qu, > > On 1/2/19 6:29 AM, Qu Wenruo wrote: >> This patchset can be fetched from: >> https://github.com/adam900710/linux/tree/bg_tree >> Which is based on v4.20-rc1 tag. >> >> This patchset will hugely reduce mount time of large fs by putting all >> block group items into its own tree. > > I'm going to test this. > > Is there some specific command you'd like me to use to produce results > for benchmarks? Nope. If I provides some commands, those must be super beneficial for BG_TREE. So just use whatever you like, near real world use case would be more appreciated. Thanks, Qu
On 2/25/19 2:10 PM, Qu Wenruo wrote: > > > On 2019/2/25 下午9:02, Hans van Kranenburg wrote: >> Hi Qu, >> >> On 1/2/19 6:29 AM, Qu Wenruo wrote: >>> This patchset can be fetched from: >>> https://github.com/adam900710/linux/tree/bg_tree >>> Which is based on v4.20-rc1 tag. >>> >>> This patchset will hugely reduce mount time of large fs by putting all >>> block group items into its own tree. >> >> I'm going to test this. >> >> Is there some specific command you'd like me to use to produce results >> for benchmarks? > > Nope. If I provides some commands, those must be super beneficial for > BG_TREE. Ah, I meant, to produce output for the mount timings, like your 'ftrace function graph'. > So just use whatever you like, near real world use case would be more > appreciated. Hans
On 2019/2/25 下午9:16, Hans van Kranenburg wrote: > On 2/25/19 2:10 PM, Qu Wenruo wrote: >> >> >> On 2019/2/25 下午9:02, Hans van Kranenburg wrote: >>> Hi Qu, >>> >>> On 1/2/19 6:29 AM, Qu Wenruo wrote: >>>> This patchset can be fetched from: >>>> https://github.com/adam900710/linux/tree/bg_tree >>>> Which is based on v4.20-rc1 tag. >>>> >>>> This patchset will hugely reduce mount time of large fs by putting all >>>> block group items into its own tree. >>> >>> I'm going to test this. >>> >>> Is there some specific command you'd like me to use to produce results >>> for benchmarks? >> >> Nope. If I provides some commands, those must be super beneficial for >> BG_TREE. > > Ah, I meant, to produce output for the mount timings, like your 'ftrace > function graph'. I don't have anything better than ftrace function graph so far. I assume you may need extra scripts to parse such output to create some beautiful graph. Thanks, Qu > >> So just use whatever you like, near real world use case would be more >> appreciated. > > Hans >