[v2] btrfs: speedup mount time with readahead chunk tree
diff mbox series

Message ID 20200707035944.15150-1-robbieko@synology.com
State New
Headers show
Series
  • [v2] btrfs: speedup mount time with readahead chunk tree
Related show

Commit Message

robbieko July 7, 2020, 3:59 a.m. UTC
From: Robbie Ko <robbieko@synology.com>

When mounting, we always need to read the whole chunk tree,
when there are too many chunk items, most of the time is
spent on btrfs_read_chunk_tree, because we only read one
leaf at a time.

It is unreasonable to limit the readahead mechanism to a
range of 64k, so we have removed that limit.

In addition we added reada_maximum_size to customize the
size of the pre-reader, The default is 64k to maintain the
original behavior.

So we fix this by used readahead mechanism, and set readahead
max size to ULLONG_MAX which reads all the leaves after the
key in the node when reading a level 1 node.

I have a test environment as follows:

200TB btrfs volume: used 192TB

Data, single: total=192.00TiB, used=192.00TiB
System, DUP: total=40.00MiB, used=19.91MiB
Metadata, DUP: total=63.00GiB, used=46.46GiB
GlobalReserve, single: total=2.00GiB, used=0.00B

chunk tree level : 2
chunk tree tree:
   nodes: 4
   leaves: 1270
   total: 1274
chunk tree size: 19.9 MB
SYSTEM chunks count : 2 (8MB, 32MB)

btrfs_read_chunk_tree spends the following time:
before: 1.89s
patch: 0.27s
Speed increase of about 85%.

Signed-off-by: Robbie Ko <robbieko@synology.com>
---
Changelog:
v2:
- add performance testing
- remove readahead logical bytenr 64k limit
- add reada_maximum_size for customize
---
 fs/btrfs/ctree.c   | 23 +++++++++++------------
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/volumes.c |  2 ++
 3 files changed, 14 insertions(+), 12 deletions(-)

Comments

David Sterba July 7, 2020, 7:25 p.m. UTC | #1
On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote:
> From: Robbie Ko <robbieko@synology.com>
> 
> When mounting, we always need to read the whole chunk tree,
> when there are too many chunk items, most of the time is
> spent on btrfs_read_chunk_tree, because we only read one
> leaf at a time.
> 
> It is unreasonable to limit the readahead mechanism to a
> range of 64k, so we have removed that limit.
> 
> In addition we added reada_maximum_size to customize the
> size of the pre-reader, The default is 64k to maintain the
> original behavior.
> 
> So we fix this by used readahead mechanism, and set readahead
> max size to ULLONG_MAX which reads all the leaves after the
> key in the node when reading a level 1 node.

The readahead of chunk tree is a special case as we know we will need
the whole tree, in all other cases the search readahead needs is
supposed to read only one leaf.

For that reason I don't want to touch the current path readahead logic
at all and do the chunk tree readahead in one go instead of the
per-search.

Also I don't like to see size increase of btrfs_path just to use the
custom once.

The idea of the whole tree readahead is to do something like:

- find first item
- start readahead on all leaves from its level 1 node parent
  (readahead_tree_block)
- when the level 1 parent changes during iterating items, start the
  readahead again

This skips readahead of all nodes above level 1, if you find a nicer way
to readahead the whole tree I won't object, but for the first
implementation the level 1 seems ok to me.

> I have a test environment as follows:
> 
> 200TB btrfs volume: used 192TB
> 
> Data, single: total=192.00TiB, used=192.00TiB
> System, DUP: total=40.00MiB, used=19.91MiB

Can you please check what's the chunk tree height? 'btrfs inspect
tree-stats' prints that but it takes long as needs to go through the
whole metadata, so extracting it from 'btrfs inspect dump-tree -c chunk'
would be faster. Thanks.
robbieko July 8, 2020, 2:19 a.m. UTC | #2
David Sterba 於 2020/7/8 上午3:25 寫道:
> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote:
>> From: Robbie Ko <robbieko@synology.com>
>>
>> When mounting, we always need to read the whole chunk tree,
>> when there are too many chunk items, most of the time is
>> spent on btrfs_read_chunk_tree, because we only read one
>> leaf at a time.
>>
>> It is unreasonable to limit the readahead mechanism to a
>> range of 64k, so we have removed that limit.
>>
>> In addition we added reada_maximum_size to customize the
>> size of the pre-reader, The default is 64k to maintain the
>> original behavior.
>>
>> So we fix this by used readahead mechanism, and set readahead
>> max size to ULLONG_MAX which reads all the leaves after the
>> key in the node when reading a level 1 node.
> The readahead of chunk tree is a special case as we know we will need
> the whole tree, in all other cases the search readahead needs is
> supposed to read only one leaf.

If, in most cases, readahead requires that only one leaf be read, then
reada_ maximum_size should be nodesize instead of 64k, or use
reada_maximum_ nr (default:1) seems better.

>
> For that reason I don't want to touch the current path readahead logic
> at all and do the chunk tree readahead in one go instead of the
> per-search.

I don't know why we don't make the change to readahead, because the current
readahead is limited to the logical address in 64k is very unreasonable,
and there is a good chance that the logical address of the next leaf 
node will
not appear in 64k, so the existing readahead is almost useless.

>
> Also I don't like to see size increase of btrfs_path just to use the
> custom once.

This variable is the parameter that controls the speed of the readahead,
and I think it should be adjustable, not the hard code in the readahead 
function.
In the future, more scenarios will be available.
For example, BGTREE. will be improved significantly faster,
My own tests have improved the speed by almost 500%.
Reference: https://lwn.net/Articles/801990 /

>
> The idea of the whole tree readahead is to do something like:
>
> - find first item
> - start readahead on all leaves from its level 1 node parent
>    (readahead_tree_block)
> - when the level 1 parent changes during iterating items, start the
>    readahead again
>
> This skips readahead of all nodes above level 1, if you find a nicer way
> to readahead the whole tree I won't object, but for the first
> implementation the level 1 seems ok to me.
>
>> I have a test environment as follows:
>>
>> 200TB btrfs volume: used 192TB
>>
>> Data, single: total=192.00TiB, used=192.00TiB
>> System, DUP: total=40.00MiB, used=19.91MiB
> Can you please check what's the chunk tree height? 'btrfs inspect
> tree-stats' prints that but it takes long as needs to go through the
> whole metadata, so extracting it from 'btrfs inspect dump-tree -c chunk'
> would be faster. Thanks.
Chunk tree height 3, level (0-2)
David Sterba July 8, 2020, 2:04 p.m. UTC | #3
On Wed, Jul 08, 2020 at 10:19:22AM +0800, Robbie Ko wrote:
> David Sterba 於 2020/7/8 上午3:25 寫道:
> > On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote:
> >> From: Robbie Ko <robbieko@synology.com>
> >>
> >> When mounting, we always need to read the whole chunk tree,
> >> when there are too many chunk items, most of the time is
> >> spent on btrfs_read_chunk_tree, because we only read one
> >> leaf at a time.
> >>
> >> It is unreasonable to limit the readahead mechanism to a
> >> range of 64k, so we have removed that limit.
> >>
> >> In addition we added reada_maximum_size to customize the
> >> size of the pre-reader, The default is 64k to maintain the
> >> original behavior.
> >>
> >> So we fix this by used readahead mechanism, and set readahead
> >> max size to ULLONG_MAX which reads all the leaves after the
> >> key in the node when reading a level 1 node.
> > The readahead of chunk tree is a special case as we know we will need
> > the whole tree, in all other cases the search readahead needs is
> > supposed to read only one leaf.
> 
> If, in most cases, readahead requires that only one leaf be read, then
> reada_ maximum_size should be nodesize instead of 64k, or use
> reada_maximum_ nr (default:1) seems better.
> 
> >
> > For that reason I don't want to touch the current path readahead logic
> > at all and do the chunk tree readahead in one go instead of the
> > per-search.
> 
> I don't know why we don't make the change to readahead, because the current
> readahead is limited to the logical address in 64k is very unreasonable,
> and there is a good chance that the logical address of the next leaf 
> node will
> not appear in 64k, so the existing readahead is almost useless.

I see and it seems that the assumption about layout and chances
succesfuly read blocks ahead is not valid. The logic of readahead could
be improved but that would need more performance evaluation.

> > Also I don't like to see size increase of btrfs_path just to use the
> > custom once.
> 
> This variable is the parameter that controls the speed of the readahead,
> and I think it should be adjustable, not the hard code in the readahead 
> function.

Yes, but it takes 8 bytes and stays constant that does not even need 8
bytes.

I just don't want to touch the generic readahead logic that is started
by b-tree search because that would affect all workloads, while your're
interested in speeding up the chunk tree load.

> In the future, more scenarios will be available.
> For example, BGTREE. will be improved significantly faster,
> My own tests have improved the speed by almost 500%.
> Reference: https://lwn.net/Articles/801990 /

The optimized block group items whould be a huge win even without the
readahead but that's a different problem.

> > The idea of the whole tree readahead is to do something like:
> >
> > - find first item
> > - start readahead on all leaves from its level 1 node parent
> >    (readahead_tree_block)
> > - when the level 1 parent changes during iterating items, start the
> >    readahead again
> >
> > This skips readahead of all nodes above level 1, if you find a nicer way
> > to readahead the whole tree I won't object, but for the first
> > implementation the level 1 seems ok to me.
> >
> >> I have a test environment as follows:
> >>
> >> 200TB btrfs volume: used 192TB
> >>
> >> Data, single: total=192.00TiB, used=192.00TiB
> >> System, DUP: total=40.00MiB, used=19.91MiB
> > Can you please check what's the chunk tree height? 'btrfs inspect
> > tree-stats' prints that but it takes long as needs to go through the
> > whole metadata, so extracting it from 'btrfs inspect dump-tree -c chunk'
> > would be faster. Thanks.
> Chunk tree height 3, level (0-2)

Thanks.
Holger Hoffstätte July 8, 2020, 2:57 p.m. UTC | #4
On 2020-07-08 16:04, David Sterba wrote:
> On Wed, Jul 08, 2020 at 10:19:22AM +0800, Robbie Ko wrote:
>> David Sterba 於 2020/7/8 上午3:25 寫道:
>>> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote:
>>>> From: Robbie Ko <robbieko@synology.com>
>>>>
>>>> When mounting, we always need to read the whole chunk tree,
>>>> when there are too many chunk items, most of the time is
>>>> spent on btrfs_read_chunk_tree, because we only read one
>>>> leaf at a time.
>>>>
>>>> It is unreasonable to limit the readahead mechanism to a
>>>> range of 64k, so we have removed that limit.
>>>>
>>>> In addition we added reada_maximum_size to customize the
>>>> size of the pre-reader, The default is 64k to maintain the
>>>> original behavior.
>>>>
>>>> So we fix this by used readahead mechanism, and set readahead
>>>> max size to ULLONG_MAX which reads all the leaves after the
>>>> key in the node when reading a level 1 node.
>>> The readahead of chunk tree is a special case as we know we will need
>>> the whole tree, in all other cases the search readahead needs is
>>> supposed to read only one leaf.
>>
>> If, in most cases, readahead requires that only one leaf be read, then
>> reada_ maximum_size should be nodesize instead of 64k, or use
>> reada_maximum_ nr (default:1) seems better.
>>
>>>
>>> For that reason I don't want to touch the current path readahead logic
>>> at all and do the chunk tree readahead in one go instead of the
>>> per-search.
>>
>> I don't know why we don't make the change to readahead, because the current
>> readahead is limited to the logical address in 64k is very unreasonable,
>> and there is a good chance that the logical address of the next leaf
>> node will
>> not appear in 64k, so the existing readahead is almost useless.
> 
> I see and it seems that the assumption about layout and chances
> succesfuly read blocks ahead is not valid. The logic of readahead could
> be improved but that would need more performance evaluation.

FWIW I gave this a try and see the following numbers, averaged over multiple
mount/unmount cycles on spinning rust:

without patch : ~2.7s
with patch    : ~4.5s

..ahem..

-h
David Sterba July 8, 2020, 3:21 p.m. UTC | #5
On Wed, Jul 08, 2020 at 04:57:56PM +0200, Holger Hoffstätte wrote:
> On 2020-07-08 16:04, David Sterba wrote:
> > On Wed, Jul 08, 2020 at 10:19:22AM +0800, Robbie Ko wrote:
> >> David Sterba 於 2020/7/8 上午3:25 寫道:
> >> I don't know why we don't make the change to readahead, because the current
> >> readahead is limited to the logical address in 64k is very unreasonable,
> >> and there is a good chance that the logical address of the next leaf
> >> node will
> >> not appear in 64k, so the existing readahead is almost useless.
> > 
> > I see and it seems that the assumption about layout and chances
> > succesfuly read blocks ahead is not valid. The logic of readahead could
> > be improved but that would need more performance evaluation.
> 
> FWIW I gave this a try and see the following numbers, averaged over multiple
> mount/unmount cycles on spinning rust:
> 
> without patch : ~2.7s
> with patch    : ~4.5s
> 
> ..ahem..

Hard to argue against numbers, thanks for posting that.
David Sterba July 8, 2020, 9:11 p.m. UTC | #6
On Tue, Jul 07, 2020 at 09:25:11PM +0200, David Sterba wrote:
> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote:
> > From: Robbie Ko <robbieko@synology.com>
> > 
> > When mounting, we always need to read the whole chunk tree,
> > when there are too many chunk items, most of the time is
> > spent on btrfs_read_chunk_tree, because we only read one
> > leaf at a time.
> > 
> > It is unreasonable to limit the readahead mechanism to a
> > range of 64k, so we have removed that limit.
> > 
> > In addition we added reada_maximum_size to customize the
> > size of the pre-reader, The default is 64k to maintain the
> > original behavior.
> > 
> > So we fix this by used readahead mechanism, and set readahead
> > max size to ULLONG_MAX which reads all the leaves after the
> > key in the node when reading a level 1 node.
> 
> The readahead of chunk tree is a special case as we know we will need
> the whole tree, in all other cases the search readahead needs is
> supposed to read only one leaf.
> 
> For that reason I don't want to touch the current path readahead logic
> at all and do the chunk tree readahead in one go instead of the
> per-search.
> 
> Also I don't like to see size increase of btrfs_path just to use the
> custom once.
> 
> The idea of the whole tree readahead is to do something like:
> 
> - find first item
> - start readahead on all leaves from its level 1 node parent
>   (readahead_tree_block)
> - when the level 1 parent changes during iterating items, start the
>   readahead again
> 
> This skips readahead of all nodes above level 1, if you find a nicer way
> to readahead the whole tree I won't object, but for the first
> implementation the level 1 seems ok to me.

Patch below, I tried to create large system chunk by fallocate on a
sparse loop device, but got only 1 node on level 1 so the readahead
cannot show off.

# btrfs fi df .
Data, single: total=59.83TiB, used=59.83TiB
System, single: total=36.00MiB, used=6.20MiB
Metadata, single: total=1.01GiB, used=91.78MiB
GlobalReserve, single: total=26.80MiB, used=0.00B

There were 395 leaf nodes that got read ahead, time between the first
and last is 0.83s and the block group tree read took about 40 seconds.
This was in a VM with file-backed images, and the loop device was
constructed from these devices so it's spinning rust.

I don't have results for non-prefetched mount to compare at the moment.


----
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c7a3d4d730a3..e19891243199 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7013,6 +7013,19 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+void readahead_tree_node_children(struct extent_buffer *node)
+{
+	int i;
+	const int nr_items = btrfs_header_nritems(node);
+
+	for (i = 0; i < nr_items; i++) {
+		u64 start;
+
+		start = btrfs_node_blockptr(node, i);
+		readahead_tree_block(node->fs_info, start);
+	}
+}
+
 int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *root = fs_info->chunk_root;
@@ -7023,6 +7036,7 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
 	int ret;
 	int slot;
 	u64 total_dev = 0;
+	u64 last_ra_node = 0;
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -7048,6 +7062,8 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
 	if (ret < 0)
 		goto error;
 	while (1) {
+		struct extent_buffer *node;
+
 		leaf = path->nodes[0];
 		slot = path->slots[0];
 		if (slot >= btrfs_header_nritems(leaf)) {
@@ -7058,6 +7074,13 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
 				goto error;
 			break;
 		}
+		node = path->nodes[1];
+		if (node) {
+			if (last_ra_node != node->start) {
+				readahead_tree_node_children(node);
+				last_ra_node = node->start;
+			}
+		}
 		btrfs_item_key_to_cpu(leaf, &found_key, slot);
 		if (found_key.type == BTRFS_DEV_ITEM_KEY) {
 			struct btrfs_dev_item *dev_item;
robbieko July 9, 2020, 1:46 a.m. UTC | #7
Holger Hoffstätte 於 2020/7/8 下午10:57 寫道:
> On 2020-07-08 16:04, David Sterba wrote:
>> On Wed, Jul 08, 2020 at 10:19:22AM +0800, Robbie Ko wrote:
>>> David Sterba 於 2020/7/8 上午3:25 寫道:
>>>> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote:
>>>>> From: Robbie Ko <robbieko@synology.com>
>>>>>
>>>>> When mounting, we always need to read the whole chunk tree,
>>>>> when there are too many chunk items, most of the time is
>>>>> spent on btrfs_read_chunk_tree, because we only read one
>>>>> leaf at a time.
>>>>>
>>>>> It is unreasonable to limit the readahead mechanism to a
>>>>> range of 64k, so we have removed that limit.
>>>>>
>>>>> In addition we added reada_maximum_size to customize the
>>>>> size of the pre-reader, The default is 64k to maintain the
>>>>> original behavior.
>>>>>
>>>>> So we fix this by used readahead mechanism, and set readahead
>>>>> max size to ULLONG_MAX which reads all the leaves after the
>>>>> key in the node when reading a level 1 node.
>>>> The readahead of chunk tree is a special case as we know we will need
>>>> the whole tree, in all other cases the search readahead needs is
>>>> supposed to read only one leaf.
>>>
>>> If, in most cases, readahead requires that only one leaf be read, then
>>> reada_ maximum_size should be nodesize instead of 64k, or use
>>> reada_maximum_ nr (default:1) seems better.
>>>
>>>>
>>>> For that reason I don't want to touch the current path readahead logic
>>>> at all and do the chunk tree readahead in one go instead of the
>>>> per-search.
>>>
>>> I don't know why we don't make the change to readahead, because the 
>>> current
>>> readahead is limited to the logical address in 64k is very 
>>> unreasonable,
>>> and there is a good chance that the logical address of the next leaf
>>> node will
>>> not appear in 64k, so the existing readahead is almost useless.
>>
>> I see and it seems that the assumption about layout and chances
>> succesfuly read blocks ahead is not valid. The logic of readahead could
>> be improved but that would need more performance evaluation.
>
> FWIW I gave this a try and see the following numbers, averaged over 
> multiple
> mount/unmount cycles on spinning rust:
>
> without patch : ~2.7s
> with patch    : ~4.5s
>
> ..ahem..
>
I have the following two questions for you.
1. What is the version you are using?
2. Can you please measure the time of btrfs_read_chunk_tree alone?

I think the problem you are having is that btrfs_read_block_groups is
slowing down because it is using the wrong READA flag, which is causing
a lot of useless IO's when reading the block group.

This can be fixed with the following commit.
btrfs: block-group: don't set the wrong READA flag for 
btrfs_read_block_groups()
https://git.kernel.org/pub/scm/linux/kernel 
/git/torvalds/linux.git/commit/?h=v5.8-rc4& 
id=83fe9e12b0558eae519351cff00da1e06bc054d2

> -h
robbieko July 9, 2020, 2:38 a.m. UTC | #8
David Sterba 於 2020/7/9 上午5:11 寫道:
> On Tue, Jul 07, 2020 at 09:25:11PM +0200, David Sterba wrote:
>> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote:
>>> From: Robbie Ko <robbieko@synology.com>
>>>
>>> When mounting, we always need to read the whole chunk tree,
>>> when there are too many chunk items, most of the time is
>>> spent on btrfs_read_chunk_tree, because we only read one
>>> leaf at a time.
>>>
>>> It is unreasonable to limit the readahead mechanism to a
>>> range of 64k, so we have removed that limit.
>>>
>>> In addition we added reada_maximum_size to customize the
>>> size of the pre-reader, The default is 64k to maintain the
>>> original behavior.
>>>
>>> So we fix this by used readahead mechanism, and set readahead
>>> max size to ULLONG_MAX which reads all the leaves after the
>>> key in the node when reading a level 1 node.
>> The readahead of chunk tree is a special case as we know we will need
>> the whole tree, in all other cases the search readahead needs is
>> supposed to read only one leaf.
>>
>> For that reason I don't want to touch the current path readahead logic
>> at all and do the chunk tree readahead in one go instead of the
>> per-search.
>>
>> Also I don't like to see size increase of btrfs_path just to use the
>> custom once.
>>
>> The idea of the whole tree readahead is to do something like:
>>
>> - find first item
>> - start readahead on all leaves from its level 1 node parent
>>    (readahead_tree_block)
>> - when the level 1 parent changes during iterating items, start the
>>    readahead again
>>
>> This skips readahead of all nodes above level 1, if you find a nicer way
>> to readahead the whole tree I won't object, but for the first
>> implementation the level 1 seems ok to me.
> Patch below, I tried to create large system chunk by fallocate on a
> sparse loop device, but got only 1 node on level 1 so the readahead
> cannot show off.
>
> # btrfs fi df .
> Data, single: total=59.83TiB, used=59.83TiB
> System, single: total=36.00MiB, used=6.20MiB
> Metadata, single: total=1.01GiB, used=91.78MiB
> GlobalReserve, single: total=26.80MiB, used=0.00B
>
> There were 395 leaf nodes that got read ahead, time between the first
> and last is 0.83s and the block group tree read took about 40 seconds.
> This was in a VM with file-backed images, and the loop device was
> constructed from these devices so it's spinning rust.
>
> I don't have results for non-prefetched mount to compare at the moment.
>
I think what you're doing is working.

But there are many similar problems that need to be improved.

1. load_free_space_tree
We need to read all BTRFS_FREE_SPACE_BITMAP_KEY and
BTRFS_FREE_SPACE_EXTENT_KEY until the next FREE_SPACE_INFO_KEY.

2. populate_free_space_tree
We need to read all BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY 
until the end of the block group

3. btrfs_real_readdir
We need as many reads as possible (inode, BTRFS_DIR_INDEX_KEY).

4. btrfs_clone
We need as many reads as possible (inode, BTRFS_EXTENT_DATA_KEY).

5. btrfs_verify_dev_extents
We need to read all the BTRFS_DEV_EXTENT_KEYs.

6. caching_kthread (inode-map.c)
We need all the BTRFS_INODE_ITEM_KEY of fs_tree to build the inode map

For the above cases.
It is not possible to write a special readahead code for each case.
We have to provide a new readaread framework
Enable the caller to determine the scope of readaheads needed.
The possible parameters of the readahead are as follows
1. reada_maximum_nr : Read a maximum of several leaves at a time.
2. reada_max_key : READA_FORWARD Early Suspension Condition
3. reada_min_key : READA_BACK Abort condition ahead of time.

We need to review all users of readahead to confirm that the The 
behavior of readahead.
For example, in scrub_enumerate_chunks readahead has the effect of Very 
small,
Because most of the time is spent on scrub_chunk,
The processing of scrub_chunk for all DEV_EXTENT in a leaf is A long time.
If the dev tree has been modified in the meantime, the previously 
pre-reading leaf may be useless.


> ----
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c7a3d4d730a3..e19891243199 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7013,6 +7013,19 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
>   	return ret;
>   }
>   
> +void readahead_tree_node_children(struct extent_buffer *node)
> +{
> +	int i;
> +	const int nr_items = btrfs_header_nritems(node);
> +
> +	for (i = 0; i < nr_items; i++) {
> +		u64 start;
> +
> +		start = btrfs_node_blockptr(node, i);
> +		readahead_tree_block(node->fs_info, start);
> +	}
> +}
> +
>   int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
>   {
>   	struct btrfs_root *root = fs_info->chunk_root;
> @@ -7023,6 +7036,7 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
>   	int ret;
>   	int slot;
>   	u64 total_dev = 0;
> +	u64 last_ra_node = 0;
>   
>   	path = btrfs_alloc_path();
>   	if (!path)
> @@ -7048,6 +7062,8 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
>   	if (ret < 0)
>   		goto error;
>   	while (1) {
> +		struct extent_buffer *node;
> +
>   		leaf = path->nodes[0];
>   		slot = path->slots[0];
>   		if (slot >= btrfs_header_nritems(leaf)) {
> @@ -7058,6 +7074,13 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
>   				goto error;
>   			break;
>   		}
> +		node = path->nodes[1];
> +		if (node) {
> +			if (last_ra_node != node->start) {
> +				readahead_tree_node_children(node);
> +				last_ra_node = node->start;
> +			}
> +		}
>   		btrfs_item_key_to_cpu(leaf, &found_key, slot);
>   		if (found_key.type == BTRFS_DEV_ITEM_KEY) {
>   			struct btrfs_dev_item *dev_item;
Holger Hoffstätte July 9, 2020, 7:17 a.m. UTC | #9
On 2020-07-09 03:46, Robbie Ko wrote:
> 
> Holger Hoffstätte 於 2020/7/8 下午10:57 寫道:
>> On 2020-07-08 16:04, David Sterba wrote:
>>> On Wed, Jul 08, 2020 at 10:19:22AM +0800, Robbie Ko wrote:
>>>> David Sterba 於 2020/7/8 上午3:25 寫道:
>>>>> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote:
>>>>>> From: Robbie Ko <robbieko@synology.com>
>>>>>>
>>>>>> When mounting, we always need to read the whole chunk tree,
>>>>>> when there are too many chunk items, most of the time is
>>>>>> spent on btrfs_read_chunk_tree, because we only read one
>>>>>> leaf at a time.
>>>>>>
>>>>>> It is unreasonable to limit the readahead mechanism to a
>>>>>> range of 64k, so we have removed that limit.
>>>>>>
>>>>>> In addition we added reada_maximum_size to customize the
>>>>>> size of the pre-reader, The default is 64k to maintain the
>>>>>> original behavior.
>>>>>>
>>>>>> So we fix this by used readahead mechanism, and set readahead
>>>>>> max size to ULLONG_MAX which reads all the leaves after the
>>>>>> key in the node when reading a level 1 node.
>>>>> The readahead of chunk tree is a special case as we know we will need
>>>>> the whole tree, in all other cases the search readahead needs is
>>>>> supposed to read only one leaf.
>>>>
>>>> If, in most cases, readahead requires that only one leaf be read, then
>>>> reada_ maximum_size should be nodesize instead of 64k, or use
>>>> reada_maximum_ nr (default:1) seems better.
>>>>
>>>>>
>>>>> For that reason I don't want to touch the current path readahead logic
>>>>> at all and do the chunk tree readahead in one go instead of the
>>>>> per-search.
>>>>
>>>> I don't know why we don't make the change to readahead, because the current
>>>> readahead is limited to the logical address in 64k is very unreasonable,
>>>> and there is a good chance that the logical address of the next leaf
>>>> node will
>>>> not appear in 64k, so the existing readahead is almost useless.
>>>
>>> I see and it seems that the assumption about layout and chances
>>> succesfuly read blocks ahead is not valid. The logic of readahead could
>>> be improved but that would need more performance evaluation.
>>
>> FWIW I gave this a try and see the following numbers, averaged over multiple
>> mount/unmount cycles on spinning rust:
>>
>> without patch : ~2.7s
>> with patch    : ~4.5s
>>
>> ..ahem..
>>
> I have the following two questions for you.
> 1. What is the version you are using?

5.7.8 + a few select patches from 5.8.

> 2. Can you please measure the time of btrfs_read_chunk_tree alone?

No perf on this system & not enough time right now, sorry.
But it shouldn't matter either way, see below.

> I think the problem you are having is that btrfs_read_block_groups is
> slowing down because it is using the wrong READA flag, which is causing
> a lot of useless IO's when reading the block group.
> 
> This can be fixed with the following commit.
> btrfs: block-group: don't set the wrong READA flag for btrfs_read_block_groups()
> https://git.kernel.org/pub/scm/linux/kernel /git/torvalds/linux.git/commit/?h=v5.8-rc4& id=83fe9e12b0558eae519351cff00da1e06bc054d2

Ah yes, that was missing. However it doesn't seem to improve things
that much either; with 83fe9e12 but with or without your patch I now get
~2.8..~2.9s mount time. Probably because I don't have that many
metadata block groups (only 4GB).

 From a conceptual perspective it it probably much easier just to
merge the bgtree patchset, since that does the right thing without
upsetting the overall readahead apple cart.

thanks,
Holger
David Sterba July 9, 2020, 9:13 a.m. UTC | #10
On Thu, Jul 09, 2020 at 10:38:42AM +0800, Robbie Ko wrote:
> David Sterba 於 2020/7/9 上午5:11 寫道:
> > On Tue, Jul 07, 2020 at 09:25:11PM +0200, David Sterba wrote:
> >> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote:
> >> This skips readahead of all nodes above level 1, if you find a nicer way
> >> to readahead the whole tree I won't object, but for the first
> >> implementation the level 1 seems ok to me.
> > Patch below, I tried to create large system chunk by fallocate on a
> > sparse loop device, but got only 1 node on level 1 so the readahead
> > cannot show off.
> >
> > # btrfs fi df .
> > Data, single: total=59.83TiB, used=59.83TiB
> > System, single: total=36.00MiB, used=6.20MiB
> > Metadata, single: total=1.01GiB, used=91.78MiB
> > GlobalReserve, single: total=26.80MiB, used=0.00B
> >
> > There were 395 leaf nodes that got read ahead, time between the first
> > and last is 0.83s and the block group tree read took about 40 seconds.
> > This was in a VM with file-backed images, and the loop device was
> > constructed from these devices so it's spinning rust.
> >
> > I don't have results for non-prefetched mount to compare at the moment.
> >
> I think what you're doing is working.
> 
> But there are many similar problems that need to be improved.

Agreed, but this started from 'let's readahead chunk tree' and now we
drifted to fixing or perhaps making better use of readahead in several
other areas.

> 1. load_free_space_tree
> We need to read all BTRFS_FREE_SPACE_BITMAP_KEY and
> BTRFS_FREE_SPACE_EXTENT_KEY until the next FREE_SPACE_INFO_KEY.
> 
> 2. populate_free_space_tree
> We need to read all BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY 
> until the end of the block group
> 
> 3. btrfs_real_readdir
> We need as many reads as possible (inode, BTRFS_DIR_INDEX_KEY).
> 
> 4. btrfs_clone
> We need as many reads as possible (inode, BTRFS_EXTENT_DATA_KEY).
> 
> 5. btrfs_verify_dev_extents
> We need to read all the BTRFS_DEV_EXTENT_KEYs.

Each case needs to be evaluated separately because the items live in
different trees and other item types could be scattered among the ones
we're interested in.

But the list gives an insight in what types of readahead we might need,
like full key range [from, to], or just all items of one key type.

> 6. caching_kthread (inode-map.c)
> We need all the BTRFS_INODE_ITEM_KEY of fs_tree to build the inode map
> 
> For the above cases.
> It is not possible to write a special readahead code for each case.
> We have to provide a new readaread framework
> Enable the caller to determine the scope of readaheads needed.
> The possible parameters of the readahead are as follows
> 1. reada_maximum_nr : Read a maximum of several leaves at a time.
> 2. reada_max_key : READA_FORWARD Early Suspension Condition
> 3. reada_min_key : READA_BACK Abort condition ahead of time.

Yeah something like that.

> We need to review all users of readahead to confirm that the The 
> behavior of readahead.
> For example, in scrub_enumerate_chunks readahead has the effect of Very 
> small,
> Because most of the time is spent on scrub_chunk,
> The processing of scrub_chunk for all DEV_EXTENT in a leaf is A long time.
> If the dev tree has been modified in the meantime, the previously 
> pre-reading leaf may be useless.

Yes that's another case where doing the readahead is useless.

So, now it's a question if we should start with the easy cases with
specific readahead and then unify them under a common API, or try to
figure out the API and then audit all us.

I'd be more in favor of the former as it allows to give us a baseline
where the readahead would be implemented optimally, the followup API
cleanup would need to keep the performance.

The latter is IMHO harder just because getting an API right on the first
try usually does not work.
robbieko July 10, 2020, 1:54 a.m. UTC | #11
David Sterba 於 2020/7/9 下午5:13 寫道:
> On Thu, Jul 09, 2020 at 10:38:42AM +0800, Robbie Ko wrote:
>> David Sterba 於 2020/7/9 上午5:11 寫道:
>>> On Tue, Jul 07, 2020 at 09:25:11PM +0200, David Sterba wrote:
>>>> On Tue, Jul 07, 2020 at 11:59:44AM +0800, robbieko wrote:
>>>> This skips readahead of all nodes above level 1, if you find a nicer way
>>>> to readahead the whole tree I won't object, but for the first
>>>> implementation the level 1 seems ok to me.
>>> Patch below, I tried to create large system chunk by fallocate on a
>>> sparse loop device, but got only 1 node on level 1 so the readahead
>>> cannot show off.
>>>
>>> # btrfs fi df .
>>> Data, single: total=59.83TiB, used=59.83TiB
>>> System, single: total=36.00MiB, used=6.20MiB
>>> Metadata, single: total=1.01GiB, used=91.78MiB
>>> GlobalReserve, single: total=26.80MiB, used=0.00B
>>>
>>> There were 395 leaf nodes that got read ahead, time between the first
>>> and last is 0.83s and the block group tree read took about 40 seconds.
>>> This was in a VM with file-backed images, and the loop device was
>>> constructed from these devices so it's spinning rust.
>>>
>>> I don't have results for non-prefetched mount to compare at the moment.
>>>
>> I think what you're doing is working.
>>
>> But there are many similar problems that need to be improved.
> Agreed, but this started from 'let's readahead chunk tree' and now we
> drifted to fixing or perhaps making better use of readahead in several
> other areas.
>
>> 1. load_free_space_tree
>> We need to read all BTRFS_FREE_SPACE_BITMAP_KEY and
>> BTRFS_FREE_SPACE_EXTENT_KEY until the next FREE_SPACE_INFO_KEY.
>>
>> 2. populate_free_space_tree
>> We need to read all BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY
>> until the end of the block group
>>
>> 3. btrfs_real_readdir
>> We need as many reads as possible (inode, BTRFS_DIR_INDEX_KEY).
>>
>> 4. btrfs_clone
>> We need as many reads as possible (inode, BTRFS_EXTENT_DATA_KEY).
>>
>> 5. btrfs_verify_dev_extents
>> We need to read all the BTRFS_DEV_EXTENT_KEYs.
> Each case needs to be evaluated separately because the items live in
> different trees and other item types could be scattered among the ones
> we're interested in.
>
> But the list gives an insight in what types of readahead we might need,
> like full key range [from, to], or just all items of one key type.
>
>> 6. caching_kthread (inode-map.c)
>> We need all the BTRFS_INODE_ITEM_KEY of fs_tree to build the inode map
>>
>> For the above cases.
>> It is not possible to write a special readahead code for each case.
>> We have to provide a new readaread framework
>> Enable the caller to determine the scope of readaheads needed.
>> The possible parameters of the readahead are as follows
>> 1. reada_maximum_nr : Read a maximum of several leaves at a time.
>> 2. reada_max_key : READA_FORWARD Early Suspension Condition
>> 3. reada_min_key : READA_BACK Abort condition ahead of time.
> Yeah something like that.
>
>> We need to review all users of readahead to confirm that the The
>> behavior of readahead.
>> For example, in scrub_enumerate_chunks readahead has the effect of Very
>> small,
>> Because most of the time is spent on scrub_chunk,
>> The processing of scrub_chunk for all DEV_EXTENT in a leaf is A long time.
>> If the dev tree has been modified in the meantime, the previously
>> pre-reading leaf may be useless.
> Yes that's another case where doing the readahead is useless.
>
> So, now it's a question if we should start with the easy cases with
> specific readahead and then unify them under a common API, or try to
> figure out the API and then audit all us.
>
> I'd be more in favor of the former as it allows to give us a baseline
> where the readahead would be implemented optimally, the followup API
> cleanup would need to keep the performance.
>
> The latter is IMHO harder just because getting an API right on the first
> try usually does not work.

Okay, I agree with you.
Please help submit your patch to speedup chunk tree read.
Thank you for your help.

Patch
diff mbox series

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 3a7648bff42c..dc84f526cd93 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -75,7 +75,14 @@  size_t __const btrfs_get_num_csums(void)
 
 struct btrfs_path *btrfs_alloc_path(void)
 {
-	return kmem_cache_zalloc(btrfs_path_cachep, GFP_NOFS);
+	struct btrfs_path *path;
+
+	path = kmem_cache_zalloc(btrfs_path_cachep, GFP_NOFS);
+
+	if (path)
+		path->reada_maximum_size = 65536;
+
+	return path;
 }
 
 /* this also releases the path */
@@ -2161,12 +2168,10 @@  static void reada_for_search(struct btrfs_fs_info *fs_info,
 	struct btrfs_disk_key disk_key;
 	u32 nritems;
 	u64 search;
-	u64 target;
 	u64 nread = 0;
 	struct extent_buffer *eb;
 	u32 nr;
 	u32 blocksize;
-	u32 nscan = 0;
 
 	if (level != 1)
 		return;
@@ -2184,8 +2189,6 @@  static void reada_for_search(struct btrfs_fs_info *fs_info,
 		return;
 	}
 
-	target = search;
-
 	nritems = btrfs_header_nritems(node);
 	nr = slot;
 
@@ -2205,13 +2208,9 @@  static void reada_for_search(struct btrfs_fs_info *fs_info,
 				break;
 		}
 		search = btrfs_node_blockptr(node, nr);
-		if ((search <= target && target - search <= 65536) ||
-		    (search > target && search - target <= 65536)) {
-			readahead_tree_block(fs_info, search);
-			nread += blocksize;
-		}
-		nscan++;
-		if ((nread > 65536 || nscan > 32))
+		readahead_tree_block(fs_info, search);
+		nread += blocksize;
+		if (nread > path->reada_maximum_size)
 			break;
 	}
 }
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d404cce8ae40..ea88a6473eb8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -360,6 +360,7 @@  struct btrfs_path {
 	/* if there is real range locking, this locks field will change */
 	u8 locks[BTRFS_MAX_LEVEL];
 	u8 reada;
+	u64 reada_maximum_size;
 	/* keep some upper locks as we walk down */
 	u8 lowest_level;
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0d6e785bcb98..fc87b2e9e865 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7043,6 +7043,8 @@  int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info)
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
+	path->reada = READA_FORWARD;
+	path->reada_maximum_size = ULLONG_MAX;
 
 	/*
 	 * uuid_mutex is needed only if we are mounting a sprout FS