[RFC,00/17] btrfs: implementation of priority aware allocator
mbox series

Message ID 20181128031148.357-1-suy.fnst@cn.fujitsu.com
Headers show
Series
  • btrfs: implementation of priority aware allocator
Related show

Message

Su Yue Nov. 28, 2018, 3:11 a.m. UTC
This patchset can be fetched from repo:
https://github.com/Damenly/btrfs-devel/commits/priority_aware_allocator.
Since patchset 'btrfs: Refactor find_free_extent()' does a nice work
to simplify find_free_extent(). This patchset dependents on the refactor.
The base is the commit in kdave/misc-next:

commit fcaaa1dfa81f2f87ad88cbe0ab86a07f9f76073c (kdave/misc-next)
Author: Nikolay Borisov <nborisov@suse.com>
Date:   Tue Nov 6 16:40:20 2018 +0200

    btrfs: Always try all copies when reading extent buffers


This patchset introduces a new mount option named 'priority_alloc=%s',
%s is supported to be "usage" and "off" now. The mount option changes
the way find_free_extent() how to search block groups.

Previously, block groups are stored in list of btrfs_space_info
by start position. When call find_free_extent() if no hint,
block_groups are searched one by one.

Design of priority aware allocator:
Block group has its own priority. We split priorities to many levels,
block groups are split to different trees according priorities.
And those trees are sorted by their levels and stored in space_info.
Once find_free_extent() is called, try to search block groups in higher
priority level then lower level. Then a block group with higher
priority is more likely to be used.

Pros:
1) Reduce the frequency of balance.
   The block group with a higher usage rate will be used preferentially
   for allocating extents. Free the empty block groups with pinned bytes
   as non-zero.[1]

2) The priority of empty block group with pinned bytes as non-zero
   will be set as the lowest.
   
3) Support zoned block device.[2]
   For metadata allocation, the block group in conventional zones
   will be used as much as possible regardless of usage rate.
   Will do it in future.
   
Cons:
1) Expectable performance regression.
   The degree of the decline is temporarily unknown.
   The user can disable block group priority to get the full performance.

TESTS:

If use usage as priority(the only available option), empty block group
is much harder to be reused.

About block group usage:
  Disk: 4 x 1T HDD gathered in LVM.

  Run script to create files and delete files randomly in loop.
  The num of files to create are double than to delete.

  Default mount option result:
  https://i.loli.net/2018/11/28/5bfdfdf08c760.png

  Priority aware allocator(usage) result:
  https://i.loli.net/2018/11/28/5bfdfdf0c1b11.png

  X coordinate means total disk usage, Y coordinate means avg block
  group usage.

  Due to fragmentation of extents, the different are not obvious,
  only about 1% improvement....
  						   
Performance regression:
   I have ran sysbench on our machine with SSD in multi combinations,
   no obvious regression found.
   However in theory, the new allocator may cost more time in some
   cases.

[1] https://www.spinics.net/lists/linux-btrfs/msg79508.html
[2] https://lkml.org/lkml/2018/8/16/174

---
Due to some reasons includes time and hardware, the use-case is not
outstanding enough. And some codes are dirty but I can't found another
way. So I named it as RFC.
     Any comments and suggestions are welcome.
     
Su Yue (17):
  btrfs: priority alloc: prepare of priority aware allocator
  btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE
  btrfs: priority alloc: introduce compute_block_group_priority/usage
  btrfs: priority alloc: add functions to create/remove priority trees
  btrfs: priority alloc: introduce functions to add block group to
    priority tree
  btrfs: priority alloc: introduce three macros to mark block group
    status
  btrfs: priority alloc: add functions to remove block group from
    priority tree
  btrfs: priority alloc: add btrfs_update_block_group_priority()
  btrfs: priority alloc: call create/remove_priority_trees in space_info
  btrfs: priority alloc: call add_block_group_priority while reading or
    making block group
  btrfs: priority alloc: remove block group from priority tree while
    removing block group
  btrfs: priority alloc: introduce find_free_extent_search()
  btrfs: priority alloc: modify find_free_extent() to fit priority
    allocator
  btrfs: priority alloc: introduce btrfs_set_bg_updating and call
    btrfs_update_block_group_prioriy
  btrfs: priority alloc: write bg->priority_groups_sem while waiting
    reservation
  btrfs: priority alloc: write bg->priority_tree->groups_sem to avoid
    race in btrfs_delete_unused_bgs()
  btrfs: add mount option "priority_alloc=%s"

 fs/btrfs/ctree.h            |  28 ++
 fs/btrfs/extent-tree.c      | 672 +++++++++++++++++++++++++++++++++---
 fs/btrfs/free-space-cache.c |   3 +
 fs/btrfs/super.c            |  18 +
 fs/btrfs/transaction.c      |   1 +
 5 files changed, 681 insertions(+), 41 deletions(-)

Comments

Qu Wenruo Nov. 28, 2018, 4:04 a.m. UTC | #1
On 2018/11/28 上午11:11, Su Yue wrote:
> This patchset can be fetched from repo:
> https://github.com/Damenly/btrfs-devel/commits/priority_aware_allocator.
> Since patchset 'btrfs: Refactor find_free_extent()' does a nice work
> to simplify find_free_extent(). This patchset dependents on the refactor.
> The base is the commit in kdave/misc-next:
> 
> commit fcaaa1dfa81f2f87ad88cbe0ab86a07f9f76073c (kdave/misc-next)
> Author: Nikolay Borisov <nborisov@suse.com>
> Date:   Tue Nov 6 16:40:20 2018 +0200
> 
>     btrfs: Always try all copies when reading extent buffers
> 
> 
> This patchset introduces a new mount option named 'priority_alloc=%s',
> %s is supported to be "usage" and "off" now. The mount option changes
> the way find_free_extent() how to search block groups.
> 
> Previously, block groups are stored in list of btrfs_space_info
> by start position. When call find_free_extent() if no hint,
> block_groups are searched one by one.
> 
> Design of priority aware allocator:
> Block group has its own priority. We split priorities to many levels,
> block groups are split to different trees according priorities.
> And those trees are sorted by their levels and stored in space_info.
> Once find_free_extent() is called, try to search block groups in higher
> priority level then lower level. Then a block group with higher
> priority is more likely to be used.
> 
> Pros:
> 1) Reduce the frequency of balance.
>    The block group with a higher usage rate will be used preferentially
>    for allocating extents. Free the empty block groups with pinned bytes
>    as non-zero.[1]
> 
> 2) The priority of empty block group with pinned bytes as non-zero
>    will be set as the lowest.
>    
> 3) Support zoned block device.[2]
>    For metadata allocation, the block group in conventional zones
>    will be used as much as possible regardless of usage rate.
>    Will do it in future.

Personally I'm a big fan of the priority aware extent allocator.

So nice job!

>    
> Cons:
> 1) Expectable performance regression.
>    The degree of the decline is temporarily unknown.
>    The user can disable block group priority to get the full performance.
> 
> TESTS:
> 
> If use usage as priority(the only available option), empty block group
> is much harder to be reused.
> 
> About block group usage:
>   Disk: 4 x 1T HDD gathered in LVM.
> 
>   Run script to create files and delete files randomly in loop.
>   The num of files to create are double than to delete.
> 
>   Default mount option result:
>   https://i.loli.net/2018/11/28/5bfdfdf08c760.png
> 
>   Priority aware allocator(usage) result:
>   https://i.loli.net/2018/11/28/5bfdfdf0c1b11.png
> 
>   X coordinate means total disk usage, Y coordinate means avg block
>   group usage.
> 
>   Due to fragmentation of extents, the different are not obvious,
>   only about 1% improvement....

I think you're using the wrong indicator to show the difference.

The real indicator should not be overall block group usage, but:
1) Number of block groups
2) Usage distribution of the block groups

If the number of block groups isn't much different, then we should go
check the distribution.
E.g. all bgs with 97% usage is not as good mostly 100% bgs and several
near 10% bgs.

And we should check the usage distribution between metadata and data bgs.
For data bg, we could hit some fragmentation problem, while for meta bgs
all extents are in the same size, thus may have a better performance for
metadata.

Thus we could do better for the test result.

>   						   
> Performance regression:
>    I have ran sysbench on our machine with SSD in multi combinations,
>    no obvious regression found.
>    However in theory, the new allocator may cost more time in some
>    cases.

Isn't that a good news? :)

> 
> [1] https://www.spinics.net/lists/linux-btrfs/msg79508.html
> [2] https://lkml.org/lkml/2018/8/16/174
> 
> ---
> Due to some reasons includes time and hardware, the use-case is not
> outstanding enough.

As discussed offline, another cause would be data extent fragmentations.
E.g we have a lot of small 4K holes but the request is a big 128M.
In that case btrfs_reserve_extent() could still trigger a new data chunk
other than return the 4K holes found.

Thanks,
Qu

> And some codes are dirty but I can't found another
> way. So I named it as RFC.
>      Any comments and suggestions are welcome.
>      
> Su Yue (17):
>   btrfs: priority alloc: prepare of priority aware allocator
>   btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE
>   btrfs: priority alloc: introduce compute_block_group_priority/usage
>   btrfs: priority alloc: add functions to create/remove priority trees
>   btrfs: priority alloc: introduce functions to add block group to
>     priority tree
>   btrfs: priority alloc: introduce three macros to mark block group
>     status
>   btrfs: priority alloc: add functions to remove block group from
>     priority tree
>   btrfs: priority alloc: add btrfs_update_block_group_priority()
>   btrfs: priority alloc: call create/remove_priority_trees in space_info
>   btrfs: priority alloc: call add_block_group_priority while reading or
>     making block group
>   btrfs: priority alloc: remove block group from priority tree while
>     removing block group
>   btrfs: priority alloc: introduce find_free_extent_search()
>   btrfs: priority alloc: modify find_free_extent() to fit priority
>     allocator
>   btrfs: priority alloc: introduce btrfs_set_bg_updating and call
>     btrfs_update_block_group_prioriy
>   btrfs: priority alloc: write bg->priority_groups_sem while waiting
>     reservation
>   btrfs: priority alloc: write bg->priority_tree->groups_sem to avoid
>     race in btrfs_delete_unused_bgs()
>   btrfs: add mount option "priority_alloc=%s"
> 
>  fs/btrfs/ctree.h            |  28 ++
>  fs/btrfs/extent-tree.c      | 672 +++++++++++++++++++++++++++++++++---
>  fs/btrfs/free-space-cache.c |   3 +
>  fs/btrfs/super.c            |  18 +
>  fs/btrfs/transaction.c      |   1 +
>  5 files changed, 681 insertions(+), 41 deletions(-)
>
Su Yue Dec. 2, 2018, 5:28 a.m. UTC | #2
On 2018/11/28 12:04 PM, Qu Wenruo wrote:
> 
> 
> On 2018/11/28 上午11:11, Su Yue wrote:
>> This patchset can be fetched from repo:
>> https://github.com/Damenly/btrfs-devel/commits/priority_aware_allocator.
>> Since patchset 'btrfs: Refactor find_free_extent()' does a nice work
>> to simplify find_free_extent(). This patchset dependents on the refactor.
>> The base is the commit in kdave/misc-next:
>>
>> commit fcaaa1dfa81f2f87ad88cbe0ab86a07f9f76073c (kdave/misc-next)
>> Author: Nikolay Borisov <nborisov@suse.com>
>> Date:   Tue Nov 6 16:40:20 2018 +0200
>>
>>      btrfs: Always try all copies when reading extent buffers
>>
>>
>> This patchset introduces a new mount option named 'priority_alloc=%s',
>> %s is supported to be "usage" and "off" now. The mount option changes
>> the way find_free_extent() how to search block groups.
>>
>> Previously, block groups are stored in list of btrfs_space_info
>> by start position. When call find_free_extent() if no hint,
>> block_groups are searched one by one.
>>
>> Design of priority aware allocator:
>> Block group has its own priority. We split priorities to many levels,
>> block groups are split to different trees according priorities.
>> And those trees are sorted by their levels and stored in space_info.
>> Once find_free_extent() is called, try to search block groups in higher
>> priority level then lower level. Then a block group with higher
>> priority is more likely to be used.
>>
>> Pros:
>> 1) Reduce the frequency of balance.
>>     The block group with a higher usage rate will be used preferentially
>>     for allocating extents. Free the empty block groups with pinned bytes
>>     as non-zero.[1]
>>
>> 2) The priority of empty block group with pinned bytes as non-zero
>>     will be set as the lowest.
>>     
>> 3) Support zoned block device.[2]
>>     For metadata allocation, the block group in conventional zones
>>     will be used as much as possible regardless of usage rate.
>>     Will do it in future.
> 
> Personally I'm a big fan of the priority aware extent allocator.
> 
> So nice job!
>

Thanks for the offline help.

>>     
>> Cons:
>> 1) Expectable performance regression.
>>     The degree of the decline is temporarily unknown.
>>     The user can disable block group priority to get the full performance.
>>
>> TESTS:
>>
>> If use usage as priority(the only available option), empty block group
>> is much harder to be reused.
>>
>> About block group usage:
>>    Disk: 4 x 1T HDD gathered in LVM.
>>
>>    Run script to create files and delete files randomly in loop.
>>    The num of files to create are double than to delete.
>>
>>    Default mount option result:
>>    https://i.loli.net/2018/11/28/5bfdfdf08c760.png
>>
>>    Priority aware allocator(usage) result:
>>    https://i.loli.net/2018/11/28/5bfdfdf0c1b11.png
>>
>>    X coordinate means total disk usage, Y coordinate means avg block
>>    group usage.
>>
>>    Due to fragmentation of extents, the different are not obvious,
>>    only about 1% improvement....
> 
> I think you're using the wrong indicator to show the difference.
> 
> The real indicator should not be overall block group usage, but:
> 1) Number of block groups
> 2) Usage distribution of the block groups
> 
> If the number of block groups isn't much different, then we should go
> check the distribution.
> E.g. all bgs with 97% usage is not as good mostly 100% bgs and several
> near 10% bgs.
>

Took some time to write scripts for summary:
Avg of percentage of block groups during disk usage from 0% to 100%

For block groups whose usage >= 98%, default: 31.09%, priorty_alloc: 46.73%


For block groups whose usage >= 95%, default: 57.69%, priorty_alloc: 64.24%


For block groups whose usage >= 90%, default: 79.87%, priorty_alloc: 80.2%

So this patchset does work in improvement of block groups usages.


> And we should check the usage distribution between metadata and data bgs.
> For data bg, we could hit some fragmentation problem, while for meta bgs
> all extents are in the same size, thus may have a better performance for
> metadata.
> 
> Thus we could do better for the test result.
> 
>>    						
>> Performance regression:
>>     I have ran sysbench on our machine with SSD in multi combinations,
>>     no obvious regression found.
>>     However in theory, the new allocator may cost more time in some
>>     cases.
> 
> Isn't that a good news? :)
> 

Yeah.
>>
>> [1] https://www.spinics.net/lists/linux-btrfs/msg79508.html
>> [2] https://lkml.org/lkml/2018/8/16/174
>>
>> ---
>> Due to some reasons includes time and hardware, the use-case is not
>> outstanding enough.
> 
> As discussed offline, another cause would be data extent fragmentations.
> E.g we have a lot of small 4K holes but the request is a big 128M.
> In that case btrfs_reserve_extent() could still trigger a new data chunk
> other than return the 4K holes found.
> 

IMO, this is another business. Doing it in another patchset is prefered.

Thanks,
Su

> Thanks,
> Qu
> 
>> And some codes are dirty but I can't found another
>> way. So I named it as RFC.
>>       Any comments and suggestions are welcome.
>>       
>> Su Yue (17):
>>    btrfs: priority alloc: prepare of priority aware allocator
>>    btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE
>>    btrfs: priority alloc: introduce compute_block_group_priority/usage
>>    btrfs: priority alloc: add functions to create/remove priority trees
>>    btrfs: priority alloc: introduce functions to add block group to
>>      priority tree
>>    btrfs: priority alloc: introduce three macros to mark block group
>>      status
>>    btrfs: priority alloc: add functions to remove block group from
>>      priority tree
>>    btrfs: priority alloc: add btrfs_update_block_group_priority()
>>    btrfs: priority alloc: call create/remove_priority_trees in space_info
>>    btrfs: priority alloc: call add_block_group_priority while reading or
>>      making block group
>>    btrfs: priority alloc: remove block group from priority tree while
>>      removing block group
>>    btrfs: priority alloc: introduce find_free_extent_search()
>>    btrfs: priority alloc: modify find_free_extent() to fit priority
>>      allocator
>>    btrfs: priority alloc: introduce btrfs_set_bg_updating and call
>>      btrfs_update_block_group_prioriy
>>    btrfs: priority alloc: write bg->priority_groups_sem while waiting
>>      reservation
>>    btrfs: priority alloc: write bg->priority_tree->groups_sem to avoid
>>      race in btrfs_delete_unused_bgs()
>>    btrfs: add mount option "priority_alloc=%s"
>>
>>   fs/btrfs/ctree.h            |  28 ++
>>   fs/btrfs/extent-tree.c      | 672 +++++++++++++++++++++++++++++++++---
>>   fs/btrfs/free-space-cache.c |   3 +
>>   fs/btrfs/super.c            |  18 +
>>   fs/btrfs/transaction.c      |   1 +
>>   5 files changed, 681 insertions(+), 41 deletions(-)
>>