mbox series

[RFC,0/3] Introduce per-profile available space array to avoid over-confident can_overcommit()

Message ID 20191225133938.115733-1-wqu@suse.com (mailing list archive)
Headers show
Series Introduce per-profile available space array to avoid over-confident can_overcommit() | expand

Message

Qu Wenruo Dec. 25, 2019, 1:39 p.m. UTC
There are several bug reports of ENOSPC error in
btrfs_run_delalloc_range().

With some extra info from one reporter, it turns out that
can_overcommit() is using a wrong way to calculate allocatable metadata
space.

The most typical case would look like:
  devid 1 unallocated:	1G
  devid 2 unallocated:  10G
  metadata profile:	RAID1

In above case, we can at most allocate 1G chunk for metadata, due to
unbalanced disk free space.
But current can_overcommit() uses factor based calculation, which never
consider the disk free space balance.


To address this problem, here comes the per-profile available space
array, which gets updated every time a chunk get allocated/removed or a
device get grown or shrunk.

This provides a quick way for hotter place like can_overcommit() to grab
an estimation on how many bytes it can over-commit.

The per-profile available space calculation tries to keep the behavior
of chunk allocator, thus it can handle uneven disks pretty well.

The RFC tag is here because I'm not yet confident enough about the
implementation.
I'm not sure this is the proper to go, or just a over-engineered mess.

Qu Wenruo (3):
  btrfs: Introduce per-profile available space facility
  btrfs: Update per-profile available space when device size/used space
    get updated
  btrfs: space-info: Use per-profile available space in can_overcommit()

 fs/btrfs/space-info.c |  15 ++--
 fs/btrfs/volumes.c    | 205 ++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/volumes.h    |  10 +++
 3 files changed, 203 insertions(+), 27 deletions(-)

Comments

Josef Bacik Dec. 27, 2019, 6:32 p.m. UTC | #1
On 12/25/19 8:39 AM, Qu Wenruo wrote:
> There are several bug reports of ENOSPC error in
> btrfs_run_delalloc_range().
> 
> With some extra info from one reporter, it turns out that
> can_overcommit() is using a wrong way to calculate allocatable metadata
> space.
> 
> The most typical case would look like:
>    devid 1 unallocated:	1G
>    devid 2 unallocated:  10G
>    metadata profile:	RAID1
> 
> In above case, we can at most allocate 1G chunk for metadata, due to
> unbalanced disk free space.
> But current can_overcommit() uses factor based calculation, which never
> consider the disk free space balance.
> 
> 
> To address this problem, here comes the per-profile available space
> array, which gets updated every time a chunk get allocated/removed or a
> device get grown or shrunk.
> 
> This provides a quick way for hotter place like can_overcommit() to grab
> an estimation on how many bytes it can over-commit.
> 
> The per-profile available space calculation tries to keep the behavior
> of chunk allocator, thus it can handle uneven disks pretty well.
> 
> The RFC tag is here because I'm not yet confident enough about the
> implementation.
> I'm not sure this is the proper to go, or just a over-engineered mess.
> 

In general I like the approach, however re-doing the whole calculation once we 
add or remove a chunk seems like overkill.  Can we get away with just doing the 
big expensive calculation on mount, and then adjust available up and down as we 
add and remove chunks?  We know the chunk allocator is not going to allow us to 
allocate more chunks than our smallest device, so we can be sure we're never 
going to go negative, and removing chunks is easy again because of the same reason.

Other than that the approach seems sound to me.  Thanks,

Josef
Qu Wenruo Dec. 28, 2019, 1:09 a.m. UTC | #2
On 2019/12/28 上午2:32, Josef Bacik wrote:
> On 12/25/19 8:39 AM, Qu Wenruo wrote:
>> There are several bug reports of ENOSPC error in
>> btrfs_run_delalloc_range().
>>
>> With some extra info from one reporter, it turns out that
>> can_overcommit() is using a wrong way to calculate allocatable metadata
>> space.
>>
>> The most typical case would look like:
>>    devid 1 unallocated:    1G
>>    devid 2 unallocated:  10G
>>    metadata profile:    RAID1
>>
>> In above case, we can at most allocate 1G chunk for metadata, due to
>> unbalanced disk free space.
>> But current can_overcommit() uses factor based calculation, which never
>> consider the disk free space balance.
>>
>>
>> To address this problem, here comes the per-profile available space
>> array, which gets updated every time a chunk get allocated/removed or a
>> device get grown or shrunk.
>>
>> This provides a quick way for hotter place like can_overcommit() to grab
>> an estimation on how many bytes it can over-commit.
>>
>> The per-profile available space calculation tries to keep the behavior
>> of chunk allocator, thus it can handle uneven disks pretty well.
>>
>> The RFC tag is here because I'm not yet confident enough about the
>> implementation.
>> I'm not sure this is the proper to go, or just a over-engineered mess.
>>
> 
> In general I like the approach, however re-doing the whole calculation
> once we add or remove a chunk seems like overkill.  Can we get away with
> just doing the big expensive calculation on mount, and then adjust
> available up and down as we add and remove chunks?

That looks good on a quick glance, but in practice it may not work as
expected, mostly due to the small difference in sort.

Current chunk allocator works by sorting the max hole size as primary
sort index, thus it may cause difference on some corner case.
Without proper re-calculation, the difference may drift larger and larger.

Thus I prefer to be a little safer to do extra calculation each time
chunk get allocated/remove.
And that calculation is not that heavy, it just iterate the device lists
several times, and all access are in-memory without sleep, it should be
pretty fast.

Thanks,
Qu

>  We know the chunk
> allocator is not going to allow us to allocate more chunks than our
> smallest device, so we can be sure we're never going to go negative, and
> removing chunks is easy again because of the same reason.
> 
> Other than that the approach seems sound to me.  Thanks,
> 
> Josef
Josef Bacik Dec. 30, 2019, 2:29 p.m. UTC | #3
On 12/27/19 8:09 PM, Qu Wenruo wrote:
> 
> 
> On 2019/12/28 上午2:32, Josef Bacik wrote:
>> On 12/25/19 8:39 AM, Qu Wenruo wrote:
>>> There are several bug reports of ENOSPC error in
>>> btrfs_run_delalloc_range().
>>>
>>> With some extra info from one reporter, it turns out that
>>> can_overcommit() is using a wrong way to calculate allocatable metadata
>>> space.
>>>
>>> The most typical case would look like:
>>>     devid 1 unallocated:    1G
>>>     devid 2 unallocated:  10G
>>>     metadata profile:    RAID1
>>>
>>> In above case, we can at most allocate 1G chunk for metadata, due to
>>> unbalanced disk free space.
>>> But current can_overcommit() uses factor based calculation, which never
>>> consider the disk free space balance.
>>>
>>>
>>> To address this problem, here comes the per-profile available space
>>> array, which gets updated every time a chunk get allocated/removed or a
>>> device get grown or shrunk.
>>>
>>> This provides a quick way for hotter place like can_overcommit() to grab
>>> an estimation on how many bytes it can over-commit.
>>>
>>> The per-profile available space calculation tries to keep the behavior
>>> of chunk allocator, thus it can handle uneven disks pretty well.
>>>
>>> The RFC tag is here because I'm not yet confident enough about the
>>> implementation.
>>> I'm not sure this is the proper to go, or just a over-engineered mess.
>>>
>>
>> In general I like the approach, however re-doing the whole calculation
>> once we add or remove a chunk seems like overkill.  Can we get away with
>> just doing the big expensive calculation on mount, and then adjust
>> available up and down as we add and remove chunks?
> 
> That looks good on a quick glance, but in practice it may not work as
> expected, mostly due to the small difference in sort.
> 
> Current chunk allocator works by sorting the max hole size as primary
> sort index, thus it may cause difference on some corner case.
> Without proper re-calculation, the difference may drift larger and larger.
> 
> Thus I prefer to be a little safer to do extra calculation each time
> chunk get allocated/remove.
> And that calculation is not that heavy, it just iterate the device lists
> several times, and all access are in-memory without sleep, it should be
> pretty fast.
> 

Ahh I hadn't thought of different hole sizes.  You're right that it shouldn't 
matter in practice, it's not like chunk allocation is a fast path.  This seems 
reasonable to me then, I'll go through the patches properly.  Thanks,

Josef