[v3,1/3] btrfs: Introduce per-profile available space facility
diff mbox series

Message ID 20200106061343.18772-2-wqu@suse.com
State New
Headers show
Series
  • Introduce per-profile available space array to avoid over-confident can_overcommit()
Related show

Commit Message

Qu Wenruo Jan. 6, 2020, 6:13 a.m. UTC
[PROBLEM]
There are some locations in btrfs requiring accurate estimation on how
many new bytes can be allocated on unallocated space.

We have two types of estimation:
- Factor based calculation
  Just use all unallocated space, divide by the profile factor
  One obvious user is can_overcommit().

- Chunk allocator like calculation
  This will emulate the chunk allocator behavior, to get a proper
  estimation.
  The only user is btrfs_calc_avail_data_space(), utilized by
  btrfs_statfs().
  The problem is, that function is not generic purposed enough, can't
  handle things like RAID5/6.

Current factor based calculation can't handle the following case:
  devid 1 unallocated:	1T
  devid 2 unallocated:	10T
  metadata type:	RAID1

If using factor, we can use (1T + 10T) / 2 = 5.5T free space for
metadata.
But in fact we can only get 1T free space, as we're limited by the
smallest device for RAID1.

[SOLUTION]
This patch will introduce per-profile available space calculation,
which can give an estimation based on chunk-allocator-like behavior.

The difference between it and chunk allocator is mostly on rounding and
[0, 1M) reserved space handling, which shouldn't cause practical impact.

The newly introduced per-profile available space calculation will
calculate available space for each type, using chunk-allocator like
calculation.

With that facility, for above device layout we get the full available
space array:
  RAID10:	0  (not enough devices)
  RAID1:	1T
  RAID1C3:	0  (not enough devices)
  RAID1C4:	0  (not enough devices)
  DUP:		5.5T
  RAID0:	2T
  SINGLE:	11T
  RAID5:	1T
  RAID6:	0  (not enough devices)

Or for a more complex example:
  devid 1 unallocated:	1T
  devid 2 unallocated:  1T
  devid 3 unallocated:	10T

We will get an array of:
  RAID10:	0  (not enough devices)
  RAID1:	2T
  RAID1C3:	1T
  RAID1C4:	0  (not enough devices)
  DUP:		6T
  RAID0:	3T
  SINGLE:	12T
  RAID5:	2T
  RAID6:	0  (not enough devices)

And for the each profile , we go chunk allocator level calculation:
The pseudo code looks like:

  clear_virtual_used_space_of_all_rw_devices();
  do {
  	/*
  	 * The same as chunk allocator, despite used space,
  	 * we also take virtual used space into consideration.
  	 */
  	sort_device_with_virtual_free_space();

  	/*
  	 * Unlike chunk allocator, we don't need to bother hole/stripe
  	 * size, so we use the smallest device to make sure we can
  	 * allocated as many stripes as regular chunk allocator
  	 */
  	stripe_size = device_with_smallest_free->avail_space;
	stripe_size = min(stripe_size, to_alloc / ndevs);

  	/*
  	 * Allocate a virtual chunk, allocated virtual chunk will
  	 * increase virtual used space, allow next iteration to
  	 * properly emulate chunk allocator behavior.
  	 */
  	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
  	if (ret == 0)
  		avail += allocated_size;
  } while (ret == 0)

As we always select the device with least free space, the device with
the most space will be the first to be utilized, just like chunk
allocator.
For above 1T + 10T device, we will allocate a 1T virtual chunk
in the first iteration, then run out of device in next iteration.

Thus only get 1T free space for RAID1 type, just like what chunk
allocator would do.

The patch will update such per-profile available space at the following
timing:
- Mount time
- Chunk allocation
- Chunk removal
- Device grow
- Device shrink

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 218 ++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h |  11 +++
 2 files changed, 209 insertions(+), 20 deletions(-)

Comments

David Sterba Jan. 6, 2020, 2:32 p.m. UTC | #1
On Mon, Jan 06, 2020 at 02:13:41PM +0800, Qu Wenruo wrote:
> +/*
> + * Return 0 if we allocated any virtual(*) chunk, and restore the size to
> + * @allocated_size
> + * Return -ENOSPC if we have no more space to allocate virtual chunk
> + *
> + * *: virtual chunk is a space holder for per-profile available space
> + *    calculator.
> + *    Such virtual chunks won't take on-disk space, thus called virtual, and
> + *    only affects per-profile available space calulation.
> + */
> +static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
> +			       struct btrfs_device_info *devices_info,
> +			       enum btrfs_raid_types type,
> +			       u64 to_alloc, u64 *allocated)
> +{
> +	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_device *device;
> +	u64 stripe_size;
> +	int i;
> +	int ndevs = 0;
> +
> +	lockdep_assert_held(&fs_info->chunk_mutex);
> +
> +	/* Go through devices to collect their unallocated space */
> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
> +		u64 avail;
> +		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
> +					&device->dev_state) ||
> +		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
> +			continue;
> +
> +		if (device->total_bytes > device->bytes_used +
> +				device->virtual_allocated)
> +			avail = device->total_bytes - device->bytes_used -
> +				device->virtual_allocated;
> +		else
> +			avail = 0;
> +
> +		/* And exclude the [0, 1M) reserved space */
> +		if (avail > SZ_1M)
> +			avail -= SZ_1M;
> +		else
> +			avail = 0;
> +
> +		if (avail == 0)
> +			continue;
> +		/*
> +		 * Unlike chunk allocator, we don't care about stripe or hole
> +		 * size, so here we use @avail directly
> +		 */
> +		devices_info[ndevs].dev_offset = 0;
> +		devices_info[ndevs].total_avail = avail;
> +		devices_info[ndevs].max_avail = avail;
> +		devices_info[ndevs].dev = device;
> +		++ndevs;
> +	}
> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +	     btrfs_cmp_device_info, NULL);
> +	ndevs -= ndevs % raid_attr->devs_increment;
> +	if (ndevs < raid_attr->devs_min)
> +		return -ENOSPC;
> +	if (raid_attr->devs_max)
> +		ndevs = min(ndevs, (int)raid_attr->devs_max);
> +	else
> +		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
> +
> +	/*
> +	 * Now allocate a virtual chunk using the unallocate space of the
> +	 * device with the least unallocated space.
> +	 */
> +	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
> +				 fs_info->sectorsize);
> +
> +	/* We can't directly do round_up for (u64)-1 as that would result 0 */
> +	if (to_alloc != (u64)-1)
> +		stripe_size = min_t(u64, stripe_size,
> +				    round_up(div_u64(to_alloc, ndevs),
> +					     fs_info->sectorsize));
> +	if (stripe_size == 0)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < ndevs; i++)
> +		devices_info[i].dev->virtual_allocated += stripe_size;
> +	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
> +		     raid_attr->ncopies;
> +	return 0;
> +}
> +
> +static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
> +				  enum btrfs_raid_types type)
> +{
> +	struct btrfs_device_info *devices_info = NULL;
> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_device *device;
> +	u64 allocated;
> +	u64 result = 0;
> +	int ret = 0;
> +
> +	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
> +
> +	/* Not enough devices, quick exit, just update the result */
> +	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
> +		goto out;
> +
> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
> +			       GFP_NOFS);

Calling kcalloc is another potential slowdown, for the statfs
considerations.

> +	if (!devices_info) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	/* Clear virtual chunk used space for each device */
> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
> +		device->virtual_allocated = 0;
> +	while (ret == 0) {
> +		ret = alloc_virtual_chunk(fs_info, devices_info, type,
> +					  (u64)-1, &allocated);
> +		if (ret == 0)
> +			result += allocated;
> +	}
> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
> +		device->virtual_allocated = 0;
> +out:
> +	kfree(devices_info);
> +	if (ret < 0 && ret != -ENOSPC)
> +		return ret;
> +	spin_lock(&fs_devices->per_profile_lock);
> +	fs_devices->per_profile_avail[type] = result;
> +	spin_unlock(&fs_devices->per_profile_lock);
> +	return 0;
> +}
> +
> +/*
> + * Calculate the per-profile available space array.
> + *
> + * Return 0 if we succeeded updating the array.
> + * Return <0 if something went wrong. (ENOMEM)
> + */
> +static int calc_per_profile_avail(struct btrfs_fs_info *fs_info)
> +{
> +	int i;
> +	int ret;
> +
> +	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
> +		ret = calc_one_profile_avail(fs_info, i);
> +		if (ret < 0)
> +			break;
> +	}
> +	return ret;
> +}
> +
>  int btrfs_grow_device(struct btrfs_trans_handle *trans,
>  		      struct btrfs_device *device, u64 new_size)
>  {
> @@ -2635,6 +2806,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
>  	struct btrfs_super_block *super_copy = fs_info->super_copy;
>  	u64 old_total;
>  	u64 diff;
> +	int ret;
>  
>  	if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
>  		return -EACCES;
> @@ -2661,7 +2833,12 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
>  	if (list_empty(&device->post_commit_list))
>  		list_add_tail(&device->post_commit_list,
>  			      &trans->transaction->dev_update_list);
> +	ret = calc_per_profile_avail(fs_info);
>  	mutex_unlock(&fs_info->chunk_mutex);
> +	if (ret < 0) {
> +		btrfs_abort_transaction(trans, ret);
> +		return ret;
> +	}
>  
>  	return btrfs_update_device(trans, device);
>  }
> @@ -2831,7 +3008,13 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
>  					device->bytes_used - dev_extent_len);
>  			atomic64_add(dev_extent_len, &fs_info->free_chunk_space);
>  			btrfs_clear_space_info_full(fs_info);
> +			ret = calc_per_profile_avail(fs_info);

Adding new failure modes

>  			mutex_unlock(&fs_info->chunk_mutex);
> +			if (ret < 0) {
> +				mutex_unlock(&fs_devices->device_list_mutex);
> +				btrfs_abort_transaction(trans, ret);
> +				goto out;
> +			}
>  		}
>  
>  		ret = btrfs_update_device(trans, device);
> @@ -4526,6 +4709,12 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
>  		atomic64_sub(diff, &fs_info->free_chunk_space);
>  	}
>  
> +	ret = calc_per_profile_avail(fs_info);
> +	if (ret < 0) {
> +		btrfs_abort_transaction(trans, ret);
> +		btrfs_end_transaction(trans);
> +		goto done;
> +	}
>  	/*
>  	 * Once the device's size has been set to the new size, ensure all
>  	 * in-memory chunks are synced to disk so that the loop below sees them
> @@ -4690,25 +4879,6 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>  	return 0;
>  }
>  
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -138,6 +138,13 @@ struct btrfs_device {
>  	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
>  
>  	struct extent_io_tree alloc_state;
> +
> +	/*
> +	 * the "virtual" allocated space by virtual chunk allocator, which
> +	 * is used to do accurate estimation on available space.
> +	 * Doesn't affect real chunk allocator.
> +	 */
> +	u64 virtual_allocated;

I find it a bit confusing to use 'virtual', though I get what you mean.
As this is per-device it accounts overall space, so the name should
reflect somthing like that. I maybe have a more concrete suggestion once
I read through the whole series.
kbuild test robot Jan. 6, 2020, 11:45 p.m. UTC | #2
Hi Qu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v5.5-rc5]
[also build test ERROR on next-20200106]
[cannot apply to btrfs/next]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/Introduce-per-profile-available-space-array-to-avoid-over-confident-can_overcommit/20200107-025134
base:    c79f46a282390e0f5b306007bf7b11a46d529538
config: m68k-multi_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 7.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.5.0 make.cross ARCH=m68k 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> ERROR: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation
Qu Wenruo Jan. 7, 2020, 2:13 a.m. UTC | #3
On 2020/1/6 下午10:32, David Sterba wrote:
> On Mon, Jan 06, 2020 at 02:13:41PM +0800, Qu Wenruo wrote:
[...]
>> +static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
>> +				  enum btrfs_raid_types type)
>> +{
>> +	struct btrfs_device_info *devices_info = NULL;
>> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>> +	struct btrfs_device *device;
>> +	u64 allocated;
>> +	u64 result = 0;
>> +	int ret = 0;
>> +
>> +	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
>> +
>> +	/* Not enough devices, quick exit, just update the result */
>> +	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
>> +		goto out;
>> +
>> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
>> +			       GFP_NOFS);
> 
> Calling kcalloc is another potential slowdown, for the statfs
> considerations.

That's also what we did in statfs() before, so it shouldn't cause extra
problem.
Furthermore, we didn't use calc_one_profile_avail() directly in statfs()
directly.

Although I get your point, and personally speaking the memory allocation
and extra in-memory device iteration should be pretty fast compared to
__btrfs_alloc_chunk().

Thus I don't think this memory allocation would cause extra trouble,
except the error handling mentioned below.

[...]
>> +			ret = calc_per_profile_avail(fs_info);
> 
> Adding new failure modes

Another solution I have tried is make calc_per_profile_avail() return
void, ignoring the ENOMEM error, and just set the affected profile to 0
available space.

But that method is just delaying ENOMEM, and would cause strange
pre-profile values until next successful update or mount cycle.

Any idea on which method is less worse?

[...]
>>  
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -138,6 +138,13 @@ struct btrfs_device {
>>  	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
>>  
>>  	struct extent_io_tree alloc_state;
>> +
>> +	/*
>> +	 * the "virtual" allocated space by virtual chunk allocator, which
>> +	 * is used to do accurate estimation on available space.
>> +	 * Doesn't affect real chunk allocator.
>> +	 */
>> +	u64 virtual_allocated;
> 
> I find it a bit confusing to use 'virtual', though I get what you mean.
> As this is per-device it accounts overall space, so the name should
> reflect somthing like that. I maybe have a more concrete suggestion once
> I read through the whole series.
> 
Looking forward that naming.

Thanks,
Qu
David Sterba Jan. 8, 2020, 3:04 p.m. UTC | #4
On Tue, Jan 07, 2020 at 10:13:43AM +0800, Qu Wenruo wrote:
> >> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
> >> +			       GFP_NOFS);
> > 
> > Calling kcalloc is another potential slowdown, for the statfs
> > considerations.
> 
> That's also what we did in statfs() before, so it shouldn't cause extra
> problem.
> Furthermore, we didn't use calc_one_profile_avail() directly in statfs()
> directly.
> 
> Although I get your point, and personally speaking the memory allocation
> and extra in-memory device iteration should be pretty fast compared to
> __btrfs_alloc_chunk().
> 
> Thus I don't think this memory allocation would cause extra trouble,
> except the error handling mentioned below.

Right, current statfs also does allocation via
btrfs_calc_avail_data_space, so it's the same as before.

> [...]
> >> +			ret = calc_per_profile_avail(fs_info);
> > 
> > Adding new failure modes
> 
> Another solution I have tried is make calc_per_profile_avail() return
> void, ignoring the ENOMEM error, and just set the affected profile to 0
> available space.
> 
> But that method is just delaying ENOMEM, and would cause strange
> pre-profile values until next successful update or mount cycle.
> 
> Any idea on which method is less worse?

Better to return the error than wrong values in this case. As the
numbers are sort of a cache and the mount cycle to get them fixed is not
very user friendly, we need some other way. As this is a global state, a
bit in fs_info::flags can be set and full recalculation attempted at
some point until it succeeds. This would leave the counters stale for
some time but I think still better than if they're suddenly 0.
Qu Wenruo Jan. 8, 2020, 11:53 p.m. UTC | #5
On 2020/1/8 下午11:04, David Sterba wrote:
> On Tue, Jan 07, 2020 at 10:13:43AM +0800, Qu Wenruo wrote:
>>>> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
>>>> +			       GFP_NOFS);
>>>
>>> Calling kcalloc is another potential slowdown, for the statfs
>>> considerations.
>>
>> That's also what we did in statfs() before, so it shouldn't cause extra
>> problem.
>> Furthermore, we didn't use calc_one_profile_avail() directly in statfs()
>> directly.
>>
>> Although I get your point, and personally speaking the memory allocation
>> and extra in-memory device iteration should be pretty fast compared to
>> __btrfs_alloc_chunk().
>>
>> Thus I don't think this memory allocation would cause extra trouble,
>> except the error handling mentioned below.
> 
> Right, current statfs also does allocation via
> btrfs_calc_avail_data_space, so it's the same as before.
> 
>> [...]
>>>> +			ret = calc_per_profile_avail(fs_info);
>>>
>>> Adding new failure modes
>>
>> Another solution I have tried is make calc_per_profile_avail() return
>> void, ignoring the ENOMEM error, and just set the affected profile to 0
>> available space.
>>
>> But that method is just delaying ENOMEM, and would cause strange
>> pre-profile values until next successful update or mount cycle.
>>
>> Any idea on which method is less worse?
> 
> Better to return the error than wrong values in this case. As the
> numbers are sort of a cache and the mount cycle to get them fixed is not
> very user friendly, we need some other way. As this is a global state, a
> bit in fs_info::flags can be set and full recalculation attempted at
> some point until it succeeds. This would leave the counters stale for
> some time but I think still better than if they're suddenly 0.
> 
If can_over_commit() is the only user of this facility, then either an
extra indicator or sudden 0 is no problem.
As in that case, we just don't over-commit and do extra flush to free
meta space.

But when statfs() is going to use this facility, either sudden 0 nor
indicator is good.
Just image seconds before, we still have several TiB free space, and all
of a sudden, just several GiB free (from allocated data chunks).

User will definitely complain.

Thus I still prefer proper error handling, as when we're low on memory,
a lot of things can go wrong anyway.

Thanks,
Qu
Qu Wenruo Jan. 9, 2020, 6:26 a.m. UTC | #6
On 2020/1/9 上午7:53, Qu WenRuo wrote:
> 
> 
> On 2020/1/8 下午11:04, David Sterba wrote:
>> On Tue, Jan 07, 2020 at 10:13:43AM +0800, Qu Wenruo wrote:
>>>>> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
>>>>> +			       GFP_NOFS);
>>>>
>>>> Calling kcalloc is another potential slowdown, for the statfs
>>>> considerations.
>>>
>>> That's also what we did in statfs() before, so it shouldn't cause extra
>>> problem.
>>> Furthermore, we didn't use calc_one_profile_avail() directly in statfs()
>>> directly.
>>>
>>> Although I get your point, and personally speaking the memory allocation
>>> and extra in-memory device iteration should be pretty fast compared to
>>> __btrfs_alloc_chunk().
>>>
>>> Thus I don't think this memory allocation would cause extra trouble,
>>> except the error handling mentioned below.
>>
>> Right, current statfs also does allocation via
>> btrfs_calc_avail_data_space, so it's the same as before.
>>
>>> [...]
>>>>> +			ret = calc_per_profile_avail(fs_info);
>>>>
>>>> Adding new failure modes
>>>
>>> Another solution I have tried is make calc_per_profile_avail() return
>>> void, ignoring the ENOMEM error, and just set the affected profile to 0
>>> available space.
>>>
>>> But that method is just delaying ENOMEM, and would cause strange
>>> pre-profile values until next successful update or mount cycle.
>>>
>>> Any idea on which method is less worse?
>>
>> Better to return the error than wrong values in this case. As the
>> numbers are sort of a cache and the mount cycle to get them fixed is not
>> very user friendly, we need some other way. As this is a global state, a
>> bit in fs_info::flags can be set and full recalculation attempted at
>> some point until it succeeds. This would leave the counters stale for
>> some time but I think still better than if they're suddenly 0.

BTW, not sure if this would make you feel less nervous.

When we return ENOMEM from this facility, the timings are:
- Mount
  So really not something would happen, and no problem would be caused
  at all.

- Chunk allocation
  It's from __btrfs_alloc_chunk() which also do memory allocation by
  itself and could return ENOMEM. So no different at error handling.

- Grow device
  This is a little complex.
  My new error handling is aborting transaction as we didn't reset the
  device size to its original size.
  But the existing btrfs_update_devcice() can return -ENOMEM, even
  without resetting device size.
  From this point of view, my new error handling is at least better
  to avoid device size mismatch.

- Shrink device
  This new error handling is overkilling.
  At done tag, we have method to revert to old device size, and we
  haven't done anything yet, so we should be able to recover from that
  situation.

Anyway, I will enhance the error handling part, to make then recover
without aborting transaction for shrinking device and growing device.

Thanks,
Qu

>>
> If can_over_commit() is the only user of this facility, then either an
> extra indicator or sudden 0 is no problem.
> As in that case, we just don't over-commit and do extra flush to free
> meta space.
> 
> But when statfs() is going to use this facility, either sudden 0 nor
> indicator is good.
> Just image seconds before, we still have several TiB free space, and all
> of a sudden, just several GiB free (from allocated data chunks).
> 
> User will definitely complain.
> 
> Thus I still prefer proper error handling, as when we're low on memory,
> a lot of things can go wrong anyway.
> 
> Thanks,
> Qu
>

Patch
diff mbox series

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d8e5560db285..e38930390e89 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -349,6 +349,7 @@  static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid,
 	INIT_LIST_HEAD(&fs_devs->devices);
 	INIT_LIST_HEAD(&fs_devs->alloc_list);
 	INIT_LIST_HEAD(&fs_devs->fs_list);
+	spin_lock_init(&fs_devs->per_profile_lock);
 	if (fsid)
 		memcpy(fs_devs->fsid, fsid, BTRFS_FSID_SIZE);
 
@@ -2628,6 +2629,176 @@  static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * sort the devices in descending order by max_avail, total_avail
+ */
+static int btrfs_cmp_device_info(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * Return 0 if we allocated any virtual(*) chunk, and restore the size to
+ * @allocated_size
+ * Return -ENOSPC if we have no more space to allocate virtual chunk
+ *
+ * *: virtual chunk is a space holder for per-profile available space
+ *    calculator.
+ *    Such virtual chunks won't take on-disk space, thus called virtual, and
+ *    only affects per-profile available space calulation.
+ */
+static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
+			       struct btrfs_device_info *devices_info,
+			       enum btrfs_raid_types type,
+			       u64 to_alloc, u64 *allocated)
+{
+	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 stripe_size;
+	int i;
+	int ndevs = 0;
+
+	lockdep_assert_held(&fs_info->chunk_mutex);
+
+	/* Go through devices to collect their unallocated space */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
+		u64 avail;
+		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+					&device->dev_state) ||
+		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
+			continue;
+
+		if (device->total_bytes > device->bytes_used +
+				device->virtual_allocated)
+			avail = device->total_bytes - device->bytes_used -
+				device->virtual_allocated;
+		else
+			avail = 0;
+
+		/* And exclude the [0, 1M) reserved space */
+		if (avail > SZ_1M)
+			avail -= SZ_1M;
+		else
+			avail = 0;
+
+		if (avail == 0)
+			continue;
+		/*
+		 * Unlike chunk allocator, we don't care about stripe or hole
+		 * size, so here we use @avail directly
+		 */
+		devices_info[ndevs].dev_offset = 0;
+		devices_info[ndevs].total_avail = avail;
+		devices_info[ndevs].max_avail = avail;
+		devices_info[ndevs].dev = device;
+		++ndevs;
+	}
+	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+	     btrfs_cmp_device_info, NULL);
+	ndevs -= ndevs % raid_attr->devs_increment;
+	if (ndevs < raid_attr->devs_min)
+		return -ENOSPC;
+	if (raid_attr->devs_max)
+		ndevs = min(ndevs, (int)raid_attr->devs_max);
+	else
+		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
+
+	/*
+	 * Now allocate a virtual chunk using the unallocate space of the
+	 * device with the least unallocated space.
+	 */
+	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
+				 fs_info->sectorsize);
+
+	/* We can't directly do round_up for (u64)-1 as that would result 0 */
+	if (to_alloc != (u64)-1)
+		stripe_size = min_t(u64, stripe_size,
+				    round_up(div_u64(to_alloc, ndevs),
+					     fs_info->sectorsize));
+	if (stripe_size == 0)
+		return -ENOSPC;
+
+	for (i = 0; i < ndevs; i++)
+		devices_info[i].dev->virtual_allocated += stripe_size;
+	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
+		     raid_attr->ncopies;
+	return 0;
+}
+
+static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
+				  enum btrfs_raid_types type)
+{
+	struct btrfs_device_info *devices_info = NULL;
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 allocated;
+	u64 result = 0;
+	int ret = 0;
+
+	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
+
+	/* Not enough devices, quick exit, just update the result */
+	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
+		goto out;
+
+	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
+			       GFP_NOFS);
+	if (!devices_info) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	/* Clear virtual chunk used space for each device */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+	while (ret == 0) {
+		ret = alloc_virtual_chunk(fs_info, devices_info, type,
+					  (u64)-1, &allocated);
+		if (ret == 0)
+			result += allocated;
+	}
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+out:
+	kfree(devices_info);
+	if (ret < 0 && ret != -ENOSPC)
+		return ret;
+	spin_lock(&fs_devices->per_profile_lock);
+	fs_devices->per_profile_avail[type] = result;
+	spin_unlock(&fs_devices->per_profile_lock);
+	return 0;
+}
+
+/*
+ * Calculate the per-profile available space array.
+ *
+ * Return 0 if we succeeded updating the array.
+ * Return <0 if something went wrong. (ENOMEM)
+ */
+static int calc_per_profile_avail(struct btrfs_fs_info *fs_info)
+{
+	int i;
+	int ret;
+
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		ret = calc_one_profile_avail(fs_info, i);
+		if (ret < 0)
+			break;
+	}
+	return ret;
+}
+
 int btrfs_grow_device(struct btrfs_trans_handle *trans,
 		      struct btrfs_device *device, u64 new_size)
 {
@@ -2635,6 +2806,7 @@  int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	struct btrfs_super_block *super_copy = fs_info->super_copy;
 	u64 old_total;
 	u64 diff;
+	int ret;
 
 	if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
 		return -EACCES;
@@ -2661,7 +2833,12 @@  int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	if (list_empty(&device->post_commit_list))
 		list_add_tail(&device->post_commit_list,
 			      &trans->transaction->dev_update_list);
+	ret = calc_per_profile_avail(fs_info);
 	mutex_unlock(&fs_info->chunk_mutex);
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, ret);
+		return ret;
+	}
 
 	return btrfs_update_device(trans, device);
 }
@@ -2831,7 +3008,13 @@  int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
 					device->bytes_used - dev_extent_len);
 			atomic64_add(dev_extent_len, &fs_info->free_chunk_space);
 			btrfs_clear_space_info_full(fs_info);
+			ret = calc_per_profile_avail(fs_info);
 			mutex_unlock(&fs_info->chunk_mutex);
+			if (ret < 0) {
+				mutex_unlock(&fs_devices->device_list_mutex);
+				btrfs_abort_transaction(trans, ret);
+				goto out;
+			}
 		}
 
 		ret = btrfs_update_device(trans, device);
@@ -4526,6 +4709,12 @@  int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 		atomic64_sub(diff, &fs_info->free_chunk_space);
 	}
 
+	ret = calc_per_profile_avail(fs_info);
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, ret);
+		btrfs_end_transaction(trans);
+		goto done;
+	}
 	/*
 	 * Once the device's size has been set to the new size, ensure all
 	 * in-memory chunks are synced to disk so that the loop below sees them
@@ -4690,25 +4879,6 @@  static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
-/*
- * sort the devices in descending order by max_avail, total_avail
- */
-static int btrfs_cmp_device_info(const void *a, const void *b)
-{
-	const struct btrfs_device_info *di_a = a;
-	const struct btrfs_device_info *di_b = b;
-
-	if (di_a->max_avail > di_b->max_avail)
-		return -1;
-	if (di_a->max_avail < di_b->max_avail)
-		return 1;
-	if (di_a->total_avail > di_b->total_avail)
-		return -1;
-	if (di_a->total_avail < di_b->total_avail)
-		return 1;
-	return 0;
-}
-
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -4992,9 +5162,10 @@  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	free_extent_map(em);
 	check_raid56_incompat_flag(info, type);
 	check_raid1c34_incompat_flag(info, type);
+	ret = calc_per_profile_avail(info);
 
 	kfree(devices_info);
-	return 0;
+	return ret;
 
 error_del_extent:
 	write_lock(&em_tree->lock);
@@ -7629,6 +7800,13 @@  int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
 
 	/* Ensure all chunks have corresponding dev extents */
 	ret = verify_chunk_dev_extent_mapping(fs_info);
+	if (ret < 0)
+		goto out;
+
+	/* All dev extents are verified, update per-profile available space */
+	mutex_lock(&fs_info->chunk_mutex);
+	ret = calc_per_profile_avail(fs_info);
+	mutex_unlock(&fs_info->chunk_mutex);
 out:
 	btrfs_free_path(path);
 	return ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index fc1b564b9cfe..5cddfe7cfee8 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -138,6 +138,13 @@  struct btrfs_device {
 	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
 
 	struct extent_io_tree alloc_state;
+
+	/*
+	 * the "virtual" allocated space by virtual chunk allocator, which
+	 * is used to do accurate estimation on available space.
+	 * Doesn't affect real chunk allocator.
+	 */
+	u64 virtual_allocated;
 };
 
 /*
@@ -257,6 +264,10 @@  struct btrfs_fs_devices {
 	struct kobject fsid_kobj;
 	struct kobject *device_dir_kobj;
 	struct completion kobj_unregister;
+
+	/* Records per-type available space */
+	spinlock_t per_profile_lock;
+	u64 per_profile_avail[BTRFS_NR_RAID_TYPES];
 };
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64