[v7,1/5] btrfs: Introduce per-profile available space facility
diff mbox series

Message ID 20200211051153.19466-2-wqu@suse.com
State New
Headers show
Series
  • Introduce per-profile available space array to avoid over-confident can_overcommit()
Related show

Commit Message

Qu Wenruo Feb. 11, 2020, 5:11 a.m. UTC
[PROBLEM]
There are some locations in btrfs requiring accurate estimation on how
many new bytes can be allocated on unallocated space.

We have two types of estimation:
- Factor based calculation
  Just use all unallocated space, divide by the profile factor
  One obvious user is can_overcommit().

- Chunk allocator like calculation
  This will emulate the chunk allocator behavior, to get a proper
  estimation.
  The only user is btrfs_calc_avail_data_space(), utilized by
  btrfs_statfs().
  The problem is, that function is not generic purposed enough, can't
  handle things like RAID5/6.

Current factor based calculation can't handle the following case:
  devid 1 unallocated:	1T
  devid 2 unallocated:	10T
  metadata type:	RAID1

If using factor, we can use (1T + 10T) / 2 = 5.5T free space for
metadata.
But in fact we can only get 1T free space, as we're limited by the
smallest device for RAID1.

[SOLUTION]
This patch will introduce per-profile available space calculation,
which can give an estimation based on chunk-allocator-like behavior.

The difference between it and chunk allocator is mostly on rounding and
[0, 1M) reserved space handling, which shouldn't cause practical impact.

The newly introduced per-profile available space calculation will
calculate available space for each type, using chunk-allocator like
calculation.

With that facility, for above device layout we get the full available
space array:
  RAID10:	0  (not enough devices)
  RAID1:	1T
  RAID1C3:	0  (not enough devices)
  RAID1C4:	0  (not enough devices)
  DUP:		5.5T
  RAID0:	2T
  SINGLE:	11T
  RAID5:	1T
  RAID6:	0  (not enough devices)

Or for a more complex example:
  devid 1 unallocated:	1T
  devid 2 unallocated:  1T
  devid 3 unallocated:	10T

We will get an array of:
  RAID10:	0  (not enough devices)
  RAID1:	2T
  RAID1C3:	1T
  RAID1C4:	0  (not enough devices)
  DUP:		6T
  RAID0:	3T
  SINGLE:	12T
  RAID5:	2T
  RAID6:	0  (not enough devices)

And for the each profile , we go chunk allocator level calculation:
The pseudo code looks like:

  clear_virtual_used_space_of_all_rw_devices();
  do {
  	/*
  	 * The same as chunk allocator, despite used space,
  	 * we also take virtual used space into consideration.
  	 */
  	sort_device_with_virtual_free_space();

  	/*
  	 * Unlike chunk allocator, we don't need to bother hole/stripe
  	 * size, so we use the smallest device to make sure we can
  	 * allocated as many stripes as regular chunk allocator
  	 */
  	stripe_size = device_with_smallest_free->avail_space;
	stripe_size = min(stripe_size, to_alloc / ndevs);

  	/*
  	 * Allocate a virtual chunk, allocated virtual chunk will
  	 * increase virtual used space, allow next iteration to
  	 * properly emulate chunk allocator behavior.
  	 */
  	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
  	if (ret == 0)
  		avail += allocated_size;
  } while (ret == 0)

As we always select the device with least free space, the device with
the most space will be the first to be utilized, just like chunk
allocator.
For above 1T + 10T device, we will allocate a 1T virtual chunk
in the first iteration, then run out of device in next iteration.

Thus only get 1T free space for RAID1 type, just like what chunk
allocator would do.

The patch will update such per-profile available space at the following
timing:
- Mount time
- Chunk allocation
- Chunk removal
- Device grow
- Device shrink

Those timing are all protected by chunk_mutex, and what we do are only
iterating in-memory only structures, no extra IO triggered, so the
performance impact should be pretty small.

For the extra error handling, the principle is to keep the old behavior.
That's to say, if old error handler would just return an error, then we
follow it, no matter if the caller reverts the device size.

For the proper error handling, they will be added in later patches.
As I don't want to make the core facility bloated by the error handling
code, especially some situation needs quite some new code to handle
errors.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 216 ++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h |  11 +++
 2 files changed, 207 insertions(+), 20 deletions(-)

Comments

kernel test robot Feb. 13, 2020, 5:12 p.m. UTC | #1
Hi Qu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v5.6-rc1]
[also build test ERROR on next-20200213]
[cannot apply to btrfs/next]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/Introduce-per-profile-available-space-array-to-avoid-over-confident-can_overcommit/20200213-093844
base:    bb6d3fb354c5ee8d6bde2d576eb7220ea09862b9
config: i386-allyesconfig (attached as .config)
compiler: gcc-7 (Debian 7.5.0-4) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: fs/btrfs/volumes.o: in function `calc_per_profile_avail':
>> volumes.c:(.text+0x2603): undefined reference to `__udivdi3'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Nikolay Borisov Feb. 20, 2020, 12:49 p.m. UTC | #2
<snip>

> 
> Suggested-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/volumes.c | 216 ++++++++++++++++++++++++++++++++++++++++-----
>  fs/btrfs/volumes.h |  11 +++
>  2 files changed, 207 insertions(+), 20 deletions(-)
> 

<snip>

> +
> +/*
> + * Return 0 if we allocated any virtual(*) chunk, and restore the size to
> + * @allocated_size
> + * Return -ENOSPC if we have no more space to allocate virtual chunk
> + *
> + * *: virtual chunk is a space holder for per-profile available space
> + *    calculator.
> + *    Such virtual chunks won't take on-disk space, thus called virtual, and
> + *    only affects per-profile available space calulation.
> + */

Document that the last parameter is an output value which contains the
size of the allocated virtual chunk.

> +static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
> +			       struct btrfs_device_info *devices_info,
> +			       enum btrfs_raid_types type,
> +			       u64 *allocated)
> +{
> +	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_device *device;
> +	u64 stripe_size;
> +	int i;
> +	int ndevs = 0;
> +
> +	lockdep_assert_held(&fs_info->chunk_mutex);
> +
> +	/* Go through devices to collect their unallocated space */
> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
> +		u64 avail;
> +		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
> +					&device->dev_state) ||
> +		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
> +			continue;
> +
> +		if (device->total_bytes > device->bytes_used +
> +				device->virtual_allocated)
> +			avail = device->total_bytes - device->bytes_used -
> +				device->virtual_allocated;
> +		else
> +			avail = 0;
> +
> +		/* And exclude the [0, 1M) reserved space */
> +		if (avail > SZ_1M)
> +			avail -= SZ_1M;
> +		else
> +			avail = 0;
> +
> +		if (avail < fs_info->sectorsize)
> +			continue;
> +		/*
> +		 * Unlike chunk allocator, we don't care about stripe or hole
> +		 * size, so here we use @avail directly
> +		 */
> +		devices_info[ndevs].dev_offset = 0;
> +		devices_info[ndevs].total_avail = avail;
> +		devices_info[ndevs].max_avail = avail;
> +		devices_info[ndevs].dev = device;
> +		++ndevs;
> +	}
> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +	     btrfs_cmp_device_info, NULL);
> +	ndevs -= ndevs % raid_attr->devs_increment;

nit: ndevs = rounddown(ndevs, raid_attr->devs_increment);
makes it more clear what's going on. Since you are working with at most
int it's not a problem for 32bits.


> +	if (ndevs < raid_attr->devs_min)
> +		return -ENOSPC;
> +	if (raid_attr->devs_max)
> +		ndevs = min(ndevs, (int)raid_attr->devs_max);
> +	else
> +		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));

Instead of casting simply use min_t(int, ndevs, BTRFS_MAX_DEVS...)

> +
> +	/*
> +	 * Now allocate a virtual chunk using the unallocate space of the

nit: missing d after 'unallocate'

> +	 * device with the least unallocated space.
> +	 */
> +	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
> +				 fs_info->sectorsize);
> +	if (stripe_size == 0)
> +		return -ENOSPC;

Isn't this check redundant - in the loop where you iterate the devices
you always ensure total_avail is at least a sector size, this guarantees
that stripe_size cannot be 0 at this point.

> +
> +	for (i = 0; i < ndevs; i++)
> +		devices_info[i].dev->virtual_allocated += stripe_size;
> +	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
> +		     raid_attr->ncopies;
> +	return 0;
> +}
> +
> +static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
> +				  enum btrfs_raid_types type,
> +				  u64 *result_ret)
> +{
> +	struct btrfs_device_info *devices_info = NULL;
> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_device *device;
> +	u64 allocated;
> +	u64 result = 0;
> +	int ret = 0;
> +

lockdep assert that chunk mutex is held since you access alloc_list.

> +	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
> +
> +	/* Not enough devices, quick exit, just update the result */
> +	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
> +		goto out;

You can directly return.

> +
> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
> +			       GFP_NOFS);
> +	if (!devices_info) {
> +		ret = -ENOMEM;
> +		goto out;

Same here.

> +	}
> +	/* Clear virtual chunk used space for each device */
> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
> +		device->virtual_allocated = 0;
> +	while (ret == 0) {
> +		ret = alloc_virtual_chunk(fs_info, devices_info, type,
> +					  &allocated);
The 'allocated' variable is used only in this loop so declare it in the
loop. Ideally we want variables to have the shortest possible lifecycle.

> +		if (ret == 0)
> +			result += allocated;
> +	}

Why do you need to call this in a loop calling alloc_virtual_chunk once
simply calculate the available space for the given raid type.

> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
> +		device->virtual_allocated = 0;
> +out:
> +	kfree(devices_info);
> +	if (ret < 0 && ret != -ENOSPC)
> +		return ret;
> +	*result_ret = result;
> +	return 0;
> +}

<snip>

> @@ -259,6 +266,10 @@ struct btrfs_fs_devices {
>  	struct kobject fsid_kobj;
>  	struct kobject *devices_kobj;
>  	struct completion kobj_unregister;
> +
> +	/* Records per-type available space */
> +	spinlock_t per_profile_lock;
> +	u64 per_profile_avail[BTRFS_NR_RAID_TYPES];

Since every avail quantity is independent of any other, can you turn
this into an array of atomic64 values and get rid of the spinlock? My
head spins how many locks we have in btrfs.

>  };
>  
>  #define BTRFS_BIO_INLINE_CSUM_SIZE	64
>
Qu Wenruo Feb. 20, 2020, 12:59 p.m. UTC | #3
On 2020/2/20 下午8:49, Nikolay Borisov wrote:
> <snip>
>
>>
>> Suggested-by: Josef Bacik <josef@toxicpanda.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/volumes.c | 216 ++++++++++++++++++++++++++++++++++++++++-----
>>  fs/btrfs/volumes.h |  11 +++
>>  2 files changed, 207 insertions(+), 20 deletions(-)
>>
>
> <snip>
>
>> +
>> +/*
>> + * Return 0 if we allocated any virtual(*) chunk, and restore the size to
>> + * @allocated_size
>> + * Return -ENOSPC if we have no more space to allocate virtual chunk
>> + *
>> + * *: virtual chunk is a space holder for per-profile available space
>> + *    calculator.
>> + *    Such virtual chunks won't take on-disk space, thus called virtual, and
>> + *    only affects per-profile available space calulation.
>> + */
>
> Document that the last parameter is an output value which contains the
> size of the allocated virtual chunk.
>
>> +static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
>> +			       struct btrfs_device_info *devices_info,
>> +			       enum btrfs_raid_types type,
>> +			       u64 *allocated)
>> +{
>> +	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
>> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>> +	struct btrfs_device *device;
>> +	u64 stripe_size;
>> +	int i;
>> +	int ndevs = 0;
>> +
>> +	lockdep_assert_held(&fs_info->chunk_mutex);
>> +
>> +	/* Go through devices to collect their unallocated space */
>> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
>> +		u64 avail;
>> +		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>> +					&device->dev_state) ||
>> +		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
>> +			continue;
>> +
>> +		if (device->total_bytes > device->bytes_used +
>> +				device->virtual_allocated)
>> +			avail = device->total_bytes - device->bytes_used -
>> +				device->virtual_allocated;
>> +		else
>> +			avail = 0;
>> +
>> +		/* And exclude the [0, 1M) reserved space */
>> +		if (avail > SZ_1M)
>> +			avail -= SZ_1M;
>> +		else
>> +			avail = 0;
>> +
>> +		if (avail < fs_info->sectorsize)
>> +			continue;
>> +		/*
>> +		 * Unlike chunk allocator, we don't care about stripe or hole
>> +		 * size, so here we use @avail directly
>> +		 */
>> +		devices_info[ndevs].dev_offset = 0;
>> +		devices_info[ndevs].total_avail = avail;
>> +		devices_info[ndevs].max_avail = avail;
>> +		devices_info[ndevs].dev = device;
>> +		++ndevs;
>> +	}
>> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +	     btrfs_cmp_device_info, NULL);
>> +	ndevs -= ndevs % raid_attr->devs_increment;
>
> nit: ndevs = rounddown(ndevs, raid_attr->devs_increment);

IIRC round_down() can only be used when the alignment is power of 2.

Don't forget we have RAID1C3 now.

> makes it more clear what's going on. Since you are working with at most
> int it's not a problem for 32bits.
>
>
>> +	if (ndevs < raid_attr->devs_min)
>> +		return -ENOSPC;
>> +	if (raid_attr->devs_max)
>> +		ndevs = min(ndevs, (int)raid_attr->devs_max);
>> +	else
>> +		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
>
> Instead of casting simply use min_t(int, ndevs, BTRFS_MAX_DEVS...)

That looks the same to me...

>
>> +
>> +	/*
>> +	 * Now allocate a virtual chunk using the unallocate space of the
>
> nit: missing d after 'unallocate'
>
>> +	 * device with the least unallocated space.
>> +	 */
>> +	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
>> +				 fs_info->sectorsize);
>> +	if (stripe_size == 0)
>> +		return -ENOSPC;
>
> Isn't this check redundant - in the loop where you iterate the devices
> you always ensure total_avail is at least a sector size, this guarantees
> that stripe_size cannot be 0 at this point.
>
>> +
>> +	for (i = 0; i < ndevs; i++)
>> +		devices_info[i].dev->virtual_allocated += stripe_size;
>> +	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
>> +		     raid_attr->ncopies;
>> +	return 0;
>> +}
>> +
>> +static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
>> +				  enum btrfs_raid_types type,
>> +				  u64 *result_ret)
>> +{
>> +	struct btrfs_device_info *devices_info = NULL;
>> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>> +	struct btrfs_device *device;
>> +	u64 allocated;
>> +	u64 result = 0;
>> +	int ret = 0;
>> +
>
> lockdep assert that chunk mutex is held since you access alloc_list.
>
>> +	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
>> +
>> +	/* Not enough devices, quick exit, just update the result */
>> +	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
>> +		goto out;
>
> You can directly return.
>
>> +
>> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
>> +			       GFP_NOFS);
>> +	if (!devices_info) {
>> +		ret = -ENOMEM;
>> +		goto out;
>
> Same here.
>
>> +	}
>> +	/* Clear virtual chunk used space for each device */
>> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
>> +		device->virtual_allocated = 0;
>> +	while (ret == 0) {
>> +		ret = alloc_virtual_chunk(fs_info, devices_info, type,
>> +					  &allocated);
> The 'allocated' variable is used only in this loop so declare it in the
> loop. Ideally we want variables to have the shortest possible lifecycle.
>
>> +		if (ret == 0)
>> +			result += allocated;
>> +	}
>
> Why do you need to call this in a loop calling alloc_virtual_chunk once
> simply calculate the available space for the given raid type.

For this case, we must go several loops:
Dev1: 10G
Dev2: 5G
Dev3: 5G
Type: RAID1.

The first loop will use 5G from dev1, 5G from dev2.
Then the 2nd loop will use the remaining 5G from dev1, 5G from dev3.

And that's the core problem per-profile available space system want to
address.

>
>> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
>> +		device->virtual_allocated = 0;
>> +out:
>> +	kfree(devices_info);
>> +	if (ret < 0 && ret != -ENOSPC)
>> +		return ret;
>> +	*result_ret = result;
>> +	return 0;
>> +}
>
> <snip>
>
>> @@ -259,6 +266,10 @@ struct btrfs_fs_devices {
>>  	struct kobject fsid_kobj;
>>  	struct kobject *devices_kobj;
>>  	struct completion kobj_unregister;
>> +
>> +	/* Records per-type available space */
>> +	spinlock_t per_profile_lock;
>> +	u64 per_profile_avail[BTRFS_NR_RAID_TYPES];
>
> Since every avail quantity is independent of any other, can you turn
> this into an array of atomic64 values and get rid of the spinlock? My
> head spins how many locks we have in btrfs.

OK, that won't hurt, since they are updated separately, there is indeed
no need for a spinlock.

Thanks,
Qu
>
>>  };
>>
>>  #define BTRFS_BIO_INLINE_CSUM_SIZE	64
>>
Nikolay Borisov Feb. 20, 2020, 1:19 p.m. UTC | #4
On 20.02.20 г. 14:59 ч., Qu Wenruo wrote:
> 
> 
> On 2020/2/20 下午8:49, Nikolay Borisov wrote:
>> <snip>
>>
>>>
>>> Suggested-by: Josef Bacik <josef@toxicpanda.com>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>  fs/btrfs/volumes.c | 216 ++++++++++++++++++++++++++++++++++++++++-----
>>>  fs/btrfs/volumes.h |  11 +++
>>>  2 files changed, 207 insertions(+), 20 deletions(-)
>>>
>>

<snip>

>>> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>> +	     btrfs_cmp_device_info, NULL);
>>> +	ndevs -= ndevs % raid_attr->devs_increment;
>>
>> nit: ndevs = rounddown(ndevs, raid_attr->devs_increment);
> 
> IIRC round_down() can only be used when the alignment is power of 2.
> 
> Don't forget we have RAID1C3 now.

Sure, but I'm referring to rounddown and not round_down :)

> 
>> makes it more clear what's going on. Since you are working with at most
>> int it's not a problem for 32bits.
>>
>>
>>> +	if (ndevs < raid_attr->devs_min)
>>> +		return -ENOSPC;
>>> +	if (raid_attr->devs_max)
>>> +		ndevs = min(ndevs, (int)raid_attr->devs_max);
>>> +	else
>>> +		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
>>
>> Instead of casting simply use min_t(int, ndevs, BTRFS_MAX_DEVS...)
> 
> That looks the same to me...

I guess it's a matter of preference so I will defer to David to decide.

> 
>>
>>> +
>>> +	/*
>>> +	 * Now allocate a virtual chunk using the unallocate space of the
>>
>> nit: missing d after 'unallocate'
>>
>>> +	 * device with the least unallocated space.
>>> +	 */
>>> +	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
>>> +				 fs_info->sectorsize);
>>> +	if (stripe_size == 0)
>>> +		return -ENOSPC;
>>
>> Isn't this check redundant - in the loop where you iterate the devices
>> you always ensure total_avail is at least a sector size, this guarantees
>> that stripe_size cannot be 0 at this point.
>>
>>> +
>>> +	for (i = 0; i < ndevs; i++)
>>> +		devices_info[i].dev->virtual_allocated += stripe_size;
>>> +	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
>>> +		     raid_attr->ncopies;
>>> +	return 0;
>>> +}
>>> +
>>> +static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
>>> +				  enum btrfs_raid_types type,
>>> +				  u64 *result_ret)
>>> +{
>>> +	struct btrfs_device_info *devices_info = NULL;
>>> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>> +	struct btrfs_device *device;
>>> +	u64 allocated;
>>> +	u64 result = 0;
>>> +	int ret = 0;
>>> +
>>
>> lockdep assert that chunk mutex is held since you access alloc_list.
>>
>>> +	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
>>> +
>>> +	/* Not enough devices, quick exit, just update the result */
>>> +	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
>>> +		goto out;
>>
>> You can directly return.
>>
>>> +
>>> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
>>> +			       GFP_NOFS);
>>> +	if (!devices_info) {
>>> +		ret = -ENOMEM;
>>> +		goto out;
>>
>> Same here.
>>
>>> +	}
>>> +	/* Clear virtual chunk used space for each device */
>>> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
>>> +		device->virtual_allocated = 0;
>>> +	while (ret == 0) {
>>> +		ret = alloc_virtual_chunk(fs_info, devices_info, type,
>>> +					  &allocated);
>> The 'allocated' variable is used only in this loop so declare it in the
>> loop. Ideally we want variables to have the shortest possible lifecycle.
>>
>>> +		if (ret == 0)
>>> +			result += allocated;
>>> +	}
>>

After the explanation on IRC I think it's better if this is written as:

while(!alloc_virtual_chunk(fs_info, devices_info, type, &allocated))
  result += allocated

That way it's easier (at least for me) to spot you are "draining"
something. IN this case you try to allocate/simulate multiple
allocations to calculate the real available space.

> 
> For this case, we must go several loops:
> Dev1: 10G
> Dev2: 5G
> Dev3: 5G
> Type: RAID1.
> 
> The first loop will use 5G from dev1, 5G from dev2.
> Then the 2nd loop will use the remaining 5G from dev1, 5G from dev3.
> 
> And that's the core problem per-profile available space system want to
> address.
> 

<snip>
Qu Wenruo Feb. 21, 2020, 5:16 a.m. UTC | #5
On 2020/2/20 下午8:49, Nikolay Borisov wrote:
> <snip>
>
>>
>> Suggested-by: Josef Bacik <josef@toxicpanda.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/volumes.c | 216 ++++++++++++++++++++++++++++++++++++++++-----
>>  fs/btrfs/volumes.h |  11 +++
>>  2 files changed, 207 insertions(+), 20 deletions(-)
>>
>
> <snip>
>
>> +
>> +/*
>> + * Return 0 if we allocated any virtual(*) chunk, and restore the size to
>> + * @allocated_size
>> + * Return -ENOSPC if we have no more space to allocate virtual chunk
>> + *
>> + * *: virtual chunk is a space holder for per-profile available space
>> + *    calculator.
>> + *    Such virtual chunks won't take on-disk space, thus called virtual, and
>> + *    only affects per-profile available space calulation.
>> + */
>
> Document that the last parameter is an output value which contains the
> size of the allocated virtual chunk.

Isn't that mentioned in the 3rd line of the comment? (Although the
parameter name doesn't match after some updates).

...
> lockdep assert that chunk mutex is held since you access alloc_list.
>
>> +	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
>> +
>> +	/* Not enough devices, quick exit, just update the result */
>> +	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
>> +		goto out;
>
> You can directly return.

Then you won't update the @result_ret.
And don't forget that, @result_ret is from an uninitialized local array.
Thus we must go out tag.

>
>> +
>> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
>> +			       GFP_NOFS);
>> +	if (!devices_info) {
>> +		ret = -ENOMEM;
>> +		goto out;
>
> Same here.

Same as above.

>
>> +	}
>> +	/* Clear virtual chunk used space for each device */
>> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
>> +		device->virtual_allocated = 0;
>> +	while (ret == 0) {
>> +		ret = alloc_virtual_chunk(fs_info, devices_info, type,
>> +					  &allocated);
> The 'allocated' variable is used only in this loop so declare it in the
> loop. Ideally we want variables to have the shortest possible lifecycle.

Then your new while() loop idea will still need @allocated declared at
at the beginning of the function.

Thanks,
Qu

>
>> +		if (ret == 0)
>> +			result += allocated;
>> +	}
>
> Why do you need to call this in a loop calling alloc_virtual_chunk once
> simply calculate the available space for the given raid type.
>
>> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
>> +		device->virtual_allocated = 0;
>> +out:
>> +	kfree(devices_info);
>> +	if (ret < 0 && ret != -ENOSPC)
>> +		return ret;
>> +	*result_ret = result;
>> +	return 0;
>> +}
>
> <snip>
>
>> @@ -259,6 +266,10 @@ struct btrfs_fs_devices {
>>  	struct kobject fsid_kobj;
>>  	struct kobject *devices_kobj;
>>  	struct completion kobj_unregister;
>> +
>> +	/* Records per-type available space */
>> +	spinlock_t per_profile_lock;
>> +	u64 per_profile_avail[BTRFS_NR_RAID_TYPES];
>
> Since every avail quantity is independent of any other, can you turn
> this into an array of atomic64 values and get rid of the spinlock? My
> head spins how many locks we have in btrfs.
>
>>  };
>>
>>  #define BTRFS_BIO_INLINE_CSUM_SIZE	64
>>

Patch
diff mbox series

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9cfc668f91f4..1886fa20530d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -352,6 +352,7 @@  static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid,
 	INIT_LIST_HEAD(&fs_devs->devices);
 	INIT_LIST_HEAD(&fs_devs->alloc_list);
 	INIT_LIST_HEAD(&fs_devs->fs_list);
+	spin_lock_init(&fs_devs->per_profile_lock);
 	if (fsid)
 		memcpy(fs_devs->fsid, fsid, BTRFS_FSID_SIZE);
 
@@ -2669,6 +2670,177 @@  static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * sort the devices in descending order by max_avail, total_avail
+ */
+static int btrfs_cmp_device_info(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * Return 0 if we allocated any virtual(*) chunk, and restore the size to
+ * @allocated_size
+ * Return -ENOSPC if we have no more space to allocate virtual chunk
+ *
+ * *: virtual chunk is a space holder for per-profile available space
+ *    calculator.
+ *    Such virtual chunks won't take on-disk space, thus called virtual, and
+ *    only affects per-profile available space calulation.
+ */
+static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
+			       struct btrfs_device_info *devices_info,
+			       enum btrfs_raid_types type,
+			       u64 *allocated)
+{
+	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 stripe_size;
+	int i;
+	int ndevs = 0;
+
+	lockdep_assert_held(&fs_info->chunk_mutex);
+
+	/* Go through devices to collect their unallocated space */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
+		u64 avail;
+		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+					&device->dev_state) ||
+		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
+			continue;
+
+		if (device->total_bytes > device->bytes_used +
+				device->virtual_allocated)
+			avail = device->total_bytes - device->bytes_used -
+				device->virtual_allocated;
+		else
+			avail = 0;
+
+		/* And exclude the [0, 1M) reserved space */
+		if (avail > SZ_1M)
+			avail -= SZ_1M;
+		else
+			avail = 0;
+
+		if (avail < fs_info->sectorsize)
+			continue;
+		/*
+		 * Unlike chunk allocator, we don't care about stripe or hole
+		 * size, so here we use @avail directly
+		 */
+		devices_info[ndevs].dev_offset = 0;
+		devices_info[ndevs].total_avail = avail;
+		devices_info[ndevs].max_avail = avail;
+		devices_info[ndevs].dev = device;
+		++ndevs;
+	}
+	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+	     btrfs_cmp_device_info, NULL);
+	ndevs -= ndevs % raid_attr->devs_increment;
+	if (ndevs < raid_attr->devs_min)
+		return -ENOSPC;
+	if (raid_attr->devs_max)
+		ndevs = min(ndevs, (int)raid_attr->devs_max);
+	else
+		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
+
+	/*
+	 * Now allocate a virtual chunk using the unallocate space of the
+	 * device with the least unallocated space.
+	 */
+	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
+				 fs_info->sectorsize);
+	if (stripe_size == 0)
+		return -ENOSPC;
+
+	for (i = 0; i < ndevs; i++)
+		devices_info[i].dev->virtual_allocated += stripe_size;
+	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
+		     raid_attr->ncopies;
+	return 0;
+}
+
+static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
+				  enum btrfs_raid_types type,
+				  u64 *result_ret)
+{
+	struct btrfs_device_info *devices_info = NULL;
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 allocated;
+	u64 result = 0;
+	int ret = 0;
+
+	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
+
+	/* Not enough devices, quick exit, just update the result */
+	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
+		goto out;
+
+	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
+			       GFP_NOFS);
+	if (!devices_info) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	/* Clear virtual chunk used space for each device */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+	while (ret == 0) {
+		ret = alloc_virtual_chunk(fs_info, devices_info, type,
+					  &allocated);
+		if (ret == 0)
+			result += allocated;
+	}
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+out:
+	kfree(devices_info);
+	if (ret < 0 && ret != -ENOSPC)
+		return ret;
+	*result_ret = result;
+	return 0;
+}
+
+/*
+ * Calculate the per-profile available space array.
+ *
+ * Return 0 if we succeeded updating the array.
+ * Return <0 if something went wrong (ENOMEM), and the array is not
+ * updated.
+ */
+static int calc_per_profile_avail(struct btrfs_fs_info *fs_info)
+{
+	u64 results[BTRFS_NR_RAID_TYPES];
+	int i;
+	int ret;
+
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		ret = calc_one_profile_avail(fs_info, i, &results[i]);
+		if (ret < 0)
+			return ret;
+	}
+
+	spin_lock(&fs_info->fs_devices->per_profile_lock);
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
+		fs_info->fs_devices->per_profile_avail[i] =
+			results[i];
+	spin_unlock(&fs_info->fs_devices->per_profile_lock);
+	return ret;
+}
+
 int btrfs_grow_device(struct btrfs_trans_handle *trans,
 		      struct btrfs_device *device, u64 new_size)
 {
@@ -2676,6 +2848,7 @@  int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	struct btrfs_super_block *super_copy = fs_info->super_copy;
 	u64 old_total;
 	u64 diff;
+	int ret;
 
 	if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
 		return -EACCES;
@@ -2702,7 +2875,10 @@  int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	if (list_empty(&device->post_commit_list))
 		list_add_tail(&device->post_commit_list,
 			      &trans->transaction->dev_update_list);
+	ret = calc_per_profile_avail(fs_info);
 	mutex_unlock(&fs_info->chunk_mutex);
+	if (ret < 0)
+		return ret;
 
 	return btrfs_update_device(trans, device);
 }
@@ -2872,7 +3048,13 @@  int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
 					device->bytes_used - dev_extent_len);
 			atomic64_add(dev_extent_len, &fs_info->free_chunk_space);
 			btrfs_clear_space_info_full(fs_info);
+			ret = calc_per_profile_avail(fs_info);
 			mutex_unlock(&fs_info->chunk_mutex);
+			if (ret < 0) {
+				mutex_unlock(&fs_devices->device_list_mutex);
+				btrfs_abort_transaction(trans, ret);
+				goto out;
+			}
 		}
 
 		ret = btrfs_update_device(trans, device);
@@ -4578,6 +4760,11 @@  int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 		atomic64_sub(diff, &fs_info->free_chunk_space);
 	}
 
+	ret = calc_per_profile_avail(fs_info);
+	if (ret < 0) {
+		mutex_unlock(&fs_info->chunk_mutex);
+		goto done;
+	}
 	/*
 	 * Once the device's size has been set to the new size, ensure all
 	 * in-memory chunks are synced to disk so that the loop below sees them
@@ -4742,25 +4929,6 @@  static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
-/*
- * sort the devices in descending order by max_avail, total_avail
- */
-static int btrfs_cmp_device_info(const void *a, const void *b)
-{
-	const struct btrfs_device_info *di_a = a;
-	const struct btrfs_device_info *di_b = b;
-
-	if (di_a->max_avail > di_b->max_avail)
-		return -1;
-	if (di_a->max_avail < di_b->max_avail)
-		return 1;
-	if (di_a->total_avail > di_b->total_avail)
-		return -1;
-	if (di_a->total_avail < di_b->total_avail)
-		return 1;
-	return 0;
-}
-
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -5038,6 +5206,7 @@  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 			list_add_tail(&dev->post_commit_list,
 				      &trans->transaction->dev_update_list);
 	}
+	ret = calc_per_profile_avail(info);
 
 	atomic64_sub(stripe_size * map->num_stripes, &info->free_chunk_space);
 
@@ -5046,7 +5215,7 @@  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	check_raid1c34_incompat_flag(info, type);
 
 	kfree(devices_info);
-	return 0;
+	return ret;
 
 error_del_extent:
 	write_lock(&em_tree->lock);
@@ -7609,6 +7778,13 @@  int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
 
 	/* Ensure all chunks have corresponding dev extents */
 	ret = verify_chunk_dev_extent_mapping(fs_info);
+	if (ret < 0)
+		goto out;
+
+	/* All dev extents are verified, update per-profile available space */
+	mutex_lock(&fs_info->chunk_mutex);
+	ret = calc_per_profile_avail(fs_info);
+	mutex_unlock(&fs_info->chunk_mutex);
 out:
 	btrfs_free_path(path);
 	return ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 409f4816fb89..c1e4a32393b0 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -140,6 +140,13 @@  struct btrfs_device {
 	struct completion kobj_unregister;
 	/* For sysfs/FSID/devinfo/devid/ */
 	struct kobject devid_kobj;
+
+	/*
+	 * the "virtual" allocated space by virtual chunk allocator, which
+	 * is used to do accurate estimation on available space.
+	 * Doesn't affect real chunk allocator.
+	 */
+	u64 virtual_allocated;
 };
 
 /*
@@ -259,6 +266,10 @@  struct btrfs_fs_devices {
 	struct kobject fsid_kobj;
 	struct kobject *devices_kobj;
 	struct completion kobj_unregister;
+
+	/* Records per-type available space */
+	spinlock_t per_profile_lock;
+	u64 per_profile_avail[BTRFS_NR_RAID_TYPES];
 };
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64