diff mbox series

[2/2] btrfs: create chunk device type aware

Message ID 4ac12d6661470b18e1145f98c355bc1a93ebf214.1642518245.git.anand.jain@oracle.com (mailing list archive)
State New, archived
Headers show
Series device type and create chunk | expand

Commit Message

Anand Jain Jan. 18, 2022, 3:18 p.m. UTC
Mixed device types configuration exist. Most commonly, HDD mixed with
SSD/NVME device types. This use case prefers that the data chunk
allocates on HDD and the metadata chunk allocates on the SSD/NVME.

As of now, in the function gather_device_info() called from
btrfs_create_chunk(), we sort the devices based on unallocated space
only.

After this patch, this function will check for mixed device types. And
will sort the devices based on enum btrfs_device_types. That is, sort if
the allocation type is metadata and reverse-sort if the allocation type
is data.

The advantage of this method is that data/metadata allocation distribution
based on the device type happens automatically without any manual
configuration.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

Comments

David Sterba Jan. 26, 2022, 5:01 p.m. UTC | #1
On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote:
> Mixed device types configuration exist. Most commonly, HDD mixed with
> SSD/NVME device types. This use case prefers that the data chunk
> allocates on HDD and the metadata chunk allocates on the SSD/NVME.
> 
> As of now, in the function gather_device_info() called from
> btrfs_create_chunk(), we sort the devices based on unallocated space
> only.
> 
> After this patch, this function will check for mixed device types. And
> will sort the devices based on enum btrfs_device_types. That is, sort if
> the allocation type is metadata and reverse-sort if the allocation type
> is data.
> 
> The advantage of this method is that data/metadata allocation distribution
> based on the device type happens automatically without any manual
> configuration.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>  fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index da3d6d0f5bc3..77fba78555d7 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>  	return 0;
>  }
>  
> +/*
> + * Sort the devices in its ascending order of latency value.
> + */
> +static int btrfs_cmp_device_latency(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +	struct btrfs_device *dev_a = di_a->dev;
> +	struct btrfs_device *dev_b = di_b->dev;
> +
> +	if (dev_a->dev_type > dev_b->dev_type)

It's strange to see dev_type being compared while the function compares
the latency. As pointed in the first patch, this could simply refer to a
array of device types sorted by latency.

> +		return 1;
> +	if (dev_a->dev_type < dev_b->dev_type)
> +		return -1;
> +	return 0;
> +}
> +
> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b)

Regarding the naming, I think we've been using _desc (descending) as
suffix for the reverse sort functions, while ascending is the default.

> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +	struct btrfs_device *dev_a = di_a->dev;
> +	struct btrfs_device *dev_b = di_b->dev;
> +
> +	if (dev_a->dev_type > dev_b->dev_type)
> +		return -1;
> +	if (dev_a->dev_type < dev_b->dev_type)
> +		return 1;
> +	return 0;

To avoid code duplication this could be

	return -btrfs_cmp_device_latency(a, b);

> +}
> +
>  /*
>   * sort the devices in descending order by max_avail, total_avail
>   */
> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>  	     btrfs_cmp_device_info, NULL);
>  
> +	/*
> +	 * Sort devices by their latency. Ascending order of latency for
> +	 * metadata and descending order of latency for the data chunks for
> +	 * mixed device types.
> +	 */
> +	if (fs_devices->mixed_dev_types) {
> +		if (ctl->type & BTRFS_BLOCK_GROUP_DATA)
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_rev_latency, NULL);
> +		else
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_latency, NULL);
> +	}
> +
>  	return 0;
>  }
>  
> -- 
> 2.33.1
David Sterba Jan. 26, 2022, 5:38 p.m. UTC | #2
On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote:
> Mixed device types configuration exist. Most commonly, HDD mixed with
> SSD/NVME device types. This use case prefers that the data chunk
> allocates on HDD and the metadata chunk allocates on the SSD/NVME.
> 
> As of now, in the function gather_device_info() called from
> btrfs_create_chunk(), we sort the devices based on unallocated space
> only.

Yes, and this guarantees maximizing the used space for the raid
profiles.

> After this patch, this function will check for mixed device types. And
> will sort the devices based on enum btrfs_device_types. That is, sort if
> the allocation type is metadata and reverse-sort if the allocation type
> is data.

And this changes the above, because a small fast device can be depleted
first.

> The advantage of this method is that data/metadata allocation distribution
> based on the device type happens automatically without any manual
> configuration.

Yeah, but the default behaviour may not be suitable for all users so
some policy will have to be done anyway.

I vaguely remember some comments regarding mixed setups, along lines
that "if there's a fast flash device I'd rather see ENOSPC and either
delete files or add more devices than to let everything work but with
the risk of storing metadata on the slow devices."

> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>  fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index da3d6d0f5bc3..77fba78555d7 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>  	return 0;
>  }
>  
> +/*
> + * Sort the devices in its ascending order of latency value.
> + */
> +static int btrfs_cmp_device_latency(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +	struct btrfs_device *dev_a = di_a->dev;
> +	struct btrfs_device *dev_b = di_b->dev;
> +
> +	if (dev_a->dev_type > dev_b->dev_type)
> +		return 1;
> +	if (dev_a->dev_type < dev_b->dev_type)
> +		return -1;
> +	return 0;
> +}
> +
> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +	struct btrfs_device *dev_a = di_a->dev;
> +	struct btrfs_device *dev_b = di_b->dev;
> +
> +	if (dev_a->dev_type > dev_b->dev_type)
> +		return -1;
> +	if (dev_a->dev_type < dev_b->dev_type)
> +		return 1;
> +	return 0;
> +}
> +
>  /*
>   * sort the devices in descending order by max_avail, total_avail
>   */
> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>  	     btrfs_cmp_device_info, NULL);
>  
> +	/*
> +	 * Sort devices by their latency. Ascending order of latency for
> +	 * metadata and descending order of latency for the data chunks for
> +	 * mixed device types.
> +	 */
> +	if (fs_devices->mixed_dev_types) {
> +		if (ctl->type & BTRFS_BLOCK_GROUP_DATA)
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_rev_latency, NULL);
> +		else
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_latency, NULL);

In case there are mixed devices the sort happens twice and because as
implemented in kernel sort() is not stable so even if device have same
amount of data they can get reorderd wildly. The remaingin space is
still a factor we need to take into account to avoid ENOSPC on the chunk
level.
Anand Jain Jan. 29, 2022, 4:24 p.m. UTC | #3
On 27/01/2022 01:01, David Sterba wrote:
> On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote:
>> Mixed device types configuration exist. Most commonly, HDD mixed with
>> SSD/NVME device types. This use case prefers that the data chunk
>> allocates on HDD and the metadata chunk allocates on the SSD/NVME.
>>
>> As of now, in the function gather_device_info() called from
>> btrfs_create_chunk(), we sort the devices based on unallocated space
>> only.
>>
>> After this patch, this function will check for mixed device types. And
>> will sort the devices based on enum btrfs_device_types. That is, sort if
>> the allocation type is metadata and reverse-sort if the allocation type
>> is data.
>>
>> The advantage of this method is that data/metadata allocation distribution
>> based on the device type happens automatically without any manual
>> configuration.
>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>>   fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 45 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index da3d6d0f5bc3..77fba78555d7 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>>   	return 0;
>>   }
>>   
>> +/*
>> + * Sort the devices in its ascending order of latency value.
>> + */
>> +static int btrfs_cmp_device_latency(const void *a, const void *b)
>> +{
>> +	const struct btrfs_device_info *di_a = a;
>> +	const struct btrfs_device_info *di_b = b;
>> +	struct btrfs_device *dev_a = di_a->dev;
>> +	struct btrfs_device *dev_b = di_b->dev;
>> +
>> +	if (dev_a->dev_type > dev_b->dev_type)
> 
> It's strange to see dev_type being compared while the function compares
> the latency. As pointed in the first patch, this could simply refer to a
> array of device types sorted by latency.
> 
  Agreed. This will change.

>> +		return 1;
>> +	if (dev_a->dev_type < dev_b->dev_type)
>> +		return -1;
>> +	return 0;
>> +}
>> +
>> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b)
> 
> Regarding the naming, I think we've been using _desc (descending) as
> suffix for the reverse sort functions, while ascending is the default.
> 

  Ok. Noted.

>> +{
>> +	const struct btrfs_device_info *di_a = a;
>> +	const struct btrfs_device_info *di_b = b;
>> +	struct btrfs_device *dev_a = di_a->dev;
>> +	struct btrfs_device *dev_b = di_b->dev;
>> +
>> +	if (dev_a->dev_type > dev_b->dev_type)
>> +		return -1;
>> +	if (dev_a->dev_type < dev_b->dev_type)
>> +		return 1;
>> +	return 0;
> 
> To avoid code duplication this could be
> 
> 	return -btrfs_cmp_device_latency(a, b);

  Oh right. I will do that.

Thanks, Anand


>> +}
>> +
>>   /*
>>    * sort the devices in descending order by max_avail, total_avail
>>    */
>> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>   	     btrfs_cmp_device_info, NULL);
>>   
>> +	/*
>> +	 * Sort devices by their latency. Ascending order of latency for
>> +	 * metadata and descending order of latency for the data chunks for
>> +	 * mixed device types.
>> +	 */
>> +	if (fs_devices->mixed_dev_types) {
>> +		if (ctl->type & BTRFS_BLOCK_GROUP_DATA)
>> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +			     btrfs_cmp_device_rev_latency, NULL);
>> +		else
>> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +			     btrfs_cmp_device_latency, NULL);
>> +	}
>> +
>>   	return 0;
>>   }
>>   
>> -- 
>> 2.33.1
Anand Jain Jan. 29, 2022, 4:46 p.m. UTC | #4
On 27/01/2022 01:38, David Sterba wrote:
> On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote:
>> Mixed device types configuration exist. Most commonly, HDD mixed with
>> SSD/NVME device types. This use case prefers that the data chunk
>> allocates on HDD and the metadata chunk allocates on the SSD/NVME.
>>
>> As of now, in the function gather_device_info() called from
>> btrfs_create_chunk(), we sort the devices based on unallocated space
>> only.
> 

> Yes, and this guarantees maximizing the used space for the raid
> profiles.

  Oops. Thanks for reminding me of that. More below.

>> After this patch, this function will check for mixed device types. And
>> will sort the devices based on enum btrfs_device_types. That is, sort if
>> the allocation type is metadata and reverse-sort if the allocation type
>> is data.
> 
> And this changes the above, because a small fast device can be depleted
> first.


Both sort by size or latency do _not_ help if
  given_raid.devs_max == 0 (raid0, raid5, raid6) OR given_raid.devs_max 
== num_devices.

It helps only when given_raid.devs_max != 0 and given_raid.devs_max < 
num_devices.

Sort by size does not help Single and Dup profiles.

So, if (given_raid.devs_max != 0 && given_raid.devs_max < num_devices) {

Mixed devs types with different sizes if sorted by free size:
  is pro for  raid1, raid1c3, raid1c4, raid10
  doesn't matter for single, dup

Mixed devs types with different sizes if sorted by latency:
  is pro for single and dup
  is con for raid1, raid1c3, raid1c4, raid10 (depends)
}


So,

If given_raid.devs_max == num_devices we don't need any type of sorting.

If given_raid.devs_max = 0 (raid0, raid5, raid6) we don't need any type 
of sorting.

And sort devs by latency for Single and Dup profiles only.

For rest of profiles sort devs by size only if given_raid.devs_max < 
num_devices.


>> The advantage of this method is that data/metadata allocation distribution
>> based on the device type happens automatically without any manual
>> configuration.
> 
> Yeah, but the default behaviour may not be suitable for all users so
> some policy will have to be done anyway.

  Right. If nothing is configured even when provided then also it should
  fallback to the default behaviour.

> I vaguely remember some comments regarding mixed setups, along lines
> that "if there's a fast flash device I'd rather see ENOSPC and either
> delete files or add more devices than to let everything work but with
> the risk of storing metadata on the slow devices."

  It entirely depends on the use-case. An option like following will
  solve it better:
    mount -o metadata_nospc_on_faster_devs=<use-slower-devs>|<error>

  If metadata_nospc_on_faster_devs=error is preferred then it also
  implies that data_nospc_on_slower_devs=error.

  Also, the use cases which prefer to use the error option should
  remember it is difficult to estimate the data/metadata ratio
  beforehand.

>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>>   fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 45 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index da3d6d0f5bc3..77fba78555d7 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>>   	return 0;
>>   }
>>   
>> +/*
>> + * Sort the devices in its ascending order of latency value.
>> + */
>> +static int btrfs_cmp_device_latency(const void *a, const void *b)
>> +{
>> +	const struct btrfs_device_info *di_a = a;
>> +	const struct btrfs_device_info *di_b = b;
>> +	struct btrfs_device *dev_a = di_a->dev;
>> +	struct btrfs_device *dev_b = di_b->dev;
>> +
>> +	if (dev_a->dev_type > dev_b->dev_type)
>> +		return 1;
>> +	if (dev_a->dev_type < dev_b->dev_type)
>> +		return -1;
>> +	return 0;
>> +}
>> +
>> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b)
>> +{
>> +	const struct btrfs_device_info *di_a = a;
>> +	const struct btrfs_device_info *di_b = b;
>> +	struct btrfs_device *dev_a = di_a->dev;
>> +	struct btrfs_device *dev_b = di_b->dev;
>> +
>> +	if (dev_a->dev_type > dev_b->dev_type)
>> +		return -1;
>> +	if (dev_a->dev_type < dev_b->dev_type)
>> +		return 1;
>> +	return 0;
>> +}
>> +
>>   /*
>>    * sort the devices in descending order by max_avail, total_avail
>>    */
>> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>   	     btrfs_cmp_device_info, NULL);
>>   
>> +	/*
>> +	 * Sort devices by their latency. Ascending order of latency for
>> +	 * metadata and descending order of latency for the data chunks for
>> +	 * mixed device types.
>> +	 */
>> +	if (fs_devices->mixed_dev_types) {
>> +		if (ctl->type & BTRFS_BLOCK_GROUP_DATA)
>> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +			     btrfs_cmp_device_rev_latency, NULL);
>> +		else
>> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +			     btrfs_cmp_device_latency, NULL);
> 
> In case there are mixed devices the sort happens twice and because as
> implemented in kernel sort() is not stable so even if device have same
> amount of data they can get reorderd wildly. The remaingin space is
> still a factor we need to take into account to avoid ENOSPC on the chunk
> level.


I didn't get this part. How about if it is this way:

    if (mixed && metadata && (single || dup)) {
      ndevs=0
      pick all non-rotational ndevs++ with free space >= required space
      if (ndevs == 0) {
        if (user_option->metadata_nospc_on_faster_devs == error)
             return -ENOSPC;
        pick all rotational
      }
      sort-by-size-select-top
    }

    if (mixed && data && (single || dup)) {
      ndevs=0
      pick all rotational ndevs++ with free space >= required space
      if (ndevs == 0) {
        if (user_option->data_nospc_on_faster_devs == error)
             return -ENOSPC;
        pick all non-rotational
      }
      sort-by-size-select-top
    }

Thanks, Anand
Goffredo Baroncelli Jan. 30, 2022, 10:15 p.m. UTC | #5
On 29/01/2022 17.46, Anand Jain wrote:
> 
> 
> On 27/01/2022 01:38, David Sterba wrote:
>> On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote:
>>> Mixed device types configuration exist. Most commonly, HDD mixed with
>>> SSD/NVME device types. This use case prefers that the data chunk
>>> allocates on HDD and the metadata chunk allocates on the SSD/NVME.
>>>
>>> As of now, in the function gather_device_info() called from
>>> btrfs_create_chunk(), we sort the devices based on unallocated space
>>> only.
>>
> 
>> Yes, and this guarantees maximizing the used space for the raid
>> profiles.
> 
>   Oops. Thanks for reminding me of that. More below.
> 
>>> After this patch, this function will check for mixed device types. And
>>> will sort the devices based on enum btrfs_device_types. That is, sort if
>>> the allocation type is metadata and reverse-sort if the allocation type
>>> is data.
>>
>> And this changes the above, because a small fast device can be depleted
>> first.
> 
> 
> Both sort by size or latency do _not_ help if
>   given_raid.devs_max == 0 (raid0, raid5, raid6) OR given_raid.devs_max == num_devices.
> 
> It helps only when given_raid.devs_max != 0 and given_raid.devs_max < num_devices.
> 
> Sort by size does not help Single and Dup profiles.
> 
> So, if (given_raid.devs_max != 0 && given_raid.devs_max < num_devices) {
> 
> Mixed devs types with different sizes if sorted by free size:
>   is pro for  raid1, raid1c3, raid1c4, raid10
>   doesn't matter for single, dup
> 
> Mixed devs types with different sizes if sorted by latency:
>   is pro for single and dup
>   is con for raid1, raid1c3, raid1c4, raid10 (depends)
> }

May be I remember bad; but in the "single" cases a new BG is allocated
in the "more empty" disks. The same for the "dup". In this it is not different from
RAID1xx. Only in the case of RAID5/6 the size doesn't matter, because a new chunk is
allocate in each disk anyway.

Pay attention that you can have a "mixed" environment of different disks sizes:
  2 x ssd (100GB + 200GB)
  2 x HDD (1T + 2T)

So it could make sense to spread the metadata by priority (only to ssd) AND by "emptiness"
(i.e. each 3 BG, one is allocated on the smaller disks and two in the bigger one).

> 
> So,
> 
> If given_raid.devs_max == num_devices we don't need any type of sorting.
> 
> If given_raid.devs_max = 0 (raid0, raid5, raid6) we don't need any type of sorting.
> 
> And sort devs by latency for Single and Dup profiles only.
> 
> For rest of profiles sort devs by size only if given_raid.devs_max < num_devices.
> 
> 
>>> The advantage of this method is that data/metadata allocation distribution
>>> based on the device type happens automatically without any manual
>>> configuration.
>>
>> Yeah, but the default behaviour may not be suitable for all users so
>> some policy will have to be done anyway.
> 
>   Right. If nothing is configured even when provided then also it should
>   fallback to the default behaviour.
> 
>> I vaguely remember some comments regarding mixed setups, along lines
>> that "if there's a fast flash device I'd rather see ENOSPC and either
>> delete files or add more devices than to let everything work but with
>> the risk of storing metadata on the slow devices."
> 
>   It entirely depends on the use-case. An option like following will
>   solve it better:
>     mount -o metadata_nospc_on_faster_devs=<use-slower-devs>|<error>
> 
>   If metadata_nospc_on_faster_devs=error is preferred then it also
>   implies that data_nospc_on_slower_devs=error.
> 
>   Also, the use cases which prefer to use the error option should
>   remember it is difficult to estimate the data/metadata ratio
>   beforehand.
> 
>>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>>> ---
>>>   fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 45 insertions(+)
>>>
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index da3d6d0f5bc3..77fba78555d7 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>>>       return 0;
>>>   }
>>> +/*
>>> + * Sort the devices in its ascending order of latency value.
>>> + */
>>> +static int btrfs_cmp_device_latency(const void *a, const void *b)
>>> +{
>>> +    const struct btrfs_device_info *di_a = a;
>>> +    const struct btrfs_device_info *di_b = b;
>>> +    struct btrfs_device *dev_a = di_a->dev;
>>> +    struct btrfs_device *dev_b = di_b->dev;
>>> +
>>> +    if (dev_a->dev_type > dev_b->dev_type)
>>> +        return 1;
>>> +    if (dev_a->dev_type < dev_b->dev_type)
>>> +        return -1;
>>> +    return 0;
>>> +}
>>> +
>>> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b)
>>> +{
>>> +    const struct btrfs_device_info *di_a = a;
>>> +    const struct btrfs_device_info *di_b = b;
>>> +    struct btrfs_device *dev_a = di_a->dev;
>>> +    struct btrfs_device *dev_b = di_b->dev;
>>> +
>>> +    if (dev_a->dev_type > dev_b->dev_type)
>>> +        return -1;
>>> +    if (dev_a->dev_type < dev_b->dev_type)
>>> +        return 1;
>>> +    return 0;
>>> +}
>>> +
>>>   /*
>>>    * sort the devices in descending order by max_avail, total_avail
>>>    */
>>> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>>       sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>>            btrfs_cmp_device_info, NULL);
>>> +    /*
>>> +     * Sort devices by their latency. Ascending order of latency for
>>> +     * metadata and descending order of latency for the data chunks for
>>> +     * mixed device types.
>>> +     */
>>> +    if (fs_devices->mixed_dev_types) {
>>> +        if (ctl->type & BTRFS_BLOCK_GROUP_DATA)
>>> +            sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>> +                 btrfs_cmp_device_rev_latency, NULL);
>>> +        else
>>> +            sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>> +                 btrfs_cmp_device_latency, NULL);
>>
>> In case there are mixed devices the sort happens twice and because as
>> implemented in kernel sort() is not stable so even if device have same
>> amount of data they can get reorderd wildly. The remaingin space is
>> still a factor we need to take into account to avoid ENOSPC on the chunk
>> level.
> 
> 
> I didn't get this part. How about if it is this way:
> 
>     if (mixed && metadata && (single || dup)) {
>       ndevs=0
>       pick all non-rotational ndevs++ with free space >= required space
>       if (ndevs == 0) {
>         if (user_option->metadata_nospc_on_faster_devs == error)
>              return -ENOSPC;
>         pick all rotational
>       }
>       sort-by-size-select-top
>     }
> 
>     if (mixed && data && (single || dup)) {
>       ndevs=0
>       pick all rotational ndevs++ with free space >= required space
>       if (ndevs == 0) {
>         if (user_option->data_nospc_on_faster_devs == error)
>              return -ENOSPC;
>         pick all non-rotational
>       }
>       sort-by-size-select-top
>     }
> 
> Thanks, Anand

I think that you have to behave as allocation_hint patches set:
The sort is one, and it sorts by
- by priority
- if the disks have the same priority, by free space
- if the disks have the same priority and free space, by max avail
...
Goffredo Baroncelli Jan. 30, 2022, 10:28 p.m. UTC | #6
On 26/01/2022 18.38, David Sterba wrote:
>> The advantage of this method is that data/metadata allocation distribution
>> based on the device type happens automatically without any manual
>> configuration.
> Yeah, but the default behaviour may not be suitable for all users so
> some policy will have to be done anyway.
> 
> I vaguely remember some comments regarding mixed setups, along lines
> that "if there's a fast flash device I'd rather see ENOSPC and either
> delete files or add more devices than to let everything work but with
> the risk of storing metadata on the slow devices."
> 

I confirm that. There are two aspects that impacted the "allocation_hint"
patches set discussions:
1) the criteria to order the disks
2) allow/permit/deny a kind of BG to be hosted by a device

Regarding 1), initially the first set of patches considered an automatic
behavior on the basis of the "rotational" attribute. Soon it was pointed out
that in the "not rotational" class there are different options (like SSD, NVME...).
The conclusion was that the "priority" should not be cabled in the btrfs code.
(e.g. we could consider the reliability ?)

Regarding 2), my initial patches set only ordered the disks and alloed any
disk to be used. Some user asked me to prevent to use certain disks
for (e.g.) the data; this to prevent the data BG to consume all the
available space [*]

These discussions leads me to create 4 "classes" for disks
- ONLY_METADATA
- PREFERRED_METADATA
- PREFERRED_DATA
- ONLY_DATA

Where the last one is not suitable for metadata, and the first one is not
suitable for data. The middle ones allow one type but only of the other disks are full.

Another differences between the Anand patches and the my ones, is that
in the "allocation_hint" for striped profiles (like raid5, which spans all
the available disks), the devices involved should have the same classes.

I.e., if btrfs has to allocate a RAID5 metadata chunk,
- first tries to use ONLY_METADATA disks.
- If the disks are not enough, then it tries to use ONLY_METADATA and
   PREFERRED_METADATA disks.
- If the disks are not enough, then it tries to use ONLY_METADATA,
   PREFERRED_METADATA and PREFERRED_DATA disks.
- If the disks are not enough, then -ENOSPC

What the Anand patches has more than allocation_hint patches, is that
these handle the case of different latency disks, giving higher
priority to the disks with lower latency. If this is a requirements
we can reserve some bits to add a priority of a disk, and then
we can sort the disk by:

- allocation_hint class
- priority
- max avail
- free space

BR
G.Baroncelli

[*] I don't want to open another discussion, but this seems to me more a "quota"
problem than a "disk allocation" problem...
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index da3d6d0f5bc3..77fba78555d7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5060,6 +5060,37 @@  static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
+/*
+ * Sort the devices in its ascending order of latency value.
+ */
+static int btrfs_cmp_device_latency(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+	struct btrfs_device *dev_a = di_a->dev;
+	struct btrfs_device *dev_b = di_b->dev;
+
+	if (dev_a->dev_type > dev_b->dev_type)
+		return 1;
+	if (dev_a->dev_type < dev_b->dev_type)
+		return -1;
+	return 0;
+}
+
+static int btrfs_cmp_device_rev_latency(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+	struct btrfs_device *dev_a = di_a->dev;
+	struct btrfs_device *dev_b = di_b->dev;
+
+	if (dev_a->dev_type > dev_b->dev_type)
+		return -1;
+	if (dev_a->dev_type < dev_b->dev_type)
+		return 1;
+	return 0;
+}
+
 /*
  * sort the devices in descending order by max_avail, total_avail
  */
@@ -5292,6 +5323,20 @@  static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
 	     btrfs_cmp_device_info, NULL);
 
+	/*
+	 * Sort devices by their latency. Ascending order of latency for
+	 * metadata and descending order of latency for the data chunks for
+	 * mixed device types.
+	 */
+	if (fs_devices->mixed_dev_types) {
+		if (ctl->type & BTRFS_BLOCK_GROUP_DATA)
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_rev_latency, NULL);
+		else
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_latency, NULL);
+	}
+
 	return 0;
 }