Message ID | 4ac12d6661470b18e1145f98c355bc1a93ebf214.1642518245.git.anand.jain@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | device type and create chunk | expand |
On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote: > Mixed device types configuration exist. Most commonly, HDD mixed with > SSD/NVME device types. This use case prefers that the data chunk > allocates on HDD and the metadata chunk allocates on the SSD/NVME. > > As of now, in the function gather_device_info() called from > btrfs_create_chunk(), we sort the devices based on unallocated space > only. > > After this patch, this function will check for mixed device types. And > will sort the devices based on enum btrfs_device_types. That is, sort if > the allocation type is metadata and reverse-sort if the allocation type > is data. > > The advantage of this method is that data/metadata allocation distribution > based on the device type happens automatically without any manual > configuration. > > Signed-off-by: Anand Jain <anand.jain@oracle.com> > --- > fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 45 insertions(+) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index da3d6d0f5bc3..77fba78555d7 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, > return 0; > } > > +/* > + * Sort the devices in its ascending order of latency value. > + */ > +static int btrfs_cmp_device_latency(const void *a, const void *b) > +{ > + const struct btrfs_device_info *di_a = a; > + const struct btrfs_device_info *di_b = b; > + struct btrfs_device *dev_a = di_a->dev; > + struct btrfs_device *dev_b = di_b->dev; > + > + if (dev_a->dev_type > dev_b->dev_type) It's strange to see dev_type being compared while the function compares the latency. As pointed in the first patch, this could simply refer to a array of device types sorted by latency. > + return 1; > + if (dev_a->dev_type < dev_b->dev_type) > + return -1; > + return 0; > +} > + > +static int btrfs_cmp_device_rev_latency(const void *a, const void *b) Regarding the naming, I think we've been using _desc (descending) as suffix for the reverse sort functions, while ascending is the default. > +{ > + const struct btrfs_device_info *di_a = a; > + const struct btrfs_device_info *di_b = b; > + struct btrfs_device *dev_a = di_a->dev; > + struct btrfs_device *dev_b = di_b->dev; > + > + if (dev_a->dev_type > dev_b->dev_type) > + return -1; > + if (dev_a->dev_type < dev_b->dev_type) > + return 1; > + return 0; To avoid code duplication this could be return -btrfs_cmp_device_latency(a, b); > +} > + > /* > * sort the devices in descending order by max_avail, total_avail > */ > @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices, > sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > btrfs_cmp_device_info, NULL); > > + /* > + * Sort devices by their latency. Ascending order of latency for > + * metadata and descending order of latency for the data chunks for > + * mixed device types. > + */ > + if (fs_devices->mixed_dev_types) { > + if (ctl->type & BTRFS_BLOCK_GROUP_DATA) > + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > + btrfs_cmp_device_rev_latency, NULL); > + else > + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > + btrfs_cmp_device_latency, NULL); > + } > + > return 0; > } > > -- > 2.33.1
On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote: > Mixed device types configuration exist. Most commonly, HDD mixed with > SSD/NVME device types. This use case prefers that the data chunk > allocates on HDD and the metadata chunk allocates on the SSD/NVME. > > As of now, in the function gather_device_info() called from > btrfs_create_chunk(), we sort the devices based on unallocated space > only. Yes, and this guarantees maximizing the used space for the raid profiles. > After this patch, this function will check for mixed device types. And > will sort the devices based on enum btrfs_device_types. That is, sort if > the allocation type is metadata and reverse-sort if the allocation type > is data. And this changes the above, because a small fast device can be depleted first. > The advantage of this method is that data/metadata allocation distribution > based on the device type happens automatically without any manual > configuration. Yeah, but the default behaviour may not be suitable for all users so some policy will have to be done anyway. I vaguely remember some comments regarding mixed setups, along lines that "if there's a fast flash device I'd rather see ENOSPC and either delete files or add more devices than to let everything work but with the risk of storing metadata on the slow devices." > Signed-off-by: Anand Jain <anand.jain@oracle.com> > --- > fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 45 insertions(+) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index da3d6d0f5bc3..77fba78555d7 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, > return 0; > } > > +/* > + * Sort the devices in its ascending order of latency value. > + */ > +static int btrfs_cmp_device_latency(const void *a, const void *b) > +{ > + const struct btrfs_device_info *di_a = a; > + const struct btrfs_device_info *di_b = b; > + struct btrfs_device *dev_a = di_a->dev; > + struct btrfs_device *dev_b = di_b->dev; > + > + if (dev_a->dev_type > dev_b->dev_type) > + return 1; > + if (dev_a->dev_type < dev_b->dev_type) > + return -1; > + return 0; > +} > + > +static int btrfs_cmp_device_rev_latency(const void *a, const void *b) > +{ > + const struct btrfs_device_info *di_a = a; > + const struct btrfs_device_info *di_b = b; > + struct btrfs_device *dev_a = di_a->dev; > + struct btrfs_device *dev_b = di_b->dev; > + > + if (dev_a->dev_type > dev_b->dev_type) > + return -1; > + if (dev_a->dev_type < dev_b->dev_type) > + return 1; > + return 0; > +} > + > /* > * sort the devices in descending order by max_avail, total_avail > */ > @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices, > sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > btrfs_cmp_device_info, NULL); > > + /* > + * Sort devices by their latency. Ascending order of latency for > + * metadata and descending order of latency for the data chunks for > + * mixed device types. > + */ > + if (fs_devices->mixed_dev_types) { > + if (ctl->type & BTRFS_BLOCK_GROUP_DATA) > + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > + btrfs_cmp_device_rev_latency, NULL); > + else > + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > + btrfs_cmp_device_latency, NULL); In case there are mixed devices the sort happens twice and because as implemented in kernel sort() is not stable so even if device have same amount of data they can get reorderd wildly. The remaingin space is still a factor we need to take into account to avoid ENOSPC on the chunk level.
On 27/01/2022 01:01, David Sterba wrote: > On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote: >> Mixed device types configuration exist. Most commonly, HDD mixed with >> SSD/NVME device types. This use case prefers that the data chunk >> allocates on HDD and the metadata chunk allocates on the SSD/NVME. >> >> As of now, in the function gather_device_info() called from >> btrfs_create_chunk(), we sort the devices based on unallocated space >> only. >> >> After this patch, this function will check for mixed device types. And >> will sort the devices based on enum btrfs_device_types. That is, sort if >> the allocation type is metadata and reverse-sort if the allocation type >> is data. >> >> The advantage of this method is that data/metadata allocation distribution >> based on the device type happens automatically without any manual >> configuration. >> >> Signed-off-by: Anand Jain <anand.jain@oracle.com> >> --- >> fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 45 insertions(+) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index da3d6d0f5bc3..77fba78555d7 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, >> return 0; >> } >> >> +/* >> + * Sort the devices in its ascending order of latency value. >> + */ >> +static int btrfs_cmp_device_latency(const void *a, const void *b) >> +{ >> + const struct btrfs_device_info *di_a = a; >> + const struct btrfs_device_info *di_b = b; >> + struct btrfs_device *dev_a = di_a->dev; >> + struct btrfs_device *dev_b = di_b->dev; >> + >> + if (dev_a->dev_type > dev_b->dev_type) > > It's strange to see dev_type being compared while the function compares > the latency. As pointed in the first patch, this could simply refer to a > array of device types sorted by latency. > Agreed. This will change. >> + return 1; >> + if (dev_a->dev_type < dev_b->dev_type) >> + return -1; >> + return 0; >> +} >> + >> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b) > > Regarding the naming, I think we've been using _desc (descending) as > suffix for the reverse sort functions, while ascending is the default. > Ok. Noted. >> +{ >> + const struct btrfs_device_info *di_a = a; >> + const struct btrfs_device_info *di_b = b; >> + struct btrfs_device *dev_a = di_a->dev; >> + struct btrfs_device *dev_b = di_b->dev; >> + >> + if (dev_a->dev_type > dev_b->dev_type) >> + return -1; >> + if (dev_a->dev_type < dev_b->dev_type) >> + return 1; >> + return 0; > > To avoid code duplication this could be > > return -btrfs_cmp_device_latency(a, b); Oh right. I will do that. Thanks, Anand >> +} >> + >> /* >> * sort the devices in descending order by max_avail, total_avail >> */ >> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices, >> sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >> btrfs_cmp_device_info, NULL); >> >> + /* >> + * Sort devices by their latency. Ascending order of latency for >> + * metadata and descending order of latency for the data chunks for >> + * mixed device types. >> + */ >> + if (fs_devices->mixed_dev_types) { >> + if (ctl->type & BTRFS_BLOCK_GROUP_DATA) >> + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >> + btrfs_cmp_device_rev_latency, NULL); >> + else >> + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >> + btrfs_cmp_device_latency, NULL); >> + } >> + >> return 0; >> } >> >> -- >> 2.33.1
On 27/01/2022 01:38, David Sterba wrote: > On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote: >> Mixed device types configuration exist. Most commonly, HDD mixed with >> SSD/NVME device types. This use case prefers that the data chunk >> allocates on HDD and the metadata chunk allocates on the SSD/NVME. >> >> As of now, in the function gather_device_info() called from >> btrfs_create_chunk(), we sort the devices based on unallocated space >> only. > > Yes, and this guarantees maximizing the used space for the raid > profiles. Oops. Thanks for reminding me of that. More below. >> After this patch, this function will check for mixed device types. And >> will sort the devices based on enum btrfs_device_types. That is, sort if >> the allocation type is metadata and reverse-sort if the allocation type >> is data. > > And this changes the above, because a small fast device can be depleted > first. Both sort by size or latency do _not_ help if given_raid.devs_max == 0 (raid0, raid5, raid6) OR given_raid.devs_max == num_devices. It helps only when given_raid.devs_max != 0 and given_raid.devs_max < num_devices. Sort by size does not help Single and Dup profiles. So, if (given_raid.devs_max != 0 && given_raid.devs_max < num_devices) { Mixed devs types with different sizes if sorted by free size: is pro for raid1, raid1c3, raid1c4, raid10 doesn't matter for single, dup Mixed devs types with different sizes if sorted by latency: is pro for single and dup is con for raid1, raid1c3, raid1c4, raid10 (depends) } So, If given_raid.devs_max == num_devices we don't need any type of sorting. If given_raid.devs_max = 0 (raid0, raid5, raid6) we don't need any type of sorting. And sort devs by latency for Single and Dup profiles only. For rest of profiles sort devs by size only if given_raid.devs_max < num_devices. >> The advantage of this method is that data/metadata allocation distribution >> based on the device type happens automatically without any manual >> configuration. > > Yeah, but the default behaviour may not be suitable for all users so > some policy will have to be done anyway. Right. If nothing is configured even when provided then also it should fallback to the default behaviour. > I vaguely remember some comments regarding mixed setups, along lines > that "if there's a fast flash device I'd rather see ENOSPC and either > delete files or add more devices than to let everything work but with > the risk of storing metadata on the slow devices." It entirely depends on the use-case. An option like following will solve it better: mount -o metadata_nospc_on_faster_devs=<use-slower-devs>|<error> If metadata_nospc_on_faster_devs=error is preferred then it also implies that data_nospc_on_slower_devs=error. Also, the use cases which prefer to use the error option should remember it is difficult to estimate the data/metadata ratio beforehand. >> Signed-off-by: Anand Jain <anand.jain@oracle.com> >> --- >> fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 45 insertions(+) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index da3d6d0f5bc3..77fba78555d7 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, >> return 0; >> } >> >> +/* >> + * Sort the devices in its ascending order of latency value. >> + */ >> +static int btrfs_cmp_device_latency(const void *a, const void *b) >> +{ >> + const struct btrfs_device_info *di_a = a; >> + const struct btrfs_device_info *di_b = b; >> + struct btrfs_device *dev_a = di_a->dev; >> + struct btrfs_device *dev_b = di_b->dev; >> + >> + if (dev_a->dev_type > dev_b->dev_type) >> + return 1; >> + if (dev_a->dev_type < dev_b->dev_type) >> + return -1; >> + return 0; >> +} >> + >> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b) >> +{ >> + const struct btrfs_device_info *di_a = a; >> + const struct btrfs_device_info *di_b = b; >> + struct btrfs_device *dev_a = di_a->dev; >> + struct btrfs_device *dev_b = di_b->dev; >> + >> + if (dev_a->dev_type > dev_b->dev_type) >> + return -1; >> + if (dev_a->dev_type < dev_b->dev_type) >> + return 1; >> + return 0; >> +} >> + >> /* >> * sort the devices in descending order by max_avail, total_avail >> */ >> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices, >> sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >> btrfs_cmp_device_info, NULL); >> >> + /* >> + * Sort devices by their latency. Ascending order of latency for >> + * metadata and descending order of latency for the data chunks for >> + * mixed device types. >> + */ >> + if (fs_devices->mixed_dev_types) { >> + if (ctl->type & BTRFS_BLOCK_GROUP_DATA) >> + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >> + btrfs_cmp_device_rev_latency, NULL); >> + else >> + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >> + btrfs_cmp_device_latency, NULL); > > In case there are mixed devices the sort happens twice and because as > implemented in kernel sort() is not stable so even if device have same > amount of data they can get reorderd wildly. The remaingin space is > still a factor we need to take into account to avoid ENOSPC on the chunk > level. I didn't get this part. How about if it is this way: if (mixed && metadata && (single || dup)) { ndevs=0 pick all non-rotational ndevs++ with free space >= required space if (ndevs == 0) { if (user_option->metadata_nospc_on_faster_devs == error) return -ENOSPC; pick all rotational } sort-by-size-select-top } if (mixed && data && (single || dup)) { ndevs=0 pick all rotational ndevs++ with free space >= required space if (ndevs == 0) { if (user_option->data_nospc_on_faster_devs == error) return -ENOSPC; pick all non-rotational } sort-by-size-select-top } Thanks, Anand
On 29/01/2022 17.46, Anand Jain wrote: > > > On 27/01/2022 01:38, David Sterba wrote: >> On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote: >>> Mixed device types configuration exist. Most commonly, HDD mixed with >>> SSD/NVME device types. This use case prefers that the data chunk >>> allocates on HDD and the metadata chunk allocates on the SSD/NVME. >>> >>> As of now, in the function gather_device_info() called from >>> btrfs_create_chunk(), we sort the devices based on unallocated space >>> only. >> > >> Yes, and this guarantees maximizing the used space for the raid >> profiles. > > Oops. Thanks for reminding me of that. More below. > >>> After this patch, this function will check for mixed device types. And >>> will sort the devices based on enum btrfs_device_types. That is, sort if >>> the allocation type is metadata and reverse-sort if the allocation type >>> is data. >> >> And this changes the above, because a small fast device can be depleted >> first. > > > Both sort by size or latency do _not_ help if > given_raid.devs_max == 0 (raid0, raid5, raid6) OR given_raid.devs_max == num_devices. > > It helps only when given_raid.devs_max != 0 and given_raid.devs_max < num_devices. > > Sort by size does not help Single and Dup profiles. > > So, if (given_raid.devs_max != 0 && given_raid.devs_max < num_devices) { > > Mixed devs types with different sizes if sorted by free size: > is pro for raid1, raid1c3, raid1c4, raid10 > doesn't matter for single, dup > > Mixed devs types with different sizes if sorted by latency: > is pro for single and dup > is con for raid1, raid1c3, raid1c4, raid10 (depends) > } May be I remember bad; but in the "single" cases a new BG is allocated in the "more empty" disks. The same for the "dup". In this it is not different from RAID1xx. Only in the case of RAID5/6 the size doesn't matter, because a new chunk is allocate in each disk anyway. Pay attention that you can have a "mixed" environment of different disks sizes: 2 x ssd (100GB + 200GB) 2 x HDD (1T + 2T) So it could make sense to spread the metadata by priority (only to ssd) AND by "emptiness" (i.e. each 3 BG, one is allocated on the smaller disks and two in the bigger one). > > So, > > If given_raid.devs_max == num_devices we don't need any type of sorting. > > If given_raid.devs_max = 0 (raid0, raid5, raid6) we don't need any type of sorting. > > And sort devs by latency for Single and Dup profiles only. > > For rest of profiles sort devs by size only if given_raid.devs_max < num_devices. > > >>> The advantage of this method is that data/metadata allocation distribution >>> based on the device type happens automatically without any manual >>> configuration. >> >> Yeah, but the default behaviour may not be suitable for all users so >> some policy will have to be done anyway. > > Right. If nothing is configured even when provided then also it should > fallback to the default behaviour. > >> I vaguely remember some comments regarding mixed setups, along lines >> that "if there's a fast flash device I'd rather see ENOSPC and either >> delete files or add more devices than to let everything work but with >> the risk of storing metadata on the slow devices." > > It entirely depends on the use-case. An option like following will > solve it better: > mount -o metadata_nospc_on_faster_devs=<use-slower-devs>|<error> > > If metadata_nospc_on_faster_devs=error is preferred then it also > implies that data_nospc_on_slower_devs=error. > > Also, the use cases which prefer to use the error option should > remember it is difficult to estimate the data/metadata ratio > beforehand. > >>> Signed-off-by: Anand Jain <anand.jain@oracle.com> >>> --- >>> fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 45 insertions(+) >>> >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>> index da3d6d0f5bc3..77fba78555d7 100644 >>> --- a/fs/btrfs/volumes.c >>> +++ b/fs/btrfs/volumes.c >>> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, >>> return 0; >>> } >>> +/* >>> + * Sort the devices in its ascending order of latency value. >>> + */ >>> +static int btrfs_cmp_device_latency(const void *a, const void *b) >>> +{ >>> + const struct btrfs_device_info *di_a = a; >>> + const struct btrfs_device_info *di_b = b; >>> + struct btrfs_device *dev_a = di_a->dev; >>> + struct btrfs_device *dev_b = di_b->dev; >>> + >>> + if (dev_a->dev_type > dev_b->dev_type) >>> + return 1; >>> + if (dev_a->dev_type < dev_b->dev_type) >>> + return -1; >>> + return 0; >>> +} >>> + >>> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b) >>> +{ >>> + const struct btrfs_device_info *di_a = a; >>> + const struct btrfs_device_info *di_b = b; >>> + struct btrfs_device *dev_a = di_a->dev; >>> + struct btrfs_device *dev_b = di_b->dev; >>> + >>> + if (dev_a->dev_type > dev_b->dev_type) >>> + return -1; >>> + if (dev_a->dev_type < dev_b->dev_type) >>> + return 1; >>> + return 0; >>> +} >>> + >>> /* >>> * sort the devices in descending order by max_avail, total_avail >>> */ >>> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices, >>> sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >>> btrfs_cmp_device_info, NULL); >>> + /* >>> + * Sort devices by their latency. Ascending order of latency for >>> + * metadata and descending order of latency for the data chunks for >>> + * mixed device types. >>> + */ >>> + if (fs_devices->mixed_dev_types) { >>> + if (ctl->type & BTRFS_BLOCK_GROUP_DATA) >>> + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >>> + btrfs_cmp_device_rev_latency, NULL); >>> + else >>> + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >>> + btrfs_cmp_device_latency, NULL); >> >> In case there are mixed devices the sort happens twice and because as >> implemented in kernel sort() is not stable so even if device have same >> amount of data they can get reorderd wildly. The remaingin space is >> still a factor we need to take into account to avoid ENOSPC on the chunk >> level. > > > I didn't get this part. How about if it is this way: > > if (mixed && metadata && (single || dup)) { > ndevs=0 > pick all non-rotational ndevs++ with free space >= required space > if (ndevs == 0) { > if (user_option->metadata_nospc_on_faster_devs == error) > return -ENOSPC; > pick all rotational > } > sort-by-size-select-top > } > > if (mixed && data && (single || dup)) { > ndevs=0 > pick all rotational ndevs++ with free space >= required space > if (ndevs == 0) { > if (user_option->data_nospc_on_faster_devs == error) > return -ENOSPC; > pick all non-rotational > } > sort-by-size-select-top > } > > Thanks, Anand I think that you have to behave as allocation_hint patches set: The sort is one, and it sorts by - by priority - if the disks have the same priority, by free space - if the disks have the same priority and free space, by max avail ...
On 26/01/2022 18.38, David Sterba wrote: >> The advantage of this method is that data/metadata allocation distribution >> based on the device type happens automatically without any manual >> configuration. > Yeah, but the default behaviour may not be suitable for all users so > some policy will have to be done anyway. > > I vaguely remember some comments regarding mixed setups, along lines > that "if there's a fast flash device I'd rather see ENOSPC and either > delete files or add more devices than to let everything work but with > the risk of storing metadata on the slow devices." > I confirm that. There are two aspects that impacted the "allocation_hint" patches set discussions: 1) the criteria to order the disks 2) allow/permit/deny a kind of BG to be hosted by a device Regarding 1), initially the first set of patches considered an automatic behavior on the basis of the "rotational" attribute. Soon it was pointed out that in the "not rotational" class there are different options (like SSD, NVME...). The conclusion was that the "priority" should not be cabled in the btrfs code. (e.g. we could consider the reliability ?) Regarding 2), my initial patches set only ordered the disks and alloed any disk to be used. Some user asked me to prevent to use certain disks for (e.g.) the data; this to prevent the data BG to consume all the available space [*] These discussions leads me to create 4 "classes" for disks - ONLY_METADATA - PREFERRED_METADATA - PREFERRED_DATA - ONLY_DATA Where the last one is not suitable for metadata, and the first one is not suitable for data. The middle ones allow one type but only of the other disks are full. Another differences between the Anand patches and the my ones, is that in the "allocation_hint" for striped profiles (like raid5, which spans all the available disks), the devices involved should have the same classes. I.e., if btrfs has to allocate a RAID5 metadata chunk, - first tries to use ONLY_METADATA disks. - If the disks are not enough, then it tries to use ONLY_METADATA and PREFERRED_METADATA disks. - If the disks are not enough, then it tries to use ONLY_METADATA, PREFERRED_METADATA and PREFERRED_DATA disks. - If the disks are not enough, then -ENOSPC What the Anand patches has more than allocation_hint patches, is that these handle the case of different latency disks, giving higher priority to the disks with lower latency. If this is a requirements we can reserve some bits to add a priority of a disk, and then we can sort the disk by: - allocation_hint class - priority - max avail - free space BR G.Baroncelli [*] I don't want to open another discussion, but this seems to me more a "quota" problem than a "disk allocation" problem...
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index da3d6d0f5bc3..77fba78555d7 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, return 0; } +/* + * Sort the devices in its ascending order of latency value. + */ +static int btrfs_cmp_device_latency(const void *a, const void *b) +{ + const struct btrfs_device_info *di_a = a; + const struct btrfs_device_info *di_b = b; + struct btrfs_device *dev_a = di_a->dev; + struct btrfs_device *dev_b = di_b->dev; + + if (dev_a->dev_type > dev_b->dev_type) + return 1; + if (dev_a->dev_type < dev_b->dev_type) + return -1; + return 0; +} + +static int btrfs_cmp_device_rev_latency(const void *a, const void *b) +{ + const struct btrfs_device_info *di_a = a; + const struct btrfs_device_info *di_b = b; + struct btrfs_device *dev_a = di_a->dev; + struct btrfs_device *dev_b = di_b->dev; + + if (dev_a->dev_type > dev_b->dev_type) + return -1; + if (dev_a->dev_type < dev_b->dev_type) + return 1; + return 0; +} + /* * sort the devices in descending order by max_avail, total_avail */ @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices, sort(devices_info, ndevs, sizeof(struct btrfs_device_info), btrfs_cmp_device_info, NULL); + /* + * Sort devices by their latency. Ascending order of latency for + * metadata and descending order of latency for the data chunks for + * mixed device types. + */ + if (fs_devices->mixed_dev_types) { + if (ctl->type & BTRFS_BLOCK_GROUP_DATA) + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), + btrfs_cmp_device_rev_latency, NULL); + else + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), + btrfs_cmp_device_latency, NULL); + } + return 0; }
Mixed device types configuration exist. Most commonly, HDD mixed with SSD/NVME device types. This use case prefers that the data chunk allocates on HDD and the metadata chunk allocates on the SSD/NVME. As of now, in the function gather_device_info() called from btrfs_create_chunk(), we sort the devices based on unallocated space only. After this patch, this function will check for mixed device types. And will sort the devices based on enum btrfs_device_types. That is, sort if the allocation type is metadata and reverse-sort if the allocation type is data. The advantage of this method is that data/metadata allocation distribution based on the device type happens automatically without any manual configuration. Signed-off-by: Anand Jain <anand.jain@oracle.com> --- fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+)