diff mbox series

btrfs: add ssd_metadata mode

Message ID 20200405082636.18016-2-kreijack@libero.it (mailing list archive)
State New, archived
Headers show
Series btrfs: add ssd_metadata mode | expand

Commit Message

Goffredo Baroncelli April 5, 2020, 8:26 a.m. UTC
From: Goffredo Baroncelli <kreijack@inwind.it>

When this mode is enabled, the allocation policy of the chunk
is so modified:
- allocation of metadata chunk: priority is given to ssd disk.
- allocation of data chunk: priority is given to a rotational disk.

When a striped profile is involved (like RAID0,5,6), the logic
is a bit more complex. If there are enough disks, the data profiles
are stored on the rotational disks only; instead the metadata profiles
are stored on the non rotational disk only.
If the disks are not enough, then the profiles is stored on all
the disks.

Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
rotational ones.
A data profile raid6, will be stored on sda, sdb, sdc, sde, sdf (sde
and sdf are not enough to host a raid5 profile).
A metadata profile raid6, will be stored on sda, sdb, sdc (these
are enough to host a raid6 profile).

To enable this mode pass -o ssd_metadata at mount time.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/super.c   |  8 +++++
 fs/btrfs/volumes.c | 89 ++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h |  1 +
 4 files changed, 97 insertions(+), 2 deletions(-)

Comments

Paul Jones April 14, 2020, 5:24 a.m. UTC | #1
> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Goffredo Baroncelli
> Sent: Sunday, 5 April 2020 6:27 PM
> To: linux-btrfs@vger.kernel.org
> Cc: Michael <mclaud@roznica.com.ua>; Hugo Mills <hugo@carfax.org.uk>;
> Martin Svec <martin.svec@zoner.cz>; Wang Yugui <wangyugui@e16-
> tech.com>; Goffredo Baroncelli <kreijack@inwind.it>
> Subject: [PATCH] btrfs: add ssd_metadata mode
> 
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> When this mode is enabled, the allocation policy of the chunk is so modified:
> - allocation of metadata chunk: priority is given to ssd disk.
> - allocation of data chunk: priority is given to a rotational disk.
> 
> When a striped profile is involved (like RAID0,5,6), the logic is a bit more
> complex. If there are enough disks, the data profiles are stored on the
> rotational disks only; instead the metadata profiles are stored on the non
> rotational disk only.
> If the disks are not enough, then the profiles is stored on all the disks.
> 
> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are rotational
> ones.
> A data profile raid6, will be stored on sda, sdb, sdc, sde, sdf (sde and sdf are
> not enough to host a raid5 profile).
> A metadata profile raid6, will be stored on sda, sdb, sdc (these are enough to
> host a raid6 profile).
> 
> To enable this mode pass -o ssd_metadata at mount time.
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>

Tested-By: Paul Jones <paul@pauljones.id.au>

Using raid 1. Makes a surprising difference in speed, nearly as fast as when I was using dm-cache


Paul.
Wang Yugui Oct. 23, 2020, 7:23 a.m. UTC | #2
Hi, Goffredo Baroncelli 

We can move 'rotational of struct btrfs_device_info' to  'bool rotating
of struct btrfs_device'.

1, it will be more close to 'bool rotating of struct btrfs_fs_devices'.

2, it maybe used to enhance the path of '[PATCH] btrfs: balance RAID1/RAID10 mirror selection'.
https://lore.kernel.org/linux-btrfs/3bddd73e-cb60-b716-4e98-61ff24beb570@oracle.com/T/#t

Best Regards
王玉贵
2020/10/23

> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> When this mode is enabled, the allocation policy of the chunk
> is so modified:
> - allocation of metadata chunk: priority is given to ssd disk.
> - allocation of data chunk: priority is given to a rotational disk.
> 
> When a striped profile is involved (like RAID0,5,6), the logic
> is a bit more complex. If there are enough disks, the data profiles
> are stored on the rotational disks only; instead the metadata profiles
> are stored on the non rotational disk only.
> If the disks are not enough, then the profiles is stored on all
> the disks.
> 
> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
> rotational ones.
> A data profile raid6, will be stored on sda, sdb, sdc, sde, sdf (sde
> and sdf are not enough to host a raid5 profile).
> A metadata profile raid6, will be stored on sda, sdb, sdc (these
> are enough to host a raid6 profile).
> 
> To enable this mode pass -o ssd_metadata at mount time.
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/super.c   |  8 +++++
>  fs/btrfs/volumes.c | 89 ++++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/volumes.h |  1 +
>  4 files changed, 97 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 36df977b64d9..773c7f8b0b0d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1236,6 +1236,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
>  #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
>  #define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
> +#define BTRFS_MOUNT_SSD_METADATA	(1 << 30)
>  
>  #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
>  #define BTRFS_DEFAULT_MAX_INLINE	(2048)
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 67c63858812a..4ad14b0a57b3 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -350,6 +350,7 @@ enum {
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	Opt_ref_verify,
>  #endif
> +	Opt_ssd_metadata,
>  	Opt_err,
>  };
>  
> @@ -421,6 +422,7 @@ static const match_table_t tokens = {
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	{Opt_ref_verify, "ref_verify"},
>  #endif
> +	{Opt_ssd_metadata, "ssd_metadata"},
>  	{Opt_err, NULL},
>  };
>  
> @@ -872,6 +874,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>  			btrfs_set_opt(info->mount_opt, REF_VERIFY);
>  			break;
>  #endif
> +		case Opt_ssd_metadata:
> +			btrfs_set_and_info(info, SSD_METADATA,
> +					"enabling ssd_metadata");
> +			break;
>  		case Opt_err:
>  			btrfs_info(info, "unrecognized mount option '%s'", p);
>  			ret = -EINVAL;
> @@ -1390,6 +1396,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
>  #endif
>  	if (btrfs_test_opt(info, REF_VERIFY))
>  		seq_puts(seq, ",ref_verify");
> +	if (btrfs_test_opt(info, SSD_METADATA))
> +		seq_puts(seq, ",ssd_metadata");
>  	seq_printf(seq, ",subvolid=%llu",
>  		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
>  	seq_puts(seq, ",subvol=");
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9cfc668f91f4..ffb2bc912c43 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4761,6 +4761,58 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
>  	return 0;
>  }
>  
> +/*
> + * sort the devices in descending order by rotational,
> + * max_avail, total_avail
> + */
> +static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +
> +	/* metadata -> non rotational first */
> +	if (!di_a->rotational && di_b->rotational)
> +		return -1;
> +	if (di_a->rotational && !di_b->rotational)
> +		return 1;
> +	if (di_a->max_avail > di_b->max_avail)
> +		return -1;
> +	if (di_a->max_avail < di_b->max_avail)
> +		return 1;
> +	if (di_a->total_avail > di_b->total_avail)
> +		return -1;
> +	if (di_a->total_avail < di_b->total_avail)
> +		return 1;
> +	return 0;
> +}
> +
> +/*
> + * sort the devices in descending order by !rotational,
> + * max_avail, total_avail
> + */
> +static int btrfs_cmp_device_info_data(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +
> +	/* data -> non rotational last */
> +	if (!di_a->rotational && di_b->rotational)
> +		return 1;
> +	if (di_a->rotational && !di_b->rotational)
> +		return -1;
> +	if (di_a->max_avail > di_b->max_avail)
> +		return -1;
> +	if (di_a->max_avail < di_b->max_avail)
> +		return 1;
> +	if (di_a->total_avail > di_b->total_avail)
> +		return -1;
> +	if (di_a->total_avail < di_b->total_avail)
> +		return 1;
> +	return 0;
> +}
> +
> +
> +
>  static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
>  {
>  	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
> @@ -4808,6 +4860,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  	int i;
>  	int j;
>  	int index;
> +	int nr_rotational;
>  
>  	BUG_ON(!alloc_profile_is_valid(type, 0));
>  
> @@ -4863,6 +4916,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  	 * about the available holes on each device.
>  	 */
>  	ndevs = 0;
> +	nr_rotational = 0;
>  	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
>  		u64 max_avail;
>  		u64 dev_offset;
> @@ -4914,14 +4968,45 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  		devices_info[ndevs].max_avail = max_avail;
>  		devices_info[ndevs].total_avail = total_avail;
>  		devices_info[ndevs].dev = device;
> +		devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
> +				&(bdev_get_queue(device->bdev)->queue_flags));
> +		if (devices_info[ndevs].rotational)
> +			nr_rotational++;
>  		++ndevs;
>  	}
>  
> +	BUG_ON(nr_rotational > ndevs);
>  	/*
>  	 * now sort the devices by hole size / available space
>  	 */
> -	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> -	     btrfs_cmp_device_info, NULL);
> +	if (((type & BTRFS_BLOCK_GROUP_DATA) &&
> +	     (type & BTRFS_BLOCK_GROUP_METADATA)) ||
> +	    !btrfs_test_opt(info, SSD_METADATA)) {
> +		/* mixed bg or SSD_METADATA not set */
> +		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_info, NULL);
> +	} else {
> +		/*
> +		 * if SSD_METADATA is set, sort the device considering also the
> +		 * kind (ssd or not). Limit the availables devices to the ones
> +		 * of the same kind, to avoid that a striped profile like raid5
> +		 * spread to all kind of devices (ssd and rotational).
> +		 * It is allowed to use different kinds of devices if the ones
> +		 * of the same kind are not enough alone.
> +		 */
> +		if (type & BTRFS_BLOCK_GROUP_DATA) {
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +				     btrfs_cmp_device_info_data, NULL);
> +			if (nr_rotational >= devs_min)
> +				ndevs = nr_rotational;
> +		} else {
> +			int nr_norot = ndevs - nr_rotational;
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +				     btrfs_cmp_device_info_metadata, NULL);
> +			if (nr_norot >= devs_min)
> +				ndevs = nr_norot;
> +		}
> +	}
>  
>  	/*
>  	 * Round down to number of usable stripes, devs_increment can be any
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index f01552a0785e..285d71d54a03 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -343,6 +343,7 @@ struct btrfs_device_info {
>  	u64 dev_offset;
>  	u64 max_avail;
>  	u64 total_avail;
> +	int rotational:1;
>  };
>  
>  struct btrfs_raid_attr {
> -- 
> 2.26.0

--------------------------------------
北京京垓科技有限公司
王玉贵	wangyugui@e16-tech.com
电话:+86-136-71123776
Adam Borowski Oct. 23, 2020, 10:11 a.m. UTC | #3
On Fri, Oct 23, 2020 at 03:23:30PM +0800, Wang Yugui wrote:
> Hi, Goffredo Baroncelli 
> 
> We can move 'rotational of struct btrfs_device_info' to  'bool rotating
> of struct btrfs_device'.
> 
> 1, it will be more close to 'bool rotating of struct btrfs_fs_devices'.
> 
> 2, it maybe used to enhance the path of '[PATCH] btrfs: balance RAID1/RAID10 mirror selection'.
> https://lore.kernel.org/linux-btrfs/3bddd73e-cb60-b716-4e98-61ff24beb570@oracle.com/T/#t

I don't think it should be a bool -- or at least, turned into a bool
late in the processing.

There are many storage tiers; rotational applies only to one of the
coldest.  In my use case, at least, I've added the following patchlet:

-               devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
+               devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_DAX,

Or, you may want Optane NVMe vs legacy (ie, NAND) NVMe.

The tiers look like:
* DIMM-connected Optane (dax=1)
* NVMe-connected Optane
* NVMe-connected flash
* SATA-connected flash
* SATA-connected spinning rust (rotational=1)
* IDE-connected spinning rust (rotational=1)
* SD cards
* floppies?

And even that is just for local storage only.

Thus, please don't hardcode the notion of "rotational", what we want is
"faster but smaller" vs "slower but bigger".

> > From: Goffredo Baroncelli <kreijack@inwind.it>
> > 
> > When this mode is enabled, the allocation policy of the chunk
> > is so modified:
> > - allocation of metadata chunk: priority is given to ssd disk.
> > - allocation of data chunk: priority is given to a rotational disk.
> > 
> > When a striped profile is involved (like RAID0,5,6), the logic
> > is a bit more complex. If there are enough disks, the data profiles
> > are stored on the rotational disks only; instead the metadata profiles
> > are stored on the non rotational disk only.
> > If the disks are not enough, then the profiles is stored on all
> > the disks.

And, a newer version of Goffredo's patchset already had
"preferred_metadata".  It did not assign the preference automatically,
but if we want god defaults, they should be smarter than just rotationality.


Meow!
Qu Wenruo Oct. 23, 2020, 11:25 a.m. UTC | #4
On 2020/10/23 下午6:11, Adam Borowski wrote:
> On Fri, Oct 23, 2020 at 03:23:30PM +0800, Wang Yugui wrote:
>> Hi, Goffredo Baroncelli
>>
>> We can move 'rotational of struct btrfs_device_info' to  'bool rotating
>> of struct btrfs_device'.
>>
>> 1, it will be more close to 'bool rotating of struct btrfs_fs_devices'.
>>
>> 2, it maybe used to enhance the path of '[PATCH] btrfs: balance RAID1/RAID10 mirror selection'.
>> https://lore.kernel.org/linux-btrfs/3bddd73e-cb60-b716-4e98-61ff24beb570@oracle.com/T/#t
>
> I don't think it should be a bool -- or at least, turned into a bool
> late in the processing.
>
> There are many storage tiers; rotational applies only to one of the
> coldest.  In my use case, at least, I've added the following patchlet:
>
> -               devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
> +               devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_DAX,
>
> Or, you may want Optane NVMe vs legacy (ie, NAND) NVMe.

A little off topic here, btrfs in fact has a better ways to model a
storage, and definitely not simply rotational or not.

In btrfs_dev_item, we have bandwith and seek_speed to determine the
characteristic, although they're never really utilized.

So if we're really going to dig deeper into the rabbit hole, we need
more characteristic to describe a device.
From basic bandwidth for large block size IO, to things like IOPS for
small random block size, and even possible multi-level performance
characteristic for cases like multi-level cache used in current NVME
ssds, and future SMR + CMR mixed devices.

Although computer is binary, the performance characteristic is never
binary. :)

Thanks,
Qu
>
> The tiers look like:
> * DIMM-connected Optane (dax=1)
> * NVMe-connected Optane
> * NVMe-connected flash
> * SATA-connected flash
> * SATA-connected spinning rust (rotational=1)
> * IDE-connected spinning rust (rotational=1)
> * SD cards
> * floppies?
>
> And even that is just for local storage only.
>
> Thus, please don't hardcode the notion of "rotational", what we want is
> "faster but smaller" vs "slower but bigger".
>
>>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>>
>>> When this mode is enabled, the allocation policy of the chunk
>>> is so modified:
>>> - allocation of metadata chunk: priority is given to ssd disk.
>>> - allocation of data chunk: priority is given to a rotational disk.
>>>
>>> When a striped profile is involved (like RAID0,5,6), the logic
>>> is a bit more complex. If there are enough disks, the data profiles
>>> are stored on the rotational disks only; instead the metadata profiles
>>> are stored on the non rotational disk only.
>>> If the disks are not enough, then the profiles is stored on all
>>> the disks.
>
> And, a newer version of Goffredo's patchset already had
> "preferred_metadata".  It did not assign the preference automatically,
> but if we want god defaults, they should be smarter than just rotationality.
>
>
> Meow!
>
Wang Yugui Oct. 23, 2020, 12:37 p.m. UTC | #5
Hi,

Can we add the feature of 'Storage Tiering' to btrfs for these use cases?

1) use faster tier firstly for metadata

2) only the subvol with higher tier can save data to 
    the higher tier disk?

3) use faster tier firstly for mirror selection of RAID1/RAID10

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2020/10/23
Qu Wenruo Oct. 23, 2020, 12:45 p.m. UTC | #6
On 2020/10/23 下午8:37, Wang Yugui wrote:
> Hi,
>
> Can we add the feature of 'Storage Tiering' to btrfs for these use cases?

Feel free to contribute.

Although it seems pretty hard already, too many factors are involved,
from extent allocator to chunk allocator.

Just consider how complex things are for bcache, it won't be a simple
feature at all.

Thanks,
Qu
>
> 1) use faster tier firstly for metadata
>
> 2) only the subvol with higher tier can save data to
>     the higher tier disk?
>
> 3) use faster tier firstly for mirror selection of RAID1/RAID10
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2020/10/23
>
>
Steven Davies Oct. 23, 2020, 1:10 p.m. UTC | #7
On 2020-10-23 13:37, Wang Yugui wrote:
> Hi,
> 
> Can we add the feature of 'Storage Tiering' to btrfs for these use 
> cases?
> 
> 1) use faster tier firstly for metadata
> 
> 2) only the subvol with higher tier can save data to
>     the higher tier disk?
> 
> 3) use faster tier firstly for mirror selection of RAID1/RAID10

I'd support user-configurable tiering by specifying which device IDs are 
allowed to be used for

a) storing metadata
b) reading data from RAID1/RAID10

that would fit into both this patch and Anand's read policy patchset. It 
could be a mount option, sysfs tunable and/or btrfs-device command.

e.g. for sysfs
/sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_metadata_store 
[0..15]
/sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_metadata_read 
[0..15]
/sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_data_read 
[0..15]

Getting the user to specify the devices' priorities would be more 
reliable than looking at the rotational attribute.
Wang Yugui Oct. 23, 2020, 1:49 p.m. UTC | #8
Hi,

We define 'int tier' for different device, such as
* dax=1
* NVMe
* SSD
* rotational=1

but in most case, just two tier of them are used at the same time,
so we use them just as two tier(faster tier, slower tier).

for phase1, we support
1) use faster tier firstly for metadata
3) use faster tier firstly for mirror selection of RAID1/RAID10

and in phase2(TODO), we support
2) let some subvol save data to the faster tier disk

for metadata, the policy is
1) faster-tier-firstly (default)
2) faster-tier-only

for data, the policy is
4) slower-tier-firstly(default)
3) slower-tier-only
1) faster-tier-firstly(phase2 TODO)
2) faster-tier-only(phase2 TODO)

Now we are using metadata profile(RAID type) and data profile(RAID type),
in phase2, we change the name of them  to 'faster tier profile' and 
'lower tier profile'.

The key is 'in most case, just two tier are used at the same time'.
so we can support the flowing case in phase1 without user config.
1) use faster tier firstly for metadata
3) use faster tier firstly for mirror selection of RAID1/RAID10

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2020/10/23


> On 2020-10-23 13:37, Wang Yugui wrote:
> > Hi,
> >
> > Can we add the feature of 'Storage Tiering' to btrfs for these use
> > cases?
> >
> > 1) use faster tier firstly for metadata
> >
> > 2) only the subvol with higher tier can save data to
> >     the higher tier disk?
> >
> > 3) use faster tier firstly for mirror selection of RAID1/RAID10
> 
> I'd support user-configurable tiering by specifying which device IDs are allowed to be used for
> 
> a) storing metadata
> b) reading data from RAID1/RAID10
> 
> that would fit into both this patch and Anand's read policy patchset. It could be a mount option, sysfs tunable and/or btrfs-device command.
> 
> e.g. for sysfs
> /sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_metadata_store [0..15]
> /sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_metadata_read [0..15]
> /sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_data_read [0..15]
> 
> Getting the user to specify the devices' priorities would be more reliable than looking at the rotational attribute.
Goffredo Baroncelli Oct. 23, 2020, 6:03 p.m. UTC | #9
On 10/23/20 2:37 PM, Wang Yugui wrote:
> Hi,
> 
> Can we add the feature of 'Storage Tiering' to btrfs for these use cases?
> 
> 1) use faster tier firstly for metadata

My tests revealed that a BTRFS filesystem stacked over bcache has better performance. So I am not sure that putting the metadata in a dedicated storage is a good thing.

> 
> 2) only the subvol with higher tier can save data to
>      the higher tier disk?
> 
> 3) use faster tier firstly for mirror selection of RAID1/RAID10

If you want to put a subvolume in a "higher tier", it is more simple to use two filesystems: one in the "higher tier" and one in the "slower one".

Also what should be the semantic of cp --reflink between different subvolumes of different tiers ? From a technical point of view, it is defined, but the expectation of the users can be vary...

The same is true about assign different raid profile to differents subvolumes....

Let the things simple.
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2020/10/23
> 

BR
GB
>
Paul Jones Oct. 24, 2020, 3:26 a.m. UTC | #10
> -----Original Message-----
> From: Goffredo Baroncelli <kreijack@libero.it>
> Sent: Saturday, 24 October 2020 5:04 AM
> To: Wang Yugui <wangyugui@e16-tech.com>; Qu Wenruo
> <quwenruo.btrfs@gmx.com>
> Cc: Adam Borowski <kilobyte@angband.pl>; linux-btrfs@vger.kernel.org;
> Michael <mclaud@roznica.com.ua>; Hugo Mills <hugo@carfax.org.uk>;
> Martin Svec <martin.svec@zoner.cz>; Goffredo Baroncelli
> <kreijack@inwind.it>
> Subject: Re: [PATCH] btrfs: add ssd_metadata mode
> 
> On 10/23/20 2:37 PM, Wang Yugui wrote:
> > Hi,
> >
> > Can we add the feature of 'Storage Tiering' to btrfs for these use cases?
> >
> > 1) use faster tier firstly for metadata
> 
> My tests revealed that a BTRFS filesystem stacked over bcache has better
> performance. So I am not sure that putting the metadata in a dedicated
> storage is a good thing.

There is a balance between ultimate speed and simplicity. I used to use dm-cache under btrfs which worked very well but is complicated and error prone to setup, fragile, and slow to restore after an error. Now I'm using your ssd_metadata patch which is almost as fast but far more robust and quick/easy to restore after errors. It's night and day better imho, especially for a production system.

Paul.
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 36df977b64d9..773c7f8b0b0d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1236,6 +1236,7 @@  static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
 #define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
+#define BTRFS_MOUNT_SSD_METADATA	(1 << 30)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 67c63858812a..4ad14b0a57b3 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -350,6 +350,7 @@  enum {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	Opt_ref_verify,
 #endif
+	Opt_ssd_metadata,
 	Opt_err,
 };
 
@@ -421,6 +422,7 @@  static const match_table_t tokens = {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	{Opt_ref_verify, "ref_verify"},
 #endif
+	{Opt_ssd_metadata, "ssd_metadata"},
 	{Opt_err, NULL},
 };
 
@@ -872,6 +874,10 @@  int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			btrfs_set_opt(info->mount_opt, REF_VERIFY);
 			break;
 #endif
+		case Opt_ssd_metadata:
+			btrfs_set_and_info(info, SSD_METADATA,
+					"enabling ssd_metadata");
+			break;
 		case Opt_err:
 			btrfs_info(info, "unrecognized mount option '%s'", p);
 			ret = -EINVAL;
@@ -1390,6 +1396,8 @@  static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 #endif
 	if (btrfs_test_opt(info, REF_VERIFY))
 		seq_puts(seq, ",ref_verify");
+	if (btrfs_test_opt(info, SSD_METADATA))
+		seq_puts(seq, ",ssd_metadata");
 	seq_printf(seq, ",subvolid=%llu",
 		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
 	seq_puts(seq, ",subvol=");
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9cfc668f91f4..ffb2bc912c43 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4761,6 +4761,58 @@  static int btrfs_cmp_device_info(const void *a, const void *b)
 	return 0;
 }
 
+/*
+ * sort the devices in descending order by rotational,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* metadata -> non rotational first */
+	if (!di_a->rotational && di_b->rotational)
+		return -1;
+	if (di_a->rotational && !di_b->rotational)
+		return 1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * sort the devices in descending order by !rotational,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_data(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* data -> non rotational last */
+	if (!di_a->rotational && di_b->rotational)
+		return 1;
+	if (di_a->rotational && !di_b->rotational)
+		return -1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+
+
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -4808,6 +4860,7 @@  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int i;
 	int j;
 	int index;
+	int nr_rotational;
 
 	BUG_ON(!alloc_profile_is_valid(type, 0));
 
@@ -4863,6 +4916,7 @@  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	 * about the available holes on each device.
 	 */
 	ndevs = 0;
+	nr_rotational = 0;
 	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
 		u64 max_avail;
 		u64 dev_offset;
@@ -4914,14 +4968,45 @@  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		devices_info[ndevs].max_avail = max_avail;
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
+		devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
+				&(bdev_get_queue(device->bdev)->queue_flags));
+		if (devices_info[ndevs].rotational)
+			nr_rotational++;
 		++ndevs;
 	}
 
+	BUG_ON(nr_rotational > ndevs);
 	/*
 	 * now sort the devices by hole size / available space
 	 */
-	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
-	     btrfs_cmp_device_info, NULL);
+	if (((type & BTRFS_BLOCK_GROUP_DATA) &&
+	     (type & BTRFS_BLOCK_GROUP_METADATA)) ||
+	    !btrfs_test_opt(info, SSD_METADATA)) {
+		/* mixed bg or SSD_METADATA not set */
+		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_info, NULL);
+	} else {
+		/*
+		 * if SSD_METADATA is set, sort the device considering also the
+		 * kind (ssd or not). Limit the availables devices to the ones
+		 * of the same kind, to avoid that a striped profile like raid5
+		 * spread to all kind of devices (ssd and rotational).
+		 * It is allowed to use different kinds of devices if the ones
+		 * of the same kind are not enough alone.
+		 */
+		if (type & BTRFS_BLOCK_GROUP_DATA) {
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_data, NULL);
+			if (nr_rotational >= devs_min)
+				ndevs = nr_rotational;
+		} else {
+			int nr_norot = ndevs - nr_rotational;
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_metadata, NULL);
+			if (nr_norot >= devs_min)
+				ndevs = nr_norot;
+		}
+	}
 
 	/*
 	 * Round down to number of usable stripes, devs_increment can be any
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index f01552a0785e..285d71d54a03 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -343,6 +343,7 @@  struct btrfs_device_info {
 	u64 dev_offset;
 	u64 max_avail;
 	u64 total_avail;
+	int rotational:1;
 };
 
 struct btrfs_raid_attr {