diff mbox series

[1/2] btrfs: Introduce the virtual_fsid feature

Message ID 20230504170708.787361-2-gpiccoli@igalia.com (mailing list archive)
State New, archived
Headers show
Series Supporting same fsid filesystems mounting on btrfs | expand

Commit Message

Guilherme G. Piccoli May 4, 2023, 5:07 p.m. UTC
Btrfs doesn't currently support to mount 2 different devices holding the
same filesystem - the fsid is used as a unique identifier in the driver.
This case is supported though in some other common filesystems, like
ext4; one of the reasons for which is not trivial supporting this case
on btrfs is due to its multi-device filesystem nature, native RAID, etc.

Supporting the same-fsid mounts has the advantage of allowing btrfs to
be used in A/B partitioned devices, like mobile phones or the Steam Deck
for example. Without this support, it's not safe for users to keep the
same "image version" in both A and B partitions, a setup that is quite
common for development, for example. Also, as a big bonus, it allows fs
integrity check based on block devices for RO devices (whereas currently
it is required that both have different fsid, breaking the block device
hash comparison).

Such same-fsid mounting is hereby added through the usage of the
mount option "virtual_fsid" - when such option is used, btrfs generates
a random fsid for the filesystem and leverages the metadata_uuid
infrastructure (introduced by [0]) to enable the usage of this secondary
virtual fsid. But differently from the regular metadata_uuid flag, this
is not written into the disk superblock - effectively, this is a spoofed
fsid approach that enables the same filesystem in different devices to
appear as different filesystems to btrfs on runtime.

In order to prevent more code complexity and potential corner cases,
given the usage is aimed to single-devices / partitions, virtual_fsid
is not allowed when the metadata_uuid flag is already present on the fs,
or if the device is on fsid-change state. Device removal/replace is also
disabled for devices mounted with the virtual_fsid option.

[0] 7239ff4b2be8 ("btrfs: Introduce support for FSID change without metadata rewrite)

Cc: Nikolay Borisov <nborisov@suse.com>
Suggested-by: John Schoenick <johns@valvesoftware.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>

---

Hi folks, first of all thanks in advance for reviews / suggestions!
Some design choices that worth a discussion:

(1) The first choice was for the random fsid versus user-passed id.
Initially we considered that the flag could be "virtual_fsid=%s", hence
userspace would be responsible to generate the fsid. But seems the
collision probability of fsids in near zero, also the random code
hereby proposed checks if any other fsid/metadata_uuid present in
the btrfs structures is a match, and if so, new random uuids are
created until they prove the unique within btrfs.

(2) About disabling device removal/replace: these cases could be
handled I guess, but with increased code complexity. If there is a
need for that, we could implement. Also worth mentioning that any
advice is appreciated with regards to potential incompatibilities
between "virtual_fsid" and any other btrfs feature / mount option.

(3) There is no proposed documentation about the "virtual_fsid" here;
seems the kernel docs points to a wiki page, so appreciate suggestions
on how to edit such wiki and how to coordinate this with the patch
development cycle.


 fs/btrfs/disk-io.c | 22 ++++++++++--
 fs/btrfs/ioctl.c   | 18 ++++++++++
 fs/btrfs/super.c   | 32 +++++++++++------
 fs/btrfs/volumes.c | 86 +++++++++++++++++++++++++++++++++++++++-------
 fs/btrfs/volumes.h |  8 ++++-
 5 files changed, 139 insertions(+), 27 deletions(-)

Comments

Qu Wenruo May 5, 2023, 7:21 a.m. UTC | #1
On 2023/5/5 01:07, Guilherme G. Piccoli wrote:
> Btrfs doesn't currently support to mount 2 different devices holding the
> same filesystem - the fsid is used as a unique identifier in the driver.
> This case is supported though in some other common filesystems, like
> ext4; one of the reasons for which is not trivial supporting this case
> on btrfs is due to its multi-device filesystem nature, native RAID, etc.

Exactly, the biggest problem is the multi-device support.

Btrfs needs to search and assemble all devices of a multi-device
filesystem, which is normally handled by things like LVM/DMraid, thus
other traditional fses won't need to bother that.

>
> Supporting the same-fsid mounts has the advantage of allowing btrfs to
> be used in A/B partitioned devices, like mobile phones or the Steam Deck
> for example. Without this support, it's not safe for users to keep the
> same "image version" in both A and B partitions, a setup that is quite
> common for development, for example. Also, as a big bonus, it allows fs
> integrity check based on block devices for RO devices (whereas currently
> it is required that both have different fsid, breaking the block device
> hash comparison).
>
> Such same-fsid mounting is hereby added through the usage of the
> mount option "virtual_fsid" - when such option is used, btrfs generates
> a random fsid for the filesystem and leverages the metadata_uuid
> infrastructure (introduced by [0]) to enable the usage of this secondary
> virtual fsid. But differently from the regular metadata_uuid flag, this
> is not written into the disk superblock - effectively, this is a spoofed
> fsid approach that enables the same filesystem in different devices to
> appear as different filesystems to btrfs on runtime.

I would prefer a much simpler but more explicit method.

Just introduce a new compat_ro feature, maybe call it SINGLE_DEV.

By this, we can avoid multiple meanings of the same super member, nor
need any special mount option.
Remember, mount option is never a good way to enable/disable a new feature.

The better method to enable/disable a feature should be mkfs and btrfstune.

Then go mostly the same of your patch, but maybe with something extra:

- Disbale multi-dev code
   Include device add/replace/removal, this is already done in your
   patch.

- Completely skip device scanning
   I see no reason to keep btrfs with SINGLE_DEV feature to be added to
   the device list at all.
   It only needs to be scanned at mount time, and never be kept in the
   in-memory device list.

Thanks,
Qu

>
> In order to prevent more code complexity and potential corner cases,
> given the usage is aimed to single-devices / partitions, virtual_fsid
> is not allowed when the metadata_uuid flag is already present on the fs,
> or if the device is on fsid-change state. Device removal/replace is also
> disabled for devices mounted with the virtual_fsid option.
>
> [0] 7239ff4b2be8 ("btrfs: Introduce support for FSID change without metadata rewrite)
>
> Cc: Nikolay Borisov <nborisov@suse.com>
> Suggested-by: John Schoenick <johns@valvesoftware.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
>
> ---
>
> Hi folks, first of all thanks in advance for reviews / suggestions!
> Some design choices that worth a discussion:
>
> (1) The first choice was for the random fsid versus user-passed id.
> Initially we considered that the flag could be "virtual_fsid=%s", hence
> userspace would be responsible to generate the fsid. But seems the
> collision probability of fsids in near zero, also the random code
> hereby proposed checks if any other fsid/metadata_uuid present in
> the btrfs structures is a match, and if so, new random uuids are
> created until they prove the unique within btrfs.
>
> (2) About disabling device removal/replace: these cases could be
> handled I guess, but with increased code complexity. If there is a
> need for that, we could implement. Also worth mentioning that any
> advice is appreciated with regards to potential incompatibilities
> between "virtual_fsid" and any other btrfs feature / mount option.
>
> (3) There is no proposed documentation about the "virtual_fsid" here;
> seems the kernel docs points to a wiki page, so appreciate suggestions
> on how to edit such wiki and how to coordinate this with the patch
> development cycle.
>
>
>   fs/btrfs/disk-io.c | 22 ++++++++++--
>   fs/btrfs/ioctl.c   | 18 ++++++++++
>   fs/btrfs/super.c   | 32 +++++++++++------
>   fs/btrfs/volumes.c | 86 +++++++++++++++++++++++++++++++++++++++-------
>   fs/btrfs/volumes.h |  8 ++++-
>   5 files changed, 139 insertions(+), 27 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 9e1596bb208d..66c2bac343b8 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -468,7 +468,8 @@ static int check_tree_block_fsid(struct extent_buffer *eb)
>   	 * seed devices it's forbidden to have their uuid changed so reading
>   	 * ->fsid in this case is fine
>   	 */
> -	if (btrfs_fs_incompat(fs_info, METADATA_UUID))
> +	if (btrfs_fs_incompat(fs_info, METADATA_UUID) ||
> +	    fs_devices->virtual_fsid)
>   		metadata_uuid = fs_devices->metadata_uuid;
>   	else
>   		metadata_uuid = fs_devices->fsid;
> @@ -2539,6 +2540,7 @@ int btrfs_validate_super(struct btrfs_fs_info *fs_info,
>   {
>   	u64 nodesize = btrfs_super_nodesize(sb);
>   	u64 sectorsize = btrfs_super_sectorsize(sb);
> +	u8 *fsid;
>   	int ret = 0;
>
>   	if (btrfs_super_magic(sb) != BTRFS_MAGIC) {
> @@ -2619,8 +2621,22 @@ int btrfs_validate_super(struct btrfs_fs_info *fs_info,
>   		ret = -EINVAL;
>   	}
>
> -	if (memcmp(fs_info->fs_devices->fsid, fs_info->super_copy->fsid,
> -		   BTRFS_FSID_SIZE)) {
> +	/*
> +	 * When the virtual_fsid flag is passed at mount time, btrfs
> +	 * creates a random fsid and makes use of the metadata_uuid
> +	 * infrastructure in order to allow, for example, two devices
> +	 * with same fsid getting mounted at the same time. But notice
> +	 * no changes happen at the disk level, so the random fsid is a
> +	 * driver abstraction for that single mount, not to be written
> +	 * in the disk. That's the reason we're required here to compare
> +	 * the fsid with the metadata_uuid if virtual_fsid was set.
> +	 */
> +	if (fs_info->fs_devices->virtual_fsid)
> +		fsid = fs_info->fs_devices->metadata_uuid;
> +	else
> +		fsid = fs_info->fs_devices->fsid;
> +
> +	if (memcmp(fsid, fs_info->super_copy->fsid, BTRFS_FSID_SIZE)) {
>   		btrfs_err(fs_info,
>   		"superblock fsid doesn't match fsid of fs_devices: %pU != %pU",
>   			fs_info->super_copy->fsid, fs_info->fs_devices->fsid);
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index ba769a1eb87a..35e3a23f8c83 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2677,6 +2677,12 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg)
>   	if (!capable(CAP_SYS_ADMIN))
>   		return -EPERM;
>
> +	if (fs_info->fs_devices->virtual_fsid) {
> +		btrfs_err(fs_info,
> +			  "device removal is unsupported on devices mounted with virtual fsid\n");
> +		return -EINVAL;
> +	}
> +
>   	vol_args = memdup_user(arg, sizeof(*vol_args));
>   	if (IS_ERR(vol_args))
>   		return PTR_ERR(vol_args);
> @@ -2743,6 +2749,12 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)
>   	if (!capable(CAP_SYS_ADMIN))
>   		return -EPERM;
>
> +	if (fs_info->fs_devices->virtual_fsid) {
> +		btrfs_err(fs_info,
> +			  "device removal is unsupported on devices mounted with virtual fsid\n");
> +		return -EINVAL;
> +	}
> +
>   	vol_args = memdup_user(arg, sizeof(*vol_args));
>   	if (IS_ERR(vol_args))
>   		return PTR_ERR(vol_args);
> @@ -3261,6 +3273,12 @@ static long btrfs_ioctl_dev_replace(struct btrfs_fs_info *fs_info,
>   		return -EINVAL;
>   	}
>
> +	if (fs_info->fs_devices->virtual_fsid) {
> +		btrfs_err(fs_info,
> +			  "device replace is unsupported on devices mounted with virtual fsid\n");
> +		return -EINVAL;
> +	}
> +
>   	p = memdup_user(arg, sizeof(*p));
>   	if (IS_ERR(p))
>   		return PTR_ERR(p);
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 366fb4cde145..8d9df169107a 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -115,6 +115,7 @@ enum {
>   	Opt_thread_pool,
>   	Opt_treelog, Opt_notreelog,
>   	Opt_user_subvol_rm_allowed,
> +	Opt_virtual_fsid,
>
>   	/* Rescue options */
>   	Opt_rescue,
> @@ -188,6 +189,7 @@ static const match_table_t tokens = {
>   	{Opt_treelog, "treelog"},
>   	{Opt_notreelog, "notreelog"},
>   	{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
> +	{Opt_virtual_fsid, "virtual_fsid"},
>
>   	/* Rescue options */
>   	{Opt_rescue, "rescue=%s"},
> @@ -352,9 +354,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>   		case Opt_subvol_empty:
>   		case Opt_subvolid:
>   		case Opt_device:
> +		case Opt_virtual_fsid:
>   			/*
>   			 * These are parsed by btrfs_parse_subvol_options or
> -			 * btrfs_parse_device_options and can be ignored here.
> +			 * btrfs_parse_early_options and can be ignored here.
>   			 */
>   			break;
>   		case Opt_nodatasum:
> @@ -845,9 +848,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>    * All other options will be parsed on much later in the mount process and
>    * only when we need to allocate a new super block.
>    */
> -static int btrfs_parse_device_options(const char *options, fmode_t flags,
> -				      void *holder)
> +static int btrfs_parse_early_options(const char *options, fmode_t flags,
> +				    bool *virtual_fsid, void *holder)
>   {
> +	struct btrfs_scan_info info = { .vfsid = false };
>   	substring_t args[MAX_OPT_ARGS];
>   	char *device_name, *opts, *orig, *p;
>   	struct btrfs_device *device = NULL;
> @@ -874,14 +878,18 @@ static int btrfs_parse_device_options(const char *options, fmode_t flags,
>   			continue;
>
>   		token = match_token(p, tokens, args);
> +
> +		if (token == Opt_virtual_fsid)
> +			(*virtual_fsid) = true;
> +
>   		if (token == Opt_device) {
>   			device_name = match_strdup(&args[0]);
>   			if (!device_name) {
>   				error = -ENOMEM;
>   				goto out;
>   			}
> -			device = btrfs_scan_one_device(device_name, flags,
> -					holder);
> +			info.path = device_name;
> +			device = btrfs_scan_one_device(&info, flags, holder);
>   			kfree(device_name);
>   			if (IS_ERR(device)) {
>   				error = PTR_ERR(device);
> @@ -913,7 +921,7 @@ static int btrfs_parse_subvol_options(const char *options, char **subvol_name,
>
>   	/*
>   	 * strsep changes the string, duplicate it because
> -	 * btrfs_parse_device_options gets called later
> +	 * btrfs_parse_early_options gets called later
>   	 */
>   	opts = kstrdup(options, GFP_KERNEL);
>   	if (!opts)
> @@ -1431,6 +1439,7 @@ static struct dentry *mount_subvol(const char *subvol_name, u64 subvol_objectid,
>   static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
>   		int flags, const char *device_name, void *data)
>   {
> +	struct btrfs_scan_info info = { .path = NULL, .vfsid = false};
>   	struct block_device *bdev = NULL;
>   	struct super_block *s;
>   	struct btrfs_device *device = NULL;
> @@ -1472,13 +1481,14 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
>   	}
>
>   	mutex_lock(&uuid_mutex);
> -	error = btrfs_parse_device_options(data, mode, fs_type);
> +	error = btrfs_parse_early_options(data, mode, &info.vfsid, fs_type);
>   	if (error) {
>   		mutex_unlock(&uuid_mutex);
>   		goto error_fs_info;
>   	}
>
> -	device = btrfs_scan_one_device(device_name, mode, fs_type);
> +	info.path = device_name;
> +	device = btrfs_scan_one_device(&info, mode, fs_type);
>   	if (IS_ERR(device)) {
>   		mutex_unlock(&uuid_mutex);
>   		error = PTR_ERR(device);
> @@ -2169,6 +2179,7 @@ static int btrfs_control_open(struct inode *inode, struct file *file)
>   static long btrfs_control_ioctl(struct file *file, unsigned int cmd,
>   				unsigned long arg)
>   {
> +	struct btrfs_scan_info info = { .vfsid = false };
>   	struct btrfs_ioctl_vol_args *vol;
>   	struct btrfs_device *device = NULL;
>   	dev_t devt = 0;
> @@ -2182,10 +2193,11 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd,
>   		return PTR_ERR(vol);
>   	vol->name[BTRFS_PATH_NAME_MAX] = '\0';
>
> +	info.path = vol->name;
>   	switch (cmd) {
>   	case BTRFS_IOC_SCAN_DEV:
>   		mutex_lock(&uuid_mutex);
> -		device = btrfs_scan_one_device(vol->name, FMODE_READ,
> +		device = btrfs_scan_one_device(&info, FMODE_READ,
>   					       &btrfs_root_fs_type);
>   		ret = PTR_ERR_OR_ZERO(device);
>   		mutex_unlock(&uuid_mutex);
> @@ -2200,7 +2212,7 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd,
>   		break;
>   	case BTRFS_IOC_DEVICES_READY:
>   		mutex_lock(&uuid_mutex);
> -		device = btrfs_scan_one_device(vol->name, FMODE_READ,
> +		device = btrfs_scan_one_device(&info, FMODE_READ,
>   					       &btrfs_root_fs_type);
>   		if (IS_ERR(device)) {
>   			mutex_unlock(&uuid_mutex);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c6d592870400..5a38b3482ec5 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -745,6 +745,33 @@ static struct btrfs_fs_devices *find_fsid_reverted_metadata(
>
>   	return NULL;
>   }
> +
> +static void prepare_virtual_fsid(struct btrfs_super_block *disk_super,
> +				 const char *path)
> +{
> +	struct btrfs_fs_devices *fs_devices;
> +	u8 vfsid[BTRFS_FSID_SIZE];
> +	bool dup_fsid = true;
> +
> +	while (dup_fsid) {
> +		dup_fsid = false;
> +		generate_random_uuid(vfsid);
> +
> +		list_for_each_entry(fs_devices, &fs_uuids, fs_list) {
> +			if (!memcmp(vfsid, fs_devices->fsid, BTRFS_FSID_SIZE) ||
> +			    !memcmp(vfsid, fs_devices->metadata_uuid,
> +				    BTRFS_FSID_SIZE))
> +				dup_fsid = true;
> +		}
> +	}
> +
> +	memcpy(disk_super->metadata_uuid, disk_super->fsid, BTRFS_FSID_SIZE);
> +	memcpy(disk_super->fsid, vfsid, BTRFS_FSID_SIZE);
> +
> +	pr_info("BTRFS: virtual fsid (%pU) set for device %s (real fsid %pU)\n",
> +		disk_super->fsid, path, disk_super->metadata_uuid);
> +}
> +
>   /*
>    * Add new device to list of registered devices
>    *
> @@ -752,12 +779,15 @@ static struct btrfs_fs_devices *find_fsid_reverted_metadata(
>    * device pointer which was just added or updated when successful
>    * error pointer when failed
>    */
> -static noinline struct btrfs_device *device_list_add(const char *path,
> -			   struct btrfs_super_block *disk_super,
> -			   bool *new_device_added)
> +static noinline
> +struct btrfs_device *device_list_add(const struct btrfs_scan_info *info,
> +				     struct btrfs_super_block *disk_super,
> +				     bool *new_device_added)
>   {
>   	struct btrfs_device *device;
>   	struct btrfs_fs_devices *fs_devices = NULL;
> +	const char *path = info->path;
> +	const bool virtual_fsid = info->vfsid;
>   	struct rcu_string *name;
>   	u64 found_transid = btrfs_super_generation(disk_super);
>   	u64 devid = btrfs_stack_device_id(&disk_super->dev_item);
> @@ -775,12 +805,25 @@ static noinline struct btrfs_device *device_list_add(const char *path,
>   		return ERR_PTR(error);
>   	}
>
> +	if (virtual_fsid) {
> +		if (has_metadata_uuid || fsid_change_in_progress) {
> +			btrfs_err(NULL,
> +			  "failed to add device %s with virtual fsid (%s)\n",
> +				  path, (has_metadata_uuid ?
> +					 "the fs already has a metadata_uuid" :
> +					 "fsid change in progress"));
> +			return ERR_PTR(-EINVAL);
> +		}
> +
> +		prepare_virtual_fsid(disk_super, path);
> +	}
> +
>   	if (fsid_change_in_progress) {
>   		if (!has_metadata_uuid)
>   			fs_devices = find_fsid_inprogress(disk_super);
>   		else
>   			fs_devices = find_fsid_changed(disk_super);
> -	} else if (has_metadata_uuid) {
> +	} else if (has_metadata_uuid || virtual_fsid) {
>   		fs_devices = find_fsid_with_metadata_uuid(disk_super);
>   	} else {
>   		fs_devices = find_fsid_reverted_metadata(disk_super);
> @@ -790,7 +833,7 @@ static noinline struct btrfs_device *device_list_add(const char *path,
>
>
>   	if (!fs_devices) {
> -		if (has_metadata_uuid)
> +		if (has_metadata_uuid || virtual_fsid)
>   			fs_devices = alloc_fs_devices(disk_super->fsid,
>   						      disk_super->metadata_uuid);
>   		else
> @@ -799,6 +842,7 @@ static noinline struct btrfs_device *device_list_add(const char *path,
>   		if (IS_ERR(fs_devices))
>   			return ERR_CAST(fs_devices);
>
> +		fs_devices->virtual_fsid = virtual_fsid;
>   		fs_devices->fsid_change = fsid_change_in_progress;
>
>   		mutex_lock(&fs_devices->device_list_mutex);
> @@ -870,11 +914,21 @@ static noinline struct btrfs_device *device_list_add(const char *path,
>   	"BTRFS: device label %s devid %llu transid %llu %s scanned by %s (%d)\n",
>   				disk_super->label, devid, found_transid, path,
>   				current->comm, task_pid_nr(current));
> -		else
> -			pr_info(
> +		else {
> +			if (virtual_fsid)
> +				pr_info(
> +"BTRFS: device with virtual fsid %pU (real fsid %pU) devid %llu transid %llu %s scanned by %s (%d)\n",
> +					disk_super->fsid,
> +					disk_super->metadata_uuid, devid,
> +					found_transid, path, current->comm,
> +					task_pid_nr(current));
> +			else
> +				pr_info(
>   	"BTRFS: device fsid %pU devid %llu transid %llu %s scanned by %s (%d)\n",
> -				disk_super->fsid, devid, found_transid, path,
> -				current->comm, task_pid_nr(current));
> +					disk_super->fsid, devid, found_transid,
> +					path, current->comm,
> +					task_pid_nr(current));
> +		}
>
>   	} else if (!device->name || strcmp(device->name->str, path)) {
>   		/*
> @@ -1348,8 +1402,8 @@ int btrfs_forget_devices(dev_t devt)
>    * and we are not allowed to call set_blocksize during the scan. The superblock
>    * is read via pagecache
>    */
> -struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
> -					   void *holder)
> +struct btrfs_device *btrfs_scan_one_device(const struct btrfs_scan_info *info,
> +					   fmode_t flags, void *holder)
>   {
>   	struct btrfs_super_block *disk_super;
>   	bool new_device_added = false;
> @@ -1377,7 +1431,7 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
>   	 * values temporarily, as the device paths of the fsid are the only
>   	 * required information for assembling the volume.
>   	 */
> -	bdev = blkdev_get_by_path(path, flags, holder);
> +	bdev = blkdev_get_by_path(info->path, flags, holder);
>   	if (IS_ERR(bdev))
>   		return ERR_CAST(bdev);
>
> @@ -1394,7 +1448,7 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
>   		goto error_bdev_put;
>   	}
>
> -	device = device_list_add(path, disk_super, &new_device_added);
> +	device = device_list_add(info, disk_super, &new_device_added);
>   	if (!IS_ERR(device) && new_device_added)
>   		btrfs_free_stale_devices(device->devt, device);
>
> @@ -2390,6 +2444,12 @@ int btrfs_get_dev_args_from_path(struct btrfs_fs_info *fs_info,
>
>   	args->devid = btrfs_stack_device_id(&disk_super->dev_item);
>   	memcpy(args->uuid, disk_super->dev_item.uuid, BTRFS_UUID_SIZE);
> +
> +	/*
> +	 * Notice that the virtual_fsid feature is not handled here; device
> +	 * removal/replace is instead forbidden when such feature is in-use,
> +	 * but this note is for future users of btrfs_get_dev_args_from_path().
> +	 */
>   	if (btrfs_fs_incompat(fs_info, METADATA_UUID))
>   		memcpy(args->fsid, disk_super->metadata_uuid, BTRFS_FSID_SIZE);
>   	else
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 7e51f2238f72..f2354e8288f9 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -73,6 +73,11 @@ enum btrfs_raid_types {
>   #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
>   #define BTRFS_DEV_STATE_NO_READA	(5)
>
> +struct btrfs_scan_info {
> +	const char *path;
> +	bool vfsid;
> +};
> +
>   struct btrfs_zoned_device_info;
>
>   struct btrfs_device {
> @@ -278,6 +283,7 @@ struct btrfs_fs_devices {
>   	u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
>   	u8 metadata_uuid[BTRFS_FSID_SIZE];
>   	bool fsid_change;
> +	bool virtual_fsid;
>   	struct list_head fs_list;
>
>   	/*
> @@ -537,7 +543,7 @@ struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
>   void btrfs_mapping_tree_free(struct extent_map_tree *tree);
>   int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
>   		       fmode_t flags, void *holder);
> -struct btrfs_device *btrfs_scan_one_device(const char *path,
> +struct btrfs_device *btrfs_scan_one_device(const struct btrfs_scan_info *info,
>   					   fmode_t flags, void *holder);
>   int btrfs_forget_devices(dev_t devt);
>   void btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
David Sterba May 5, 2023, 1:18 p.m. UTC | #2
On Thu, May 04, 2023 at 02:07:07PM -0300, Guilherme G. Piccoli wrote:
> Btrfs doesn't currently support to mount 2 different devices holding the
> same filesystem - the fsid is used as a unique identifier in the driver.
> This case is supported though in some other common filesystems, like
> ext4; one of the reasons for which is not trivial supporting this case
> on btrfs is due to its multi-device filesystem nature, native RAID, etc.
> 
> Supporting the same-fsid mounts has the advantage of allowing btrfs to
> be used in A/B partitioned devices, like mobile phones or the Steam Deck
> for example. Without this support, it's not safe for users to keep the
> same "image version" in both A and B partitions, a setup that is quite
> common for development, for example. Also, as a big bonus, it allows fs
> integrity check based on block devices for RO devices (whereas currently
> it is required that both have different fsid, breaking the block device
> hash comparison).
> 
> Such same-fsid mounting is hereby added through the usage of the
> mount option "virtual_fsid" - when such option is used, btrfs generates
> a random fsid for the filesystem and leverages the metadata_uuid
> infrastructure (introduced by [0]) to enable the usage of this secondary
> virtual fsid. But differently from the regular metadata_uuid flag, this
> is not written into the disk superblock - effectively, this is a spoofed
> fsid approach that enables the same filesystem in different devices to
> appear as different filesystems to btrfs on runtime.

Have you evaluated if the metadata_uuid could be used for that? It is
stored on the filesystem image, but could you adapt the usecase to set
the UUID before mount manually (in tooling)?

The metadata_uuid is lightweight and meant to change the appearance of
the fs regarding uuids, verly close to what you describe. Adding yet
another uuid does not seem right so I'm first asking if and in what way
the metadata_uuid could be extended.
David Sterba May 5, 2023, 1:38 p.m. UTC | #3
On Fri, May 05, 2023 at 03:21:35PM +0800, Qu Wenruo wrote:
> On 2023/5/5 01:07, Guilherme G. Piccoli wrote:
> > Btrfs doesn't currently support to mount 2 different devices holding the
> > same filesystem - the fsid is used as a unique identifier in the driver.
> > This case is supported though in some other common filesystems, like
> > ext4; one of the reasons for which is not trivial supporting this case
> > on btrfs is due to its multi-device filesystem nature, native RAID, etc.
> 
> Exactly, the biggest problem is the multi-device support.
> 
> Btrfs needs to search and assemble all devices of a multi-device
> filesystem, which is normally handled by things like LVM/DMraid, thus
> other traditional fses won't need to bother that.
> 
> >
> > Supporting the same-fsid mounts has the advantage of allowing btrfs to
> > be used in A/B partitioned devices, like mobile phones or the Steam Deck
> > for example. Without this support, it's not safe for users to keep the
> > same "image version" in both A and B partitions, a setup that is quite
> > common for development, for example. Also, as a big bonus, it allows fs
> > integrity check based on block devices for RO devices (whereas currently
> > it is required that both have different fsid, breaking the block device
> > hash comparison).
> >
> > Such same-fsid mounting is hereby added through the usage of the
> > mount option "virtual_fsid" - when such option is used, btrfs generates
> > a random fsid for the filesystem and leverages the metadata_uuid
> > infrastructure (introduced by [0]) to enable the usage of this secondary
> > virtual fsid. But differently from the regular metadata_uuid flag, this
> > is not written into the disk superblock - effectively, this is a spoofed
> > fsid approach that enables the same filesystem in different devices to
> > appear as different filesystems to btrfs on runtime.
> 
> I would prefer a much simpler but more explicit method.
> 
> Just introduce a new compat_ro feature, maybe call it SINGLE_DEV.
> 
> By this, we can avoid multiple meanings of the same super member, nor
> need any special mount option.
> Remember, mount option is never a good way to enable/disable a new feature.
> 
> The better method to enable/disable a feature should be mkfs and btrfstune.
> 
> Then go mostly the same of your patch, but maybe with something extra:
> 
> - Disbale multi-dev code
>    Include device add/replace/removal, this is already done in your
>    patch.
> 
> - Completely skip device scanning
>    I see no reason to keep btrfs with SINGLE_DEV feature to be added to
>    the device list at all.
>    It only needs to be scanned at mount time, and never be kept in the
>    in-memory device list.

This is actually a good point, we can do that already. As a conterpart
to 5f58d783fd7823 ("btrfs: free device in btrfs_close_devices for a
single device filesystem") that drops single device from the list,
single fs devices wouldn't be added to the list but some checks could be
still done like superblock validation for eventual error reporting.
Guilherme G. Piccoli May 5, 2023, 3:51 p.m. UTC | #4
On 05/05/2023 04:21, Qu Wenruo wrote:
> [...]
> Exactly, the biggest problem is the multi-device support.
> 
> Btrfs needs to search and assemble all devices of a multi-device
> filesystem, which is normally handled by things like LVM/DMraid, thus
> other traditional fses won't need to bother that.

Hi Qu, thanks a bunch for your feedback, and for validating my
understanding of the issue!


>  [...]
> 
> I would prefer a much simpler but more explicit method.
> 
> Just introduce a new compat_ro feature, maybe call it SINGLE_DEV.
> 
> By this, we can avoid multiple meanings of the same super member, nor
> need any special mount option.
> Remember, mount option is never a good way to enable/disable a new feature.
> 
> The better method to enable/disable a feature should be mkfs and btrfstune.
> 
> Then go mostly the same of your patch, but maybe with something extra:
> 
> - Disbale multi-dev code
>    Include device add/replace/removal, this is already done in your
>    patch.
> 
> - Completely skip device scanning
>    I see no reason to keep btrfs with SINGLE_DEV feature to be added to
>    the device list at all.
>    It only needs to be scanned at mount time, and never be kept in the
>    in-memory device list.
> 

This seems very interesting, but I'm a bit confused on how that would
work with 2 identical filesystem images mounted at the same time.

Imagine you have 2 devices, /dev/sda1 and /dev/sda2 holding the exact
same image, with the SINGLE_DEV feature set. They are identical, and
IIUC no matter if we skip scanning or disable any multi-device approach,
in the end both have the *same* fsid. How do we track this in the btrfs
code now? Once we try to mount the second one, it'll try to add the same
entity to the fs_uuids list...

That's the problem I faced when investigating the code and why the
proposal is to "spoof" the fsid with some random generated one.

Also, one more question: why do you say "Remember, mount option is never
a good way to enable/disable a new feature"? I'm not expert in
filesystems (by far heh), so I'm curious to fully understand your
point-of-view.

From my naive viewpoint, seems a mount option is "cheaper" than
introducing a new feature in the OS that requires changes on btrfs
userspace applications as well as to track incompatibilities in
different kernel versions.

Thanks again,


Guilherme
Guilherme G. Piccoli May 5, 2023, 4:18 p.m. UTC | #5
On 05/05/2023 10:18, David Sterba wrote:
> [...]
> Have you evaluated if the metadata_uuid could be used for that? It is
> stored on the filesystem image, but could you adapt the usecase to set
> the UUID before mount manually (in tooling)?
> 
> The metadata_uuid is lightweight and meant to change the appearance of
> the fs regarding uuids, verly close to what you describe. Adding yet
> another uuid does not seem right so I'm first asking if and in what way
> the metadata_uuid could be extended.

Hi David, thanks for your suggestion!

It might be possible, it seems a valid suggestion. But worth notice that
we cannot modify the FS at all. That's why I've implemented the feature
in a way it "fakes" the fsid for the driver, as a mount option, but
nothing changes in the FS.

The images on Deck are read-only. So, by using the metadata_uuid purely,
can we mount 2 identical images at the same time *not modifying* the
filesystem in any way? If it's possible, then we have only to implement
the skip scanning idea from Qu in the other thread (or else ioclt scans
would prevent mounting them).

Cheers,


Guilherme
Goffredo Baroncelli May 5, 2023, 5:34 p.m. UTC | #6
On 05/05/2023 09.21, Qu Wenruo wrote:
> 
> I would prefer a much simpler but more explicit method.
> 
> Just introduce a new compat_ro feature, maybe call it SINGLE_DEV.

It is not clear to me if we need that.

I don't understand in what checking for SINGLE_DEV is different from
btrfs_super_block.disks_num == 1.

Let me to argument:

I see two scenarios:
1) mount two different fs with the same UUID NOT at the same time:
This could be done now with small change in the kernel:
- we need to NOT store the data of a filesystem when a disk is
   scanned IF it is composed by only one disk
- after the unmount we need to discard the data too (checking again
   that the filesystem is composed by only one disk)

No limit is needed to add/replace a disk. Of course after a disk is
added a filesystem with the same UUID cannot be mounted without a
full cycle of --forget.

I have to point out that this problem would be easily solved in
userspace if we switch from the current model where the disks are
scanned asynchronously (udev which call btrfs dev scan) to a model
where the disk are scanned at mount time by a mount.btrfs helper.

A mount.btrfs helper, also could be a place to put some more clear error
message like "we cannot mount this filesystem because one disk of a
raid5 is missing, try passing -o degraded"
or "we cannot mount this filesystem because we detect a brain split
problem" ....

2) mount two different fs with the same UUID at the SAME time:
This is a bit more complicated; we need to store a virtual UUID
somewhere.

However sometime we need to use the real fsid (during a write),
and sometime we need to use the virtual_uuid (e.g. for /sys/fs/btrfs/<uuid>)

Both in 1) and 2) we need to/it is enough to have btrfs_super_block.disks_num == 1
In the case 2) using a virtual_uuid mount option will prevent
to add a disk.
Qu Wenruo May 5, 2023, 10:15 p.m. UTC | #7
On 2023/5/5 23:51, Guilherme G. Piccoli wrote:
> On 05/05/2023 04:21, Qu Wenruo wrote:
>> [...]
>> Exactly, the biggest problem is the multi-device support.
>>
>> Btrfs needs to search and assemble all devices of a multi-device
>> filesystem, which is normally handled by things like LVM/DMraid, thus
>> other traditional fses won't need to bother that.
> 
> Hi Qu, thanks a bunch for your feedback, and for validating my
> understanding of the issue!
> 
> 
>>   [...]
>>
>> I would prefer a much simpler but more explicit method.
>>
>> Just introduce a new compat_ro feature, maybe call it SINGLE_DEV.
>>
>> By this, we can avoid multiple meanings of the same super member, nor
>> need any special mount option.
>> Remember, mount option is never a good way to enable/disable a new feature.
>>
>> The better method to enable/disable a feature should be mkfs and btrfstune.
>>
>> Then go mostly the same of your patch, but maybe with something extra:
>>
>> - Disbale multi-dev code
>>     Include device add/replace/removal, this is already done in your
>>     patch.
>>
>> - Completely skip device scanning
>>     I see no reason to keep btrfs with SINGLE_DEV feature to be added to
>>     the device list at all.
>>     It only needs to be scanned at mount time, and never be kept in the
>>     in-memory device list.
>>
> 
> This seems very interesting, but I'm a bit confused on how that would
> work with 2 identical filesystem images mounted at the same time.
> 
> Imagine you have 2 devices, /dev/sda1 and /dev/sda2 holding the exact
> same image, with the SINGLE_DEV feature set. They are identical, and
> IIUC no matter if we skip scanning or disable any multi-device approach,
> in the end both have the *same* fsid. How do we track this in the btrfs
> code now? Once we try to mount the second one, it'll try to add the same
> entity to the fs_uuids list...

My bad, I forgot to mention that, if we hit such SINGLE_DEV fses, we 
should also not add them to the fs_uuids list either.

So the fs_uuids list conflicts would not be a problem at all.

> 
> That's the problem I faced when investigating the code and why the
> proposal is to "spoof" the fsid with some random generated one.
> 
> Also, one more question: why do you say "Remember, mount option is never
> a good way to enable/disable a new feature"? I'm not expert in
> filesystems (by far heh), so I'm curious to fully understand your
> point-of-view.

We had a bad example in the past, free space tree (aka, v2 space cache).

It's initially a pretty convenient way to enable the new feature, but 
now it's making a lot of new features, which can depend on v2 cache, 
more complex to handle those old mount options.

The compatibility matrix would only grow, and all the (mostly one-time) 
logic need to be implemented in kernel.

So in the long run, we prefer offline convert tool.

Thanks,
Qu

> 
>  From my naive viewpoint, seems a mount option is "cheaper" than
> introducing a new feature in the OS that requires changes on btrfs
> userspace applications as well as to track incompatibilities in
> different kernel versions.
> 
> Thanks again,
> 
> 
> Guilherme
Qu Wenruo May 5, 2023, 10:31 p.m. UTC | #8
On 2023/5/6 01:34, Goffredo Baroncelli wrote:
> On 05/05/2023 09.21, Qu Wenruo wrote:
>>
>> I would prefer a much simpler but more explicit method.
>>
>> Just introduce a new compat_ro feature, maybe call it SINGLE_DEV.
> 
> It is not clear to me if we need that.
> 
> I don't understand in what checking for SINGLE_DEV is different from
> btrfs_super_block.disks_num == 1.

Because disks_num == 1 doesn't exclude the ability to add new disks in 
the future.

Without that new SINGLE_DEV compat_ro, we should still do the regular 
device scan.

> 
> Let me to argument:
> 
> I see two scenarios:
> 1) mount two different fs with the same UUID NOT at the same time:
> This could be done now with small change in the kernel:
> - we need to NOT store the data of a filesystem when a disk is
>    scanned IF it is composed by only one disk
> - after the unmount we need to discard the data too (checking again
>    that the filesystem is composed by only one disk)
> 
> No limit is needed to add/replace a disk. Of course after a disk is
> added a filesystem with the same UUID cannot be mounted without a
> full cycle of --forget.

The problem is, what if:

- Both btrfs have single disk
- Both btrfs have the same fsid
- Both btrfs have been mounted
- Then one of btrfs is going to add a new disk

We either:

- Don't scan nor trace the device/fsid anyway
   Then after unmount, the new two disks btrfs can not be mounted
   and need extra scan/forgot/whatever to reassemble list.
   And that would also cause fsid conflicts if not properly handled
   between the single and multi disk btrfs.

- Scan and record the fsid/device at device add time
   This means we should reject the device add.
   This can sometimes cause confusion to the end user, just because they
   have mounted another fs, now they can not add a new device.

   And this is going to change device add code path quite hugely.
   We currently expects all device scan/trace thing done way before
   mount.
   Such huge change can lead to hidden bugs.

To me, neither is good to the end users.

A SINGLE_DEV feature would reject the corner case in a way more 
user-friendly and clear way.

   With SINGLE_DEV feature, just no dev add/replace/delete no matter
   what.


> 
> I have to point out that this problem would be easily solved in
> userspace if we switch from the current model where the disks are
> scanned asynchronously (udev which call btrfs dev scan) to a model
> where the disk are scanned at mount time by a mount.btrfs helper.
> 
> A mount.btrfs helper, also could be a place to put some more clear error
> message like "we cannot mount this filesystem because one disk of a
> raid5 is missing, try passing -o degraded"
> or "we cannot mount this filesystem because we detect a brain split
> problem" ....
> 
> 2) mount two different fs with the same UUID at the SAME time:
> This is a bit more complicated; we need to store a virtual UUID
> somewhere.
> 
> However sometime we need to use the real fsid (during a write),
> and sometime we need to use the virtual_uuid (e.g. for 
> /sys/fs/btrfs/<uuid>)

Another thing is, we already have too many uuids.

Some are unavoidable like fsid and device uuid.

But I still prefer not to add a new layer of unnecessary uuids.

Thanks,
Qu

> 
> Both in 1) and 2) we need to/it is enough to have 
> btrfs_super_block.disks_num == 1
> In the case 2) using a virtual_uuid mount option will prevent
> to add a disk.
David Sterba May 5, 2023, 11 p.m. UTC | #9
On Fri, May 05, 2023 at 01:18:56PM -0300, Guilherme G. Piccoli wrote:
> On 05/05/2023 10:18, David Sterba wrote:
> > [...]
> > Have you evaluated if the metadata_uuid could be used for that? It is
> > stored on the filesystem image, but could you adapt the usecase to set
> > the UUID before mount manually (in tooling)?
> > 
> > The metadata_uuid is lightweight and meant to change the appearance of
> > the fs regarding uuids, verly close to what you describe. Adding yet
> > another uuid does not seem right so I'm first asking if and in what way
> > the metadata_uuid could be extended.
> 
> Hi David, thanks for your suggestion!
> 
> It might be possible, it seems a valid suggestion. But worth notice that
> we cannot modify the FS at all. That's why I've implemented the feature
> in a way it "fakes" the fsid for the driver, as a mount option, but
> nothing changes in the FS.
> 
> The images on Deck are read-only. So, by using the metadata_uuid purely,
> can we mount 2 identical images at the same time *not modifying* the
> filesystem in any way? If it's possible, then we have only to implement
> the skip scanning idea from Qu in the other thread (or else ioclt scans
> would prevent mounting them).

Ok, I see, the device is read-only. The metadata_uuid is now set on an
unmounted filesystem and we don't have any semantics for a mount option.

If there's an equivalent mount option (let's say metadata_uuid for
compatibility) with the same semantics as if set offline, on the first
commit the metadata_uuid would be written.

The question is if this would be sane for read-only devices. You've
implemented the uuid on the metadata_uuid base but named it differently,
but this effectively means that metadata_uuid could work on read-only
devices too, but with some necessary updates to the device scanning.

From the use case perspective this should work, the virtual uuid would
basically be the metadata_uuid set and on a read-only device. The
problems start in the state transitions in the device tracking, we had
some bugs there and the code is hard to grasp. For that I'd very much
vote for using the metadata_uuid but we can provide an interface on top
of that to make it work.
Goffredo Baroncelli May 6, 2023, 5:30 p.m. UTC | #10
On 06/05/2023 00.31, Qu Wenruo wrote:
> 
> 
> On 2023/5/6 01:34, Goffredo Baroncelli wrote:
>> On 05/05/2023 09.21, Qu Wenruo wrote:
>>>
>>> I would prefer a much simpler but more explicit method.
>>>
>>> Just introduce a new compat_ro feature, maybe call it SINGLE_DEV.
>>
>> It is not clear to me if we need that.
>>
>> I don't understand in what checking for SINGLE_DEV is different from
>> btrfs_super_block.disks_num == 1.
> 
> Because disks_num == 1 doesn't exclude the ability to add new disks in the future.
> 
> Without that new SINGLE_DEV compat_ro, we should still do the regular device scan.
> 
>>
>> Let me to argument:
>>
>> I see two scenarios:
>> 1) mount two different fs with the same UUID NOT at the same time:
>> This could be done now with small change in the kernel:
>> - we need to NOT store the data of a filesystem when a disk is
>>    scanned IF it is composed by only one disk
>> - after the unmount we need to discard the data too (checking again
>>    that the filesystem is composed by only one disk)
>>
>> No limit is needed to add/replace a disk. Of course after a disk is
>> added a filesystem with the same UUID cannot be mounted without a
>> full cycle of --forget.
> 
> The problem is, what if:
> 
> - Both btrfs have single disk
> - Both btrfs have the same fsid
> - Both btrfs have been mounted
> - Then one of btrfs is going to add a new disk
> 

Why the user should be prevented to add a disk. It may
a aware user that want to do that, knowing the possible consequence.


[...]

> 
> - Scan and record the fsid/device at device add time
>    This means we should reject the device add.
>    This can sometimes cause confusion to the end user, just because they
>    have mounted another fs, now they can not add a new device.

I agree about the confusion. But not about the cause.
The confusion is due to the poor communication between the kernel (where the error is
detected) and the user. Now the only solution is to look at dmesg.

Allowing to mount two filesystem with the same UUID is technically possible.
There are some constraints bat are well defined; there are some corner case
but are well defined (like add a device to a single device filesystem).

However when we hit one of these corner case, now it is difficult to inform
the user about the problem. Because now the user has to look at the dmesg
to understand what is the problem.

This is the real problem. The communication. And we have a lot of these
problem (like mount a multi device filesystem without some disk, or with a
brain slip problem, or better inform the user if it is possible the
mount -o degraded).

Look this in another way; what if we had a mount.btrfs helper that:

- look for the devices which compose the filesystem at mounting time
- check if these devices are consistent:
	- if the fs is one-device, we don't need further check; otherwise check
	- if all the devices are present
	- if all the device have the same transaction id
	- if ...
   if any of the check above fails, write an error message; otherwise
- register the device(s) in the kernel or (better) pass it in the mount command
   line
- finally mount the filesystem


No need of strange flag; all the corner case can be handle safely and avoid
any confusion to the user.






> 
>    And this is going to change device add code path quite hugely.
>    We currently expects all device scan/trace thing done way before
>    mount.
>    Such huge change can lead to hidden bugs.
> 
> To me, neither is good to the end users.
> 
> A SINGLE_DEV feature would reject the corner case in a way more user-friendly and clear way.
> 
>    With SINGLE_DEV feature, just no dev add/replace/delete no matter
>    what.
> 
> 
>>
>> I have to point out that this problem would be easily solved in
>> userspace if we switch from the current model where the disks are
>> scanned asynchronously (udev which call btrfs dev scan) to a model
>> where the disk are scanned at mount time by a mount.btrfs helper.
>>
>> A mount.btrfs helper, also could be a place to put some more clear error
>> message like "we cannot mount this filesystem because one disk of a
>> raid5 is missing, try passing -o degraded"
>> or "we cannot mount this filesystem because we detect a brain split
>> problem" ....
>>
>> 2) mount two different fs with the same UUID at the SAME time:
>> This is a bit more complicated; we need to store a virtual UUID
>> somewhere.
>>
>> However sometime we need to use the real fsid (during a write),
>> and sometime we need to use the virtual_uuid (e.g. for /sys/fs/btrfs/<uuid>)
> 
> Another thing is, we already have too many uuids.
> 
> Some are unavoidable like fsid and device uuid.
> 
> But I still prefer not to add a new layer of unnecessary uuids.
> 
> Thanks,
> Qu
> 
>>
>> Both in 1) and 2) we need to/it is enough to have btrfs_super_block.disks_num == 1
>> In the case 2) using a virtual_uuid mount option will prevent
>> to add a disk.
>
Qu Wenruo May 6, 2023, 11 p.m. UTC | #11
On 2023/5/7 01:30, Goffredo Baroncelli wrote:
> On 06/05/2023 00.31, Qu Wenruo wrote:
>>
>>
>> On 2023/5/6 01:34, Goffredo Baroncelli wrote:
>>> On 05/05/2023 09.21, Qu Wenruo wrote:
>>>>
>>>> I would prefer a much simpler but more explicit method.
>>>>
>>>> Just introduce a new compat_ro feature, maybe call it SINGLE_DEV.
>>>
>>> It is not clear to me if we need that.
>>>
>>> I don't understand in what checking for SINGLE_DEV is different from
>>> btrfs_super_block.disks_num == 1.
>>
>> Because disks_num == 1 doesn't exclude the ability to add new disks in
>> the future.
>>
>> Without that new SINGLE_DEV compat_ro, we should still do the regular
>> device scan.
>>
>>>
>>> Let me to argument:
>>>
>>> I see two scenarios:
>>> 1) mount two different fs with the same UUID NOT at the same time:
>>> This could be done now with small change in the kernel:
>>> - we need to NOT store the data of a filesystem when a disk is
>>>    scanned IF it is composed by only one disk
>>> - after the unmount we need to discard the data too (checking again
>>>    that the filesystem is composed by only one disk)
>>>
>>> No limit is needed to add/replace a disk. Of course after a disk is
>>> added a filesystem with the same UUID cannot be mounted without a
>>> full cycle of --forget.
>>
>> The problem is, what if:
>>
>> - Both btrfs have single disk
>> - Both btrfs have the same fsid
>> - Both btrfs have been mounted
>> - Then one of btrfs is going to add a new disk
>>
>
> Why the user should be prevented to add a disk. It may
> a aware user that want to do that, knowing the possible consequence.
>
>
> [...]
>
>>
>> - Scan and record the fsid/device at device add time
>>    This means we should reject the device add.
>>    This can sometimes cause confusion to the end user, just because they
>>    have mounted another fs, now they can not add a new device.
>
> I agree about the confusion. But not about the cause.
> The confusion is due to the poor communication between the kernel (where
> the error is
> detected) and the user. Now the only solution is to look at dmesg.
>
> Allowing to mount two filesystem with the same UUID is technically
> possible.
> There are some constraints bat are well defined; there are some corner case
> but are well defined (like add a device to a single device filesystem).
>
> However when we hit one of these corner case, now it is difficult to inform
> the user about the problem. Because now the user has to look at the dmesg
> to understand what is the problem.
>
> This is the real problem. The communication. And we have a lot of these
> problem (like mount a multi device filesystem without some disk, or with a
> brain slip problem, or better inform the user if it is possible the
> mount -o degraded).
>
> Look this in another way; what if we had a mount.btrfs helper that:
>
> - look for the devices which compose the filesystem at mounting time
> - check if these devices are consistent:
>      - if the fs is one-device, we don't need further check; otherwise
> check
>      - if all the devices are present
>      - if all the device have the same transaction id
>      - if ...
>    if any of the check above fails, write an error message; otherwise
> - register the device(s) in the kernel or (better) pass it in the mount
> command
>    line
> - finally mount the filesystem
>
>
> No need of strange flag; all the corner case can be handle safely and avoid
> any confusion to the user.

Just handling corner cases is not good for maintaining already.

So nope, I don't believe this is the good way to go.

Furthermore, if someone really found the same fsid limits is a problem,
they should go with the SINGLE_DEV features, not relying on some extra
corner cases handling, which may or may not be supported on certain kernels.

On the other handle, a compat_ro flag is clear dedicated way to tell the
compatibility.

In fact, if you're going to introduce an unexpected behavior change,
it's way more strange to do without any compat_ro/ro/incompact flags.

Thanks,
Qu
>
>
>
>
>
>
>>
>>    And this is going to change device add code path quite hugely.
>>    We currently expects all device scan/trace thing done way before
>>    mount.
>>    Such huge change can lead to hidden bugs.
>>
>> To me, neither is good to the end users.
>>
>> A SINGLE_DEV feature would reject the corner case in a way more
>> user-friendly and clear way.
>>
>>    With SINGLE_DEV feature, just no dev add/replace/delete no matter
>>    what.
>>
>>
>>>
>>> I have to point out that this problem would be easily solved in
>>> userspace if we switch from the current model where the disks are
>>> scanned asynchronously (udev which call btrfs dev scan) to a model
>>> where the disk are scanned at mount time by a mount.btrfs helper.
>>>
>>> A mount.btrfs helper, also could be a place to put some more clear error
>>> message like "we cannot mount this filesystem because one disk of a
>>> raid5 is missing, try passing -o degraded"
>>> or "we cannot mount this filesystem because we detect a brain split
>>> problem" ....
>>>
>>> 2) mount two different fs with the same UUID at the SAME time:
>>> This is a bit more complicated; we need to store a virtual UUID
>>> somewhere.
>>>
>>> However sometime we need to use the real fsid (during a write),
>>> and sometime we need to use the virtual_uuid (e.g. for
>>> /sys/fs/btrfs/<uuid>)
>>
>> Another thing is, we already have too many uuids.
>>
>> Some are unavoidable like fsid and device uuid.
>>
>> But I still prefer not to add a new layer of unnecessary uuids.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Both in 1) and 2) we need to/it is enough to have
>>> btrfs_super_block.disks_num == 1
>>> In the case 2) using a virtual_uuid mount option will prevent
>>> to add a disk.
>>
>
Anand Jain May 8, 2023, 11:27 a.m. UTC | #12
On 05/05/2023 21:38, David Sterba wrote:
> On Fri, May 05, 2023 at 03:21:35PM +0800, Qu Wenruo wrote:
>> On 2023/5/5 01:07, Guilherme G. Piccoli wrote:
>>> Btrfs doesn't currently support to mount 2 different devices holding the
>>> same filesystem - the fsid is used as a unique identifier in the driver.
>>> This case is supported though in some other common filesystems, like
>>> ext4; one of the reasons for which is not trivial supporting this case
>>> on btrfs is due to its multi-device filesystem nature, native RAID, etc.
>>
>> Exactly, the biggest problem is the multi-device support.
>>
>> Btrfs needs to search and assemble all devices of a multi-device
>> filesystem, which is normally handled by things like LVM/DMraid, thus
>> other traditional fses won't need to bother that.
>>
>>>
>>> Supporting the same-fsid mounts has the advantage of allowing btrfs to
>>> be used in A/B partitioned devices, like mobile phones or the Steam Deck
>>> for example. Without this support, it's not safe for users to keep the
>>> same "image version" in both A and B partitions, a setup that is quite
>>> common for development, for example. Also, as a big bonus, it allows fs
>>> integrity check based on block devices for RO devices (whereas currently
>>> it is required that both have different fsid, breaking the block device
>>> hash comparison).
>>>
>>> Such same-fsid mounting is hereby added through the usage of the
>>> mount option "virtual_fsid" - when such option is used, btrfs generates
>>> a random fsid for the filesystem and leverages the metadata_uuid
>>> infrastructure (introduced by [0]) to enable the usage of this secondary
>>> virtual fsid. But differently from the regular metadata_uuid flag, this
>>> is not written into the disk superblock - effectively, this is a spoofed
>>> fsid approach that enables the same filesystem in different devices to
>>> appear as different filesystems to btrfs on runtime.
>>
>> I would prefer a much simpler but more explicit method.
>>
>> Just introduce a new compat_ro feature, maybe call it SINGLE_DEV.
>>
>> By this, we can avoid multiple meanings of the same super member, nor
>> need any special mount option.
>> Remember, mount option is never a good way to enable/disable a new feature.
>>
>> The better method to enable/disable a feature should be mkfs and btrfstune.
>>
>> Then go mostly the same of your patch, but maybe with something extra:
>>
>> - Disbale multi-dev code
>>     Include device add/replace/removal, this is already done in your
>>     patch.
>>
>> - Completely skip device scanning
>>     I see no reason to keep btrfs with SINGLE_DEV feature to be added to
>>     the device list at all.
>>     It only needs to be scanned at mount time, and never be kept in the
>>     in-memory device list.
> 
> This is actually a good point, we can do that already. As a conterpart
> to 5f58d783fd7823 ("btrfs: free device in btrfs_close_devices for a
> single device filesystem") that drops single device from the list,
> single fs devices wouldn't be added to the list but some checks could be
> still done like superblock validation for eventual error reporting.

Something similar occurred to me earlier. However, even for a single
device, we need to perform the scan because there may be an unfinished
replace target from a previous reboot, or a sprout Btrfs filesystem may
have a single seed device. If we were to make an exception for replace
targets and seed devices, it would only complicate the scan logic, which
goes against our attempt to simplify it.

Thanks, Anand
Qu Wenruo May 8, 2023, 11:50 a.m. UTC | #13
On 2023/5/8 19:27, Anand Jain wrote:
> On 05/05/2023 21:38, David Sterba wrote:
>> On Fri, May 05, 2023 at 03:21:35PM +0800, Qu Wenruo wrote:
>>> On 2023/5/5 01:07, Guilherme G. Piccoli wrote:
>>>> Btrfs doesn't currently support to mount 2 different devices holding
>>>> the
>>>> same filesystem - the fsid is used as a unique identifier in the
>>>> driver.
>>>> This case is supported though in some other common filesystems, like
>>>> ext4; one of the reasons for which is not trivial supporting this case
>>>> on btrfs is due to its multi-device filesystem nature, native RAID,
>>>> etc.
>>>
>>> Exactly, the biggest problem is the multi-device support.
>>>
>>> Btrfs needs to search and assemble all devices of a multi-device
>>> filesystem, which is normally handled by things like LVM/DMraid, thus
>>> other traditional fses won't need to bother that.
>>>
>>>>
>>>> Supporting the same-fsid mounts has the advantage of allowing btrfs to
>>>> be used in A/B partitioned devices, like mobile phones or the Steam
>>>> Deck
>>>> for example. Without this support, it's not safe for users to keep the
>>>> same "image version" in both A and B partitions, a setup that is quite
>>>> common for development, for example. Also, as a big bonus, it allows fs
>>>> integrity check based on block devices for RO devices (whereas
>>>> currently
>>>> it is required that both have different fsid, breaking the block device
>>>> hash comparison).
>>>>
>>>> Such same-fsid mounting is hereby added through the usage of the
>>>> mount option "virtual_fsid" - when such option is used, btrfs generates
>>>> a random fsid for the filesystem and leverages the metadata_uuid
>>>> infrastructure (introduced by [0]) to enable the usage of this
>>>> secondary
>>>> virtual fsid. But differently from the regular metadata_uuid flag, this
>>>> is not written into the disk superblock - effectively, this is a
>>>> spoofed
>>>> fsid approach that enables the same filesystem in different devices to
>>>> appear as different filesystems to btrfs on runtime.
>>>
>>> I would prefer a much simpler but more explicit method.
>>>
>>> Just introduce a new compat_ro feature, maybe call it SINGLE_DEV.
>>>
>>> By this, we can avoid multiple meanings of the same super member, nor
>>> need any special mount option.
>>> Remember, mount option is never a good way to enable/disable a new
>>> feature.
>>>
>>> The better method to enable/disable a feature should be mkfs and
>>> btrfstune.
>>>
>>> Then go mostly the same of your patch, but maybe with something extra:
>>>
>>> - Disbale multi-dev code
>>>     Include device add/replace/removal, this is already done in your
>>>     patch.
>>>
>>> - Completely skip device scanning
>>>     I see no reason to keep btrfs with SINGLE_DEV feature to be added to
>>>     the device list at all.
>>>     It only needs to be scanned at mount time, and never be kept in the
>>>     in-memory device list.
>>
>> This is actually a good point, we can do that already. As a conterpart
>> to 5f58d783fd7823 ("btrfs: free device in btrfs_close_devices for a
>> single device filesystem") that drops single device from the list,
>> single fs devices wouldn't be added to the list but some checks could be
>> still done like superblock validation for eventual error reporting.
>
> Something similar occurred to me earlier. However, even for a single
> device, we need to perform the scan because there may be an unfinished
> replace target from a previous reboot, or a sprout Btrfs filesystem may
> have a single seed device. If we were to make an exception for replace
> targets and seed devices, it would only complicate the scan logic, which
> goes against our attempt to simplify it.

If we go SINGLE_DEV compat_ro flags, then no such problem at all, we can
easily reject any multi-dev features from such SINGLE_DEV fs.

Thanks,
Qu

>
> Thanks, Anand
>
>
Guilherme G. Piccoli May 8, 2023, 10:49 p.m. UTC | #14
On 05/05/2023 19:15, Qu Wenruo wrote:
> [...] 
>> Imagine you have 2 devices, /dev/sda1 and /dev/sda2 holding the exact
>> same image, with the SINGLE_DEV feature set. They are identical, and
>> IIUC no matter if we skip scanning or disable any multi-device approach,
>> in the end both have the *same* fsid. How do we track this in the btrfs
>> code now? Once we try to mount the second one, it'll try to add the same
>> entity to the fs_uuids list...
> 
> My bad, I forgot to mention that, if we hit such SINGLE_DEV fses, we 
> should also not add them to the fs_uuids list either.
> 
> So the fs_uuids list conflicts would not be a problem at all.

Awesome, thanks for clarifying Qu! Now I understand it =)

> [...]
>> Also, one more question: why do you say "Remember, mount option is never
>> a good way to enable/disable a new feature"? I'm not expert in
>> filesystems (by far heh), so I'm curious to fully understand your
>> point-of-view.
> 
> We had a bad example in the past, free space tree (aka, v2 space cache).
> 
> It's initially a pretty convenient way to enable the new feature, but 
> now it's making a lot of new features, which can depend on v2 cache, 
> more complex to handle those old mount options.
> 
> The compatibility matrix would only grow, and all the (mostly one-time) 
> logic need to be implemented in kernel.
> 
> So in the long run, we prefer offline convert tool.

OK, I understand your point. I guess I could rewrite it to make use of
such compat_ro flag, it'd be fine. *Personally* (thinking as an user), I
much rather have mount options, I think it's consistent with other
filesystems and doesn't require specific btrfs tooling usage...

But of course I'll defer the decision to the maintainers!

Cheers,


Guilherme
Guilherme G. Piccoli May 8, 2023, 10:59 p.m. UTC | #15
On 05/05/2023 20:00, David Sterba wrote:
> [...]
>> Hi David, thanks for your suggestion!
>>
>> It might be possible, it seems a valid suggestion. But worth notice that
>> we cannot modify the FS at all. That's why I've implemented the feature
>> in a way it "fakes" the fsid for the driver, as a mount option, but
>> nothing changes in the FS.
>>
>> The images on Deck are read-only. So, by using the metadata_uuid purely,
>> can we mount 2 identical images at the same time *not modifying* the
>> filesystem in any way? If it's possible, then we have only to implement
>> the skip scanning idea from Qu in the other thread (or else ioclt scans
>> would prevent mounting them).
> 
> Ok, I see, the device is read-only. The metadata_uuid is now set on an
> unmounted filesystem and we don't have any semantics for a mount option.
> 
> If there's an equivalent mount option (let's say metadata_uuid for
> compatibility) with the same semantics as if set offline, on the first
> commit the metadata_uuid would be written.
> 
> The question is if this would be sane for read-only devices. You've
> implemented the uuid on the metadata_uuid base but named it differently,
> but this effectively means that metadata_uuid could work on read-only
> devices too, but with some necessary updates to the device scanning.
> 
> From the use case perspective this should work, the virtual uuid would
> basically be the metadata_uuid set and on a read-only device. The
> problems start in the state transitions in the device tracking, we had
> some bugs there and the code is hard to grasp. For that I'd very much
> vote for using the metadata_uuid but we can provide an interface on top
> of that to make it work.

OK, being completely honest here, I couldn't parse fully what you're
proposing - I blame it to my lack of knowledge on btrfs, so apologies heh

Could you clarify it a bit more? Are you suggesting we somewhat rework
"metadata_uuid", to kinda overload its meaning to be able to accomplish
this same-fsid mounting using "metadata_uuid" purely?

I see that we seem to have 3 proposals here:

(a) The compat_ro flag from Qu;

(b) Your idea (that requires some clarification for my fully
understanding - thanks in advance!);

(c) Renaming the mount option "virtual_fsid" to "nouuid" to keep
filesystem consistency, like XFS (courtesy of Dave Chinner) - please
correct me here if I misunderstood you Dave =)

I'd like to thank you all for the suggestions, and I'm willing to follow
the preferred one - as long we have a consensus / blessing from the
maintainers, I'm happy to rework this as the best possible approach for
btrfs.

Also, what about patch 2, does it make sense or should we kinda "embed"
the idea of scan skipping into the same-fsid mounting? Per my current
understanding, the idea (a) from Qu includes/fixes the scan thing and
makes patch 2 unnecessary.

Thanks,


Guilherme
Qu Wenruo May 8, 2023, 11:18 p.m. UTC | #16
On 2023/5/9 06:59, Guilherme G. Piccoli wrote:
> On 05/05/2023 20:00, David Sterba wrote:
>> [...]
>>> Hi David, thanks for your suggestion!
>>>
>>> It might be possible, it seems a valid suggestion. But worth notice that
>>> we cannot modify the FS at all. That's why I've implemented the feature
>>> in a way it "fakes" the fsid for the driver, as a mount option, but
>>> nothing changes in the FS.
>>>
>>> The images on Deck are read-only. So, by using the metadata_uuid purely,
>>> can we mount 2 identical images at the same time *not modifying* the
>>> filesystem in any way? If it's possible, then we have only to implement
>>> the skip scanning idea from Qu in the other thread (or else ioclt scans
>>> would prevent mounting them).
>>
>> Ok, I see, the device is read-only. The metadata_uuid is now set on an
>> unmounted filesystem and we don't have any semantics for a mount option.
>>
>> If there's an equivalent mount option (let's say metadata_uuid for
>> compatibility) with the same semantics as if set offline, on the first
>> commit the metadata_uuid would be written.
>>
>> The question is if this would be sane for read-only devices. You've
>> implemented the uuid on the metadata_uuid base but named it differently,
>> but this effectively means that metadata_uuid could work on read-only
>> devices too, but with some necessary updates to the device scanning.
>>
>>  From the use case perspective this should work, the virtual uuid would
>> basically be the metadata_uuid set and on a read-only device. The
>> problems start in the state transitions in the device tracking, we had
>> some bugs there and the code is hard to grasp. For that I'd very much
>> vote for using the metadata_uuid but we can provide an interface on top
>> of that to make it work.
>
> OK, being completely honest here, I couldn't parse fully what you're
> proposing - I blame it to my lack of knowledge on btrfs, so apologies heh
>
> Could you clarify it a bit more? Are you suggesting we somewhat rework
> "metadata_uuid", to kinda overload its meaning to be able to accomplish
> this same-fsid mounting using "metadata_uuid" purely?
>
> I see that we seem to have 3 proposals here:
>
> (a) The compat_ro flag from Qu;
>
> (b) Your idea (that requires some clarification for my fully
> understanding - thanks in advance!);
>
> (c) Renaming the mount option "virtual_fsid" to "nouuid" to keep
> filesystem consistency, like XFS (courtesy of Dave Chinner) - please
> correct me here if I misunderstood you Dave =)

To me, (a) and (c) don't conflict at all.

We can allow "nouuid" only to work with SINGLE_DEV compat_ro.

That compat_ro flags is more like a better guarantee that the fs will
never have more disks.

As even with SINGLE_DEV compat_ro flags, we may still want some checks
to prevent the same fs being RW mounted at different instances, which
can cause other problems, thus dedicated "nouuid" may still be needed.

Thanks,
Qu

>
> I'd like to thank you all for the suggestions, and I'm willing to follow
> the preferred one - as long we have a consensus / blessing from the
> maintainers, I'm happy to rework this as the best possible approach for
> btrfs.
>
> Also, what about patch 2, does it make sense or should we kinda "embed"
> the idea of scan skipping into the same-fsid mounting? Per my current
> understanding, the idea (a) from Qu includes/fixes the scan thing and
> makes patch 2 unnecessary.
>
> Thanks,
>
>
> Guilherme
Guilherme G. Piccoli May 8, 2023, 11:49 p.m. UTC | #17
On 08/05/2023 20:18, Qu Wenruo wrote:
> [...]
>> I see that we seem to have 3 proposals here:
>>
>> (a) The compat_ro flag from Qu;
>>
>> (b) Your idea (that requires some clarification for my fully
>> understanding - thanks in advance!);
>>
>> (c) Renaming the mount option "virtual_fsid" to "nouuid" to keep
>> filesystem consistency, like XFS (courtesy of Dave Chinner) - please
>> correct me here if I misunderstood you Dave =)
> 
> To me, (a) and (c) don't conflict at all.
> 
> We can allow "nouuid" only to work with SINGLE_DEV compat_ro.
> 
> That compat_ro flags is more like a better guarantee that the fs will
> never have more disks.
> 
> As even with SINGLE_DEV compat_ro flags, we may still want some checks
> to prevent the same fs being RW mounted at different instances, which
> can cause other problems, thus dedicated "nouuid" may still be needed.
> 
> Thanks,
> Qu

Hey Qu, I confess now I'm a bit confused heh

The whole idea of (a) was to *not* use a mount option, right?! Per my
understanding of your objections in this thread, you're not into a mount
option for this same-fsid feature (based on a bad previous experience,
as you explained).

If we're keeping the "nouuid" mount option, why we'd require the
compat_ro flag? Or vice-versa: having the compat_ro flag, why we'd need
the mount option?

Thanks in advance for clarifications,


Guilherme
Qu Wenruo May 9, 2023, 12:02 a.m. UTC | #18
On 2023/5/9 07:49, Guilherme G. Piccoli wrote:
> On 08/05/2023 20:18, Qu Wenruo wrote:
>> [...]
>>> I see that we seem to have 3 proposals here:
>>>
>>> (a) The compat_ro flag from Qu;
>>>
>>> (b) Your idea (that requires some clarification for my fully
>>> understanding - thanks in advance!);
>>>
>>> (c) Renaming the mount option "virtual_fsid" to "nouuid" to keep
>>> filesystem consistency, like XFS (courtesy of Dave Chinner) - please
>>> correct me here if I misunderstood you Dave =)
>>
>> To me, (a) and (c) don't conflict at all.
>>
>> We can allow "nouuid" only to work with SINGLE_DEV compat_ro.
>>
>> That compat_ro flags is more like a better guarantee that the fs will
>> never have more disks.
>>
>> As even with SINGLE_DEV compat_ro flags, we may still want some checks
>> to prevent the same fs being RW mounted at different instances, which
>> can cause other problems, thus dedicated "nouuid" may still be needed.
>>
>> Thanks,
>> Qu
>
> Hey Qu, I confess now I'm a bit confused heh
>
> The whole idea of (a) was to *not* use a mount option, right?! Per my
> understanding of your objections in this thread, you're not into a mount
> option for this same-fsid feature (based on a bad previous experience,
> as you explained).

My bad, I initially thought there would be some extra checks inside VFS
layer rejecting the same fs.

But after checking the xfs code, it looks like it's handling the nouuid
internally.

Thus no need for nouuid mount option if we go compat_ro.

Although I would still like a nouuid mount option, which is only valid
if the fs has the compat_ro flags, to provide a generic behavior just
like XFS.

Thanks,
Qu
>
> If we're keeping the "nouuid" mount option, why we'd require the
> compat_ro flag? Or vice-versa: having the compat_ro flag, why we'd need
> the mount option? >
> Thanks in advance for clarifications,
>
>
> Guilherme
David Sterba May 11, 2023, 11:51 a.m. UTC | #19
On Mon, May 08, 2023 at 07:50:55PM +0800, Qu Wenruo wrote:
> On 2023/5/8 19:27, Anand Jain wrote:
> > On 05/05/2023 21:38, David Sterba wrote:
> >> On Fri, May 05, 2023 at 03:21:35PM +0800, Qu Wenruo wrote:
> >>> On 2023/5/5 01:07, Guilherme G. Piccoli wrote:
> >> This is actually a good point, we can do that already. As a conterpart
> >> to 5f58d783fd7823 ("btrfs: free device in btrfs_close_devices for a
> >> single device filesystem") that drops single device from the list,
> >> single fs devices wouldn't be added to the list but some checks could be
> >> still done like superblock validation for eventual error reporting.
> >
> > Something similar occurred to me earlier. However, even for a single
> > device, we need to perform the scan because there may be an unfinished
> > replace target from a previous reboot, or a sprout Btrfs filesystem may
> > have a single seed device. If we were to make an exception for replace
> > targets and seed devices, it would only complicate the scan logic, which
> > goes against our attempt to simplify it.
> 
> If we go SINGLE_DEV compat_ro flags, then no such problem at all, we can
> easily reject any multi-dev features from such SINGLE_DEV fs.

With the scanning complications that Anand mentions the compat_ro flag
might make more sense, with all the limitations but allowing a safe use
of the duplicated UUIDs.

The flag would have to be set at mkfs time or by btrfsune on an
unmounted filesystem. Doing that on a mounted filesystem is possible too
but brings problems with updating the state of scanned device,
potentially ongoing operations like dev-replace and more.
Anand Jain May 11, 2023, 2:12 p.m. UTC | #20
On 11/5/23 19:51, David Sterba wrote:
> On Mon, May 08, 2023 at 07:50:55PM +0800, Qu Wenruo wrote:
>> On 2023/5/8 19:27, Anand Jain wrote:
>>> On 05/05/2023 21:38, David Sterba wrote:
>>>> On Fri, May 05, 2023 at 03:21:35PM +0800, Qu Wenruo wrote:
>>>>> On 2023/5/5 01:07, Guilherme G. Piccoli wrote:
>>>> This is actually a good point, we can do that already. As a conterpart
>>>> to 5f58d783fd7823 ("btrfs: free device in btrfs_close_devices for a
>>>> single device filesystem") that drops single device from the list,
>>>> single fs devices wouldn't be added to the list but some checks could be
>>>> still done like superblock validation for eventual error reporting.
>>>
>>> Something similar occurred to me earlier. However, even for a single
>>> device, we need to perform the scan because there may be an unfinished
>>> replace target from a previous reboot, or a sprout Btrfs filesystem may
>>> have a single seed device. If we were to make an exception for replace
>>> targets and seed devices, it would only complicate the scan logic, which
>>> goes against our attempt to simplify it.
>>
>> If we go SINGLE_DEV compat_ro flags, then no such problem at all, we can
>> easily reject any multi-dev features from such SINGLE_DEV fs.
> 

For a single device, if we remove device replacement and seeding for a
specific flag, scanning is unnecessary.

> With the scanning complications that Anand mentions the compat_ro flag
> might make more sense, with all the limitations but allowing a safe use
> of the duplicated UUIDs.
> 
> The flag would have to be set at mkfs time or by btrfsune on an
> unmounted filesystem. Doing that on a mounted filesystem is possible too
> but brings problems with updating the state of scanned device,
> potentially ongoing operations like dev-replace and more.

Setting the flag at mkfs time is preferable, in my opinion. We could
still support the btrfstune method and/or online method later.

While we are here, I'm taking the opportunity to consolidate the
scattered metadata UUID checking, which has been a long-standing
goal of mine to clean up. This will make adding multi-UUID support
cleaner. If you have any ideas, please share.

Thanks, Anand
Guilherme G. Piccoli May 14, 2023, 9:25 p.m. UTC | #21
On 11/05/2023 08:51, David Sterba wrote:
> [...]
> With the scanning complications that Anand mentions the compat_ro flag
> might make more sense, with all the limitations but allowing a safe use
> of the duplicated UUIDs.
> 
> The flag would have to be set at mkfs time or by btrfsune on an
> unmounted filesystem. Doing that on a mounted filesystem is possible too
> but brings problems with updating the state of scanned device,
> potentially ongoing operations like dev-replace and more.

Hi David, thank you! So, would it make sense to also have a
"nouuid"-like mount option along with the compat_ro flag? I'm saying
this because I'm considering the "old"/existing SteamOS images heh

If we go only with the compat_ro flag, we'll only be able to mount 2
images at same time iff they have it set, meaning it'll only work for
newer images.

Anyway, I'm glad to implement the compat_ro flag code - I'll be out some
weeks on holidays, and will retake this work as soon as I'm back.

Thanks all that provided feedback / suggestions in this thread!
Cheers,


Guilherme
Guilherme G. Piccoli July 5, 2023, 10:52 p.m. UTC | #22
On 05/05/2023 04:21, Qu Wenruo wrote:
> [...]
> I would prefer a much simpler but more explicit method.
> 
> Just introduce a new compat_ro feature, maybe call it SINGLE_DEV.
> 
> By this, we can avoid multiple meanings of the same super member, nor
> need any special mount option.
> Remember, mount option is never a good way to enable/disable a new feature.
> 
> The better method to enable/disable a feature should be mkfs and btrfstune.
> 
> Then go mostly the same of your patch, but maybe with something extra:
> 
> - Disbale multi-dev code
>    Include device add/replace/removal, this is already done in your
>    patch.
> 
> - Completely skip device scanning
>    I see no reason to keep btrfs with SINGLE_DEV feature to be added to
>    the device list at all.
>    It only needs to be scanned at mount time, and never be kept in the
>    in-memory device list.
> 

Hi Qu, I'm implementing this compat_ro idea of yours, but I'd like to
ask your input in some "design decisions" I'm facing here.

(a) I've skipped the device_list_add() step of adding the recent created
fs_devices struct to fs_uuids list, but I kept the btrfs_device creation
step. With that, the mount of two filesystems with same fsid fails..at
sysfs directory creation!

Of course - because it tries to add the same folder name to
/sys/fs/btrfs/ !!! I have some options here:

(I) Should I keep using a random generated fsid for single_dev devices,
in order we can mount many of them while not messing too much with the
code? I'd continue "piggybacking" on metadata_uuid idea if (I) is the
proper choice.

(II) Or maybe track down all fsid usage in the code (like this sysfs
case) and deal with that? In the sysfs case, we could change that folder
name to some other format, like fsid.NUM for single_dev devices, whereas
NUM is an incremental value for devices mounted with same fsid.

I'm not too fond of this alternative due to its complexity and "API"
breakage - userspace already expects /sys/fs/btrfs/ entries to be the fsid.

(III) Should we hide the filesystem from sysfs (and other potential
conflicts that come from same fsid mounting points)? Seems a hideous
approach, due to API breakage and bug potentials.

Maybe there are other choices, better than mine - lemme know if you have
some ideas!

Also, one last question/confirmation: you mentioned that "The better
method to enable/disable a feature should be mkfs" - you mean the same
way mkfs could be used to set features like "raid56" or "no-holes"?

By checking "mkfs.btrfs -O list-all", I don't see metadata_uuid for
example, which is confined to btrfstune it seems. I'm already modifying
btrfs-progs/mkfs, but since I'm emailing you, why not confirm, right? heh

Thanks again for the advice and suggestions - much appreciated!
Cheers,


Guilherme
Qu Wenruo July 6, 2023, 12:53 a.m. UTC | #23
On 2023/7/6 06:52, Guilherme G. Piccoli wrote:
> On 05/05/2023 04:21, Qu Wenruo wrote:
>> [...]
>> I would prefer a much simpler but more explicit method.
>>
>> Just introduce a new compat_ro feature, maybe call it SINGLE_DEV.
>>
>> By this, we can avoid multiple meanings of the same super member, nor
>> need any special mount option.
>> Remember, mount option is never a good way to enable/disable a new feature.
>>
>> The better method to enable/disable a feature should be mkfs and btrfstune.
>>
>> Then go mostly the same of your patch, but maybe with something extra:
>>
>> - Disbale multi-dev code
>>     Include device add/replace/removal, this is already done in your
>>     patch.
>>
>> - Completely skip device scanning
>>     I see no reason to keep btrfs with SINGLE_DEV feature to be added to
>>     the device list at all.
>>     It only needs to be scanned at mount time, and never be kept in the
>>     in-memory device list.
>>
>
> Hi Qu, I'm implementing this compat_ro idea of yours, but I'd like to
> ask your input in some "design decisions" I'm facing here.
>
> (a) I've skipped the device_list_add() step of adding the recent created
> fs_devices struct to fs_uuids list, but I kept the btrfs_device creation
> step. With that, the mount of two filesystems with same fsid fails..at
> sysfs directory creation!
>
> Of course - because it tries to add the same folder name to
> /sys/fs/btrfs/ !!! I have some options here:
>
> (I) Should I keep using a random generated fsid for single_dev devices,
> in order we can mount many of them while not messing too much with the
> code? I'd continue "piggybacking" on metadata_uuid idea if (I) is the
> proper choice.
>
> (II) Or maybe track down all fsid usage in the code (like this sysfs
> case) and deal with that? In the sysfs case, we could change that folder
> name to some other format, like fsid.NUM for single_dev devices, whereas
> NUM is an incremental value for devices mounted with same fsid.
>
> I'm not too fond of this alternative due to its complexity and "API"
> breakage - userspace already expects /sys/fs/btrfs/ entries to be the fsid.
>
> (III) Should we hide the filesystem from sysfs (and other potential
> conflicts that come from same fsid mounting points)? Seems a hideous
> approach, due to API breakage and bug potentials.

Personally speaking, I would go one of the following solution:

- Keep the sysfs, but adds a refcount to the sysfs related structures
   If we still go register the sysfs interface, then we have to keep a
   refcount, or any of the same fsid got unmounted, we would remove the
   whole sysfs entry.

- Skip the sysfs entry completely for any fs with the new compat_ro flag
   This would your idea (III), but the sysfs interface itself is not that
   critical (we add and remove entries from time to time), so I believe
   it's feasible to hide the sysfs for certain features.

>
> Maybe there are other choices, better than mine - lemme know if you have
> some ideas!
>
> Also, one last question/confirmation: you mentioned that "The better
> method to enable/disable a feature should be mkfs" - you mean the same
> way mkfs could be used to set features like "raid56" or "no-holes"?

Yes.

>
> By checking "mkfs.btrfs -O list-all", I don't see metadata_uuid for
> example, which is confined to btrfstune it seems. I'm already modifying
> btrfs-progs/mkfs, but since I'm emailing you, why not confirm, right? heh

I'm not familiar with metadata_uuid, but there are similar features like
seeding, which is only available in btrfstune, but not in mkfs.

It's not that uncommon, but yeah, you have found something we should
improve on.

Thanks,
Qu

>
> Thanks again for the advice and suggestions - much appreciated!
> Cheers,
>
>
> Guilherme
Guilherme G. Piccoli July 6, 2023, 10:32 p.m. UTC | #24
On 05/07/2023 21:53, Qu Wenruo wrote:
> [...]
> Personally speaking, I would go one of the following solution:
> 
> - Keep the sysfs, but adds a refcount to the sysfs related structures
>    If we still go register the sysfs interface, then we have to keep a
>    refcount, or any of the same fsid got unmounted, we would remove the
>    whole sysfs entry.
> 
> - Skip the sysfs entry completely for any fs with the new compat_ro flag
>    This would your idea (III), but the sysfs interface itself is not that
>    critical (we add and remove entries from time to time), so I believe
>    it's feasible to hide the sysfs for certain features.
> 

Hi Qu, thanks for you prompt response.
I've been trying to mess with btrfs sysfs to allow two same fsid
co-existing, without success. For each corner case I handle, two more
show-up heh

Seems quite tricky and error-prone to have this "special-casing" of
sysfs just to accommodate this feature.

Are you strongly against keeping the previous idea, of a spoofed/virtual
fsid, but applied to the compat_ro single_dev idea? This way, all of
this sysfs situation (and other potentially hidden corner cases) would
be avoided. That's like my suggestion (I).

David / Anand, any thoughts/ideas? Thanks in advance!


>> [...]
>> Also, one last question/confirmation: you mentioned that "The better
>> method to enable/disable a feature should be mkfs" - you mean the same
>> way mkfs could be used to set features like "raid56" or "no-holes"?
> 
> Yes.
> 
>> [...]
> I'm not familiar with metadata_uuid, but there are similar features like
> seeding, which is only available in btrfstune, but not in mkfs.
> 
> It's not that uncommon, but yeah, you have found something we should
> improve on.
> 

Thanks for confirming, I could implement it in both mkfs and btrfstune -
seems the more consistent path.
Cheers,


Guilherme
diff mbox series

Patch

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9e1596bb208d..66c2bac343b8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -468,7 +468,8 @@  static int check_tree_block_fsid(struct extent_buffer *eb)
 	 * seed devices it's forbidden to have their uuid changed so reading
 	 * ->fsid in this case is fine
 	 */
-	if (btrfs_fs_incompat(fs_info, METADATA_UUID))
+	if (btrfs_fs_incompat(fs_info, METADATA_UUID) ||
+	    fs_devices->virtual_fsid)
 		metadata_uuid = fs_devices->metadata_uuid;
 	else
 		metadata_uuid = fs_devices->fsid;
@@ -2539,6 +2540,7 @@  int btrfs_validate_super(struct btrfs_fs_info *fs_info,
 {
 	u64 nodesize = btrfs_super_nodesize(sb);
 	u64 sectorsize = btrfs_super_sectorsize(sb);
+	u8 *fsid;
 	int ret = 0;
 
 	if (btrfs_super_magic(sb) != BTRFS_MAGIC) {
@@ -2619,8 +2621,22 @@  int btrfs_validate_super(struct btrfs_fs_info *fs_info,
 		ret = -EINVAL;
 	}
 
-	if (memcmp(fs_info->fs_devices->fsid, fs_info->super_copy->fsid,
-		   BTRFS_FSID_SIZE)) {
+	/*
+	 * When the virtual_fsid flag is passed at mount time, btrfs
+	 * creates a random fsid and makes use of the metadata_uuid
+	 * infrastructure in order to allow, for example, two devices
+	 * with same fsid getting mounted at the same time. But notice
+	 * no changes happen at the disk level, so the random fsid is a
+	 * driver abstraction for that single mount, not to be written
+	 * in the disk. That's the reason we're required here to compare
+	 * the fsid with the metadata_uuid if virtual_fsid was set.
+	 */
+	if (fs_info->fs_devices->virtual_fsid)
+		fsid = fs_info->fs_devices->metadata_uuid;
+	else
+		fsid = fs_info->fs_devices->fsid;
+
+	if (memcmp(fsid, fs_info->super_copy->fsid, BTRFS_FSID_SIZE)) {
 		btrfs_err(fs_info,
 		"superblock fsid doesn't match fsid of fs_devices: %pU != %pU",
 			fs_info->super_copy->fsid, fs_info->fs_devices->fsid);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ba769a1eb87a..35e3a23f8c83 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2677,6 +2677,12 @@  static long btrfs_ioctl_rm_dev_v2(struct file *file, void __user *arg)
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+	if (fs_info->fs_devices->virtual_fsid) {
+		btrfs_err(fs_info,
+			  "device removal is unsupported on devices mounted with virtual fsid\n");
+		return -EINVAL;
+	}
+
 	vol_args = memdup_user(arg, sizeof(*vol_args));
 	if (IS_ERR(vol_args))
 		return PTR_ERR(vol_args);
@@ -2743,6 +2749,12 @@  static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg)
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+	if (fs_info->fs_devices->virtual_fsid) {
+		btrfs_err(fs_info,
+			  "device removal is unsupported on devices mounted with virtual fsid\n");
+		return -EINVAL;
+	}
+
 	vol_args = memdup_user(arg, sizeof(*vol_args));
 	if (IS_ERR(vol_args))
 		return PTR_ERR(vol_args);
@@ -3261,6 +3273,12 @@  static long btrfs_ioctl_dev_replace(struct btrfs_fs_info *fs_info,
 		return -EINVAL;
 	}
 
+	if (fs_info->fs_devices->virtual_fsid) {
+		btrfs_err(fs_info,
+			  "device replace is unsupported on devices mounted with virtual fsid\n");
+		return -EINVAL;
+	}
+
 	p = memdup_user(arg, sizeof(*p));
 	if (IS_ERR(p))
 		return PTR_ERR(p);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 366fb4cde145..8d9df169107a 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -115,6 +115,7 @@  enum {
 	Opt_thread_pool,
 	Opt_treelog, Opt_notreelog,
 	Opt_user_subvol_rm_allowed,
+	Opt_virtual_fsid,
 
 	/* Rescue options */
 	Opt_rescue,
@@ -188,6 +189,7 @@  static const match_table_t tokens = {
 	{Opt_treelog, "treelog"},
 	{Opt_notreelog, "notreelog"},
 	{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+	{Opt_virtual_fsid, "virtual_fsid"},
 
 	/* Rescue options */
 	{Opt_rescue, "rescue=%s"},
@@ -352,9 +354,10 @@  int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 		case Opt_subvol_empty:
 		case Opt_subvolid:
 		case Opt_device:
+		case Opt_virtual_fsid:
 			/*
 			 * These are parsed by btrfs_parse_subvol_options or
-			 * btrfs_parse_device_options and can be ignored here.
+			 * btrfs_parse_early_options and can be ignored here.
 			 */
 			break;
 		case Opt_nodatasum:
@@ -845,9 +848,10 @@  int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
  * All other options will be parsed on much later in the mount process and
  * only when we need to allocate a new super block.
  */
-static int btrfs_parse_device_options(const char *options, fmode_t flags,
-				      void *holder)
+static int btrfs_parse_early_options(const char *options, fmode_t flags,
+				    bool *virtual_fsid, void *holder)
 {
+	struct btrfs_scan_info info = { .vfsid = false };
 	substring_t args[MAX_OPT_ARGS];
 	char *device_name, *opts, *orig, *p;
 	struct btrfs_device *device = NULL;
@@ -874,14 +878,18 @@  static int btrfs_parse_device_options(const char *options, fmode_t flags,
 			continue;
 
 		token = match_token(p, tokens, args);
+
+		if (token == Opt_virtual_fsid)
+			(*virtual_fsid) = true;
+
 		if (token == Opt_device) {
 			device_name = match_strdup(&args[0]);
 			if (!device_name) {
 				error = -ENOMEM;
 				goto out;
 			}
-			device = btrfs_scan_one_device(device_name, flags,
-					holder);
+			info.path = device_name;
+			device = btrfs_scan_one_device(&info, flags, holder);
 			kfree(device_name);
 			if (IS_ERR(device)) {
 				error = PTR_ERR(device);
@@ -913,7 +921,7 @@  static int btrfs_parse_subvol_options(const char *options, char **subvol_name,
 
 	/*
 	 * strsep changes the string, duplicate it because
-	 * btrfs_parse_device_options gets called later
+	 * btrfs_parse_early_options gets called later
 	 */
 	opts = kstrdup(options, GFP_KERNEL);
 	if (!opts)
@@ -1431,6 +1439,7 @@  static struct dentry *mount_subvol(const char *subvol_name, u64 subvol_objectid,
 static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
 		int flags, const char *device_name, void *data)
 {
+	struct btrfs_scan_info info = { .path = NULL, .vfsid = false};
 	struct block_device *bdev = NULL;
 	struct super_block *s;
 	struct btrfs_device *device = NULL;
@@ -1472,13 +1481,14 @@  static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
 	}
 
 	mutex_lock(&uuid_mutex);
-	error = btrfs_parse_device_options(data, mode, fs_type);
+	error = btrfs_parse_early_options(data, mode, &info.vfsid, fs_type);
 	if (error) {
 		mutex_unlock(&uuid_mutex);
 		goto error_fs_info;
 	}
 
-	device = btrfs_scan_one_device(device_name, mode, fs_type);
+	info.path = device_name;
+	device = btrfs_scan_one_device(&info, mode, fs_type);
 	if (IS_ERR(device)) {
 		mutex_unlock(&uuid_mutex);
 		error = PTR_ERR(device);
@@ -2169,6 +2179,7 @@  static int btrfs_control_open(struct inode *inode, struct file *file)
 static long btrfs_control_ioctl(struct file *file, unsigned int cmd,
 				unsigned long arg)
 {
+	struct btrfs_scan_info info = { .vfsid = false };
 	struct btrfs_ioctl_vol_args *vol;
 	struct btrfs_device *device = NULL;
 	dev_t devt = 0;
@@ -2182,10 +2193,11 @@  static long btrfs_control_ioctl(struct file *file, unsigned int cmd,
 		return PTR_ERR(vol);
 	vol->name[BTRFS_PATH_NAME_MAX] = '\0';
 
+	info.path = vol->name;
 	switch (cmd) {
 	case BTRFS_IOC_SCAN_DEV:
 		mutex_lock(&uuid_mutex);
-		device = btrfs_scan_one_device(vol->name, FMODE_READ,
+		device = btrfs_scan_one_device(&info, FMODE_READ,
 					       &btrfs_root_fs_type);
 		ret = PTR_ERR_OR_ZERO(device);
 		mutex_unlock(&uuid_mutex);
@@ -2200,7 +2212,7 @@  static long btrfs_control_ioctl(struct file *file, unsigned int cmd,
 		break;
 	case BTRFS_IOC_DEVICES_READY:
 		mutex_lock(&uuid_mutex);
-		device = btrfs_scan_one_device(vol->name, FMODE_READ,
+		device = btrfs_scan_one_device(&info, FMODE_READ,
 					       &btrfs_root_fs_type);
 		if (IS_ERR(device)) {
 			mutex_unlock(&uuid_mutex);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c6d592870400..5a38b3482ec5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -745,6 +745,33 @@  static struct btrfs_fs_devices *find_fsid_reverted_metadata(
 
 	return NULL;
 }
+
+static void prepare_virtual_fsid(struct btrfs_super_block *disk_super,
+				 const char *path)
+{
+	struct btrfs_fs_devices *fs_devices;
+	u8 vfsid[BTRFS_FSID_SIZE];
+	bool dup_fsid = true;
+
+	while (dup_fsid) {
+		dup_fsid = false;
+		generate_random_uuid(vfsid);
+
+		list_for_each_entry(fs_devices, &fs_uuids, fs_list) {
+			if (!memcmp(vfsid, fs_devices->fsid, BTRFS_FSID_SIZE) ||
+			    !memcmp(vfsid, fs_devices->metadata_uuid,
+				    BTRFS_FSID_SIZE))
+				dup_fsid = true;
+		}
+	}
+
+	memcpy(disk_super->metadata_uuid, disk_super->fsid, BTRFS_FSID_SIZE);
+	memcpy(disk_super->fsid, vfsid, BTRFS_FSID_SIZE);
+
+	pr_info("BTRFS: virtual fsid (%pU) set for device %s (real fsid %pU)\n",
+		disk_super->fsid, path, disk_super->metadata_uuid);
+}
+
 /*
  * Add new device to list of registered devices
  *
@@ -752,12 +779,15 @@  static struct btrfs_fs_devices *find_fsid_reverted_metadata(
  * device pointer which was just added or updated when successful
  * error pointer when failed
  */
-static noinline struct btrfs_device *device_list_add(const char *path,
-			   struct btrfs_super_block *disk_super,
-			   bool *new_device_added)
+static noinline
+struct btrfs_device *device_list_add(const struct btrfs_scan_info *info,
+				     struct btrfs_super_block *disk_super,
+				     bool *new_device_added)
 {
 	struct btrfs_device *device;
 	struct btrfs_fs_devices *fs_devices = NULL;
+	const char *path = info->path;
+	const bool virtual_fsid = info->vfsid;
 	struct rcu_string *name;
 	u64 found_transid = btrfs_super_generation(disk_super);
 	u64 devid = btrfs_stack_device_id(&disk_super->dev_item);
@@ -775,12 +805,25 @@  static noinline struct btrfs_device *device_list_add(const char *path,
 		return ERR_PTR(error);
 	}
 
+	if (virtual_fsid) {
+		if (has_metadata_uuid || fsid_change_in_progress) {
+			btrfs_err(NULL,
+			  "failed to add device %s with virtual fsid (%s)\n",
+				  path, (has_metadata_uuid ?
+					 "the fs already has a metadata_uuid" :
+					 "fsid change in progress"));
+			return ERR_PTR(-EINVAL);
+		}
+
+		prepare_virtual_fsid(disk_super, path);
+	}
+
 	if (fsid_change_in_progress) {
 		if (!has_metadata_uuid)
 			fs_devices = find_fsid_inprogress(disk_super);
 		else
 			fs_devices = find_fsid_changed(disk_super);
-	} else if (has_metadata_uuid) {
+	} else if (has_metadata_uuid || virtual_fsid) {
 		fs_devices = find_fsid_with_metadata_uuid(disk_super);
 	} else {
 		fs_devices = find_fsid_reverted_metadata(disk_super);
@@ -790,7 +833,7 @@  static noinline struct btrfs_device *device_list_add(const char *path,
 
 
 	if (!fs_devices) {
-		if (has_metadata_uuid)
+		if (has_metadata_uuid || virtual_fsid)
 			fs_devices = alloc_fs_devices(disk_super->fsid,
 						      disk_super->metadata_uuid);
 		else
@@ -799,6 +842,7 @@  static noinline struct btrfs_device *device_list_add(const char *path,
 		if (IS_ERR(fs_devices))
 			return ERR_CAST(fs_devices);
 
+		fs_devices->virtual_fsid = virtual_fsid;
 		fs_devices->fsid_change = fsid_change_in_progress;
 
 		mutex_lock(&fs_devices->device_list_mutex);
@@ -870,11 +914,21 @@  static noinline struct btrfs_device *device_list_add(const char *path,
 	"BTRFS: device label %s devid %llu transid %llu %s scanned by %s (%d)\n",
 				disk_super->label, devid, found_transid, path,
 				current->comm, task_pid_nr(current));
-		else
-			pr_info(
+		else {
+			if (virtual_fsid)
+				pr_info(
+"BTRFS: device with virtual fsid %pU (real fsid %pU) devid %llu transid %llu %s scanned by %s (%d)\n",
+					disk_super->fsid,
+					disk_super->metadata_uuid, devid,
+					found_transid, path, current->comm,
+					task_pid_nr(current));
+			else
+				pr_info(
 	"BTRFS: device fsid %pU devid %llu transid %llu %s scanned by %s (%d)\n",
-				disk_super->fsid, devid, found_transid, path,
-				current->comm, task_pid_nr(current));
+					disk_super->fsid, devid, found_transid,
+					path, current->comm,
+					task_pid_nr(current));
+		}
 
 	} else if (!device->name || strcmp(device->name->str, path)) {
 		/*
@@ -1348,8 +1402,8 @@  int btrfs_forget_devices(dev_t devt)
  * and we are not allowed to call set_blocksize during the scan. The superblock
  * is read via pagecache
  */
-struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
-					   void *holder)
+struct btrfs_device *btrfs_scan_one_device(const struct btrfs_scan_info *info,
+					   fmode_t flags, void *holder)
 {
 	struct btrfs_super_block *disk_super;
 	bool new_device_added = false;
@@ -1377,7 +1431,7 @@  struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
 	 * values temporarily, as the device paths of the fsid are the only
 	 * required information for assembling the volume.
 	 */
-	bdev = blkdev_get_by_path(path, flags, holder);
+	bdev = blkdev_get_by_path(info->path, flags, holder);
 	if (IS_ERR(bdev))
 		return ERR_CAST(bdev);
 
@@ -1394,7 +1448,7 @@  struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
 		goto error_bdev_put;
 	}
 
-	device = device_list_add(path, disk_super, &new_device_added);
+	device = device_list_add(info, disk_super, &new_device_added);
 	if (!IS_ERR(device) && new_device_added)
 		btrfs_free_stale_devices(device->devt, device);
 
@@ -2390,6 +2444,12 @@  int btrfs_get_dev_args_from_path(struct btrfs_fs_info *fs_info,
 
 	args->devid = btrfs_stack_device_id(&disk_super->dev_item);
 	memcpy(args->uuid, disk_super->dev_item.uuid, BTRFS_UUID_SIZE);
+
+	/*
+	 * Notice that the virtual_fsid feature is not handled here; device
+	 * removal/replace is instead forbidden when such feature is in-use,
+	 * but this note is for future users of btrfs_get_dev_args_from_path().
+	 */
 	if (btrfs_fs_incompat(fs_info, METADATA_UUID))
 		memcpy(args->fsid, disk_super->metadata_uuid, BTRFS_FSID_SIZE);
 	else
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 7e51f2238f72..f2354e8288f9 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -73,6 +73,11 @@  enum btrfs_raid_types {
 #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
 #define BTRFS_DEV_STATE_NO_READA	(5)
 
+struct btrfs_scan_info {
+	const char *path;
+	bool vfsid;
+};
+
 struct btrfs_zoned_device_info;
 
 struct btrfs_device {
@@ -278,6 +283,7 @@  struct btrfs_fs_devices {
 	u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
 	u8 metadata_uuid[BTRFS_FSID_SIZE];
 	bool fsid_change;
+	bool virtual_fsid;
 	struct list_head fs_list;
 
 	/*
@@ -537,7 +543,7 @@  struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
 void btrfs_mapping_tree_free(struct extent_map_tree *tree);
 int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
 		       fmode_t flags, void *holder);
-struct btrfs_device *btrfs_scan_one_device(const char *path,
+struct btrfs_device *btrfs_scan_one_device(const struct btrfs_scan_info *info,
 					   fmode_t flags, void *holder);
 int btrfs_forget_devices(dev_t devt);
 void btrfs_close_devices(struct btrfs_fs_devices *fs_devices);