diff mbox

[RESEND,1/2] btrfs: handle volume split brain scenario

Message ID 20171217135247.1838-1-anand.jain@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Anand Jain Dec. 17, 2017, 1:52 p.m. UTC
In two device configs of RAID1/RAID5 where one device can be missing
in the degraded mount, or in the configs such as four devices RAID6
where two devices can be missing, in these type of configs it can form
two separate set of devices where each of the set can be mounted without
the other set. And if user does that and writes the data, one set would
have to loose the data. As of now we just loose the data without the
user consent depending on the SB generation number which user won't have
any meaningful interpretations.

This patch introduces a new SB flag BTRFS_SUPER_FLAG_VOL_MOVED_ON which
will be set when RAID group is mounted with missing device, so when RAID
is mounted with no missing device and if all the devices contains this
flag then we know that each of these set are mounted without the other
set. In this scenario this patch will fail the mount so that user can
decide which set would they prefer to keep the data.

Now the procedure to assemble the disks would be to continue to mount
the good set first without the device set on which new data can be
ignored, and later run btrfs device scan to bring in the missing device
and complete the RAID group which then shall reset the flag
BTRFS_SUPER_FLAG_VOL_MOVED_ON.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
On top of kdave misc-next.

 fs/btrfs/disk-io.c              | 56 ++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.c              | 14 +++++++++--
 include/uapi/linux/btrfs_tree.h |  1 +
 3 files changed, 68 insertions(+), 3 deletions(-)

Comments

Austin S. Hemmelgarn Dec. 18, 2017, 12:36 p.m. UTC | #1
On 2017-12-17 08:52, Anand Jain wrote:
> In two device configs of RAID1/RAID5 where one device can be missing
> in the degraded mount, or in the configs such as four devices RAID6
> where two devices can be missing, in these type of configs it can form
> two separate set of devices where each of the set can be mounted without
> the other set. And if user does that and writes the data, one set would
> have to loose the data. As of now we just loose the data without the
> user consent depending on the SB generation number which user won't have
> any meaningful interpretations.
> 
> This patch introduces a new SB flag BTRFS_SUPER_FLAG_VOL_MOVED_ON which
> will be set when RAID group is mounted with missing device, so when RAID
> is mounted with no missing device and if all the devices contains this
> flag then we know that each of these set are mounted without the other
> set. In this scenario this patch will fail the mount so that user can
> decide which set would they prefer to keep the data.
> 
> Now the procedure to assemble the disks would be to continue to mount
> the good set first without the device set on which new data can be
> ignored, and later run btrfs device scan to bring in the missing device
> and complete the RAID group which then shall reset the flag
> BTRFS_SUPER_FLAG_VOL_MOVED_ON.
Couple of thoughts on this:

1. This needs to go in _after_ the patches to allow the kernel to forget 
devices.  Otherwise, the only way to handle things is to physically 
disconnect the device you don't want, mount the other one, and then 
reconnect the disconnected device, because pretty much everything uses 
udev, which by default scans for BTRFS on every device that gets 
connected, and handling it like that is not an option for some people 
(potentially for quite a few people).
2. How exactly is this supposed to work for people who don't have new 
enough btrfs-progs to tell the kernel to forget the device?  Ideally, 
there should be some other way to tell the kernel which one to use 
(perhaps a one-shot mount option?) without needing any userspace support 
(and if that happens, then I retract my first comment).  I'm not saying 
that the currently proposed method is a bad solution, just that it's not 
a complete solution from a usability perspective.
3. How does this get handled with volumes using the raid1 profile and 
more than 2 devices?  In that case, things get complicated fast. 
Handling an array of N devices in raid1 mode where N-1 devices have this 
flag set is in theory easy, except you have no verification that the N-1 
devices were mounted separately or as a group (in the first case, the 
user needs to pick one, in the second, it should automatically recover 
since they all agree on the state of the filesystem).
4. Having some way to do this solely from btrfs-progs would be nice too 
(so that you can just drop to a shell during boot, or drop to the 
initramfs, and fix things without having to mess with device scanning), 
but is secondary to the kernel-side IMHO.

Other than all of that, I do think this is a good idea.  Being able to 
more sanely inform the user about the situation and help them figure out 
how to fix it correctly is something that's desperately needed here.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
> On top of kdave misc-next.
> 
>   fs/btrfs/disk-io.c              | 56 ++++++++++++++++++++++++++++++++++++++++-
>   fs/btrfs/volumes.c              | 14 +++++++++--
>   include/uapi/linux/btrfs_tree.h |  1 +
>   3 files changed, 68 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a3e9b74f6006..55bc6c846671 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -61,7 +61,8 @@
>   				 BTRFS_HEADER_FLAG_RELOC |\
>   				 BTRFS_SUPER_FLAG_ERROR |\
>   				 BTRFS_SUPER_FLAG_SEEDING |\
> -				 BTRFS_SUPER_FLAG_METADUMP)
> +				 BTRFS_SUPER_FLAG_METADUMP|\
> +				 BTRFS_SUPER_FLAG_VOL_MOVED_ON)
>   
>   static const struct extent_io_ops btree_extent_io_ops;
>   static void end_workqueue_fn(struct btrfs_work *work);
> @@ -2383,6 +2384,43 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>   	return 0;
>   }
>   
> +bool volume_has_split_brain(struct btrfs_fs_info *fs_info)
> +{
> +	unsigned long devs_moved_on = 0;
> +	struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;
> +	struct list_head *head = &fs_devs->devices;
> +	struct btrfs_device *dev;
> +
> +again:
> +	list_for_each_entry(dev, head, dev_list) {
> +		struct buffer_head *bh;
> +		struct btrfs_super_block *sb;
> +
> +		if (!dev->devid)
> +			continue;
> +
> +		bh = btrfs_read_dev_super(dev->bdev);
> +		if (IS_ERR(bh))
> +			continue;
> +
> +		sb = (struct btrfs_super_block *)bh->b_data;
> +		if (btrfs_super_flags(sb) & BTRFS_SUPER_FLAG_VOL_MOVED_ON)
> +			devs_moved_on++;
> +		brelse(bh);
> +	}
> +
> +	fs_devs = fs_devs->seed;
> +	if (fs_devs) {
> +		head = &fs_devs->devices;
> +		goto again;
> +	}
> +
> +	if (devs_moved_on == fs_info->fs_devices->total_devices)
> +		return true;
> +	else
> +		return false;
> +}
> +
>   int open_ctree(struct super_block *sb,
>   	       struct btrfs_fs_devices *fs_devices,
>   	       char *options)
> @@ -2872,6 +2910,22 @@ int open_ctree(struct super_block *sb,
>   		goto fail_sysfs;
>   	}
>   
> +	if (fs_info->fs_devices->missing_devices) {
> +		btrfs_set_super_flags(fs_info->super_copy,
> +			fs_info->super_copy->flags |
> +			BTRFS_SUPER_FLAG_VOL_MOVED_ON);
> +	} else if (fs_info->super_copy->flags &
> +			BTRFS_SUPER_FLAG_VOL_MOVED_ON) {
> +		if (volume_has_split_brain(fs_info)) {
> +			btrfs_err(fs_info,
> +				"Detected 'moved_on' flag on all device");
> +			goto fail_sysfs;
> +		}
> +		btrfs_set_super_flags(fs_info->super_copy,
> +			fs_info->super_copy->flags &
> +			~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
> +	}
> +
>   	fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
>   					       "btrfs-cleaner");
>   	if (IS_ERR(fs_info->cleaner_kthread))
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 33e814ef992f..c08b9b89e285 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2057,8 +2057,13 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
>   	device->fs_devices->num_devices--;
>   	device->fs_devices->total_devices--;
>   
> -	if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
> +	if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) {
>   		device->fs_devices->missing_devices--;
> +		if (!device->fs_devices->missing_devices)
> +			btrfs_set_super_flags(fs_info->super_copy,
> +				fs_info->super_copy->flags &
> +				~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
> +	}
>   
>   	btrfs_assign_next_active_device(fs_info, device, NULL);
>   
> @@ -2132,8 +2137,13 @@ void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_fs_info *fs_info,
>   	list_del_rcu(&srcdev->dev_list);
>   	list_del(&srcdev->dev_alloc_list);
>   	fs_devices->num_devices--;
> -	if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state))
> +	if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state)) {
>   		fs_devices->missing_devices--;
> +		if (!fs_devices->missing_devices)
> +			btrfs_set_super_flags(fs_info->super_copy,
> +				fs_info->super_copy->flags &
> +				~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
> +	}
>   
>   	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &srcdev->dev_state))
>   		fs_devices->rw_devices--;
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index 6d6e5da51527..a9f625be0589 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -456,6 +456,7 @@ struct btrfs_free_space_header {
>   
>   #define BTRFS_SUPER_FLAG_SEEDING	(1ULL << 32)
>   #define BTRFS_SUPER_FLAG_METADUMP	(1ULL << 33)
> +#define BTRFS_SUPER_FLAG_VOL_MOVED_ON	(1ULL << 36)
>   
>   
>   /*
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov Dec. 18, 2017, 12:52 p.m. UTC | #2
On 17.12.2017 15:52, Anand Jain wrote:
> In two device configs of RAID1/RAID5 where one device can be missing
> in the degraded mount, or in the configs such as four devices RAID6
> where two devices can be missing, in these type of configs it can form
> two separate set of devices where each of the set can be mounted without
> the other set. And if user does that and writes the data, one set would
> have to loose the data. As of now we just loose the data without the
> user consent depending on the SB generation number which user won't have
> any meaningful interpretations.

This is rather hard to parse something like:

In raid configs RAID1/RAID5/RAID6 it's possible to have some devices
missing which would render btrfs to be mounted in degraded state but
still be operational. In those cases it's possible (albeit highly
undesirable) that the degraded and missing parts of the filesystem are
mounted independently. When writes occur such split-brain scenarios
(caused by intentional user action) then one of the sides of the RAID
config will have to be blown away when bringing it back to the
consistent state.

> 
> This patch introduces a new SB flag BTRFS_SUPER_FLAG_VOL_MOVED_ON which
MOVED_ON sounds way too colloquial. something like VOL_FORCED_MOUNT,
VOL_PARTIAL_MOUNT.

Another thing which comes to mind is why not piggyback on the degraded
mount capabilities of the filesystem - is it just a mount option or do
we have a flag which indicates this volume was mounted in degraded mode?
If there is not flag perhaps name the flag VOL_DEGRADED_MOUNT?

> will be set when RAID group is mounted with missing device, so when RAID
> is mounted with no missing device and if all the devices contains this
> flag then we know that each of these set are mounted without the other
> set. In this scenario this patch will fail the mount so that user can
> decide which set would they prefer to keep the data.

This is also hard to parse

> 
> Now the procedure to assemble the disks would be to continue to mount
> the good set first without the device set on which new data can be
> ignored, and later run btrfs device scan to bring in the missing device
> and complete the RAID group which then shall reset the flag
> BTRFS_SUPER_FLAG_VOL_MOVED_ON.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
> On top of kdave misc-next.
> 
>  fs/btrfs/disk-io.c              | 56 ++++++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/volumes.c              | 14 +++++++++--
>  include/uapi/linux/btrfs_tree.h |  1 +
>  3 files changed, 68 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a3e9b74f6006..55bc6c846671 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -61,7 +61,8 @@
>  				 BTRFS_HEADER_FLAG_RELOC |\
>  				 BTRFS_SUPER_FLAG_ERROR |\
>  				 BTRFS_SUPER_FLAG_SEEDING |\
> -				 BTRFS_SUPER_FLAG_METADUMP)
> +				 BTRFS_SUPER_FLAG_METADUMP|\
> +				 BTRFS_SUPER_FLAG_VOL_MOVED_ON)
>  
>  static const struct extent_io_ops btree_extent_io_ops;
>  static void end_workqueue_fn(struct btrfs_work *work);
> @@ -2383,6 +2384,43 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>  	return 0;
>  }
>  
> +bool volume_has_split_brain(struct btrfs_fs_info *fs_info)
> +{
> +	unsigned long devs_moved_on = 0;
> +	struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;

almost every other instance of the which iterates fs_devices uses the
full name of the temp variable fs_devices, let's do that here as well
for the sake of grep

> +	struct list_head *head = &fs_devs->devices;
> +	struct btrfs_device *dev;
> +
> +again:
> +	list_for_each_entry(dev, head, dev_list) {
> +		struct buffer_head *bh;
> +		struct btrfs_super_block *sb;
> +
> +		if (!dev->devid)
> +			continue;
> +
> +		bh = btrfs_read_dev_super(dev->bdev);
> +		if (IS_ERR(bh))
> +			continue;
> +
> +		sb = (struct btrfs_super_block *)bh->b_data;
> +		if (btrfs_super_flags(sb) & BTRFS_SUPER_FLAG_VOL_MOVED_ON)
> +			devs_moved_on++;
> +		brelse(bh);
> +	}
> +
> +	fs_devs = fs_devs->seed;
> +	if (fs_devs) {
> +		head = &fs_devs->devices;
> +		goto again;
> +	}
> +
> +	if (devs_moved_on == fs_info->fs_devices->total_devices)
> +		return true;
> +	else
> +		return false;
> +}
> +
>  int open_ctree(struct super_block *sb,
>  	       struct btrfs_fs_devices *fs_devices,
>  	       char *options)
> @@ -2872,6 +2910,22 @@ int open_ctree(struct super_block *sb,
>  		goto fail_sysfs;
>  	}
>  
> +	if (fs_info->fs_devices->missing_devices) {
> +		btrfs_set_super_flags(fs_info->super_copy,
> +			fs_info->super_copy->flags |
> +			BTRFS_SUPER_FLAG_VOL_MOVED_ON);
> +	} else if (fs_info->super_copy->flags &
> +			BTRFS_SUPER_FLAG_VOL_MOVED_ON) {
> +		if (volume_has_split_brain(fs_info)) {
> +			btrfs_err(fs_info,
> +				"Detected 'moved_on' flag on all device");
> +			goto fail_sysfs;
> +		}
> +		btrfs_set_super_flags(fs_info->super_copy,
> +			fs_info->super_copy->flags &
> +			~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
> +	}

Is there any specific reason why you place this check right before the
cleaner_kthread? If not I think it makes sense to be as close as
possible to btrfs_read_chunk_tree, since the information it uses is
populated by that function.

> +
>  	fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
>  					       "btrfs-cleaner");
>  	if (IS_ERR(fs_info->cleaner_kthread))
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 33e814ef992f..c08b9b89e285 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2057,8 +2057,13 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
>  	device->fs_devices->num_devices--;
>  	device->fs_devices->total_devices--;
>  
> -	if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
> +	if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) {
>  		device->fs_devices->missing_devices--;
> +		if (!device->fs_devices->missing_devices)
> +			btrfs_set_super_flags(fs_info->super_copy,
> +				fs_info->super_copy->flags &
> +				~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
> +	}
>  
>  	btrfs_assign_next_active_device(fs_info, device, NULL);
>  
> @@ -2132,8 +2137,13 @@ void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_fs_info *fs_info,
>  	list_del_rcu(&srcdev->dev_list);
>  	list_del(&srcdev->dev_alloc_list);
>  	fs_devices->num_devices--;
> -	if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state))
> +	if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state)) {
>  		fs_devices->missing_devices--;
> +		if (!fs_devices->missing_devices)
> +			btrfs_set_super_flags(fs_info->super_copy,
> +				fs_info->super_copy->flags &
> +				~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
> +	}
>  
>  	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &srcdev->dev_state))
>  		fs_devices->rw_devices--;
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index 6d6e5da51527..a9f625be0589 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -456,6 +456,7 @@ struct btrfs_free_space_header {
>  
>  #define BTRFS_SUPER_FLAG_SEEDING	(1ULL << 32)
>  #define BTRFS_SUPER_FLAG_METADUMP	(1ULL << 33)
> +#define BTRFS_SUPER_FLAG_VOL_MOVED_ON	(1ULL << 36)
>  
>  
>  /*
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anand Jain Dec. 18, 2017, 2:39 p.m. UTC | #3
>> Now the procedure to assemble the disks would be to continue to mount
>> the good set first without the device set on which new data can be
>> ignored, and later run btrfs device scan to bring in the missing device
>> and complete the RAID group which then shall reset the flag
>> BTRFS_SUPER_FLAG_VOL_MOVED_ON.
> Couple of thoughts on this:
> 
> 1. This needs to go in _after_ the patches to allow the kernel to forget 
> devices.

  Right will mention that explicitly or it can use kernel module reload
  if its a choice.

>  Otherwise, the only way to handle things is to physically 
> disconnect the device you don't want, mount the other one, and then 
> reconnect the disconnected device, because pretty much everything uses 
> udev, which by default scans for BTRFS on every device that gets 
> connected, and handling it like that is not an option for some people 
> (potentially for quite a few people).

  Yeah. That's a problem. If its a choice then they can reload
  the module.

> 2. How exactly is this supposed to work for people who don't have new 
> enough btrfs-progs to tell the kernel to forget the device?  Ideally, 
> there should be some other way to tell the kernel which one to use 
> (perhaps a one-shot mount option?) without needing any userspace support 
> (and if that happens, then I retract my first comment).  I'm not saying 
> that the currently proposed method is a bad solution, just that it's not 
> a complete solution from a usability perspective.

  Oh. I got it. But the current solution using forget cli OR kernel
  module reload is a kind of general solution, also split brain problem
  arises from wrong usage.

  Your concern of not having forget cli is only a short term concern,
  developing a new mount-option for a short term concern is not
  sure if its right. I am ok either way, if more people prefer that
  way I don't mind writing one.

> 3. How does this get handled with volumes using the raid1 profile and 
> more than 2 devices?
> In that case, things get complicated fast. 
> Handling an array of N devices in raid1 mode where N-1 devices have this 
> flag set is in theory easy, except you have no verification that the N-1 
> devices were mounted separately or as a group (in the first case, the 
> user needs to pick one, in the second, it should automatically recover 
> since they all agree on the state of the filesystem).

  Even in N device raid1, the number of device that can be missing
  is still one. There will be at least one or more devices which
  are common to both the sets. So split brain can't happen.

> 4. Having some way to do this solely from btrfs-progs would be nice too 
> (so that you can just drop to a shell during boot, or drop to the 
> initramfs, and fix things without having to mess with device scanning), 
> but is secondary to the kernel-side IMHO.

  Pls. Note this patch only detects and avoids the split brain devices
  to loose data, by failing the mount. However the fix is out side of
  this patch. Next, cli 'btrfs dev forget' is anyway isn't purposely
  made for this purpose. Its a generic un-scan code which helps here
  otherwise as well.

  If we need a purposely built cli (btrfstune ?) to fix the SB, it
  can be done. But I think it will be complicated.

> Other than all of that, I do think this is a good idea.  Being able to 
> more sanely inform the user about the situation and help them figure out 
> how to fix it correctly is something that's desperately needed here.

Thanks, Anand

>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>> On top of kdave misc-next.
>>
>>   fs/btrfs/disk-io.c              | 56 
>> ++++++++++++++++++++++++++++++++++++++++-
>>   fs/btrfs/volumes.c              | 14 +++++++++--
>>   include/uapi/linux/btrfs_tree.h |  1 +
>>   3 files changed, 68 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index a3e9b74f6006..55bc6c846671 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -61,7 +61,8 @@
>>                    BTRFS_HEADER_FLAG_RELOC |\
>>                    BTRFS_SUPER_FLAG_ERROR |\
>>                    BTRFS_SUPER_FLAG_SEEDING |\
>> -                 BTRFS_SUPER_FLAG_METADUMP)
>> +                 BTRFS_SUPER_FLAG_METADUMP|\
>> +                 BTRFS_SUPER_FLAG_VOL_MOVED_ON)
>>   static const struct extent_io_ops btree_extent_io_ops;
>>   static void end_workqueue_fn(struct btrfs_work *work);
>> @@ -2383,6 +2384,43 @@ static int btrfs_read_roots(struct 
>> btrfs_fs_info *fs_info)
>>       return 0;
>>   }
>> +bool volume_has_split_brain(struct btrfs_fs_info *fs_info)
>> +{
>> +    unsigned long devs_moved_on = 0;
>> +    struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;
>> +    struct list_head *head = &fs_devs->devices;
>> +    struct btrfs_device *dev;
>> +
>> +again:
>> +    list_for_each_entry(dev, head, dev_list) {
>> +        struct buffer_head *bh;
>> +        struct btrfs_super_block *sb;
>> +
>> +        if (!dev->devid)
>> +            continue;
>> +
>> +        bh = btrfs_read_dev_super(dev->bdev);
>> +        if (IS_ERR(bh))
>> +            continue;
>> +
>> +        sb = (struct btrfs_super_block *)bh->b_data;
>> +        if (btrfs_super_flags(sb) & BTRFS_SUPER_FLAG_VOL_MOVED_ON)
>> +            devs_moved_on++;
>> +        brelse(bh);
>> +    }
>> +
>> +    fs_devs = fs_devs->seed;
>> +    if (fs_devs) {
>> +        head = &fs_devs->devices;
>> +        goto again;
>> +    }
>> +
>> +    if (devs_moved_on == fs_info->fs_devices->total_devices)
>> +        return true;
>> +    else
>> +        return false;
>> +}
>> +
>>   int open_ctree(struct super_block *sb,
>>              struct btrfs_fs_devices *fs_devices,
>>              char *options)
>> @@ -2872,6 +2910,22 @@ int open_ctree(struct super_block *sb,
>>           goto fail_sysfs;
>>       }
>> +    if (fs_info->fs_devices->missing_devices) {
>> +        btrfs_set_super_flags(fs_info->super_copy,
>> +            fs_info->super_copy->flags |
>> +            BTRFS_SUPER_FLAG_VOL_MOVED_ON);
>> +    } else if (fs_info->super_copy->flags &
>> +            BTRFS_SUPER_FLAG_VOL_MOVED_ON) {
>> +        if (volume_has_split_brain(fs_info)) {
>> +            btrfs_err(fs_info,
>> +                "Detected 'moved_on' flag on all device");
>> +            goto fail_sysfs;
>> +        }
>> +        btrfs_set_super_flags(fs_info->super_copy,
>> +            fs_info->super_copy->flags &
>> +            ~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
>> +    }
>> +
>>       fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
>>                              "btrfs-cleaner");
>>       if (IS_ERR(fs_info->cleaner_kthread))
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 33e814ef992f..c08b9b89e285 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2057,8 +2057,13 @@ int btrfs_rm_device(struct btrfs_fs_info 
>> *fs_info, const char *device_path,
>>       device->fs_devices->num_devices--;
>>       device->fs_devices->total_devices--;
>> -    if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
>> +    if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) {
>>           device->fs_devices->missing_devices--;
>> +        if (!device->fs_devices->missing_devices)
>> +            btrfs_set_super_flags(fs_info->super_copy,
>> +                fs_info->super_copy->flags &
>> +                ~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
>> +    }
>>       btrfs_assign_next_active_device(fs_info, device, NULL);
>> @@ -2132,8 +2137,13 @@ void btrfs_rm_dev_replace_remove_srcdev(struct 
>> btrfs_fs_info *fs_info,
>>       list_del_rcu(&srcdev->dev_list);
>>       list_del(&srcdev->dev_alloc_list);
>>       fs_devices->num_devices--;
>> -    if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state))
>> +    if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state)) {
>>           fs_devices->missing_devices--;
>> +        if (!fs_devices->missing_devices)
>> +            btrfs_set_super_flags(fs_info->super_copy,
>> +                fs_info->super_copy->flags &
>> +                ~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
>> +    }
>>       if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &srcdev->dev_state))
>>           fs_devices->rw_devices--;
>> diff --git a/include/uapi/linux/btrfs_tree.h 
>> b/include/uapi/linux/btrfs_tree.h
>> index 6d6e5da51527..a9f625be0589 100644
>> --- a/include/uapi/linux/btrfs_tree.h
>> +++ b/include/uapi/linux/btrfs_tree.h
>> @@ -456,6 +456,7 @@ struct btrfs_free_space_header {
>>   #define BTRFS_SUPER_FLAG_SEEDING    (1ULL << 32)
>>   #define BTRFS_SUPER_FLAG_METADUMP    (1ULL << 33)
>> +#define BTRFS_SUPER_FLAG_VOL_MOVED_ON    (1ULL << 36)
>>   /*
>>
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anand Jain Dec. 18, 2017, 3:03 p.m. UTC | #4
On 12/18/2017 08:52 PM, Nikolay Borisov wrote:
> 
> 
> On 17.12.2017 15:52, Anand Jain wrote:
>> In two device configs of RAID1/RAID5 where one device can be missing
>> in the degraded mount, or in the configs such as four devices RAID6
>> where two devices can be missing, in these type of configs it can form
>> two separate set of devices where each of the set can be mounted without
>> the other set. And if user does that and writes the data, one set would
>> have to loose the data. As of now we just loose the data without the
>> user consent depending on the SB generation number which user won't have
>> any meaningful interpretations.
> 
> This is rather hard to parse something like:
> 
> In raid configs RAID1/RAID5/RAID6 it's possible to have some devices
> missing which would render btrfs to be mounted in degraded state but
> still be operational. In those cases it's possible (albeit highly
> undesirable) that the degraded and missing parts of the filesystem are
> mounted independently. When writes occur such split-brain scenarios
> (caused by intentional user action) then one of the sides of the RAID
> config will have to be blown away when bringing it back to the
> consistent state.

  It make sense to put it like this. Thanks.

>>
>> This patch introduces a new SB flag BTRFS_SUPER_FLAG_VOL_MOVED_ON which
> MOVED_ON sounds way too colloquial. something like VOL_FORCED_MOUNT,
> VOL_PARTIAL_MOUNT.
> 
> Another thing which comes to mind is why not piggyback on the degraded
> mount capabilities of the filesystem - is it just a mount option or do
> we have a flag which indicates this volume was mounted in degraded mode?
> If there is not flag perhaps name the flag VOL_DEGRADED_MOUNT?

  I struggled to name it. Thanks for the suggestion.
  How about just that BTRFS_SUPER_FLAG_DEGRADED.
  It matches BTRFS_MOUNT_DEGRADED.

>> will be set when RAID group is mounted with missing device, so when RAID
>> is mounted with no missing device and if all the devices contains this
>> flag then we know that each of these set are mounted without the other
>> set. In this scenario this patch will fail the mount so that user can
>> decide which set would they prefer to keep the data.
> 
> This is also hard to parse

  Sure will update.

>>
>> Now the procedure to assemble the disks would be to continue to mount
>> the good set first without the device set on which new data can be
>> ignored, and later run btrfs device scan to bring in the missing device
>> and complete the RAID group which then shall reset the flag
>> BTRFS_SUPER_FLAG_VOL_MOVED_ON.
>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>> On top of kdave misc-next.
>>
>>   fs/btrfs/disk-io.c              | 56 ++++++++++++++++++++++++++++++++++++++++-
>>   fs/btrfs/volumes.c              | 14 +++++++++--
>>   include/uapi/linux/btrfs_tree.h |  1 +
>>   3 files changed, 68 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index a3e9b74f6006..55bc6c846671 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -61,7 +61,8 @@
>>   				 BTRFS_HEADER_FLAG_RELOC |\
>>   				 BTRFS_SUPER_FLAG_ERROR |\
>>   				 BTRFS_SUPER_FLAG_SEEDING |\
>> -				 BTRFS_SUPER_FLAG_METADUMP)
>> +				 BTRFS_SUPER_FLAG_METADUMP|\
>> +				 BTRFS_SUPER_FLAG_VOL_MOVED_ON)
>>   
>>   static const struct extent_io_ops btree_extent_io_ops;
>>   static void end_workqueue_fn(struct btrfs_work *work);
>> @@ -2383,6 +2384,43 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
>>   	return 0;
>>   }
>>   
>> +bool volume_has_split_brain(struct btrfs_fs_info *fs_info)
>> +{
>> +	unsigned long devs_moved_on = 0;
>> +	struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;
> 
> almost every other instance of the which iterates fs_devices uses the
> full name of the temp variable fs_devices, let's do that here as well
> for the sake of grep

  Ok.

>> +	struct list_head *head = &fs_devs->devices;
>> +	struct btrfs_device *dev;
>> +
>> +again:
>> +	list_for_each_entry(dev, head, dev_list) {
>> +		struct buffer_head *bh;
>> +		struct btrfs_super_block *sb;
>> +
>> +		if (!dev->devid)
>> +			continue;
>> +
>> +		bh = btrfs_read_dev_super(dev->bdev);
>> +		if (IS_ERR(bh))
>> +			continue;
>> +
>> +		sb = (struct btrfs_super_block *)bh->b_data;
>> +		if (btrfs_super_flags(sb) & BTRFS_SUPER_FLAG_VOL_MOVED_ON)
>> +			devs_moved_on++;
>> +		brelse(bh);
>> +	}
>> +
>> +	fs_devs = fs_devs->seed;
>> +	if (fs_devs) {
>> +		head = &fs_devs->devices;
>> +		goto again;
>> +	}
>> +
>> +	if (devs_moved_on == fs_info->fs_devices->total_devices)
>> +		return true;
>> +	else
>> +		return false;
>> +}
>> +
>>   int open_ctree(struct super_block *sb,
>>   	       struct btrfs_fs_devices *fs_devices,
>>   	       char *options)
>> @@ -2872,6 +2910,22 @@ int open_ctree(struct super_block *sb,
>>   		goto fail_sysfs;
>>   	}
>>   
>> +	if (fs_info->fs_devices->missing_devices) {
>> +		btrfs_set_super_flags(fs_info->super_copy,
>> +			fs_info->super_copy->flags |
>> +			BTRFS_SUPER_FLAG_VOL_MOVED_ON);
>> +	} else if (fs_info->super_copy->flags &
>> +			BTRFS_SUPER_FLAG_VOL_MOVED_ON) {
>> +		if (volume_has_split_brain(fs_info)) {
>> +			btrfs_err(fs_info,
>> +				"Detected 'moved_on' flag on all device");
>> +			goto fail_sysfs;
>> +		}
>> +		btrfs_set_super_flags(fs_info->super_copy,
>> +			fs_info->super_copy->flags &
>> +			~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
>> +	}
> 
> Is there any specific reason why you place this check right before the
> cleaner_kthread? If not I think it makes sense to be as close as
> possible to btrfs_read_chunk_tree, since the information it uses is
> populated by that function.

  It can be there  will move.

Thanks, Anand

>> +
>>   	fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
>>   					       "btrfs-cleaner");
>>   	if (IS_ERR(fs_info->cleaner_kthread))
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 33e814ef992f..c08b9b89e285 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2057,8 +2057,13 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
>>   	device->fs_devices->num_devices--;
>>   	device->fs_devices->total_devices--;
>>   
>> -	if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
>> +	if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) {
>>   		device->fs_devices->missing_devices--;
>> +		if (!device->fs_devices->missing_devices)
>> +			btrfs_set_super_flags(fs_info->super_copy,
>> +				fs_info->super_copy->flags &
>> +				~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
>> +	}
>>   
>>   	btrfs_assign_next_active_device(fs_info, device, NULL);
>>   
>> @@ -2132,8 +2137,13 @@ void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_fs_info *fs_info,
>>   	list_del_rcu(&srcdev->dev_list);
>>   	list_del(&srcdev->dev_alloc_list);
>>   	fs_devices->num_devices--;
>> -	if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state))
>> +	if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state)) {
>>   		fs_devices->missing_devices--;
>> +		if (!fs_devices->missing_devices)
>> +			btrfs_set_super_flags(fs_info->super_copy,
>> +				fs_info->super_copy->flags &
>> +				~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
>> +	}
>>   
>>   	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &srcdev->dev_state))
>>   		fs_devices->rw_devices--;
>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
>> index 6d6e5da51527..a9f625be0589 100644
>> --- a/include/uapi/linux/btrfs_tree.h
>> +++ b/include/uapi/linux/btrfs_tree.h
>> @@ -456,6 +456,7 @@ struct btrfs_free_space_header {
>>   
>>   #define BTRFS_SUPER_FLAG_SEEDING	(1ULL << 32)
>>   #define BTRFS_SUPER_FLAG_METADUMP	(1ULL << 33)
>> +#define BTRFS_SUPER_FLAG_VOL_MOVED_ON	(1ULL << 36)
>>   
>>   
>>   /*
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Austin S. Hemmelgarn Dec. 18, 2017, 4:29 p.m. UTC | #5
On 2017-12-18 09:39, Anand Jain wrote:
> 
> 
> 
>>> Now the procedure to assemble the disks would be to continue to mount
>>> the good set first without the device set on which new data can be
>>> ignored, and later run btrfs device scan to bring in the missing device
>>> and complete the RAID group which then shall reset the flag
>>> BTRFS_SUPER_FLAG_VOL_MOVED_ON.
>> Couple of thoughts on this:
>>
>> 1. This needs to go in _after_ the patches to allow the kernel to 
>> forget devices.
> 
>   Right will mention that explicitly or it can use kernel module reload
>   if its a choice.
> 
>>   Otherwise, the only way to handle things is to physically disconnect 
>> the device you don't want, mount the other one, and then reconnect the 
>> disconnected device, because pretty much everything uses udev, which 
>> by default scans for BTRFS on every device that gets connected, and 
>> handling it like that is not an option for some people (potentially 
>> for quite a few people).
> 
>   Yeah. That's a problem. If its a choice then they can reload
>   the module.
Which may not always be a choice (though I sincerely hope people aren't 
messing around with stuff like this with system-critical filesystems 
like / and /usr...).
> 
>> 2. How exactly is this supposed to work for people who don't have new 
>> enough btrfs-progs to tell the kernel to forget the device?  Ideally, 
>> there should be some other way to tell the kernel which one to use 
>> (perhaps a one-shot mount option?) without needing any userspace 
>> support (and if that happens, then I retract my first comment).  I'm 
>> not saying that the currently proposed method is a bad solution, just 
>> that it's not a complete solution from a usability perspective.
> 
>   Oh. I got it. But the current solution using forget cli OR kernel
>   module reload is a kind of general solution, also split brain problem
>   arises from wrong usage.
> 
>   Your concern of not having forget cli is only a short term concern,
>   developing a new mount-option for a short term concern is not
>   sure if its right. I am ok either way, if more people prefer that
>   way I don't mind writing one.
> 
>> 3. How does this get handled with volumes using the raid1 profile and 
>> more than 2 devices?
>> In that case, things get complicated fast. Handling an array of N 
>> devices in raid1 mode where N-1 devices have this flag set is in 
>> theory easy, except you have no verification that the N-1 devices were 
>> mounted separately or as a group (in the first case, the user needs to 
>> pick one, in the second, it should automatically recover since they 
>> all agree on the state of the filesystem).
> 
>   Even in N device raid1, the number of device that can be missing
>   is still one. There will be at least one or more devices which
>   are common to both the sets. So split brain can't happen.
No, there isn't a guarantee that that's the case.  Suppose you have 3 
devices in a raid1 and one goes missing.  The user can now, in trying to 
fix this, mount each of the two remaining devices separately (by 
accident of course), in which case you can still have a split brain 
situation.  The same logic applies for all values of N greater than 3 as 
well, with a greater number of possible situations for each higher N 
(for N=4, you have at least 4 ways the user can create a split-brain, 
for N=5 you have at least 11, etc).

I have no issue with there being on handling for this (it should be a 
reasonably rare case among people who don't have existing knowledge of 
how to fix it as things are already), but if that's the case it should 
ideally be explicitly stated that you don't handle this.
> 
>> 4. Having some way to do this solely from btrfs-progs would be nice 
>> too (so that you can just drop to a shell during boot, or drop to the 
>> initramfs, and fix things without having to mess with device 
>> scanning), but is secondary to the kernel-side IMHO.
> 
>   Pls. Note this patch only detects and avoids the split brain devices
>   to loose data, by failing the mount. However the fix is out side of
>   this patch. Next, cli 'btrfs dev forget' is anyway isn't purposely
>   made for this purpose. Its a generic un-scan code which helps here
>   otherwise as well.
> 
>   If we need a purposely built cli (btrfstune ?) to fix the SB, it
>   can be done. But I think it will be complicated.
It's arguably easier to make a (possibly single) call to `btrfstune` to 
unset the flag on the device you want to drop and then mount again than 
it is to make the kernel forget it, mount degraded, and then re-scan 
devices (and sounds a whole lot safer and less error-prone too).

Even if that does become the case, having kernel code to mark the 
devices mounted degraded and refuse the mount in some circumstances is 
still needed though.
> 
>> Other than all of that, I do think this is a good idea.  Being able to 
>> more sanely inform the user about the situation and help them figure 
>> out how to fix it correctly is something that's desperately needed here.
> 
> Thanks, Anand
> 
>>>
>>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>>> ---
>>> On top of kdave misc-next.
>>>
>>>   fs/btrfs/disk-io.c              | 56 
>>> ++++++++++++++++++++++++++++++++++++++++-
>>>   fs/btrfs/volumes.c              | 14 +++++++++--
>>>   include/uapi/linux/btrfs_tree.h |  1 +
>>>   3 files changed, 68 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index a3e9b74f6006..55bc6c846671 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -61,7 +61,8 @@
>>>                    BTRFS_HEADER_FLAG_RELOC |\
>>>                    BTRFS_SUPER_FLAG_ERROR |\
>>>                    BTRFS_SUPER_FLAG_SEEDING |\
>>> -                 BTRFS_SUPER_FLAG_METADUMP)
>>> +                 BTRFS_SUPER_FLAG_METADUMP|\
>>> +                 BTRFS_SUPER_FLAG_VOL_MOVED_ON)
>>>   static const struct extent_io_ops btree_extent_io_ops;
>>>   static void end_workqueue_fn(struct btrfs_work *work);
>>> @@ -2383,6 +2384,43 @@ static int btrfs_read_roots(struct 
>>> btrfs_fs_info *fs_info)
>>>       return 0;
>>>   }
>>> +bool volume_has_split_brain(struct btrfs_fs_info *fs_info)
>>> +{
>>> +    unsigned long devs_moved_on = 0;
>>> +    struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;
>>> +    struct list_head *head = &fs_devs->devices;
>>> +    struct btrfs_device *dev;
>>> +
>>> +again:
>>> +    list_for_each_entry(dev, head, dev_list) {
>>> +        struct buffer_head *bh;
>>> +        struct btrfs_super_block *sb;
>>> +
>>> +        if (!dev->devid)
>>> +            continue;
>>> +
>>> +        bh = btrfs_read_dev_super(dev->bdev);
>>> +        if (IS_ERR(bh))
>>> +            continue;
>>> +
>>> +        sb = (struct btrfs_super_block *)bh->b_data;
>>> +        if (btrfs_super_flags(sb) & BTRFS_SUPER_FLAG_VOL_MOVED_ON)
>>> +            devs_moved_on++;
>>> +        brelse(bh);
>>> +    }
>>> +
>>> +    fs_devs = fs_devs->seed;
>>> +    if (fs_devs) {
>>> +        head = &fs_devs->devices;
>>> +        goto again;
>>> +    }
>>> +
>>> +    if (devs_moved_on == fs_info->fs_devices->total_devices)
>>> +        return true;
>>> +    else
>>> +        return false;
>>> +}
>>> +
>>>   int open_ctree(struct super_block *sb,
>>>              struct btrfs_fs_devices *fs_devices,
>>>              char *options)
>>> @@ -2872,6 +2910,22 @@ int open_ctree(struct super_block *sb,
>>>           goto fail_sysfs;
>>>       }
>>> +    if (fs_info->fs_devices->missing_devices) {
>>> +        btrfs_set_super_flags(fs_info->super_copy,
>>> +            fs_info->super_copy->flags |
>>> +            BTRFS_SUPER_FLAG_VOL_MOVED_ON);
>>> +    } else if (fs_info->super_copy->flags &
>>> +            BTRFS_SUPER_FLAG_VOL_MOVED_ON) {
>>> +        if (volume_has_split_brain(fs_info)) {
>>> +            btrfs_err(fs_info,
>>> +                "Detected 'moved_on' flag on all device");
>>> +            goto fail_sysfs;
>>> +        }
>>> +        btrfs_set_super_flags(fs_info->super_copy,
>>> +            fs_info->super_copy->flags &
>>> +            ~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
>>> +    }
>>> +
>>>       fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
>>>                              "btrfs-cleaner");
>>>       if (IS_ERR(fs_info->cleaner_kthread))
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 33e814ef992f..c08b9b89e285 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -2057,8 +2057,13 @@ int btrfs_rm_device(struct btrfs_fs_info 
>>> *fs_info, const char *device_path,
>>>       device->fs_devices->num_devices--;
>>>       device->fs_devices->total_devices--;
>>> -    if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
>>> +    if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) {
>>>           device->fs_devices->missing_devices--;
>>> +        if (!device->fs_devices->missing_devices)
>>> +            btrfs_set_super_flags(fs_info->super_copy,
>>> +                fs_info->super_copy->flags &
>>> +                ~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
>>> +    }
>>>       btrfs_assign_next_active_device(fs_info, device, NULL);
>>> @@ -2132,8 +2137,13 @@ void btrfs_rm_dev_replace_remove_srcdev(struct 
>>> btrfs_fs_info *fs_info,
>>>       list_del_rcu(&srcdev->dev_list);
>>>       list_del(&srcdev->dev_alloc_list);
>>>       fs_devices->num_devices--;
>>> -    if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state))
>>> +    if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state)) {
>>>           fs_devices->missing_devices--;
>>> +        if (!fs_devices->missing_devices)
>>> +            btrfs_set_super_flags(fs_info->super_copy,
>>> +                fs_info->super_copy->flags &
>>> +                ~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
>>> +    }
>>>       if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &srcdev->dev_state))
>>>           fs_devices->rw_devices--;
>>> diff --git a/include/uapi/linux/btrfs_tree.h 
>>> b/include/uapi/linux/btrfs_tree.h
>>> index 6d6e5da51527..a9f625be0589 100644
>>> --- a/include/uapi/linux/btrfs_tree.h
>>> +++ b/include/uapi/linux/btrfs_tree.h
>>> @@ -456,6 +456,7 @@ struct btrfs_free_space_header {
>>>   #define BTRFS_SUPER_FLAG_SEEDING    (1ULL << 32)
>>>   #define BTRFS_SUPER_FLAG_METADUMP    (1ULL << 33)
>>> +#define BTRFS_SUPER_FLAG_VOL_MOVED_ON    (1ULL << 36)
>>>   /*
>>>
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anand Jain Dec. 20, 2017, 4:46 a.m. UTC | #6
> Suppose you have 3 
> devices in a raid1 and one goes missing.  The user can now, in trying to 
> fix this, mount each of the two remaining devices separately (by 
> accident of course), in which case you can still have a split brain 
> situation.

Austin, that's not possible a N disk raid1 can still miss only one 
device. Sorry if I had confused you by the hard to parse commit log.
Which I have fixed it in v2.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a3e9b74f6006..55bc6c846671 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -61,7 +61,8 @@ 
 				 BTRFS_HEADER_FLAG_RELOC |\
 				 BTRFS_SUPER_FLAG_ERROR |\
 				 BTRFS_SUPER_FLAG_SEEDING |\
-				 BTRFS_SUPER_FLAG_METADUMP)
+				 BTRFS_SUPER_FLAG_METADUMP|\
+				 BTRFS_SUPER_FLAG_VOL_MOVED_ON)
 
 static const struct extent_io_ops btree_extent_io_ops;
 static void end_workqueue_fn(struct btrfs_work *work);
@@ -2383,6 +2384,43 @@  static int btrfs_read_roots(struct btrfs_fs_info *fs_info)
 	return 0;
 }
 
+bool volume_has_split_brain(struct btrfs_fs_info *fs_info)
+{
+	unsigned long devs_moved_on = 0;
+	struct btrfs_fs_devices *fs_devs = fs_info->fs_devices;
+	struct list_head *head = &fs_devs->devices;
+	struct btrfs_device *dev;
+
+again:
+	list_for_each_entry(dev, head, dev_list) {
+		struct buffer_head *bh;
+		struct btrfs_super_block *sb;
+
+		if (!dev->devid)
+			continue;
+
+		bh = btrfs_read_dev_super(dev->bdev);
+		if (IS_ERR(bh))
+			continue;
+
+		sb = (struct btrfs_super_block *)bh->b_data;
+		if (btrfs_super_flags(sb) & BTRFS_SUPER_FLAG_VOL_MOVED_ON)
+			devs_moved_on++;
+		brelse(bh);
+	}
+
+	fs_devs = fs_devs->seed;
+	if (fs_devs) {
+		head = &fs_devs->devices;
+		goto again;
+	}
+
+	if (devs_moved_on == fs_info->fs_devices->total_devices)
+		return true;
+	else
+		return false;
+}
+
 int open_ctree(struct super_block *sb,
 	       struct btrfs_fs_devices *fs_devices,
 	       char *options)
@@ -2872,6 +2910,22 @@  int open_ctree(struct super_block *sb,
 		goto fail_sysfs;
 	}
 
+	if (fs_info->fs_devices->missing_devices) {
+		btrfs_set_super_flags(fs_info->super_copy,
+			fs_info->super_copy->flags |
+			BTRFS_SUPER_FLAG_VOL_MOVED_ON);
+	} else if (fs_info->super_copy->flags &
+			BTRFS_SUPER_FLAG_VOL_MOVED_ON) {
+		if (volume_has_split_brain(fs_info)) {
+			btrfs_err(fs_info,
+				"Detected 'moved_on' flag on all device");
+			goto fail_sysfs;
+		}
+		btrfs_set_super_flags(fs_info->super_copy,
+			fs_info->super_copy->flags &
+			~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
+	}
+
 	fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
 					       "btrfs-cleaner");
 	if (IS_ERR(fs_info->cleaner_kthread))
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 33e814ef992f..c08b9b89e285 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2057,8 +2057,13 @@  int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
 	device->fs_devices->num_devices--;
 	device->fs_devices->total_devices--;
 
-	if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
+	if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) {
 		device->fs_devices->missing_devices--;
+		if (!device->fs_devices->missing_devices)
+			btrfs_set_super_flags(fs_info->super_copy,
+				fs_info->super_copy->flags &
+				~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
+	}
 
 	btrfs_assign_next_active_device(fs_info, device, NULL);
 
@@ -2132,8 +2137,13 @@  void btrfs_rm_dev_replace_remove_srcdev(struct btrfs_fs_info *fs_info,
 	list_del_rcu(&srcdev->dev_list);
 	list_del(&srcdev->dev_alloc_list);
 	fs_devices->num_devices--;
-	if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state))
+	if (test_bit(BTRFS_DEV_STATE_MISSING, &srcdev->dev_state)) {
 		fs_devices->missing_devices--;
+		if (!fs_devices->missing_devices)
+			btrfs_set_super_flags(fs_info->super_copy,
+				fs_info->super_copy->flags &
+				~BTRFS_SUPER_FLAG_VOL_MOVED_ON);
+	}
 
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &srcdev->dev_state))
 		fs_devices->rw_devices--;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 6d6e5da51527..a9f625be0589 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -456,6 +456,7 @@  struct btrfs_free_space_header {
 
 #define BTRFS_SUPER_FLAG_SEEDING	(1ULL << 32)
 #define BTRFS_SUPER_FLAG_METADUMP	(1ULL << 33)
+#define BTRFS_SUPER_FLAG_VOL_MOVED_ON	(1ULL << 36)
 
 
 /*