[v2] btrfs: handle dynamically reappearing missing device
diff mbox

Message ID 20171117123614.11822-1-anand.jain@oracle.com
State New
Headers show

Commit Message

Anand Jain Nov. 17, 2017, 12:36 p.m. UTC
If the device is not present at the time of (-o degrade) mount,
the mount context will create a dummy missing struct btrfs_device.
Later this device may reappear after the FS is mounted and
then device is included in the device list but it missed the
open_device part. So this patch handles that case by going
through the open_device steps which this device missed and finally
adds to the device alloc list.

So now with this patch, to bring back the missing device user can run,

   btrfs dev scan <path-of-missing-device>

Without this kernel patch, even though 'btrfs fi show' and 'btrfs
dev ready' would tell you that missing device has reappeared
successfully but actually in kernel FS layer it didn't.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
This patch needs:
 [PATCH 0/4]  factor __btrfs_open_devices()

v2:
Add more comments.
Add more change log.
Add to check if device missing is set, to handle the case
dev open fail and user will rerun the dev scan.

 fs/btrfs/volumes.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 60 insertions(+), 3 deletions(-)

Comments

Goffredo Baroncelli Nov. 18, 2017, 1:52 p.m. UTC | #1
On 11/17/2017 01:36 PM, Anand Jain wrote:
> If the device is not present at the time of (-o degrade) mount,
> the mount context will create a dummy missing struct btrfs_device.
> Later this device may reappear after the FS is mounted and
> then device is included in the device list but it missed the
> open_device part. So this patch handles that case by going
> through the open_device steps which this device missed and finally
> adds to the device alloc list.

What happens if the first devices got writes before the "last device" is joined ?

Supposing to have a raid1 pair of devices: sda, sdb

- sda is dissappeared (usb detached ?)
- the filesystem is mounted as
	mount -o degraded /dev/sdb
- some writes happens on /dev/sdb
- the user reattach /dev/sda
- udev run "btrfs dev scan"
- the system joins /dev/sda to the filesystem disks pool

Because the filesystem is a raid1, btrfs may read from /dev/sdb (updated data) or /dev/sda (old data), or worse read something from the former and something from the later (metadata from sda and data from sdb ) ???

Am I missing something ?

BR
G.Baroncelli
Anand Jain Nov. 20, 2017, 8:19 a.m. UTC | #2
On 11/18/2017 09:52 PM, Goffredo Baroncelli wrote:
> On 11/17/2017 01:36 PM, Anand Jain wrote:
>> If the device is not present at the time of (-o degrade) mount,
>> the mount context will create a dummy missing struct btrfs_device.
>> Later this device may reappear after the FS is mounted and
>> then device is included in the device list but it missed the
>> open_device part. So this patch handles that case by going
>> through the open_device steps which this device missed and finally
>> adds to the device alloc list.
> 
> What happens if the first devices got writes before the "last device" is joined ?
> 
> Supposing to have a raid1 pair of devices: sda, sdb
> 
> - sda is dissappeared (usb detached ?)
> - the filesystem is mounted as
> 	mount -o degraded /dev/sdb
> - some writes happens on /dev/sdb
> - the user reattach /dev/sda
> - udev run "btrfs dev scan"
> - the system joins /dev/sda to the filesystem disks pool
> 
> Because the filesystem is a raid1, btrfs may read from /dev/sdb (updated data) or /dev/sda (old data), or worse read something from the former and something from the later (metadata from sda and data from sdb ) ???

  Thanks for the test scenario, this case is fine as its read from
  the disk having highest generation number, so we pick sdb in the
  case above.

Thanks, Anand


> Am I missing something ?
> 
> BR
> G.Baroncelli
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli Nov. 20, 2017, 7:28 p.m. UTC | #3
On 11/20/2017 09:19 AM, Anand Jain wrote:
> 
> 
> On 11/18/2017 09:52 PM, Goffredo Baroncelli wrote:
>> On 11/17/2017 01:36 PM, Anand Jain wrote:
>>> If the device is not present at the time of (-o degrade) mount,
>>> the mount context will create a dummy missing struct btrfs_device.
>>> Later this device may reappear after the FS is mounted and
>>> then device is included in the device list but it missed the
>>> open_device part. So this patch handles that case by going
>>> through the open_device steps which this device missed and finally
>>> adds to the device alloc list.
>>
>> What happens if the first devices got writes before the "last device" is joined ?
>>
>> Supposing to have a raid1 pair of devices: sda, sdb
>>
>> - sda is dissappeared (usb detached ?)
>> - the filesystem is mounted as
>>     mount -o degraded /dev/sdb
>> - some writes happens on /dev/sdb
>> - the user reattach /dev/sda
>> - udev run "btrfs dev scan"
>> - the system joins /dev/sda to the filesystem disks pool
>>
>> Because the filesystem is a raid1, btrfs may read from /dev/sdb (updated data) or /dev/sda (old data), or worse read something from the former and something from the later (metadata from sda and data from sdb ) ???
> 
>  Thanks for the test scenario, this case is fine as its read from
>  the disk having highest generation number, so we pick sdb in the
>  case above.

Ok, so in this case which is the benefit to add a disk ? With a lover number generation, the added will be used at all ?
In this case, would be better an explicit user intervention instead of an automatic one ?


> 
> Thanks, Anand
> 
> 
>> Am I missing something ?
>>
>> BR
>> G.Baroncelli
>>
>
Anand Jain Nov. 20, 2017, 9:14 p.m. UTC | #4
On 11/21/2017 03:28 AM, Goffredo Baroncelli wrote:
> On 11/20/2017 09:19 AM, Anand Jain wrote:
>>
>>
>> On 11/18/2017 09:52 PM, Goffredo Baroncelli wrote:
>>> On 11/17/2017 01:36 PM, Anand Jain wrote:
>>>> If the device is not present at the time of (-o degrade) mount,
>>>> the mount context will create a dummy missing struct btrfs_device.
>>>> Later this device may reappear after the FS is mounted and
>>>> then device is included in the device list but it missed the
>>>> open_device part. So this patch handles that case by going
>>>> through the open_device steps which this device missed and finally
>>>> adds to the device alloc list.
>>>
>>> What happens if the first devices got writes before the "last device" is joined ?
>>>
>>> Supposing to have a raid1 pair of devices: sda, sdb
>>>
>>> - sda is dissappeared (usb detached ?)
>>> - the filesystem is mounted as
>>>      mount -o degraded /dev/sdb
>>> - some writes happens on /dev/sdb
>>> - the user reattach /dev/sda
>>> - udev run "btrfs dev scan"
>>> - the system joins /dev/sda to the filesystem disks pool
>>>
>>> Because the filesystem is a raid1, btrfs may read from /dev/sdb (updated data) or /dev/sda (old data), or worse read something from the former and something from the later (metadata from sda and data from sdb ) ???
>>
>>   Thanks for the test scenario, this case is fine as its read from
>>   the disk having highest generation number, so we pick sdb in the
>>   case above.
> 
> Ok, so in this case which is the benefit to add a disk ?
> With a lover number generation, the added will be used at all ?

  It will be used for new writes, it will stripe across both the disks.
  Balance is only for the older single chunks, which in the long term
  would go away when we are able to create a degraded raid1 chunks.

> In this case, would be better an explicit user intervention instead of an automatic one ?

   Without this patch the disk already joins the raid group.

   But it had a bug that it missed joining the alloc group.

Thanks, Anand
> 
>>
>> Thanks, Anand
>>
>>
>>> Am I missing something ?
>>>
>>> BR
>>> G.Baroncelli
>>>
>>
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anand Jain Dec. 4, 2017, 7:09 a.m. UTC | #5
> [..]  would be better an explicit user intervention instead of an automatic one ?

  What is the user intervention method steps that you have in mind ?
  Just curious. Pls remember downtime is not a choice of recovery
  from this context which means FS should be available for the
  applications to perform read/write.

  -o remount was one this which I had skipped for sometime as
  theoretically it can't solve the problem (to bring back missing
  device) as well, but now I have verified it, it won't.

  There is no record of what is the original idea to perform this
  basic volume manager step.

Thanks, Anand

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5bd73edc2602..b3224baa6db4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -725,7 +725,9 @@  static noinline int device_list_add(const char *path,
 
 		ret = 1;
 		device->fs_devices = fs_devices;
-	} else if (!device->name || strcmp(device->name->str, path)) {
+	} else if (!device->name ||
+			strcmp(device->name->str, path) ||
+			device->missing) {
 		/*
 		 * When FS is already mounted.
 		 * 1. If you are here and if the device->name is NULL that
@@ -769,8 +771,63 @@  static noinline int device_list_add(const char *path,
 		rcu_string_free(device->name);
 		rcu_assign_pointer(device->name, name);
 		if (device->missing) {
-			fs_devices->missing_devices--;
-			device->missing = 0;
+			int ret;
+			struct btrfs_fs_info *fs_info = fs_devices->fs_info;
+			fmode_t fmode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+			if (btrfs_super_flags(disk_super) &
+					BTRFS_SUPER_FLAG_SEEDING)
+				fmode &= ~FMODE_WRITE;
+
+			/*
+			 * Missing can be set only when FS is mounted.
+			 * So here its always fs_devices->opened > 0 and most
+			 * of the struct device members are already updated by
+			 * the mount process even if this device was missing, so
+			 * now follow the normal open device procedure for this
+			 * device. The scrub will take care of filling the
+			 * missing stripes for raid56 and balance for raid1 and
+			 * raid10.
+			 */
+			ASSERT(fs_devices->opened);
+			mutex_lock(&fs_devices->device_list_mutex);
+			mutex_lock(&fs_info->chunk_mutex);
+			/*
+			 * By not failing for the reason that
+			 * btrfs_open_one_device() could fail, it will
+			 * keep the original behaviors as it is for now.
+			 * That's fix me for later.
+			 */
+			ret = btrfs_open_one_device(fs_devices, device, fmode,
+							fs_info->bdev_holder);
+			if (!ret) {
+				/*
+				 * Making sure missing is set to 0
+				 * only when bdev != NULL is as expected
+				 * at most of the places in the code.
+				 * Further, what if user fixes the
+				 * dev and reruns the dev scan, then again
+				 * we need to come here to reset missing.
+				 */
+				fs_devices->missing_devices--;
+				device->missing = 0;
+				btrfs_clear_opt(fs_info->mount_opt, DEGRADED);
+				btrfs_warn(fs_info,
+					"BTRFS: device %s devid %llu uuid %pU joined\n",
+					path, devid, device->uuid);
+			}
+
+			if (device->writeable &&
+					!device->is_tgtdev_for_dev_replace) {
+				fs_devices->total_rw_bytes +=
+							device->total_bytes;
+				atomic64_add(device->total_bytes -
+						device->bytes_used,
+						&fs_info->free_chunk_space);
+			}
+			device->in_fs_metadata = 1;
+			mutex_unlock(&fs_info->chunk_mutex);
+			mutex_unlock(&fs_devices->device_list_mutex);
 		}
 	}