Missing device handling (was: 'unable to mount btrfs pool...')
diff mbox

Message ID 20160408195259.GA23661@jeknote.loshitsa1.net
State New
Headers show

Commit Message

Yauhen Kharuzhy April 8, 2016, 7:53 p.m. UTC
On Fri, Apr 08, 2016 at 03:23:28PM -0400, Austin S. Hemmelgarn wrote:
> On 2016-04-08 12:17, Chris Murphy wrote:
> 
> I would personally suggest adding a per-filesystem node in sysfs to handle
> both 2 and 5. Having it open tells BTRFS to not automatically attempt
> countermeasures when degraded, select/epoll on it will return when state
> changes, reads will return (at minimum): what devices comprise the FS, per
> disk state (is it working, failed, missing, a hot-spare, etc), and what
> effective redundancy we have (how many devices we can lose and still be
> mountable, so 1 for raid1, raid10, and raid5, 2 for raid6, and 0 for
> raid0/single/dup, possibly higher for n-way replication (n-1), n-order
> parity (n), or erasure coding). This would make it trivial to write a daemon
> to monitor the filesystem, react when something happens, and handle all the
> policy decisions.

Hm, good proposal. Personally I tried to use uevents for this but they
cause locking troubles, and I didn't continue this attempt.

In any case we need have interface for btrfs-progs to passing FS state
information (presence and IDs of missing devices, for example,
degraded/good state of RAID etc.).

For testing as first attempt I implemented following interface. It still seems
not good for me but acceptable as a starting point. Additionally to this, I changed
missing device name reported in btrfs_ioctl_dev_info() to 'missing' for avoiding of
interferences with block devices inserted after closing of failed device (adding of
'missing' field to the struct btrfs_ioctl_dev_info_args may be more right way). So,
your opinion?

Comments

Duncan April 9, 2016, 7:24 a.m. UTC | #1
Yauhen Kharuzhy posted on Fri, 08 Apr 2016 22:53:00 +0300 as excerpted:

> On Fri, Apr 08, 2016 at 03:23:28PM -0400, Austin S. Hemmelgarn wrote:
>> On 2016-04-08 12:17, Chris Murphy wrote:
>> 
>> I would personally suggest adding a per-filesystem node in sysfs to
>> handle both 2 and 5. Having it open tells BTRFS to not automatically
>> attempt countermeasures when degraded, select/epoll on it will return
>> when state changes, reads will return (at minimum): what devices
>> comprise the FS, per disk state (is it working, failed, missing, a
>> hot-spare, etc), and what effective redundancy we have (how many
>> devices we can lose and still be mountable, so 1 for raid1, raid10, and
>> raid5, 2 for raid6, and 0 for raid0/single/dup, possibly higher for
>> n-way replication (n-1), n-order parity (n), or erasure coding). This
>> would make it trivial to write a daemon to monitor the filesystem,
>> react when something happens, and handle all the policy decisions.
> 
> Hm, good proposal. Personally I tried to use uevents for this but they
> cause locking troubles, and I didn't continue this attempt.

Except that... in sysfs (unlike proc) there's a rather strictly enforced 
rule of one property per file.

So you could NOT hold a single sysfs file open, that upon read would 
return 1) what devices comprise the FS, 2) per device (um, disk in the 
original, except that it can be a non-disk device, so changed to device 
here) state, 3) effective number of can-be-lost devices.

The sysfs style interface would be a filesystem directory containing a 
devices subdir, with (read-only?) per-device state-files in that subdir.  
The listing of per-device state-files would thus provide #1, with the 
contents of each state-file being the status of that device, therefore 
providing #2.  Back in the main filesystem dir, there'd be a devices-
loseable file, which would provide #3.

There could also be a filesystem-level state file which could be read for 
the current state of the filesystem as a whole or selected/epolled for 
state-changes, and probably yet another file, we'll call it leave-be here 
simply because I don't have a better name, that would be read/write 
allowing reading or setting the no-countermeasures property.


Actually, after looking at the existing /sys/fs/btrfs layout, we already 
have filesystem directories, each with a devices subdir, tho the symlinks 
therein point to the /sys/devices tree device dirs.  The listing thereof 
already provides #1, at least for operational devices.

I'm not going to go testing what happens to the current sysfs devices 
listings when a device goes missing, but we already know btrfs doesn't 
dynamically use that information.  Presumably, once it does, the symlinks 
could be replaced with subdirs for missing devices, with the still known 
information in the subdir (which could then be named as either the btrfs 
device ID or as missing-N), and the status of the device being detectable 
by whether it's a symlink to a devices tree device (device online) or a 
subdir (device offline).

The per-filesystem devices-losable, fs-status, and leave-be files could 
be added to the existing syfs btrfs interface.
Austin S. Hemmelgarn April 11, 2016, 11:32 a.m. UTC | #2
On 2016-04-09 03:24, Duncan wrote:
> Yauhen Kharuzhy posted on Fri, 08 Apr 2016 22:53:00 +0300 as excerpted:
>
>> On Fri, Apr 08, 2016 at 03:23:28PM -0400, Austin S. Hemmelgarn wrote:
>>> On 2016-04-08 12:17, Chris Murphy wrote:
>>>
>>> I would personally suggest adding a per-filesystem node in sysfs to
>>> handle both 2 and 5. Having it open tells BTRFS to not automatically
>>> attempt countermeasures when degraded, select/epoll on it will return
>>> when state changes, reads will return (at minimum): what devices
>>> comprise the FS, per disk state (is it working, failed, missing, a
>>> hot-spare, etc), and what effective redundancy we have (how many
>>> devices we can lose and still be mountable, so 1 for raid1, raid10, and
>>> raid5, 2 for raid6, and 0 for raid0/single/dup, possibly higher for
>>> n-way replication (n-1), n-order parity (n), or erasure coding). This
>>> would make it trivial to write a daemon to monitor the filesystem,
>>> react when something happens, and handle all the policy decisions.
>>
>> Hm, good proposal. Personally I tried to use uevents for this but they
>> cause locking troubles, and I didn't continue this attempt.
>
> Except that... in sysfs (unlike proc) there's a rather strictly enforced
> rule of one property per file.
Good point, I had forgotten about this.
>
> So you could NOT hold a single sysfs file open, that upon read would
> return 1) what devices comprise the FS, 2) per device (um, disk in the
> original, except that it can be a non-disk device, so changed to device
> here) state, 3) effective number of can-be-lost devices.
>
> The sysfs style interface would be a filesystem directory containing a
> devices subdir, with (read-only?) per-device state-files in that subdir.
> The listing of per-device state-files would thus provide #1, with the
> contents of each state-file being the status of that device, therefore
> providing #2.  Back in the main filesystem dir, there'd be a devices-
> loseable file, which would provide #3.
>
> There could also be a filesystem-level state file which could be read for
> the current state of the filesystem as a whole or selected/epolled for
> state-changes, and probably yet another file, we'll call it leave-be here
> simply because I don't have a better name, that would be read/write
> allowing reading or setting the no-countermeasures property.
I actually rather like this suggestion, with the caveat that we ideally 
should have multiple options for the auto-recovery mode:
1. Full auto-recovery, go read-only when an error is detected.
2. Go read-only when an error is detected but don't do auto-recovery 
(probably not very useful).
3. Do auto-recovery, but don't go read-only when an error is detected.
4. Don't do auto-recovery, and don't go read-only when an error is detected.
5-8. Same as the above, but require that the process that set the state 
keep the file open to maintain it (useful for cases when we need some 
kind of recovery if at all possible, but would prefer the monitoring 
tool to do it if possible).

In theory, we could do it as a bit-field to control what gets recovered 
and what doesn't.
>
>
> Actually, after looking at the existing /sys/fs/btrfs layout, we already
> have filesystem directories, each with a devices subdir, tho the symlinks
> therein point to the /sys/devices tree device dirs.  The listing thereof
> already provides #1, at least for operational devices.
>
> I'm not going to go testing what happens to the current sysfs devices
> listings when a device goes missing, but we already know btrfs doesn't
> dynamically use that information.  Presumably, once it does, the symlinks
> could be replaced with subdirs for missing devices, with the still known
> information in the subdir (which could then be named as either the btrfs
> device ID or as missing-N), and the status of the device being detectable
> by whether it's a symlink to a devices tree device (device online) or a
> subdir (device offline).
IIRC, under the current implementation, the symlink stays around as long 
as the device node in /dev stays around (so usually until the filesystem 
gets unmounted).

That said, there are issues inherent in trying to do something like 
replacing a symlink with a directory in sysfs, especially if the new 
directory contains a different layout than the one the symlink was 
pointing at:
1. You horribly break compatibility with existing tools.
2. You break the expectations of stability that are supposed to be 
guaranteed by sysfs for a given mount of it.
3. Sysfs isn't designed in a way that this could be done atomically, 
which severely limits usability (because the node might not be there, or 
it might be an empty directory).
This means we would need a separate directory to report device state.
>
> The per-filesystem devices-losable, fs-status, and leave-be files could
> be added to the existing syfs btrfs interface.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Murphy April 18, 2016, 12:55 a.m. UTC | #3
On Mon, Apr 11, 2016 at 5:32 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-09 03:24, Duncan wrote:
>>
>> Yauhen Kharuzhy posted on Fri, 08 Apr 2016 22:53:00 +0300 as excerpted:
>>
>>> On Fri, Apr 08, 2016 at 03:23:28PM -0400, Austin S. Hemmelgarn wrote:
>>>>
>>>>
>>>> I would personally suggest adding a per-filesystem node in sysfs to
>>>> handle both 2 and 5. Having it open tells BTRFS to not automatically
>>>> attempt countermeasures when degraded, select/epoll on it will return
>>>> when state changes, reads will return (at minimum): what devices
>>>> comprise the FS, per disk state (is it working, failed, missing, a
>>>> hot-spare, etc), and what effective redundancy we have (how many
>>>> devices we can lose and still be mountable, so 1 for raid1, raid10, and
>>>> raid5, 2 for raid6, and 0 for raid0/single/dup, possibly higher for
>>>> n-way replication (n-1), n-order parity (n), or erasure coding). This
>>>> would make it trivial to write a daemon to monitor the filesystem,
>>>> react when something happens, and handle all the policy decisions.
>>>
>>>
>>> Hm, good proposal. Personally I tried to use uevents for this but they
>>> cause locking troubles, and I didn't continue this attempt.
>>
>>
>> Except that... in sysfs (unlike proc) there's a rather strictly enforced
>> rule of one property per file.
>
> Good point, I had forgotten about this.

I just ran across this:
https://www.kernel.org/doc/Documentation/block/stat.txt

Q. Why are there multiple statistics in a single file?  Doesn't sysfs
   normally contain a single value per file?
A. By having a single file, the kernel can guarantee that the statistics
   represent a consistent snapshot of the state of the device.

So there might be an exception. I'm using a zram device as a sprout
for a Btrfs seed. And this is what I'm seeing:

[root@f23m 0]# cat /sys/block/zram0/stat
   64258        0   514064       19    19949        0   159592
214        0      233      233

Anyway there might be a plausible exception, if there's a good reason,
for the one property per file rule.
Austin S. Hemmelgarn April 18, 2016, 12:18 p.m. UTC | #4
On 2016-04-17 20:55, Chris Murphy wrote:
> On Mon, Apr 11, 2016 at 5:32 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-04-09 03:24, Duncan wrote:
>>>
>>> Yauhen Kharuzhy posted on Fri, 08 Apr 2016 22:53:00 +0300 as excerpted:
>>>
>>>> On Fri, Apr 08, 2016 at 03:23:28PM -0400, Austin S. Hemmelgarn wrote:
>>>>>
>>>>>
>>>>> I would personally suggest adding a per-filesystem node in sysfs to
>>>>> handle both 2 and 5. Having it open tells BTRFS to not automatically
>>>>> attempt countermeasures when degraded, select/epoll on it will return
>>>>> when state changes, reads will return (at minimum): what devices
>>>>> comprise the FS, per disk state (is it working, failed, missing, a
>>>>> hot-spare, etc), and what effective redundancy we have (how many
>>>>> devices we can lose and still be mountable, so 1 for raid1, raid10, and
>>>>> raid5, 2 for raid6, and 0 for raid0/single/dup, possibly higher for
>>>>> n-way replication (n-1), n-order parity (n), or erasure coding). This
>>>>> would make it trivial to write a daemon to monitor the filesystem,
>>>>> react when something happens, and handle all the policy decisions.
>>>>
>>>>
>>>> Hm, good proposal. Personally I tried to use uevents for this but they
>>>> cause locking troubles, and I didn't continue this attempt.
>>>
>>>
>>> Except that... in sysfs (unlike proc) there's a rather strictly enforced
>>> rule of one property per file.
>>
>> Good point, I had forgotten about this.
>
> I just ran across this:
> https://www.kernel.org/doc/Documentation/block/stat.txt
>
> Q. Why are there multiple statistics in a single file?  Doesn't sysfs
>     normally contain a single value per file?
> A. By having a single file, the kernel can guarantee that the statistics
>     represent a consistent snapshot of the state of the device.
>
> So there might be an exception. I'm using a zram device as a sprout
> for a Btrfs seed. And this is what I'm seeing:
>
> [root@f23m 0]# cat /sys/block/zram0/stat
>     64258        0   514064       19    19949        0   159592
> 214        0      233      233
>
> Anyway there might be a plausible exception, if there's a good reason,
> for the one property per file rule.
Part of the requirement for that though is that we have to provide a 
consistent set of info.  IOW, we would probably need to use something 
like RCU or locks to handle the data so we could get a consistent 
snapshot of the state.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d9b147f..f9a2fa6 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2716,12 +2716,17 @@  static long btrfs_ioctl_fs_info(struct btrfs_root *root, void __user *arg)
 
        mutex_lock(&fs_devices->device_list_mutex);
        fi_args->num_devices = fs_devices->num_devices;
+       fi_args->missing_devices = fs_devices->missing_devices;
+       fi_args->open_devices = fs_devices->open_devices;
+       fi_args->rw_devices = fs_devices->rw_devices;
+       fi_args->total_devices = fs_devices->total_devices;
        memcpy(&fi_args->fsid, root->fs_info->fsid, sizeof(fi_args->fsid));
 
        list_for_each_entry(device, &fs_devices->devices, dev_list) {
                if (device->devid > fi_args->max_id)
                        fi_args->max_id = device->devid;
        }
+       fi_args->state = root->fs_info->fs_state;
        mutex_unlock(&fs_devices->device_list_mutex);
 
        fi_args->nodesize = root->fs_info->super_copy->nodesize;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dea8931..6808bf2 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -186,8 +186,12 @@  struct btrfs_ioctl_fs_info_args {
        __u32 nodesize;                         /* out */
        __u32 sectorsize;                       /* out */
        __u32 clone_alignment;                  /* out */
-       __u32 reserved32;
-       __u64 reserved[122];                    /* pad to 1k */
+       __u32 state;                            /* out */
+       __u64 missing_devices;                  /* out */
+       __u64 open_devices;                     /* out */
+       __u64 rw_devices;                       /* out */
+       __u64 total_devices;                    /* out */
+       __u64 reserved[118];                    /* pad to 1k */
 };
 
 struct btrfs_ioctl_feature_flags {