Message ID | 20191205112706.8125-5-anand.jain@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs, sysfs cleanup and add dev_state | expand |
On Thu, Dec 05, 2019 at 07:27:06PM +0800, Anand Jain wrote: > +static void btrfs_dev_state_to_str(struct btrfs_device *device, char *dev_state_str, size_t n) > +{ > + int state; > + const char *btrfs_dev_states[] = { "WRITEABLE", "IN_FS_METADATA", > + "MISSING", "REPLACE_TGT", "FLUSHING" > + }; > + > + n = n - strlen(dev_state_str) - 1; > + > + for (state = 0; state < ARRAY_SIZE(btrfs_dev_states); state++) { > + if (test_bit(state, &device->dev_state)) { > + if (strlen(dev_state_str)) > + strncat(dev_state_str, " ", n); > + strncat(dev_state_str, btrfs_dev_states[state], n); > + } > + } > + strncat(dev_state_str, "\n", n); Please use the snprintf way as in supported_checksums_show and not the str* functions.
On Thu, Dec 05, 2019 at 07:27:06PM +0800, Anand Jain wrote: > A new sysfs RW attribute Why RW? Most attributes/files in sysfs are supposed to be read-only, but you actually define the attribute as read-only with the macro BTRFS_ATTR. > UUID/devinfo/<devid>/dev_state > is added here. The dev_state here reflects the state of the device from > the kernel parameter btrfs_device::dev_state and this attribute is born > after the device is mounted and goes along with the dynamic nature of the > device add and delete. This attribute gets deleted at unmount. > > For example: > pwd > /sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo > cat 1/dev_state > IN_FS_METADATA MISSING > cat 2/dev_state > WRITABLE IN_FS_METADATA So the values copy the device state macros, that's probably ok. > +static ssize_t btrfs_sysfs_dev_state_show(struct kobject *kobj, > + struct kobj_attribute *a, char *buf) > +{ > + struct btrfs_device *device = container_of(kobj, struct btrfs_device, > + devid_kobj); > + > + btrfs_dev_state_to_str(device, buf, PAGE_SIZE); The device access is unprotected, you need at least RCU but that still does not prevent from the device being freed by deletion. The device_list_mutex is quite heavy and allowing a DoS by repeatedly reading the file contents is not something we want to allow.
On 12/5/19 10:21 PM, David Sterba wrote: > On Thu, Dec 05, 2019 at 07:27:06PM +0800, Anand Jain wrote: >> A new sysfs RW attribute > > Why RW? Most attributes/files in sysfs are supposed to be read-only, but > you actually define the attribute as read-only with the macro > BTRFS_ATTR. > Oh. Change log has to be fixed here. But I am also working on a new patch to make this RW, to set the new read_preferred state on the device. >> UUID/devinfo/<devid>/dev_state >> is added here. The dev_state here reflects the state of the device from >> the kernel parameter btrfs_device::dev_state and this attribute is born >> after the device is mounted and goes along with the dynamic nature of the >> device add and delete. This attribute gets deleted at unmount. >> >> For example: >> pwd >> /sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo >> cat 1/dev_state >> IN_FS_METADATA MISSING >> cat 2/dev_state >> WRITABLE IN_FS_METADATA > > So the values copy the device state macros, that's probably ok. > Yep. >> +static ssize_t btrfs_sysfs_dev_state_show(struct kobject *kobj, >> + struct kobj_attribute *a, char *buf) >> +{ >> + struct btrfs_device *device = container_of(kobj, struct btrfs_device, >> + devid_kobj); >> + >> + btrfs_dev_state_to_str(device, buf, PAGE_SIZE); > > The device access is unprotected, you need at least RCU but that still > does not prevent from the device being freed by deletion. We need RCU let me fix. Device being deleted is fine, there is nothing to lose, another directory lookup will show that UUID/devinfo/<devid> is gone is the device is deleted. > The > device_list_mutex is quite heavy and allowing a DoS by repeatedly > reading the file contents is not something we want to allow. > Yes we don't have to use device_list_mutex here, as its read, a refresh/re-read will refresh the dev_state. Thanks, Anand
On 12/5/19 10:10 PM, David Sterba wrote: > On Thu, Dec 05, 2019 at 07:27:06PM +0800, Anand Jain wrote: >> +static void btrfs_dev_state_to_str(struct btrfs_device *device, char *dev_state_str, size_t n) >> +{ >> + int state; >> + const char *btrfs_dev_states[] = { "WRITEABLE", "IN_FS_METADATA", >> + "MISSING", "REPLACE_TGT", "FLUSHING" >> + }; >> + >> + n = n - strlen(dev_state_str) - 1; >> + >> + for (state = 0; state < ARRAY_SIZE(btrfs_dev_states); state++) { >> + if (test_bit(state, &device->dev_state)) { >> + if (strlen(dev_state_str)) >> + strncat(dev_state_str, " ", n); >> + strncat(dev_state_str, btrfs_dev_states[state], n); >> + } >> + } >> + strncat(dev_state_str, "\n", n); > > Please use the snprintf way as in supported_checksums_show and not the > str* functions. > Ok. Will fix. Thanks, Anand
On Thu, Dec 05, 2019 at 10:38:15PM +0800, Anand Jain wrote: > > So the values copy the device state macros, that's probably ok. > Yep. Although, sysfs files should print one value per file, which makes sense in many cases, so eg. missing should exist separately too for quick checks for the most common device states. The dev_state reflects the internal state and is likely useful only for debugging. > >> +static ssize_t btrfs_sysfs_dev_state_show(struct kobject *kobj, > >> + struct kobj_attribute *a, char *buf) > >> +{ > >> + struct btrfs_device *device = container_of(kobj, struct btrfs_device, > >> + devid_kobj); > >> + > >> + btrfs_dev_state_to_str(device, buf, PAGE_SIZE); > > > > The device access is unprotected, you need at least RCU but that still > > does not prevent from the device being freed by deletion. > > We need RCU let me fix. Device being deleted is fine, there > is nothing to lose, another directory lookup will show that > UUID/devinfo/<devid> is gone is the device is deleted. The device can be gone from the list but the sysfs files can still exist. CPU1 CPU2 btrfs_rm_device open file btrfs_sysfs_rm_device_link btrfs_free_device kfree(device) call read, access freed device > > The > > device_list_mutex is quite heavy and allowing a DoS by repeatedly > > reading the file contents is not something we want to allow. > > > > Yes we don't have to use device_list_mutex here, as its read, > a refresh/re-read will refresh the dev_state. The point is not to synchronize the device state values but the device object itself.
On 5/12/19 11:14 PM, David Sterba wrote: > On Thu, Dec 05, 2019 at 10:38:15PM +0800, Anand Jain wrote: >>> So the values copy the device state macros, that's probably ok. >> Yep. > > Although, sysfs files should print one value per file, which makes sense > in many cases, so eg. missing should exist separately too for quick > checks for the most common device states. The dev_state reflects the > internal state and is likely useful only for debugging. > I agree. Its better to create an individual attribute for each of the device states. For instance.. under the 'UUID/devinfo/<devid>' kobject attributes will be: missing in_fs_metadata replace_target cat missing 0 cat in_fs_metadata 1 ..etc which seems to be more or less standard for block devices. Will fix it in v2. >>>> +static ssize_t btrfs_sysfs_dev_state_show(struct kobject *kobj, >>>> + struct kobj_attribute *a, char *buf) >>>> +{ >>>> + struct btrfs_device *device = container_of(kobj, struct btrfs_device, >>>> + devid_kobj); >>>> + >>>> + btrfs_dev_state_to_str(device, buf, PAGE_SIZE); >>> >>> The device access is unprotected, you need at least RCU but that still >>> does not prevent from the device being freed by deletion. >> >> We need RCU let me fix. Device being deleted is fine, there >> is nothing to lose, another directory lookup will show that >> UUID/devinfo/<devid> is gone is the device is deleted. > > The device can be gone from the list but the sysfs files can still > exist. > > CPU1 CPU2 > > btrfs_rm_device > open file > btrfs_sysfs_rm_device_link > btrfs_free_device > kfree(device) > call read, access freed device > I completely missed the sysfs synchronization with device delete. As in the other email discussion, I think a new rwlock shall suffice. And as its lock is only between device delete and sysfs so it shall be light weight without affecting the other device_list_mutex holders. Thanks, Anand >>> The >>> device_list_mutex is quite heavy and allowing a DoS by repeatedly >>> reading the file contents is not something we want to allow. >>> >> >> Yes we don't have to use device_list_mutex here, as its read, >> a refresh/re-read will refresh the dev_state. > > The point is not to synchronize the device state values but the device > object itself.
On 12/6/19 9:49 PM, Anand Jain wrote: > > > On 5/12/19 11:14 PM, David Sterba wrote: >> On Thu, Dec 05, 2019 at 10:38:15PM +0800, Anand Jain wrote: >>>> So the values copy the device state macros, that's probably ok. >>> Yep. >> >> Although, sysfs files should print one value per file, which makes sense >> in many cases, so eg. missing should exist separately too for quick >> checks for the most common device states. The dev_state reflects the >> internal state and is likely useful only for debugging. >> > > I agree. Its better to create an individual attribute for each of the > device states. For instance.. > > under the 'UUID/devinfo/<devid>' kobject > attributes will be: > missing > in_fs_metadata > replace_target > > cat missing > 0 > cat in_fs_metadata > 1 > > ..etc > > which seems to be more or less standard for block devices. > > Will fix it in v2. This is fixed in v2. > >>>>> +static ssize_t btrfs_sysfs_dev_state_show(struct kobject *kobj, >>>>> + struct kobj_attribute *a, char *buf) >>>>> +{ >>>>> + struct btrfs_device *device = container_of(kobj, struct >>>>> btrfs_device, >>>>> + devid_kobj); >>>>> + >>>>> + btrfs_dev_state_to_str(device, buf, PAGE_SIZE); >>>> >>>> The device access is unprotected, you need at least RCU but that still >>>> does not prevent from the device being freed by deletion. >>> >>> We need RCU let me fix. Device being deleted is fine, there >>> is nothing to lose, another directory lookup will show that >>> UUID/devinfo/<devid> is gone is the device is deleted. >> >> The device can be gone from the list but the sysfs files can still >> exist. >> >> CPU1 CPU2 >> >> btrfs_rm_device >> open file >> btrfs_sysfs_rm_device_link >> btrfs_free_device >> kfree(device) >> call read, access freed device >> > > I completely missed the sysfs synchronization with device delete. > As in the other email discussion, I think a new rwlock shall suffice. > And as its lock is only between device delete and sysfs so it shall > be light weight without affecting the other device_list_mutex holders. Looked into this further, actually we don't need any lock here the device delete thread which calls kobject_put() makes sure sysfs read is closed. So an existing sysfs read thread will have to complete before device free. CPU1 CPU2 btrfs_rm_device open file btrfs_sysfs_rm_device_link call read, access freed device sysfs waits for the open file to close. btrfs_free_device kfree(device) Thanks Anand > Thanks, > Anand > >>>> The >>>> device_list_mutex is quite heavy and allowing a DoS by repeatedly >>>> reading the file contents is not something we want to allow. >>>> >>> >>> Yes we don't have to use device_list_mutex here, as its read, >>> a refresh/re-read will refresh the dev_state. >> >> The point is not to synchronize the device state values but the device >> object itself. > > >
On Mon, Dec 09, 2019 at 10:05:39PM +0800, Anand Jain wrote: > Looked into this further, actually we don't need any lock here > the device delete thread which calls kobject_put() makes sure > sysfs read is closed. So an existing sysfs read thread will have > to complete before device free. > > > CPU1 CPU2 > > btrfs_rm_device > open file > btrfs_sysfs_rm_device_link > call read, access freed device > sysfs waits for the open file > to close. > btrfs_free_device > kfree(device) Ok, as long as this holds and sysfs file is removed before the device is freed, the locking is not necessary.
On Mon, Dec 09, 2019 at 10:05:39PM +0800, Anand Jain wrote: > On 12/6/19 9:49 PM, Anand Jain wrote: > > > > > > On 5/12/19 11:14 PM, David Sterba wrote: > >> On Thu, Dec 05, 2019 at 10:38:15PM +0800, Anand Jain wrote: > >>>> So the values copy the device state macros, that's probably ok. > >>> Yep. > >> > >> Although, sysfs files should print one value per file, which makes sense > >> in many cases, so eg. missing should exist separately too for quick > >> checks for the most common device states. The dev_state reflects the > >> internal state and is likely useful only for debugging. > >> > > > > I agree. Its better to create an individual attribute for each of the > > device states. For instance.. > > > > under the 'UUID/devinfo/<devid>' kobject > > attributes will be: > > missing > > in_fs_metadata > > replace_target > > > > cat missing > > 0 > > cat in_fs_metadata > > 1 > > > > ..etc > > > > which seems to be more or less standard for block devices. > > > > Will fix it in v2. > > This is fixed in v2. > > > > > >>>>> +static ssize_t btrfs_sysfs_dev_state_show(struct kobject *kobj, > >>>>> + struct kobj_attribute *a, char *buf) > >>>>> +{ > >>>>> + struct btrfs_device *device = container_of(kobj, struct > >>>>> btrfs_device, > >>>>> + devid_kobj); > >>>>> + > >>>>> + btrfs_dev_state_to_str(device, buf, PAGE_SIZE); > >>>> > >>>> The device access is unprotected, you need at least RCU but that still > >>>> does not prevent from the device being freed by deletion. > >>> > >>> We need RCU let me fix. Device being deleted is fine, there > >>> is nothing to lose, another directory lookup will show that > >>> UUID/devinfo/<devid> is gone is the device is deleted. > >> > >> The device can be gone from the list but the sysfs files can still > >> exist. > >> > >> CPU1 CPU2 > >> > >> btrfs_rm_device > >> open file > >> btrfs_sysfs_rm_device_link > >> btrfs_free_device > >> kfree(device) > >> call read, access freed device > >> > > > > I completely missed the sysfs synchronization with device delete. > > As in the other email discussion, I think a new rwlock shall suffice. > > And as its lock is only between device delete and sysfs so it shall > > be light weight without affecting the other device_list_mutex holders. > > Looked into this further, actually we don't need any lock here > the device delete thread which calls kobject_put() makes sure > sysfs read is closed. So an existing sysfs read thread will have > to complete before device free. > > > CPU1 CPU2 > > btrfs_rm_device > open file > btrfs_sysfs_rm_device_link > call read, access freed device > sysfs waits for the open file > to close. How exactly does sysfs wait for the device? Is it eg wait_event checking number of references? If the file stays open by an evil process is it going to block the device removal indefinitelly?
On Fri, Dec 13, 2019 at 05:43:32PM +0100, David Sterba wrote: > > Looked into this further, actually we don't need any lock here > > the device delete thread which calls kobject_put() makes sure > > sysfs read is closed. So an existing sysfs read thread will have > > to complete before device free. > > > > > > CPU1 CPU2 > > > > btrfs_rm_device > > open file > > btrfs_sysfs_rm_device_link > > call read, access freed device > > sysfs waits for the open file > > to close. > > How exactly does sysfs wait for the device? Is it eg wait_event checking > number of references? If the file stays open by an evil process is it > going to block the device removal indefinitelly? Yeah, sysfs waits until the file is closed. Eg. umount can be stalled that way too.
On 14/12/19 1:02 AM, David Sterba wrote: > On Fri, Dec 13, 2019 at 05:43:32PM +0100, David Sterba wrote: >>> Looked into this further, actually we don't need any lock here >>> the device delete thread which calls kobject_put() makes sure >>> sysfs read is closed. So an existing sysfs read thread will have >>> to complete before device free. >>> >>> >>> CPU1 CPU2 >>> >>> btrfs_rm_device >>> open file >>> btrfs_sysfs_rm_device_link >>> call read, access freed device >>> sysfs waits for the open file >>> to close. >> >> How exactly does sysfs wait for the device? Is it eg wait_event checking >> number of references? If the file stays open by an evil process is it >> going to block the device removal indefinitelly? > > Yeah, sysfs waits until the file is closed. Eg. umount can be stalled > that way too. > And similar to umount, I don't think we should return EBUSY for btrfs_rm_device if the device sysfs attribute is opened, as sysfs show attributes are non blocking and would be completed in the timely manner. regards, Anand
On Sat, Dec 14, 2019 at 08:26:24AM +0800, Anand Jain wrote: > > > On 14/12/19 1:02 AM, David Sterba wrote: > > On Fri, Dec 13, 2019 at 05:43:32PM +0100, David Sterba wrote: > >>> Looked into this further, actually we don't need any lock here > >>> the device delete thread which calls kobject_put() makes sure > >>> sysfs read is closed. So an existing sysfs read thread will have > >>> to complete before device free. > >>> > >>> > >>> CPU1 CPU2 > >>> > >>> btrfs_rm_device > >>> open file > >>> btrfs_sysfs_rm_device_link > >>> call read, access freed device > >>> sysfs waits for the open file > >>> to close. > >> > >> How exactly does sysfs wait for the device? Is it eg wait_event checking > >> number of references? If the file stays open by an evil process is it > >> going to block the device removal indefinitelly? > > > > Yeah, sysfs waits until the file is closed. Eg. umount can be stalled > > that way too. > > And similar to umount, I don't think we should return EBUSY > for btrfs_rm_device if the device sysfs attribute is opened, > as sysfs show attributes are non blocking and would be completed > in the timely manner. While I don't think the blocking behaviour is totally OK, returning EBUSY could be confusing without any other explanation. Also leaving sysfs files open but not read is kind of strange on itself.
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index 834f712ed60c..99c7269e0524 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -978,29 +978,76 @@ int btrfs_sysfs_remove_devices_attr(struct btrfs_fs_devices *fs_devices, if (!fs_devices->devices_kobj) return -EINVAL; - if (one_device && one_device->bdev) { - disk = one_device->bdev->bd_part; - disk_kobj = &part_to_dev(disk)->kobj; + if (one_device) { + if (one_device->bdev) { + disk = one_device->bdev->bd_part; + disk_kobj = &part_to_dev(disk)->kobj; + sysfs_remove_link(fs_devices->devices_kobj, disk_kobj->name); + } - sysfs_remove_link(fs_devices->devices_kobj, disk_kobj->name); - } + kobject_del(&one_device->devid_kobj); + kobject_put(&one_device->devid_kobj); - if (one_device) return 0; + } - list_for_each_entry(one_device, - &fs_devices->devices, dev_list) { - if (!one_device->bdev) - continue; - disk = one_device->bdev->bd_part; - disk_kobj = &part_to_dev(disk)->kobj; + list_for_each_entry(one_device, &fs_devices->devices, dev_list) { - sysfs_remove_link(fs_devices->devices_kobj, disk_kobj->name); + if (one_device->bdev) { + disk = one_device->bdev->bd_part; + disk_kobj = &part_to_dev(disk)->kobj; + sysfs_remove_link(fs_devices->devices_kobj, disk_kobj->name); + } + kobject_del(&one_device->devid_kobj); + kobject_put(&one_device->devid_kobj); } return 0; } +static void btrfs_dev_state_to_str(struct btrfs_device *device, char *dev_state_str, size_t n) +{ + int state; + const char *btrfs_dev_states[] = { "WRITEABLE", "IN_FS_METADATA", + "MISSING", "REPLACE_TGT", "FLUSHING" + }; + + n = n - strlen(dev_state_str) - 1; + + for (state = 0; state < ARRAY_SIZE(btrfs_dev_states); state++) { + if (test_bit(state, &device->dev_state)) { + if (strlen(dev_state_str)) + strncat(dev_state_str, " ", n); + strncat(dev_state_str, btrfs_dev_states[state], n); + } + } + strncat(dev_state_str, "\n", n); +} + +static ssize_t btrfs_sysfs_dev_state_show(struct kobject *kobj, + struct kobj_attribute *a, char *buf) +{ + struct btrfs_device *device = container_of(kobj, struct btrfs_device, + devid_kobj); + + btrfs_dev_state_to_str(device, buf, PAGE_SIZE); + return strlen(buf); +} + +BTRFS_ATTR(, dev_state, btrfs_sysfs_dev_state_show); + +static struct attribute *devinfo_attrs[] = { + BTRFS_ATTR_PTR(, dev_state), + NULL +}; + +ATTRIBUTE_GROUPS(devinfo); + +static struct kobj_type devinfo_ktype = { + .sysfs_ops = &kobj_sysfs_ops, + .default_groups = devinfo_groups, +}; + int btrfs_sysfs_add_devices_attr(struct btrfs_fs_devices *fs_devices, struct btrfs_device *one_device) { @@ -1008,22 +1055,30 @@ int btrfs_sysfs_add_devices_attr(struct btrfs_fs_devices *fs_devices, struct btrfs_device *dev; list_for_each_entry(dev, &fs_devices->devices, dev_list) { - struct hd_struct *disk; - struct kobject *disk_kobj; - - if (!dev->bdev) - continue; if (one_device && one_device != dev) continue; - disk = dev->bdev->bd_part; - disk_kobj = &part_to_dev(disk)->kobj; + if (dev->bdev) { + struct hd_struct *disk; + struct kobject *disk_kobj; + + disk = dev->bdev->bd_part; + disk_kobj = &part_to_dev(disk)->kobj; - error = sysfs_create_link(fs_devices->devices_kobj, - disk_kobj, disk_kobj->name); - if (error) + error = sysfs_create_link(fs_devices->devices_kobj, + disk_kobj, disk_kobj->name); + if (error) + break; + } + + error = kobject_init_and_add(&dev->devid_kobj, &devinfo_ktype, + fs_devices->devinfo_kobj, "%llu", + dev->devid); + if (error) { + kobject_put(&dev->devid_kobj); break; + } } return error; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 38f2e8437b68..68021d1ee216 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -138,6 +138,8 @@ struct btrfs_device { atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX]; struct extent_io_tree alloc_state; + + struct kobject devid_kobj; }; /*
A new sysfs RW attribute UUID/devinfo/<devid>/dev_state is added here. The dev_state here reflects the state of the device from the kernel parameter btrfs_device::dev_state and this attribute is born after the device is mounted and goes along with the dynamic nature of the device add and delete. This attribute gets deleted at unmount. For example: pwd /sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo cat 1/dev_state IN_FS_METADATA MISSING cat 2/dev_state WRITABLE IN_FS_METADATA Signed-off-by: Anand Jain <anand.jain@oracle.com> --- fs/btrfs/sysfs.c | 101 ++++++++++++++++++++++++++++++++++----------- fs/btrfs/volumes.h | 2 + 2 files changed, 80 insertions(+), 23 deletions(-)