Message ID | 20210623003630.274804-1-mcgrof@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] drivers/base/core: refcount kobject and bus on device attribute read / store | expand |
On Tue, Jun 22, 2021 at 05:36:30PM -0700, Luis Chamberlain wrote: > It's possible today to have a device attribute read or store > race against device removal. When this happens there is a small > chance that the dereference for the private data area of the driver > is NULL. > > Let's consider the zram driver as an example. Its possible to run into > a race where a sysfs knob is being used, we get preempted, and a zram > device is removed before we complete use of the sysfs knob. This can happen > for instance on block devices, where for instance the zram block devices > just part of the private data of the block device. > > For instance this can happen in the following two situations > as examples to illustrate this better: > > CPU 1 CPU 2 > destroy_devices > ... > compact_store() > zram = dev_to_zram(dev); > idr_for_each(zram_remove_cb > zram_remove > ... > kfree(zram) > down_read(&zram->init_lock); > > CPU 1 CPU 2 > hot_remove_store > compact_store() > zram = dev_to_zram(dev); > zram_remove > kfree(zram) > down_read(&zram->init_lock); > > To ensure the private data pointer is valid we could use bdget() / bdput() > in between access, however that would mean doing that in all sysfs > reads/stores on the driver. Instead a generic solution for all drivers > is to ensure the device kobject is still valid and also the bus, if > a bus is present. > > This issue does not fix a known crash, however this race was > spotted by Minchan Kim through code inspection upon code review > of another zram patch. > > Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > > This v3 is the same as v2, but since the I got an email asking me to > clarify the differences between the first iteration and the second, this > v3 just describes what I did not explain in the v2. Namely: > > * I removed the checks from get_device() calls as suggested > * The functions which are now being used outside of bus.c have > forwared declarations stuffed into base.h > > drivers/base/base.h | 2 ++ > drivers/base/bus.c | 4 ++-- > drivers/base/core.c | 36 ++++++++++++++++++++++++++++++++---- > 3 files changed, 36 insertions(+), 6 deletions(-) > > diff --git a/drivers/base/base.h b/drivers/base/base.h > index e5f9b7e656c3..3f95b125b667 100644 > --- a/drivers/base/base.h > +++ b/drivers/base/base.h > @@ -127,6 +127,8 @@ static inline void auxiliary_bus_init(void) { } > > struct kobject *virtual_device_parent(struct device *dev); > > +extern struct bus_type *bus_get(struct bus_type *bus); > +extern void bus_put(struct bus_type *bus); > extern int bus_add_device(struct device *dev); > extern void bus_probe_device(struct device *dev); > extern void bus_remove_device(struct device *dev); > diff --git a/drivers/base/bus.c b/drivers/base/bus.c > index 36d0c654ea61..21c80d7d6433 100644 > --- a/drivers/base/bus.c > +++ b/drivers/base/bus.c > @@ -39,7 +39,7 @@ static struct kset *system_kset; > static int __must_check bus_rescan_devices_helper(struct device *dev, > void *data); > > -static struct bus_type *bus_get(struct bus_type *bus) > +struct bus_type *bus_get(struct bus_type *bus) > { > if (bus) { > kset_get(&bus->p->subsys); > @@ -48,7 +48,7 @@ static struct bus_type *bus_get(struct bus_type *bus) > return NULL; > } > > -static void bus_put(struct bus_type *bus) > +void bus_put(struct bus_type *bus) > { > if (bus) > kset_put(&bus->p->subsys); > diff --git a/drivers/base/core.c b/drivers/base/core.c > index 4a8bf8cda52b..f69aa040b56d 100644 > --- a/drivers/base/core.c > +++ b/drivers/base/core.c > @@ -2042,28 +2042,56 @@ EXPORT_SYMBOL(dev_driver_string); > static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, > char *buf) > { > - struct device_attribute *dev_attr = to_dev_attr(attr); > - struct device *dev = kobj_to_dev(kobj); > + struct device_attribute *dev_attr; > + struct device *dev; > + struct bus_type *bus = NULL; > ssize_t ret = -EIO; > > + dev = get_device(kobj_to_dev(kobj)); > + if (dev->bus) { No need to test for this, right? > + bus = bus_get(dev->bus); > + if (!bus) > + goto out; How can this happen? > + } > + > + dev_attr = to_dev_attr(attr); Why did you move this call to way down here? It's fine in the first line of the function above as-is. > if (dev_attr->show) > ret = dev_attr->show(dev, dev_attr, buf); > if (ret >= (ssize_t)PAGE_SIZE) { > printk("dev_attr_show: %pS returned bad count\n", > dev_attr->show); > } > + > + bus_put(bus); You are incrementing the bus, which is nice, but I do not understand why it is needed. What is causing the bus to go away _before_ the devices are going away? Busses almost never are removed from the system, and if they are, all devices associated with them are removed first. So I do not think you need to increment anything with that here. But step back, what prevented the kobject from decrementing between the call to dev_attr_show() and the first line of the function? The kobject needs to be incremented before show is called, right? Or does kernfs handle this properly already? I don't have the time at the moment to dig into this, but maybe this isn't an issue? Look at how sysfs handles the kobject lookups for kernfs nodes to see if all is sane there, maybe there isn't an issue? thanks, greg k-h
On Wed, Jun 23, 2021 at 10:32:53AM +0200, Greg KH wrote: > On Tue, Jun 22, 2021 at 05:36:30PM -0700, Luis Chamberlain wrote: > > diff --git a/drivers/base/core.c b/drivers/base/core.c > > index 4a8bf8cda52b..f69aa040b56d 100644 > > --- a/drivers/base/core.c > > +++ b/drivers/base/core.c > > @@ -2042,28 +2042,56 @@ EXPORT_SYMBOL(dev_driver_string); > > static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, > > char *buf) > > { > > - struct device_attribute *dev_attr = to_dev_attr(attr); > > - struct device *dev = kobj_to_dev(kobj); > > + struct device_attribute *dev_attr; > > + struct device *dev; > > + struct bus_type *bus = NULL; > > ssize_t ret = -EIO; > > > > + dev = get_device(kobj_to_dev(kobj)); > > + if (dev->bus) { > > No need to test for this, right? dev_uevent() checks for dev->bus, so I thought that was a clear indication this isn't always set. > > > + bus = bus_get(dev->bus); > > + if (!bus) > > + goto out; > > How can this happen? device_add() calls bus_add_device(), and the bus_add_device() implementation seems to have a check for the bus returned from bus_get() as well? > > + } > > + > > + dev_attr = to_dev_attr(attr); > > Why did you move this call to way down here? It's fine in the first > line of the function above as-is. I had done that when I had the device kobject reference incremented. I'll move it back up. > > if (dev_attr->show) > > ret = dev_attr->show(dev, dev_attr, buf); > > if (ret >= (ssize_t)PAGE_SIZE) { > > printk("dev_attr_show: %pS returned bad count\n", > > dev_attr->show); > > } > > + > > + bus_put(bus); > > You are incrementing the bus, which is nice, but I do not understand why > it is needed. What is causing the bus to go away _before_ the devices > are going away? Busses almost never are removed from the system, and if > they are, all devices associated with them are removed first. So I do > not think you need to increment anything with that here. You tell me. It was your suggestion as a replacement for the type specific lock, in the zram case, its a block device so I was using bdgrab(). > But step back, what prevented the kobject from decrementing between the > call to dev_attr_show() and the first line of the function? > > The kobject needs to be incremented before show is called, right? Or > does kernfs handle this properly already? kernfs / sysfs is in not struct device specific, it has no semantics for modules or even devices. The aspect of kernfs which deals with locking a struct kernfs_node is kernfs_get_active(). The refcount there of importance is the call to atomic_inc_unless_negative(&kn->active). struct kernfs_node *kernfs_get_active(struct kernfs_node *kn) { if (unlikely(!kn)) return NULL; if (!atomic_inc_unless_negative(&kn->active)) return NULL; if (kernfs_lockdep(kn)) rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_); return kn; } BTW this was also precisely where I had suggested to extend the kernfs_node with an optional kn->owner and if set we try_module_get() to prevent the deadlock case if the module exit routine also happens to use a lock also used by a sysfs attribute store/read. The flow would be (from a real live gdb crash backtrace from an original bug report from a customer): write system call --> ksys_write () vfs_write() --> __vfs_write() --> kernfs_fop_write() (note now this is kernfs_fop_write_iter()) --> sysfs_kf_write() --> dev_attr_store() --> null reference The dereference is because the dev_attr_store() call which we are modifying LINE-001 static ssize_t dev_attr_store(struct kobject *kobj, struct attribute *attr, LINE-002 const char *buf, size_t count) LINE-003 { LINE-004 struct device_attribute *dev_attr = to_dev_attr(attr); LINE-005 struct device *dev = kobj_to_dev(kobj); LINE-006 ssize_t ret = -EIO; LINE-007 LINE-008 if (dev_attr->store) LINE-009 ret = dev_attr->store(dev, dev_attr, buf, count); ... } The race happens because a sysfs store / read can be triggered, the CPU could be preempted after LINE-008, and the ->store is gone by LINE-009. This begs the question if kernfs_fop_write_iter() or sysfs protects this somehow? Let's see. For kernfs we have: static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter) { struct kernfs_open_file *of = kernfs_of(iocb->ki_filp); ... mutex_lock(&of->mutex); if (!kernfs_get_active(of->kn)) { mutex_unlock(&of->mutex); len = -ENODEV; goto out_free; } ops = kernfs_ops(of->kn); if (ops->write) len = ops->write(of, buf, len, iocb->ki_pos); else len = -EINVAL; kernfs_put_active(of->kn); mutex_unlock(&of->mutex); ... } And the write call here is a syfs calls: static ssize_t sysfs_kf_write(struct kernfs_open_file *of, char *buf, size_t count, loff_t pos) { const struct sysfs_ops *ops = sysfs_file_ops(of->kn); struct kobject *kobj = of->kn->parent->priv; if (!count) return 0; return ops->store(kobj, of->kn->priv, buf, count); } As we have observed already, the active reference obtained through kernfs_get_active() was for the struct kernfs_node. Sure, the ops->write() is valid, in this case it sysfs_kf_write(). sysfs isn't doing any active reference check for the kobject device attribute as it doesn't care for them. So syfs calls dev_attr_store(), but the dev_attr_store() is not preventing the device attribute ops to go fishing, and we destroy them while we're destroying the device on module removal. In the zram case, the abstraction is worse given the device attributes are are created on behalf of the driver as group attributes. The zram disksize_store() for instance will use: static inline struct zram *dev_to_zram(struct device *dev) { return (struct zram *)dev_to_disk(dev)->private_data; } static ssize_t disksize_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t len) { ... struct zram *zram = dev_to_zram(dev); ... } Nothing is preventing an active block group sysfs attribute from going fishing, in the dev_to_disk() is the struct gendisk, and that can get free'd on module exit while the sysfs attribute is about to be poked. > I don't have the time at the > moment to dig into this, but maybe this isn't an issue? Look at how > sysfs handles the kobject lookups for kernfs nodes to see if all is sane > there, maybe there isn't an issue? The deadlock issue I noted and am fixing with the zram driver was one real live situation a customer reported, however a null dereference illustrated above turned out to also be in the logs of the same bug report, just that it happened in other situations, so what Minchan claimed as theoretical actually indeed was a real issue and I hope that the above illustrates this better. Luis
On Wed, Jun 23, 2021 at 09:14:34AM -0700, Luis Chamberlain wrote: > On Wed, Jun 23, 2021 at 10:32:53AM +0200, Greg KH wrote: > > On Tue, Jun 22, 2021 at 05:36:30PM -0700, Luis Chamberlain wrote: > > > diff --git a/drivers/base/core.c b/drivers/base/core.c > > > index 4a8bf8cda52b..f69aa040b56d 100644 > > > --- a/drivers/base/core.c > > > +++ b/drivers/base/core.c > > > @@ -2042,28 +2042,56 @@ EXPORT_SYMBOL(dev_driver_string); > > > static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, > > > char *buf) > > > { > > > - struct device_attribute *dev_attr = to_dev_attr(attr); > > > - struct device *dev = kobj_to_dev(kobj); > > > + struct device_attribute *dev_attr; > > > + struct device *dev; > > > + struct bus_type *bus = NULL; > > > ssize_t ret = -EIO; > > > > > > + dev = get_device(kobj_to_dev(kobj)); > > > + if (dev->bus) { > > > > No need to test for this, right? > > dev_uevent() checks for dev->bus, so I thought that was a clear > indication this isn't always set. > > > > > > + bus = bus_get(dev->bus); > > > + if (!bus) > > > + goto out; The point is that even if dev->bus is NULL, then bus_get(NULL) is NULL. That's the only way that bus_get() can return NULL, which means this check too is not needed. > > > if (dev_attr->show) > > > ret = dev_attr->show(dev, dev_attr, buf); > > > if (ret >= (ssize_t)PAGE_SIZE) { > > > printk("dev_attr_show: %pS returned bad count\n", > > > dev_attr->show); > > > } > > > + > > > + bus_put(bus); > > > > You are incrementing the bus, which is nice, but I do not understand why > > it is needed. What is causing the bus to go away _before_ the devices > > are going away? Busses almost never are removed from the system, and if > > they are, all devices associated with them are removed first. So I do > > not think you need to increment anything with that here. > > You tell me. It was your suggestion as a replacement for the type > specific lock, in the zram case, its a block device so I was using > bdgrab(). I did? Sorry, I do not remember, but this is not a lock, nor does it protect anything. I'll respond to the rest later... greg k-h
On Wed, Jun 23, 2021 at 09:14:34AM -0700, Luis Chamberlain wrote: > kernfs / sysfs is in not struct device specific, it has no semantics for > modules or even devices. The aspect of kernfs which deals with locking > a struct kernfs_node is kernfs_get_active(). The refcount there of > importance is the call to atomic_inc_unless_negative(&kn->active). > > struct kernfs_node *kernfs_get_active(struct kernfs_node *kn) > { > if (unlikely(!kn)) > return NULL; > > if (!atomic_inc_unless_negative(&kn->active)) > return NULL; > > if (kernfs_lockdep(kn)) > rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_); > return kn; > } > > BTW this was also precisely where I had suggested to extend the > kernfs_node with an optional kn->owner and if set we try_module_get() to > prevent the deadlock case if the module exit routine also happens to use > a lock also used by a sysfs attribute store/read. > > The flow would be (from a real live gdb crash backtrace from an original > bug report from a customer): > > write system call --> > ksys_write () > vfs_write() --> > __vfs_write() --> > kernfs_fop_write() (note now this is kernfs_fop_write_iter()) --> > sysfs_kf_write() --> > dev_attr_store() --> > null reference > > The dereference is because the dev_attr_store() call which we are > modifying > > LINE-001 static ssize_t dev_attr_store(struct kobject *kobj, struct attribute *attr, > LINE-002 const char *buf, size_t count) > LINE-003 { > LINE-004 struct device_attribute *dev_attr = to_dev_attr(attr); > LINE-005 struct device *dev = kobj_to_dev(kobj); > LINE-006 ssize_t ret = -EIO; > LINE-007 > LINE-008 if (dev_attr->store) > LINE-009 ret = dev_attr->store(dev, dev_attr, buf, count); > ... > } > > The race happens because a sysfs store / read can be triggered, the CPU > could be preempted after LINE-008, and the ->store is gone by LINE-009. > This begs the question if kernfs_fop_write_iter() or sysfs protects this > somehow? Let's see. > > For kernfs we have: > > static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter) > { > struct kernfs_open_file *of = kernfs_of(iocb->ki_filp); > ... > mutex_lock(&of->mutex); > if (!kernfs_get_active(of->kn)) { > mutex_unlock(&of->mutex); > len = -ENODEV; > goto out_free; > } > > ops = kernfs_ops(of->kn); > if (ops->write) > len = ops->write(of, buf, len, iocb->ki_pos); > else > len = -EINVAL; > > kernfs_put_active(of->kn); > mutex_unlock(&of->mutex); > ... > } > > And the write call here is a syfs calls: > > static ssize_t sysfs_kf_write(struct kernfs_open_file *of, char *buf, > size_t count, loff_t pos) > { > const struct sysfs_ops *ops = sysfs_file_ops(of->kn); > struct kobject *kobj = of->kn->parent->priv; > > if (!count) > return 0; > > return ops->store(kobj, of->kn->priv, buf, count); > } > > As we have observed already, the active reference obtained through > kernfs_get_active() was for the struct kernfs_node. Sure, the > ops->write() is valid, in this case it sysfs_kf_write(). > > sysfs isn't doing any active reference check for the kobject device > attribute as it doesn't care for them. So syfs calls > dev_attr_store(), but the dev_attr_store() is not preventing the device > attribute ops to go fishing, and we destroy them while we're destroying > the device on module removal. Ah, but sysfs _should_ be doing this properly. I think the issue is that when we store the kobject pointer in kernfs, it is NOT incremented. Look at sysfs_create_dir_ns(), if we call kobject_get(kobj) right before we call kernfs_create_dir_ns(), and then properly clean things up later on when we remove the sysfs directory (i.e. the kobject), it _should_ fix this problem. Then, we know, whenever show/store/whatever is called, when we cast out of kernfs the private pointer to a kobject, that the kobject really is still alive, so we can use it properly. Can you try that, it should be a much "simpler" change here. thanks, greg k-h
On Wed, Jun 23, 2021 at 06:51:42PM +0200, Greg KH wrote: > On Wed, Jun 23, 2021 at 09:14:34AM -0700, Luis Chamberlain wrote: > > sysfs isn't doing any active reference check for the kobject device > > attribute as it doesn't care for them. So syfs calls > > dev_attr_store(), but the dev_attr_store() is not preventing the device > > attribute ops to go fishing, and we destroy them while we're destroying > > the device on module removal. > > Ah, but sysfs _should_ be doing this properly. > > I think the issue is that when we store the kobject pointer in kernfs, > it is NOT incremented. Look at sysfs_create_dir_ns(), if we call > kobject_get(kobj) right before we call kernfs_create_dir_ns(), and then > properly clean things up later on when we remove the sysfs directory > (i.e. the kobject), it _should_ fix this problem. > > Then, we know, whenever show/store/whatever is called, when we cast out > of kernfs the private pointer to a kobject, that the kobject really is > still alive, so we can use it properly. > > Can you try that, it should be a much "simpler" change here. Agreed, its cleaner. It should also address the type race consideration I had, given that in the zram case, for instance, we will call device_del() on del_gendisk() and the order of freeing typically is something like: del_gendisk(zram->disk); blk_cleanup_queue(zram->disk->queue); put_disk(zram->disk); The put_disk() is what would make our gendisk->private_data invalid. Will spin up a v4 with your suggestion. Luis
diff --git a/drivers/base/base.h b/drivers/base/base.h index e5f9b7e656c3..3f95b125b667 100644 --- a/drivers/base/base.h +++ b/drivers/base/base.h @@ -127,6 +127,8 @@ static inline void auxiliary_bus_init(void) { } struct kobject *virtual_device_parent(struct device *dev); +extern struct bus_type *bus_get(struct bus_type *bus); +extern void bus_put(struct bus_type *bus); extern int bus_add_device(struct device *dev); extern void bus_probe_device(struct device *dev); extern void bus_remove_device(struct device *dev); diff --git a/drivers/base/bus.c b/drivers/base/bus.c index 36d0c654ea61..21c80d7d6433 100644 --- a/drivers/base/bus.c +++ b/drivers/base/bus.c @@ -39,7 +39,7 @@ static struct kset *system_kset; static int __must_check bus_rescan_devices_helper(struct device *dev, void *data); -static struct bus_type *bus_get(struct bus_type *bus) +struct bus_type *bus_get(struct bus_type *bus) { if (bus) { kset_get(&bus->p->subsys); @@ -48,7 +48,7 @@ static struct bus_type *bus_get(struct bus_type *bus) return NULL; } -static void bus_put(struct bus_type *bus) +void bus_put(struct bus_type *bus) { if (bus) kset_put(&bus->p->subsys); diff --git a/drivers/base/core.c b/drivers/base/core.c index 4a8bf8cda52b..f69aa040b56d 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -2042,28 +2042,56 @@ EXPORT_SYMBOL(dev_driver_string); static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, char *buf) { - struct device_attribute *dev_attr = to_dev_attr(attr); - struct device *dev = kobj_to_dev(kobj); + struct device_attribute *dev_attr; + struct device *dev; + struct bus_type *bus = NULL; ssize_t ret = -EIO; + dev = get_device(kobj_to_dev(kobj)); + if (dev->bus) { + bus = bus_get(dev->bus); + if (!bus) + goto out; + } + + dev_attr = to_dev_attr(attr); if (dev_attr->show) ret = dev_attr->show(dev, dev_attr, buf); if (ret >= (ssize_t)PAGE_SIZE) { printk("dev_attr_show: %pS returned bad count\n", dev_attr->show); } + + bus_put(bus); +out: + put_device(dev); + return ret; } static ssize_t dev_attr_store(struct kobject *kobj, struct attribute *attr, const char *buf, size_t count) { - struct device_attribute *dev_attr = to_dev_attr(attr); - struct device *dev = kobj_to_dev(kobj); + struct device_attribute *dev_attr; + struct device *dev; + struct bus_type *bus = NULL; ssize_t ret = -EIO; + dev = get_device(kobj_to_dev(kobj)); + if (dev->bus) { + bus = bus_get(dev->bus); + if (!bus) + goto out; + } + + dev_attr = to_dev_attr(attr); if (dev_attr->store) ret = dev_attr->store(dev, dev_attr, buf, count); + + bus_put(bus); +out: + put_device(dev); + return ret; }
It's possible today to have a device attribute read or store race against device removal. When this happens there is a small chance that the dereference for the private data area of the driver is NULL. Let's consider the zram driver as an example. Its possible to run into a race where a sysfs knob is being used, we get preempted, and a zram device is removed before we complete use of the sysfs knob. This can happen for instance on block devices, where for instance the zram block devices just part of the private data of the block device. For instance this can happen in the following two situations as examples to illustrate this better: CPU 1 CPU 2 destroy_devices ... compact_store() zram = dev_to_zram(dev); idr_for_each(zram_remove_cb zram_remove ... kfree(zram) down_read(&zram->init_lock); CPU 1 CPU 2 hot_remove_store compact_store() zram = dev_to_zram(dev); zram_remove kfree(zram) down_read(&zram->init_lock); To ensure the private data pointer is valid we could use bdget() / bdput() in between access, however that would mean doing that in all sysfs reads/stores on the driver. Instead a generic solution for all drivers is to ensure the device kobject is still valid and also the bus, if a bus is present. This issue does not fix a known crash, however this race was spotted by Minchan Kim through code inspection upon code review of another zram patch. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- This v3 is the same as v2, but since the I got an email asking me to clarify the differences between the first iteration and the second, this v3 just describes what I did not explain in the v2. Namely: * I removed the checks from get_device() calls as suggested * The functions which are now being used outside of bus.c have forwared declarations stuffed into base.h drivers/base/base.h | 2 ++ drivers/base/bus.c | 4 ++-- drivers/base/core.c | 36 ++++++++++++++++++++++++++++++++---- 3 files changed, 36 insertions(+), 6 deletions(-)