diff mbox series

[v8,09/12] sysfs: fix deadlock race with module removal

Message ID 20210927163805.808907-10-mcgrof@kernel.org (mailing list archive)
State New
Headers show
Series syfs: generic deadlock fix with module removal | expand

Commit Message

Luis Chamberlain Sept. 27, 2021, 4:38 p.m. UTC
When driver sysfs attributes use a lock also used on module removal we
can race to deadlock. This happens when for instance a sysfs file on
a driver is used, then at the same time we have module removal call
trigger. The module removal call code holds a lock, and then the
driver's sysfs file entry waits for the same lock. While holding the
lock the module removal tries to remove the sysfs entries, but these
cannot be removed yet as one is waiting for a lock. This won't complete
as the lock is already held. Likewise module removal cannot complete,
and so we deadlock.

This can now be easily reproducible with our sysfs selftest as follows:

./tools/testing/selftests/sysfs/sysfs.sh -t 0027

This uses a local driver lock. Test 0028 can also be used, that uses
the rtnl_lock():

./tools/testing/selftests/sysfs/sysfs.sh -t 0028

To fix this we extend the struct kernfs_node with a module reference
and use the try_module_get() after kernfs_get_active() is called. As
documented in the prior patch, we now know that once kernfs_get_active()
is called the module is implicitly guarded to exist and cannot be removed.
This is because the module is the one in charge of removing the same
sysfs file it created, and removal of sysfs files on module exit will wait
until they don't have any active references. By using a try_module_get()
after kernfs_get_active() we yield to let module removal trump calls to
process a sysfs operation, while also preventing module removal if a sysfs
operation is in already progress. This prevents the deadlock.

This deadlock was first reported with the zram driver, however the live
patching folks have acknowledged they have observed this as well with
live patching, when a live patch is removed. I was then able to
reproduce easily by creating a dedicated selftest for it.

A sketch of how this can happen follows, consider foo a local mutex
part of a driver, and used on the driver's module exit routine and
on one of its sysfs ops:

foo.c:
static DEFINE_MUTEX(foo);
static ssize_t foo_store(struct device *dev,
			 struct device_attribute *attr,
			 const char *buf, size_t count)
{
	...
	mutex_lock(&foo);
	...
	mutex_lock(&foo);
	...
}
static DEVICE_ATTR_RW(foo);
...
void foo_exit(void)
{
	mutex_lock(&foo);
	...
	mutex_unlock(&foo);
}
module_exit(foo_exit);

And this can lead to this condition:

CPU A                              CPU B
                                   foo_store()
foo_exit()
  mutex_lock(&foo)
                                   mutex_lock(&foo)
   del_gendisk(some_struct->disk);
     device_del()
       device_remove_groups()

In this situation foo_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the sysfs file being poked at
completes which is waiting for a lock already held.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +-
 fs/kernfs/dir.c                        | 44 ++++++++++++++++++----
 fs/kernfs/file.c                       |  6 ++-
 fs/kernfs/kernfs-internal.h            |  3 +-
 fs/kernfs/symlink.c                    |  3 +-
 fs/sysfs/dir.c                         |  2 +-
 fs/sysfs/file.c                        |  6 ++-
 fs/sysfs/group.c                       |  3 +-
 include/linux/kernfs.h                 | 14 ++++---
 include/linux/sysfs.h                  | 52 ++++++++++++++++++++------
 kernel/cgroup/cgroup.c                 |  2 +-
 11 files changed, 105 insertions(+), 34 deletions(-)

Comments

Ming Lei Oct. 5, 2021, 9:24 a.m. UTC | #1
On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> When driver sysfs attributes use a lock also used on module removal we
> can race to deadlock. This happens when for instance a sysfs file on
> a driver is used, then at the same time we have module removal call
> trigger. The module removal call code holds a lock, and then the
> driver's sysfs file entry waits for the same lock. While holding the
> lock the module removal tries to remove the sysfs entries, but these
> cannot be removed yet as one is waiting for a lock. This won't complete
> as the lock is already held. Likewise module removal cannot complete,
> and so we deadlock.
> 
> This can now be easily reproducible with our sysfs selftest as follows:
> 
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> 
> This uses a local driver lock. Test 0028 can also be used, that uses
> the rtnl_lock():
> 
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> 
> To fix this we extend the struct kernfs_node with a module reference
> and use the try_module_get() after kernfs_get_active() is called. As
> documented in the prior patch, we now know that once kernfs_get_active()
> is called the module is implicitly guarded to exist and cannot be removed.
> This is because the module is the one in charge of removing the same
> sysfs file it created, and removal of sysfs files on module exit will wait
> until they don't have any active references. By using a try_module_get()
> after kernfs_get_active() we yield to let module removal trump calls to
> process a sysfs operation, while also preventing module removal if a sysfs
> operation is in already progress. This prevents the deadlock.
> 
> This deadlock was first reported with the zram driver, however the live

Looks not see the lock pattern you mentioned in zram driver, can you
share the related zram code?

> patching folks have acknowledged they have observed this as well with
> live patching, when a live patch is removed. I was then able to
> reproduce easily by creating a dedicated selftest for it.
> 
> A sketch of how this can happen follows, consider foo a local mutex
> part of a driver, and used on the driver's module exit routine and
> on one of its sysfs ops:
> 
> foo.c:
> static DEFINE_MUTEX(foo);
> static ssize_t foo_store(struct device *dev,
> 			 struct device_attribute *attr,
> 			 const char *buf, size_t count)
> {
> 	...
> 	mutex_lock(&foo);
> 	...
> 	mutex_lock(&foo);
> 	...
> }
> static DEVICE_ATTR_RW(foo);
> ...
> void foo_exit(void)
> {
> 	mutex_lock(&foo);
> 	...
> 	mutex_unlock(&foo);
> }
> module_exit(foo_exit);
> 
> And this can lead to this condition:
> 
> CPU A                              CPU B
>                                    foo_store()
> foo_exit()
>   mutex_lock(&foo)
>                                    mutex_lock(&foo)
>    del_gendisk(some_struct->disk);
>      device_del()
>        device_remove_groups()

I guess the deadlock exists if foo_exit() is called anywhere. If yes,
look the issue may not be related with removing module directly, right?



Thanks,
Ming
Kees Cook Oct. 5, 2021, 8:50 p.m. UTC | #2
On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> When driver sysfs attributes use a lock also used on module removal we
> can race to deadlock. This happens when for instance a sysfs file on
> a driver is used, then at the same time we have module removal call
> trigger. The module removal call code holds a lock, and then the
> driver's sysfs file entry waits for the same lock. While holding the
> lock the module removal tries to remove the sysfs entries, but these
> cannot be removed yet as one is waiting for a lock. This won't complete
> as the lock is already held. Likewise module removal cannot complete,
> and so we deadlock.
> 
> This can now be easily reproducible with our sysfs selftest as follows:
> 
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> 
> This uses a local driver lock. Test 0028 can also be used, that uses
> the rtnl_lock():
> 
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> 
> To fix this we extend the struct kernfs_node with a module reference
> and use the try_module_get() after kernfs_get_active() is called. As

I would agree: kernfs must know about the module containing the ops
structure it has been given. (Without this, there are, at the very least,
removal races for looking at kernfs_ops structures.)

In other places in the kernel, function callback dependencies are more
explicit in that if code is holding such things, it has already taken a
module reference, etc. But kernfs is special in the sense that just
because a kernfs entry exists, we don't want to pin the module use count
too.

But simple locking isn't workable to solve this because kernfs_remove()
must be able to be called from a module_exit routine without deadlocking.
(i.e. we would create exactly the situation that caused this condition
to get noticed in the first place.)

> documented in the prior patch, we now know that once kernfs_get_active()
> is called the module is implicitly guarded to exist and cannot be removed.
> This is because the module is the one in charge of removing the same
> sysfs file it created, and removal of sysfs files on module exit will wait
> until they don't have any active references. By using a try_module_get()
> after kernfs_get_active() we yield to let module removal trump calls to
> process a sysfs operation, while also preventing module removal if a sysfs
> operation is in already progress. This prevents the deadlock.
> 
> This deadlock was first reported with the zram driver, however the live
> patching folks have acknowledged they have observed this as well with
> live patching, when a live patch is removed. I was then able to
> reproduce easily by creating a dedicated selftest for it.
> 
> A sketch of how this can happen follows, consider foo a local mutex
> part of a driver, and used on the driver's module exit routine and
> on one of its sysfs ops:
> 
> foo.c:
> static DEFINE_MUTEX(foo);
> static ssize_t foo_store(struct device *dev,
> 			 struct device_attribute *attr,
> 			 const char *buf, size_t count)
> {
> 	...
> 	mutex_lock(&foo);
> 	...
> 	mutex_lock(&foo);
> 	...
> }
> static DEVICE_ATTR_RW(foo);
> ...
> void foo_exit(void)
> {
> 	mutex_lock(&foo);
> 	...
> 	mutex_unlock(&foo);
> }
> module_exit(foo_exit);
> 
> And this can lead to this condition:
> 
> CPU A                              CPU B
>                                    foo_store()
> foo_exit()
>   mutex_lock(&foo)
>                                    mutex_lock(&foo)
>    del_gendisk(some_struct->disk);
>      device_del()
>        device_remove_groups()

Please expand this further, where does device_remove_groups() end up
waiting for that never happens?

> 
> In this situation foo_store() is waiting for the mutex foo to
> become unlocked, but that won't happen until module removal is complete.
> But module removal won't complete until the sysfs file being poked at
> completes which is waiting for a lock already held.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +-
>  fs/kernfs/dir.c                        | 44 ++++++++++++++++++----
>  fs/kernfs/file.c                       |  6 ++-
>  fs/kernfs/kernfs-internal.h            |  3 +-
>  fs/kernfs/symlink.c                    |  3 +-
>  fs/sysfs/dir.c                         |  2 +-
>  fs/sysfs/file.c                        |  6 ++-
>  fs/sysfs/group.c                       |  3 +-
>  include/linux/kernfs.h                 | 14 ++++---
>  include/linux/sysfs.h                  | 52 ++++++++++++++++++++------
>  kernel/cgroup/cgroup.c                 |  2 +-
>  11 files changed, 105 insertions(+), 34 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index b57b3db9a6a7..4edf3b37fd2c 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
>  
>  	kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
>  				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> -				  0, rft->kf_ops, rft, NULL, NULL);
> +				  0, rft->kf_ops, rft, NULL, NULL, NULL);
>  	if (IS_ERR(kn))
>  		return PTR_ERR(kn);
>  
> @@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
>  
>  	kn = __kernfs_create_file(parent_kn, name, 0444,
>  				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
> -				  &kf_mondata_ops, priv, NULL, NULL);
> +				  &kf_mondata_ops, priv, NULL, NULL, NULL);
>  	if (IS_ERR(kn))
>  		return PTR_ERR(kn);
>  
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index ba581429bf7b..e841201fd11b 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -14,6 +14,7 @@
>  #include <linux/slab.h>
>  #include <linux/security.h>
>  #include <linux/hash.h>
> +#include <linux/module.h>
>  
>  #include "kernfs-internal.h"
>  
> @@ -414,12 +415,29 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
>   */
>  struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
>  {
> +	int v;
> +
>  	if (unlikely(!kn))
>  		return NULL;
>  
>  	if (!atomic_inc_unless_negative(&kn->active))
>  		return NULL;
>  
> +	/*
> +	 * If a module created the kernfs_node, the module cannot possibly be
> +	 * removed if the above atomic_inc_unless_negative() succeeded. So the
> +	 * try_module_get() below is not to protect the lifetime of the module
> +	 * as that is already guaranteed. The try_module_get() below is used
> +	 * to ensure that we don't deadlock in case a kernfs operation and
> +	 * module removal used a shared lock.
> +	 */
> +	if (!try_module_get(kn->owner)) {
> +		v = atomic_dec_return(&kn->active);
> +		if (unlikely(v == KN_DEACTIVATED_BIAS))
> +			wake_up_all(&kernfs_root(kn)->deactivate_waitq);
> +		return NULL;
> +	}

The special casing in here makes me think this isn't happening the right
place. (i.e this looks like an open-coded version of kernfs_put_active())

> +
>  	if (kernfs_lockdep(kn))
>  		rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
>  	return kn;
> @@ -442,6 +460,13 @@ void kernfs_put_active(struct kernfs_node *kn)
>  	if (kernfs_lockdep(kn))
>  		rwsem_release(&kn->dep_map, _RET_IP_);
>  	v = atomic_dec_return(&kn->active);
> +
> +	/*
> +	 * We prevent module exit *until* we know for sure all possible
> +	 * kernfs ops are done.
> +	 */
> +	module_put(kn->owner);
> +
>  	if (likely(v != KN_DEACTIVATED_BIAS))
>  		return;

What I don't understand, however, is what kernfs_get/put_active() is
intending to do -- it looks like it's trying to provide an interruption
point for open kernfs file operations?

This all seems extremely complex for what seems like it should just be a
global "am I being removed?" bool?

Regardless, while I do see the logic of associating the module get/put
with get/put of kernfs "active", why is it not better tied to strictly
kernfs open/close? That would seem to be much simpler and not require
any special handling?

For example, why does this not work?


diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 60e2a86c535e..e44502ac244d 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -525,6 +525,9 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
 {
 	struct kernfs_open_node *on, *new_on = NULL;
 
+	if (!try_module_get(kn->owner))
+		return -ENODEV;
+
  retry:
 	mutex_lock(&kernfs_open_file_mutex);
 	spin_lock_irq(&kernfs_open_node_lock);
@@ -592,6 +595,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
 	mutex_unlock(&kernfs_open_file_mutex);
 
 	kfree(on);
+	module_put(kn->owner);
 }
 
 static int kernfs_fop_open(struct inode *inode, struct file *file)
@@ -719,6 +723,7 @@ static int kernfs_fop_open(struct inode *inode, struct file *file)
 	kfree(of);
 err_out:
 	kernfs_put_active(kn);
+	module_put(kn->owner);
 	return error;
 }
 


>  
> @@ -572,7 +597,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
>  					     struct kernfs_node *parent,
>  					     const char *name, umode_t mode,
>  					     kuid_t uid, kgid_t gid,
> -					     unsigned flags)
> +					     unsigned flags,
> +					     struct module *owner)
>  {
>  	struct kernfs_node *kn;
>  	u32 id_highbits;
> @@ -607,6 +633,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
>  	kn->name = name;
>  	kn->mode = mode;
>  	kn->flags = flags;
> +	kn->owner = owner;
>  
>  	if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
>  		struct iattr iattr = {
> @@ -640,12 +667,13 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
>  struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
>  				    const char *name, umode_t mode,
>  				    kuid_t uid, kgid_t gid,
> -				    unsigned flags)
> +				    unsigned flags,
> +				    struct module *owner)
>  {
>  	struct kernfs_node *kn;
>  
>  	kn = __kernfs_new_node(kernfs_root(parent), parent,
> -			       name, mode, uid, gid, flags);
> +			       name, mode, uid, gid, flags, owner);
>  	if (kn) {
>  		kernfs_get(parent);
>  		kn->parent = parent;
> @@ -927,7 +955,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>  
>  	kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
>  			       GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> -			       KERNFS_DIR);
> +			       KERNFS_DIR, NULL);
>  	if (!kn) {
>  		idr_destroy(&root->ino_idr);
>  		kfree(root);
> @@ -969,20 +997,22 @@ void kernfs_destroy_root(struct kernfs_root *root)
>   * @gid: gid of the new directory
>   * @priv: opaque data associated with the new directory
>   * @ns: optional namespace tag of the directory
> + * @owner: if set, the module owner of this directory
>   *
>   * Returns the created node on success, ERR_PTR() value on failure.
>   */
>  struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
>  					 const char *name, umode_t mode,
>  					 kuid_t uid, kgid_t gid,
> -					 void *priv, const void *ns)
> +					 void *priv, const void *ns,
> +					 struct module *owner)
>  {
>  	struct kernfs_node *kn;
>  	int rc;
>  
>  	/* allocate */
>  	kn = kernfs_new_node(parent, name, mode | S_IFDIR,
> -			     uid, gid, KERNFS_DIR);
> +			     uid, gid, KERNFS_DIR, owner);
>  	if (!kn)
>  		return ERR_PTR(-ENOMEM);
>  
> @@ -1014,7 +1044,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
>  
>  	/* allocate */
>  	kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
> -			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
> +			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR, NULL);
>  	if (!kn)
>  		return ERR_PTR(-ENOMEM);
>  
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index 4479c6580333..0e125287e050 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -978,6 +978,7 @@ const struct file_operations kernfs_file_fops = {
>   * @priv: private data for the file
>   * @ns: optional namespace tag of the file
>   * @key: lockdep key for the file's active_ref, %NULL to disable lockdep
> + * @owner: if set, the module owner of the file
>   *
>   * Returns the created node on success, ERR_PTR() value on error.
>   */
> @@ -987,7 +988,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
>  					 loff_t size,
>  					 const struct kernfs_ops *ops,
>  					 void *priv, const void *ns,
> -					 struct lock_class_key *key)
> +					 struct lock_class_key *key,
> +					 struct module *owner)
>  {
>  	struct kernfs_node *kn;
>  	unsigned flags;
> @@ -996,7 +998,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
>  	flags = KERNFS_FILE;
>  
>  	kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
> -			     uid, gid, flags);
> +			     uid, gid, flags, owner);
>  	if (!kn)
>  		return ERR_PTR(-ENOMEM);
>  
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index 9e3abf597e2d..6d275d661987 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -134,7 +134,8 @@ int kernfs_add_one(struct kernfs_node *kn);
>  struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
>  				    const char *name, umode_t mode,
>  				    kuid_t uid, kgid_t gid,
> -				    unsigned flags);
> +				    unsigned flags,
> +				    struct module *owner);
>  
>  /*
>   * file.c
> diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
> index 19a6c71c6ff5..5a053eebee52 100644
> --- a/fs/kernfs/symlink.c
> +++ b/fs/kernfs/symlink.c
> @@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
>  		gid = target->iattr->ia_gid;
>  	}
>  
> -	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
> +	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK,
> +			     target->owner);
>  	if (!kn)
>  		return ERR_PTR(-ENOMEM);
>  
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index b6b6796e1616..9763c2fde3c7 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -57,7 +57,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
>  	kobject_get_ownership(kobj, &uid, &gid);
>  
>  	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
> -				  kobj, ns);
> +				  kobj, ns, NULL);
>  	if (IS_ERR(kn)) {
>  		if (PTR_ERR(kn) == -EEXIST)
>  			sysfs_warn_dup(parent, kobject_name(kobj));
> diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
> index 42dcf96881b6..af9e91fd1a92 100644
> --- a/fs/sysfs/file.c
> +++ b/fs/sysfs/file.c
> @@ -292,7 +292,8 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
>  #endif
>  
>  	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
> -				  PAGE_SIZE, ops, (void *)attr, ns, key);
> +				  PAGE_SIZE, ops, (void *)attr, ns, key,
> +				  attr->owner);
>  	if (IS_ERR(kn)) {
>  		if (PTR_ERR(kn) == -EEXIST)
>  			sysfs_warn_dup(parent, attr->name);
> @@ -327,7 +328,8 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
>  #endif
>  
>  	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
> -				  battr->size, ops, (void *)attr, ns, key);
> +				  battr->size, ops, (void *)attr, ns, key,
> +				  attr->owner);
>  	if (IS_ERR(kn)) {
>  		if (PTR_ERR(kn) == -EEXIST)
>  			sysfs_warn_dup(parent, attr->name);
> diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
> index eeb0e3099421..372864d1cb54 100644
> --- a/fs/sysfs/group.c
> +++ b/fs/sysfs/group.c
> @@ -135,7 +135,8 @@ static int internal_create_group(struct kobject *kobj, int update,
>  		} else {
>  			kn = kernfs_create_dir_ns(kobj->sd, grp->name,
>  						  S_IRWXU | S_IRUGO | S_IXUGO,
> -						  uid, gid, kobj, NULL);
> +						  uid, gid, kobj, NULL,
> +						  grp->owner);
>  			if (IS_ERR(kn)) {
>  				if (PTR_ERR(kn) == -EEXIST)
>  					sysfs_warn_dup(kobj->sd, grp->name);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index cd968ee2b503..818b00ebd107 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -161,6 +161,7 @@ struct kernfs_node {
>  	unsigned short		flags;
>  	umode_t			mode;
>  	struct kernfs_iattrs	*iattr;
> +	struct module           *owner;
>  };
>  
>  /*
> @@ -370,7 +371,8 @@ void kernfs_destroy_root(struct kernfs_root *root);
>  struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
>  					 const char *name, umode_t mode,
>  					 kuid_t uid, kgid_t gid,
> -					 void *priv, const void *ns);
> +					 void *priv, const void *ns,
> +					 struct module *owner);
>  struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
>  					    const char *name);
>  struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
> @@ -379,7 +381,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
>  					 loff_t size,
>  					 const struct kernfs_ops *ops,
>  					 void *priv, const void *ns,
> -					 struct lock_class_key *key);
> +					 struct lock_class_key *key,
> +					 struct module *owner);
>  struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
>  				       const char *name,
>  				       struct kernfs_node *target);
> @@ -472,14 +475,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { }
>  static inline struct kernfs_node *
>  kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
>  		     umode_t mode, kuid_t uid, kgid_t gid,
> -		     void *priv, const void *ns)
> +		     void *priv, const void *ns, struct module *owner)
>  { return ERR_PTR(-ENOSYS); }
>  
>  static inline struct kernfs_node *
>  __kernfs_create_file(struct kernfs_node *parent, const char *name,
>  		     umode_t mode, kuid_t uid, kgid_t gid,
>  		     loff_t size, const struct kernfs_ops *ops,
> -		     void *priv, const void *ns, struct lock_class_key *key)
> +		     void *priv, const void *ns, struct lock_class_key *key,
> +		     struct module *owner)
>  { return ERR_PTR(-ENOSYS); }
>  
>  static inline struct kernfs_node *
> @@ -566,7 +570,7 @@ kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
>  {
>  	return kernfs_create_dir_ns(parent, name, mode,
>  				    GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> -				    priv, NULL);
> +				    priv, NULL, parent->owner);
>  }
>  
>  static inline int kernfs_remove_by_name(struct kernfs_node *parent,
> diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
> index e3f1e8ac1f85..babbabb460dc 100644
> --- a/include/linux/sysfs.h
> +++ b/include/linux/sysfs.h
> @@ -30,6 +30,7 @@ enum kobj_ns_type;
>  struct attribute {
>  	const char		*name;
>  	umode_t			mode;
> +	struct module           *owner;
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>  	bool			ignore_lockdep:1;
>  	struct lock_class_key	*key;
> @@ -80,6 +81,7 @@ do {							\
>   * @attrs:	Pointer to NULL terminated list of attributes.
>   * @bin_attrs:	Pointer to NULL terminated list of binary attributes.
>   *		Either attrs or bin_attrs or both must be provided.
> + * @module:	If set, module responsible for this attribute group
>   */
>  struct attribute_group {
>  	const char		*name;
> @@ -89,6 +91,7 @@ struct attribute_group {
>  						  struct bin_attribute *, int);
>  	struct attribute	**attrs;
>  	struct bin_attribute	**bin_attrs;
> +	struct module           *owner;
>  };
>  
>  /*
> @@ -100,38 +103,52 @@ struct attribute_group {
>  
>  #define __ATTR(_name, _mode, _show, _store) {				\
>  	.attr = {.name = __stringify(_name),				\
> -		 .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
> +		 .mode = VERIFY_OCTAL_PERMISSIONS(_mode),               \
> +		 .owner  = THIS_MODULE,                                 \
> +	},                                                             \
>  	.show	= _show,						\
>  	.store	= _store,						\
>  }
>  
>  #define __ATTR_PREALLOC(_name, _mode, _show, _store) {			\
>  	.attr = {.name = __stringify(_name),				\
> -		 .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\
> +		 .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\
> +		 .owner  = THIS_MODULE,                                 \
> +	},                                                              \
>  	.show	= _show,						\
>  	.store	= _store,						\
>  }
>  
>  #define __ATTR_RO(_name) {						\
> -	.attr	= { .name = __stringify(_name), .mode = 0444 },		\
> +	.attr	= { .name = __stringify(_name),                         \
> +		    .mode = 0444,					\
> +		    .owner  = THIS_MODULE,				\
> +		},                                                     \
>  	.show	= _name##_show,						\
>  }
>  
>  #define __ATTR_RO_MODE(_name, _mode) {					\
>  	.attr	= { .name = __stringify(_name),				\
> -		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
> +		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode),            \
> +		    .owner  = THIS_MODULE,				\
> +	},                                                              \
>  	.show	= _name##_show,						\
>  }
>  
>  #define __ATTR_RW_MODE(_name, _mode) {					\
>  	.attr	= { .name = __stringify(_name),				\
> -		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
> +		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode),            \
> +		    .owner  = THIS_MODULE,                              \
> +	},								\
>  	.show	= _name##_show,						\
>  	.store	= _name##_store,					\
>  }
>  
>  #define __ATTR_WO(_name) {						\
> -	.attr	= { .name = __stringify(_name), .mode = 0200 },		\
> +	.attr	= { .name = __stringify(_name),                         \
> +		    .mode = 0200,					\
> +		    .owner  = THIS_MODULE,				\
> +	},                                                              \
>  	.store	= _name##_store,					\
>  }
>  
> @@ -141,8 +158,11 @@ struct attribute_group {
>  
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>  #define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) {	\
> -	.attr = {.name = __stringify(_name), .mode = _mode,	\
> -			.ignore_lockdep = true },		\
> +	.attr = {.name = __stringify(_name),                    \
> +		 .mode = _mode,					\
> +		 .ignore_lockdep = true,                        \
> +		 .owner  = THIS_MODULE,                         \
> +	},							\
>  	.show		= _show,				\
>  	.store		= _store,				\
>  }
> @@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = {	\
>  #define ATTRIBUTE_GROUPS(_name)					\
>  static const struct attribute_group _name##_group = {		\
>  	.attrs = _name##_attrs,					\
> +	.owner = THIS_MODULE,					\
>  };								\
>  __ATTRIBUTE_GROUPS(_name)
>  
> @@ -199,20 +220,29 @@ struct bin_attribute {
>  
>  /* macros to create static binary attributes easier */
>  #define __BIN_ATTR(_name, _mode, _read, _write, _size) {		\
> -	.attr = { .name = __stringify(_name), .mode = _mode },		\
> +	.attr = { .name = __stringify(_name),                           \
> +		   .mode = _mode,					\
> +		   .owner = THIS_MODULE,				\
> +	},								\
>  	.read	= _read,						\
>  	.write	= _write,						\
>  	.size	= _size,						\
>  }
>  
>  #define __BIN_ATTR_RO(_name, _size) {					\
> -	.attr	= { .name = __stringify(_name), .mode = 0444 },		\
> +	.attr	= { .name = __stringify(_name),                         \
> +		    .mode = 0444,					\
> +		    .owner = THIS_MODULE,				\
> +	},								\
>  	.read	= _name##_read,						\
>  	.size	= _size,						\
>  }
>  
>  #define __BIN_ATTR_WO(_name, _size) {					\
> -	.attr	= { .name = __stringify(_name), .mode = 0200 },		\
> +	.attr	= { .name = __stringify(_name),                         \
> +		    .mode = 0200,					\
> +		    .owner = THIS_MODULE,				\
> +	},								\
>  	.write	= _name##_write,					\
>  	.size	= _size,						\
>  }
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 9e0390000025..c6b0a28f599c 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3975,7 +3975,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
>  				  cgroup_file_mode(cft),
>  				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
>  				  0, cft->kf_ops, cft,
> -				  NULL, key);
> +				  NULL, key, NULL);
>  	if (IS_ERR(kn))
>  		return PTR_ERR(kn);
>  
> -- 
> 2.30.2
>
Luis Chamberlain Oct. 11, 2021, 9:25 p.m. UTC | #3
On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
> On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > When driver sysfs attributes use a lock also used on module removal we
> > can race to deadlock. This happens when for instance a sysfs file on
> > a driver is used, then at the same time we have module removal call
> > trigger. The module removal call code holds a lock, and then the
> > driver's sysfs file entry waits for the same lock. While holding the
> > lock the module removal tries to remove the sysfs entries, but these
> > cannot be removed yet as one is waiting for a lock. This won't complete
> > as the lock is already held. Likewise module removal cannot complete,
> > and so we deadlock.
> > 
> > This can now be easily reproducible with our sysfs selftest as follows:
> > 
> > ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> > 
> > This uses a local driver lock. Test 0028 can also be used, that uses
> > the rtnl_lock():
> > 
> > ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> > 
> > To fix this we extend the struct kernfs_node with a module reference
> > and use the try_module_get() after kernfs_get_active() is called. As
> > documented in the prior patch, we now know that once kernfs_get_active()
> > is called the module is implicitly guarded to exist and cannot be removed.
> > This is because the module is the one in charge of removing the same
> > sysfs file it created, and removal of sysfs files on module exit will wait
> > until they don't have any active references. By using a try_module_get()
> > after kernfs_get_active() we yield to let module removal trump calls to
> > process a sysfs operation, while also preventing module removal if a sysfs
> > operation is in already progress. This prevents the deadlock.
> > 
> > This deadlock was first reported with the zram driver, however the live
> 
> Looks not see the lock pattern you mentioned in zram driver, can you
> share the related zram code?

I recommend to not look at the zram driver, instead look at the
test_sysfs driver as that abstracts the issue more clearly and uses
two different locks as an example. The point is that if on module
removal *any* lock is used which is *also* used on the sysfs file
created by the module, you can deadlock.

> > And this can lead to this condition:
> > 
> > CPU A                              CPU B
> >                                    foo_store()
> > foo_exit()
> >   mutex_lock(&foo)
> >                                    mutex_lock(&foo)
> >    del_gendisk(some_struct->disk);
> >      device_del()
> >        device_remove_groups()
> 
> I guess the deadlock exists if foo_exit() is called anywhere. If yes,
> look the issue may not be related with removing module directly, right?

No, the reason this can deadlock is that the module exit routine will
patiently wait for the sysfs / kernfs files to be stop being used,
but clearly they cannot if the exit routine took the mutex also used
by the sysfs ops. That is, the special condition here is the removal of
the sysfs files, and the sysfs files using a lock also used on module
exit.

 Luis
Luis Chamberlain Oct. 11, 2021, 10:26 p.m. UTC | #4
On Tue, Oct 05, 2021 at 01:50:31PM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > A sketch of how this can happen follows, consider foo a local mutex
> > part of a driver, and used on the driver's module exit routine and
> > on one of its sysfs ops:
> > 
> > foo.c:
> > static DEFINE_MUTEX(foo);
> > static ssize_t foo_store(struct device *dev,
> > 			 struct device_attribute *attr,
> > 			 const char *buf, size_t count)
> > {
> > 	...
> > 	mutex_lock(&foo);
> > 	...
> > 	mutex_lock(&foo);
> > 	...
> > }
> > static DEVICE_ATTR_RW(foo);
> > ...
> > void foo_exit(void)
> > {
> > 	mutex_lock(&foo);
> > 	...
> > 	mutex_unlock(&foo);
> > }
> > module_exit(foo_exit);
> > 
> > And this can lead to this condition:
> > 
> > CPU A                              CPU B
> >                                    foo_store()
> > foo_exit()
> >   mutex_lock(&foo)
> >                                    mutex_lock(&foo)
> >    del_gendisk(some_struct->disk);
> >      device_del()
> >        device_remove_groups()
> 
> Please expand this further, where does device_remove_groups() end up
> waiting for that never happens?

Sure. How about:

Furthermore, device_remove_groups() will just go on trying to remove
the sysfs files, which are kernfs entries. The way kernfs deals with
removal is that it will wait until all active references for the files
being removed are done. The active reference is obtained through
kernfs_get_active(). Removal ends up waiting through kernfs_drain()
for the active references to be done, and that only happens if the
kernfs file ops can complete. If these kernfs ops / sysfs files
are waiting for a mutex which taken by the module's exit routine
prior to trying to remove the sysfs files we deadlock.

> > In this situation foo_store() is waiting for the mutex foo to
> > become unlocked, but that won't happen until module removal is complete.
> > But module removal won't complete until the sysfs file being poked at
> > completes which is waiting for a lock already held.
> > 
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >  arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 +-
> >  fs/kernfs/dir.c                        | 44 ++++++++++++++++++----
> >  fs/kernfs/file.c                       |  6 ++-
> >  fs/kernfs/kernfs-internal.h            |  3 +-
> >  fs/kernfs/symlink.c                    |  3 +-
> >  fs/sysfs/dir.c                         |  2 +-
> >  fs/sysfs/file.c                        |  6 ++-
> >  fs/sysfs/group.c                       |  3 +-
> >  include/linux/kernfs.h                 | 14 ++++---
> >  include/linux/sysfs.h                  | 52 ++++++++++++++++++++------
> >  kernel/cgroup/cgroup.c                 |  2 +-
> >  11 files changed, 105 insertions(+), 34 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index b57b3db9a6a7..4edf3b37fd2c 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
> >  
> >  	kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
> >  				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> > -				  0, rft->kf_ops, rft, NULL, NULL);
> > +				  0, rft->kf_ops, rft, NULL, NULL, NULL);
> >  	if (IS_ERR(kn))
> >  		return PTR_ERR(kn);
> >  
> > @@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> >  
> >  	kn = __kernfs_create_file(parent_kn, name, 0444,
> >  				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
> > -				  &kf_mondata_ops, priv, NULL, NULL);
> > +				  &kf_mondata_ops, priv, NULL, NULL, NULL);
> >  	if (IS_ERR(kn))
> >  		return PTR_ERR(kn);
> >  
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index ba581429bf7b..e841201fd11b 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/slab.h>
> >  #include <linux/security.h>
> >  #include <linux/hash.h>
> > +#include <linux/module.h>
> >  
> >  #include "kernfs-internal.h"
> >  
> > @@ -414,12 +415,29 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
> >   */
> >  struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
> >  {
> > +	int v;
> > +
> >  	if (unlikely(!kn))
> >  		return NULL;
> >  
> >  	if (!atomic_inc_unless_negative(&kn->active))
> >  		return NULL;
> >  
> > +	/*
> > +	 * If a module created the kernfs_node, the module cannot possibly be
> > +	 * removed if the above atomic_inc_unless_negative() succeeded. So the
> > +	 * try_module_get() below is not to protect the lifetime of the module
> > +	 * as that is already guaranteed. The try_module_get() below is used
> > +	 * to ensure that we don't deadlock in case a kernfs operation and
> > +	 * module removal used a shared lock.
> > +	 */
> > +	if (!try_module_get(kn->owner)) {
> > +		v = atomic_dec_return(&kn->active);
> > +		if (unlikely(v == KN_DEACTIVATED_BIAS))
> > +			wake_up_all(&kernfs_root(kn)->deactivate_waitq);
> > +		return NULL;
> > +	}
> 
> The special casing in here makes me think this isn't happening the right
> place. (i.e this looks like an open-coded version of kernfs_put_active())

No, well you see, in effect the special care taken in
kernfs_put_active() *is* the right way to inform a waiter that
that the *taken* reference right above *also* is no longer active.

The special casing here is because we took the active reference
before the try_module_get() in the above atomic_inc_unless_negative()
call. Outside callers deal with this through kernfs_put_active().

We are special casing to deal with the deadlock case.

> > +
> >  	if (kernfs_lockdep(kn))
> >  		rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
> >  	return kn;
> > @@ -442,6 +460,13 @@ void kernfs_put_active(struct kernfs_node *kn)
> >  	if (kernfs_lockdep(kn))
> >  		rwsem_release(&kn->dep_map, _RET_IP_);
> >  	v = atomic_dec_return(&kn->active);
> > +
> > +	/*
> > +	 * We prevent module exit *until* we know for sure all possible
> > +	 * kernfs ops are done.
> > +	 */
> > +	module_put(kn->owner);
> > +
> >  	if (likely(v != KN_DEACTIVATED_BIAS))
> >  		return;
> 
> What I don't understand, however, is what kernfs_get/put_active() is
> intending to do -- it looks like it's trying to provide an interruption
> point for open kernfs file operations?

It is essentially ensuring that removal does not happen if any ops
are being used.

> This all seems extremely complex for what seems like it should just be a
> global "am I being removed?" bool?

It used to be worse :) And Tejun has cleaned this up over time. Yes,
perhaps we can improve that more but, given how sensible this code
is I think such improvements should be made separately.

> Regardless, while I do see the logic of associating the module get/put
> with get/put of kernfs "active", why is it not better tied to strictly
> kernfs open/close? 

It's not just files, consider kernfs_iop_mkdir() which also calls
kernfs_get_active(). How about kernfs_fop_mmap()? And so, the common
denominator is actually kernfs_get_active().

> That would seem to be much simpler and not require
> any special handling?

Yes true, but it I think this would still leave open some other possible
deadlocks.

> For example, why does this not work?

It does for the write case for sure, but I haven't written tests for the
other odd cases, but suspect that would deadlock as well.

 Luis
Ming Lei Oct. 12, 2021, 12:20 a.m. UTC | #5
On Mon, Oct 11, 2021 at 02:25:46PM -0700, Luis Chamberlain wrote:
> On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
> > On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > > When driver sysfs attributes use a lock also used on module removal we
> > > can race to deadlock. This happens when for instance a sysfs file on
> > > a driver is used, then at the same time we have module removal call
> > > trigger. The module removal call code holds a lock, and then the
> > > driver's sysfs file entry waits for the same lock. While holding the
> > > lock the module removal tries to remove the sysfs entries, but these
> > > cannot be removed yet as one is waiting for a lock. This won't complete
> > > as the lock is already held. Likewise module removal cannot complete,
> > > and so we deadlock.
> > > 
> > > This can now be easily reproducible with our sysfs selftest as follows:
> > > 
> > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> > > 
> > > This uses a local driver lock. Test 0028 can also be used, that uses
> > > the rtnl_lock():
> > > 
> > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> > > 
> > > To fix this we extend the struct kernfs_node with a module reference
> > > and use the try_module_get() after kernfs_get_active() is called. As
> > > documented in the prior patch, we now know that once kernfs_get_active()
> > > is called the module is implicitly guarded to exist and cannot be removed.
> > > This is because the module is the one in charge of removing the same
> > > sysfs file it created, and removal of sysfs files on module exit will wait
> > > until they don't have any active references. By using a try_module_get()
> > > after kernfs_get_active() we yield to let module removal trump calls to
> > > process a sysfs operation, while also preventing module removal if a sysfs
> > > operation is in already progress. This prevents the deadlock.
> > > 
> > > This deadlock was first reported with the zram driver, however the live
> > 
> > Looks not see the lock pattern you mentioned in zram driver, can you
> > share the related zram code?
> 
> I recommend to not look at the zram driver, instead look at the
> test_sysfs driver as that abstracts the issue more clearly and uses

Looks test_sysfs isn't in linus tree, where can I find it? Also please
update your commit log about this wrong info if it can't be applied on
zram.

> two different locks as an example. The point is that if on module
> removal *any* lock is used which is *also* used on the sysfs file
> created by the module, you can deadlock.
> 
> > > And this can lead to this condition:
> > > 
> > > CPU A                              CPU B
> > >                                    foo_store()
> > > foo_exit()
> > >   mutex_lock(&foo)
> > >                                    mutex_lock(&foo)
> > >    del_gendisk(some_struct->disk);
> > >      device_del()
> > >        device_remove_groups()
> > 
> > I guess the deadlock exists if foo_exit() is called anywhere. If yes,
> > look the issue may not be related with removing module directly, right?
> 
> No, the reason this can deadlock is that the module exit routine will
> patiently wait for the sysfs / kernfs files to be stop being used,

Can you share the code which waits for the sysfs / kernfs files to be
stop being used? And why does it make a difference in case of being
called from module_exit()?



Thanks,
Ming
Luis Chamberlain Oct. 12, 2021, 9:18 p.m. UTC | #6
On Tue, Oct 12, 2021 at 08:20:46AM +0800, Ming Lei wrote:
> On Mon, Oct 11, 2021 at 02:25:46PM -0700, Luis Chamberlain wrote:
> > On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
> > > On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > > > When driver sysfs attributes use a lock also used on module removal we
> > > > can race to deadlock. This happens when for instance a sysfs file on
> > > > a driver is used, then at the same time we have module removal call
> > > > trigger. The module removal call code holds a lock, and then the
> > > > driver's sysfs file entry waits for the same lock. While holding the
> > > > lock the module removal tries to remove the sysfs entries, but these
> > > > cannot be removed yet as one is waiting for a lock. This won't complete
> > > > as the lock is already held. Likewise module removal cannot complete,
> > > > and so we deadlock.
> > > > 
> > > > This can now be easily reproducible with our sysfs selftest as follows:
> > > > 
> > > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> > > > 
> > > > This uses a local driver lock. Test 0028 can also be used, that uses
> > > > the rtnl_lock():
> > > > 
> > > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> > > > 
> > > > To fix this we extend the struct kernfs_node with a module reference
> > > > and use the try_module_get() after kernfs_get_active() is called. As
> > > > documented in the prior patch, we now know that once kernfs_get_active()
> > > > is called the module is implicitly guarded to exist and cannot be removed.
> > > > This is because the module is the one in charge of removing the same
> > > > sysfs file it created, and removal of sysfs files on module exit will wait
> > > > until they don't have any active references. By using a try_module_get()
> > > > after kernfs_get_active() we yield to let module removal trump calls to
> > > > process a sysfs operation, while also preventing module removal if a sysfs
> > > > operation is in already progress. This prevents the deadlock.
> > > > 
> > > > This deadlock was first reported with the zram driver, however the live
> > > 
> > > Looks not see the lock pattern you mentioned in zram driver, can you
> > > share the related zram code?
> > 
> > I recommend to not look at the zram driver, instead look at the
> > test_sysfs driver as that abstracts the issue more clearly and uses
> 
> Looks test_sysfs isn't in linus tree, where can I find it?

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix

> Also please
> update your commit log about this wrong info if it can't be applied on
> zram.

It does apply to zram, it is just that I have other fixes for zram in
my pipeline which will change the zram driver further, and so what makes
more sense is to abstract the issue into a selftest driver to
demonstrate the issue more clearly.

To reproduce the deadlock revert the patch in this thread and then run
either of these two tests as root:

./tools/testing/selftests/sysfs/sysfs.sh -w 0027
./tools/testing/selftests/sysfs/sysfs.sh -w 0028

You will need to enable the test_sysfs driver.

> > two different locks as an example. The point is that if on module
> > removal *any* lock is used which is *also* used on the sysfs file
> > created by the module, you can deadlock.
> > 
> > > > And this can lead to this condition:
> > > > 
> > > > CPU A                              CPU B
> > > >                                    foo_store()
> > > > foo_exit()
> > > >   mutex_lock(&foo)
> > > >                                    mutex_lock(&foo)
> > > >    del_gendisk(some_struct->disk);
> > > >      device_del()
> > > >        device_remove_groups()
> > > 
> > > I guess the deadlock exists if foo_exit() is called anywhere. If yes,
> > > look the issue may not be related with removing module directly, right?
> > 
> > No, the reason this can deadlock is that the module exit routine will
> > patiently wait for the sysfs / kernfs files to be stop being used,
> 
> Can you share the code which waits for the sysfs / kernfs files to be
> stop being used?

How about a call trace of the two tasks which deadlock, here is one of
running test 0027:

kdevops login: [  363.875459] INFO: task sysfs.sh:1271 blocked for more
than 120 seconds.
[  363.878341]       Tainted: G            E
5.15.0-rc3-next-20210927+ #83
[  363.881218] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  363.882255] task:sysfs.sh        state:D stack:    0 pid: 1271 ppid:
1 flags:0x00000004
[  363.882894] Call Trace:
[  363.883091]  <TASK>
[  363.883259]  __schedule+0x2fd/0x990
[  363.883551]  schedule+0x43/0xe0
[  363.883800]  schedule_preempt_disabled+0x14/0x20
[  363.884160]  __mutex_lock.constprop.0+0x249/0x470
[  363.884524]  test_dev_x_store+0xa5/0xc0 [test_sysfs]
[  363.884915]  kernfs_fop_write_iter+0x177/0x220
[  363.885257]  new_sync_write+0x11c/0x1b0
[  363.885556]  vfs_write+0x20d/0x2a0
[  363.885821]  ksys_write+0x5f/0xe0
[  363.886081]  do_syscall_64+0x38/0xc0
[  363.886359]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  363.886748] RIP: 0033:0x7fee00f8bf33
[  363.887029] RSP: 002b:00007ffd372c5d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  363.887633] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fee00f8bf33
[  363.888217] RDX: 0000000000000003 RSI: 000055a4d14a0db0 RDI: 0000000000000001
[  363.888761] RBP: 000055a4d14a0db0 R08: 000000000000000a R09: 0000000000000002
[  363.889267] R10: 000055a4d1554ac0 R11: 0000000000000246 R12: 0000000000000003
[  363.889983] R13: 00007fee0105c6a0 R14: 0000000000000003 R15: 00007fee0105c8a0
[  363.890513]  </TASK>
[  363.890709] INFO: task modprobe:1276 blocked for more than 120 seconds.
[  363.891185]       Tainted: G            E 5.15.0-rc3-next-20210927+ #83
[  363.891781] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.892353] task:modprobe        state:D stack:    0 pid: 1276 ppid: 1 flags:0x00004000
[  363.892955] Call Trace:
[  363.893141]  <TASK>
[  363.893457]  __schedule+0x2fd/0x990
[  363.893865]  schedule+0x43/0xe0
[  363.894246]  __kernfs_remove.part.0+0x21e/0x2a0
[  363.894704]  ? do_wait_intr_irq+0xa0/0xa0
[  363.895142]  kernfs_remove_by_name_ns+0x50/0x90
[  363.895632]  remove_files+0x2b/0x60
[  363.896035]  sysfs_remove_group+0x38/0x80
[  363.896470]  sysfs_remove_groups+0x29/0x40
[  363.896912]  device_remove_attrs+0x5b/0x90
[  363.897352]  device_del+0x183/0x400
[  363.897758]  unregister_test_dev_sysfs+0x5b/0xaa [test_sysfs]
[  363.898317]  test_sysfs_exit+0x45/0xfb0 [test_sysfs]
[  363.898833]  __do_sys_delete_module+0x18d/0x2a0
[  363.899329]  ? fpregs_assert_state_consistent+0x1e/0x40
[  363.899868]  ? exit_to_user_mode_prepare+0x3a/0x180
[  363.900390]  do_syscall_64+0x38/0xc0
[  363.900810]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  363.901330] RIP: 0033:0x7f21915c57d7
[  363.901747] RSP: 002b:00007ffd90869fe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  363.902442] RAX: ffffffffffffffda RBX: 000055ce676ffc30 RCX: 00007f21915c57d7
[  363.903104] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055ce676ffc98
[  363.903782] RBP: 000055ce676ffc30 R08: 0000000000000000 R09: 0000000000000000
[  363.904462] R10: 00007f2191638ac0 R11: 0000000000000206 R12: 000055ce676ffc98
[  363.905128] R13: 0000000000000000 R14: 0000000000000000 R15: 000055ce676ffdf0
[  363.905797]  </TASK>


And gdb:

(gdb) l *(__kernfs_remove+0x21e)
0xffffffff8139288e is in __kernfs_remove (fs/kernfs/dir.c:476).
471                     if (atomic_read(&kn->active) != KN_DEACTIVATED_BIAS)
472                             lock_contended(&kn->dep_map, _RET_IP_);
473             }
474
475             /* but everyone should wait for draining */
476             wait_event(root->deactivate_waitq,
477                        atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);
478
479             if (kernfs_lockdep(kn)) {
480                     lock_acquired(&kn->dep_map, _RET_IP_);

(gdb) l *(kernfs_remove_by_name_ns+0x50)
0xffffffff813938d0 is in kernfs_remove_by_name_ns (fs/kernfs/dir.c:1534).
1529
1530            kn = kernfs_find_ns(parent, name, ns);
1531            if (kn)
1532                    __kernfs_remove(kn);
1533
1534            up_write(&kernfs_rwsem);
1535
1536            if (kn)
1537                    return 0;
1538            else

The same happens for test 0028 except instead of a mutex
lock an rtnl_lock() is used.

Would this be better for the commit log?

> And why does it make a difference in case of being
> called from module_exit()?

Well because that is where we remove the sysfs files. *If*
a developer happens to use a lock on a sysfs op but it is
also used on module exit, this deadlock is bound to happen.

  Luis
Ming Lei Oct. 13, 2021, 1:07 a.m. UTC | #7
On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
> On Tue, Oct 12, 2021 at 08:20:46AM +0800, Ming Lei wrote:
> > On Mon, Oct 11, 2021 at 02:25:46PM -0700, Luis Chamberlain wrote:
> > > On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
> > > > On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > > > > When driver sysfs attributes use a lock also used on module removal we
> > > > > can race to deadlock. This happens when for instance a sysfs file on
> > > > > a driver is used, then at the same time we have module removal call
> > > > > trigger. The module removal call code holds a lock, and then the
> > > > > driver's sysfs file entry waits for the same lock. While holding the
> > > > > lock the module removal tries to remove the sysfs entries, but these
> > > > > cannot be removed yet as one is waiting for a lock. This won't complete
> > > > > as the lock is already held. Likewise module removal cannot complete,
> > > > > and so we deadlock.
> > > > > 
> > > > > This can now be easily reproducible with our sysfs selftest as follows:
> > > > > 
> > > > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> > > > > 
> > > > > This uses a local driver lock. Test 0028 can also be used, that uses
> > > > > the rtnl_lock():
> > > > > 
> > > > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> > > > > 
> > > > > To fix this we extend the struct kernfs_node with a module reference
> > > > > and use the try_module_get() after kernfs_get_active() is called. As
> > > > > documented in the prior patch, we now know that once kernfs_get_active()
> > > > > is called the module is implicitly guarded to exist and cannot be removed.
> > > > > This is because the module is the one in charge of removing the same
> > > > > sysfs file it created, and removal of sysfs files on module exit will wait
> > > > > until they don't have any active references. By using a try_module_get()
> > > > > after kernfs_get_active() we yield to let module removal trump calls to
> > > > > process a sysfs operation, while also preventing module removal if a sysfs
> > > > > operation is in already progress. This prevents the deadlock.
> > > > > 
> > > > > This deadlock was first reported with the zram driver, however the live
> > > > 
> > > > Looks not see the lock pattern you mentioned in zram driver, can you
> > > > share the related zram code?
> > > 
> > > I recommend to not look at the zram driver, instead look at the
> > > test_sysfs driver as that abstracts the issue more clearly and uses
> > 
> > Looks test_sysfs isn't in linus tree, where can I find it?
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
> 
> > Also please
> > update your commit log about this wrong info if it can't be applied on
> > zram.
> 
> It does apply to zram, it is just that I have other fixes for zram in
> my pipeline which will change the zram driver further, and so what makes
> more sense is to abstract the issue into a selftest driver to
> demonstrate the issue more clearly.
> 
> To reproduce the deadlock revert the patch in this thread and then run
> either of these two tests as root:
> 
> ./tools/testing/selftests/sysfs/sysfs.sh -w 0027
> ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
> 
> You will need to enable the test_sysfs driver.
> 
> > > two different locks as an example. The point is that if on module
> > > removal *any* lock is used which is *also* used on the sysfs file
> > > created by the module, you can deadlock.
> > > 
> > > > > And this can lead to this condition:
> > > > > 
> > > > > CPU A                              CPU B
> > > > >                                    foo_store()
> > > > > foo_exit()
> > > > >   mutex_lock(&foo)
> > > > >                                    mutex_lock(&foo)
> > > > >    del_gendisk(some_struct->disk);
> > > > >      device_del()
> > > > >        device_remove_groups()
> > > > 
> > > > I guess the deadlock exists if foo_exit() is called anywhere. If yes,
> > > > look the issue may not be related with removing module directly, right?
> > > 
> > > No, the reason this can deadlock is that the module exit routine will
> > > patiently wait for the sysfs / kernfs files to be stop being used,
> > 
> > Can you share the code which waits for the sysfs / kernfs files to be
> > stop being used?
> 
> How about a call trace of the two tasks which deadlock, here is one of
> running test 0027:
> 
> kdevops login: [  363.875459] INFO: task sysfs.sh:1271 blocked for more
> than 120 seconds.
> [  363.878341]       Tainted: G            E
> 5.15.0-rc3-next-20210927+ #83
> [  363.881218] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  363.882255] task:sysfs.sh        state:D stack:    0 pid: 1271 ppid:
> 1 flags:0x00000004
> [  363.882894] Call Trace:
> [  363.883091]  <TASK>
> [  363.883259]  __schedule+0x2fd/0x990
> [  363.883551]  schedule+0x43/0xe0
> [  363.883800]  schedule_preempt_disabled+0x14/0x20
> [  363.884160]  __mutex_lock.constprop.0+0x249/0x470
> [  363.884524]  test_dev_x_store+0xa5/0xc0 [test_sysfs]
> [  363.884915]  kernfs_fop_write_iter+0x177/0x220
> [  363.885257]  new_sync_write+0x11c/0x1b0
> [  363.885556]  vfs_write+0x20d/0x2a0
> [  363.885821]  ksys_write+0x5f/0xe0
> [  363.886081]  do_syscall_64+0x38/0xc0
> [  363.886359]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  363.886748] RIP: 0033:0x7fee00f8bf33
> [  363.887029] RSP: 002b:00007ffd372c5d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  363.887633] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fee00f8bf33
> [  363.888217] RDX: 0000000000000003 RSI: 000055a4d14a0db0 RDI: 0000000000000001
> [  363.888761] RBP: 000055a4d14a0db0 R08: 000000000000000a R09: 0000000000000002
> [  363.889267] R10: 000055a4d1554ac0 R11: 0000000000000246 R12: 0000000000000003
> [  363.889983] R13: 00007fee0105c6a0 R14: 0000000000000003 R15: 00007fee0105c8a0
> [  363.890513]  </TASK>
> [  363.890709] INFO: task modprobe:1276 blocked for more than 120 seconds.
> [  363.891185]       Tainted: G            E 5.15.0-rc3-next-20210927+ #83
> [  363.891781] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  363.892353] task:modprobe        state:D stack:    0 pid: 1276 ppid: 1 flags:0x00004000
> [  363.892955] Call Trace:
> [  363.893141]  <TASK>
> [  363.893457]  __schedule+0x2fd/0x990
> [  363.893865]  schedule+0x43/0xe0
> [  363.894246]  __kernfs_remove.part.0+0x21e/0x2a0
> [  363.894704]  ? do_wait_intr_irq+0xa0/0xa0
> [  363.895142]  kernfs_remove_by_name_ns+0x50/0x90
> [  363.895632]  remove_files+0x2b/0x60
> [  363.896035]  sysfs_remove_group+0x38/0x80
> [  363.896470]  sysfs_remove_groups+0x29/0x40
> [  363.896912]  device_remove_attrs+0x5b/0x90
> [  363.897352]  device_del+0x183/0x400
> [  363.897758]  unregister_test_dev_sysfs+0x5b/0xaa [test_sysfs]
> [  363.898317]  test_sysfs_exit+0x45/0xfb0 [test_sysfs]
> [  363.898833]  __do_sys_delete_module+0x18d/0x2a0
> [  363.899329]  ? fpregs_assert_state_consistent+0x1e/0x40
> [  363.899868]  ? exit_to_user_mode_prepare+0x3a/0x180
> [  363.900390]  do_syscall_64+0x38/0xc0
> [  363.900810]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  363.901330] RIP: 0033:0x7f21915c57d7
> [  363.901747] RSP: 002b:00007ffd90869fe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> [  363.902442] RAX: ffffffffffffffda RBX: 000055ce676ffc30 RCX: 00007f21915c57d7
> [  363.903104] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055ce676ffc98
> [  363.903782] RBP: 000055ce676ffc30 R08: 0000000000000000 R09: 0000000000000000
> [  363.904462] R10: 00007f2191638ac0 R11: 0000000000000206 R12: 000055ce676ffc98
> [  363.905128] R13: 0000000000000000 R14: 0000000000000000 R15: 000055ce676ffdf0
> [  363.905797]  </TASK>

That doesn't show the deadlock is related with module_exit().

> 
> 
> And gdb:
> 
> (gdb) l *(__kernfs_remove+0x21e)
> 0xffffffff8139288e is in __kernfs_remove (fs/kernfs/dir.c:476).
> 471                     if (atomic_read(&kn->active) != KN_DEACTIVATED_BIAS)
> 472                             lock_contended(&kn->dep_map, _RET_IP_);
> 473             }
> 474
> 475             /* but everyone should wait for draining */
> 476             wait_event(root->deactivate_waitq,
> 477                        atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);
> 478
> 479             if (kernfs_lockdep(kn)) {
> 480                     lock_acquired(&kn->dep_map, _RET_IP_);
> 
> (gdb) l *(kernfs_remove_by_name_ns+0x50)
> 0xffffffff813938d0 is in kernfs_remove_by_name_ns (fs/kernfs/dir.c:1534).
> 1529
> 1530            kn = kernfs_find_ns(parent, name, ns);
> 1531            if (kn)
> 1532                    __kernfs_remove(kn);
> 1533
> 1534            up_write(&kernfs_rwsem);
> 1535
> 1536            if (kn)
> 1537                    return 0;
> 1538            else
> 
> The same happens for test 0028 except instead of a mutex
> lock an rtnl_lock() is used.
> 
> Would this be better for the commit log?
> 
> > And why does it make a difference in case of being
> > called from module_exit()?
> 
> Well because that is where we remove the sysfs files. *If*
> a developer happens to use a lock on a sysfs op but it is
> also used on module exit, this deadlock is bound to happen.

It is clearly one AA deadlock, what I meant was that it isn't related with
module exit cause lock & device_del() isn't always done in module exit, so
I doubt your fix with grabbing module refcnt is good or generic enough.

Except for your cooked test_sys module, how many real drivers do suffer the
problem? What are they? Why can't we fix the exact driver?


Thanks,
Ming
Luis Chamberlain Oct. 13, 2021, 12:35 p.m. UTC | #8
On Wed, Oct 13, 2021 at 09:07:03AM +0800, Ming Lei wrote:
> On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
> > > Looks test_sysfs isn't in linus tree, where can I find it?
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
> > 
> > To reproduce the deadlock revert the patch in this thread and then run
> > either of these two tests as root:
> > 
> > ./tools/testing/selftests/sysfs/sysfs.sh -w 0027
> > ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
> > 
> > You will need to enable the test_sysfs driver.
> > > Can you share the code which waits for the sysfs / kernfs files to be
> > > stop being used?
> > 
> > How about a call trace of the two tasks which deadlock, here is one of
> > running test 0027:
> > 
> > kdevops login: [  363.875459] INFO: task sysfs.sh:1271 blocked for more
> > than 120 seconds.

<-- snip -->

> That doesn't show the deadlock is related with module_exit().

Not directly no.

> It is clearly one AA deadlock, what I meant was that it isn't related with
> module exit cause lock & device_del() isn't always done in module exit, so
> I doubt your fix with grabbing module refcnt is good or generic enough.

A device_del() *can* happen in other areas other than module exit sure,
but the issue is if a shared lock is used *before* device_del() and also
used on a sysfs op. Typically this can happen on module exit, and the
other common use case in my experience is on sysfs ops, such is the case
with the zram driver. Both cases are covered then by this fix.

If there are other areas, that is still driver specific, but of the
things we *can* generalize, definitely module exit is a common path.

> Except for your cooked test_sys module, how many real drivers do suffer the
> problem? What are they?

I only really seriously considered trying to generalize this after it
was hinted to me live patching was also affected, and so clearly
something generic was desirable.

There may be other drivers for sure, but a hunt for that with semantics
would require a bit complex coccinelle patch with iteration support.

> Why can't we fix the exact driver?

You can try, the way the lock is used in zram is correct, specially
after my other fix in this series which addresses another unrelated bug
with cpu hotplug multistate support. So we then can proceed to either
take the position to say: "Thou shalt not use a shared lock on module
exit and a sysfs op" and try to fix all places, or we generalize a fix
for this. A generic fix seems more desirable.

  Luis
Luis Chamberlain Oct. 13, 2021, 12:41 p.m. UTC | #9
On Mon, Oct 11, 2021 at 03:26:02PM -0700, Luis Chamberlain wrote:
> On Tue, Oct 05, 2021 at 01:50:31PM -0700, Kees Cook wrote:
> > For example, why does this not work?
> 
> It does for the write case for sure,

I mispoke, just for the record, the changes you mentioned actually don't
suffice for the test cases in question for test_sysfs, the deadlock
still occurs with those changes. At first I thought it did but I had failed
to remove my own fix first on fs/kernfs/dir.c. After removing that and
just trying the proposed changes I confirm it does not fix the deadlock.

  Luis
Ming Lei Oct. 13, 2021, 3:04 p.m. UTC | #10
On Wed, Oct 13, 2021 at 05:35:31AM -0700, Luis Chamberlain wrote:
> On Wed, Oct 13, 2021 at 09:07:03AM +0800, Ming Lei wrote:
> > On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
> > > > Looks test_sysfs isn't in linus tree, where can I find it?
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
> > > 
> > > To reproduce the deadlock revert the patch in this thread and then run
> > > either of these two tests as root:
> > > 
> > > ./tools/testing/selftests/sysfs/sysfs.sh -w 0027
> > > ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
> > > 
> > > You will need to enable the test_sysfs driver.
> > > > Can you share the code which waits for the sysfs / kernfs files to be
> > > > stop being used?
> > > 
> > > How about a call trace of the two tasks which deadlock, here is one of
> > > running test 0027:
> > > 
> > > kdevops login: [  363.875459] INFO: task sysfs.sh:1271 blocked for more
> > > than 120 seconds.
> 
> <-- snip -->
> 
> > That doesn't show the deadlock is related with module_exit().
> 
> Not directly no.

Then the patch title of 'sysfs: fix deadlock race with module removal'
is wrong.

> 
> > It is clearly one AA deadlock, what I meant was that it isn't related with
> > module exit cause lock & device_del() isn't always done in module exit, so
> > I doubt your fix with grabbing module refcnt is good or generic enough.
> 
> A device_del() *can* happen in other areas other than module exit sure,
> but the issue is if a shared lock is used *before* device_del() and also
> used on a sysfs op. Typically this can happen on module exit, and the
> other common use case in my experience is on sysfs ops, such is the case
> with the zram driver. Both cases are covered then by this fix.

Again, can you share the related zram code about the issue? In
zram_drv.c of linus or next tree, I don't see any lock is held before
calling del_gendisk().

> 
> If there are other areas, that is still driver specific, but of the
> things we *can* generalize, definitely module exit is a common path.
> 
> > Except for your cooked test_sys module, how many real drivers do suffer the
> > problem? What are they?
> 
> I only really seriously considered trying to generalize this after it

IMO your generalization isn't good or correct because this kind of issue
is _not_ related with module exit at all. What matters is just that one lock is
held before calling device_del(), meantime the same lock is required
in the device's attribute show/store function().

There are many cases in which we call device_del() not from module_exit(),
such as scsi scan, scsi sysfs store(), or even handling event from
device side, nvme error handling, usb hotplug, ...

> was hinted to me live patching was also affected, and so clearly
> something generic was desirable.

It might be just the only two drivers(zram and live patch) with this bug, and
it is one simply AA bug in driver. Not mention I don't see such usage in
zram_drv.c.

> 
> There may be other drivers for sure, but a hunt for that with semantics
> would require a bit complex coccinelle patch with iteration support.
> 
> > Why can't we fix the exact driver?
> 
> You can try, the way the lock is used in zram is correct, specially

What is the lock in zram? Again can you share the related functions?

> after my other fix in this series which addresses another unrelated bug
> with cpu hotplug multistate support. So we then can proceed to either
> take the position to say: "Thou shalt not use a shared lock on module
> exit and a sysfs op" and try to fix all places, or we generalize a fix
> for this. A generic fix seems more desirable.

What matters is that the lock is held before calling device_del()
instead of being held in module_exit().



Thanks,
Ming
Luis Chamberlain Oct. 13, 2021, 9:16 p.m. UTC | #11
On Wed, Oct 13, 2021 at 11:04:07PM +0800, Ming Lei wrote:
> On Wed, Oct 13, 2021 at 05:35:31AM -0700, Luis Chamberlain wrote:
> > On Wed, Oct 13, 2021 at 09:07:03AM +0800, Ming Lei wrote:
> > > On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
> > > > > Looks test_sysfs isn't in linus tree, where can I find it?
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
> > > > 
> > > > To reproduce the deadlock revert the patch in this thread and then run
> > > > either of these two tests as root:
> > > > 
> > > > ./tools/testing/selftests/sysfs/sysfs.sh -w 0027
> > > > ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
> > > > 
> > > > You will need to enable the test_sysfs driver.
> > > > > Can you share the code which waits for the sysfs / kernfs files to be
> > > > > stop being used?
> > > > 
> > > > How about a call trace of the two tasks which deadlock, here is one of
> > > > running test 0027:
> > > > 
> > > > kdevops login: [  363.875459] INFO: task sysfs.sh:1271 blocked for more
> > > > than 120 seconds.
> > 
> > <-- snip -->
> > 
> > > That doesn't show the deadlock is related with module_exit().
> > 
> > Not directly no.
> 
> Then the patch title of 'sysfs: fix deadlock race with module removal'
> is wrong.

Well that is what it does though. The scope of the issue you are raising
is beyond module removal, but I do agree such races can exist outside of
module removal.

> > > It is clearly one AA deadlock, what I meant was that it isn't related with
> > > module exit cause lock & device_del() isn't always done in module exit, so
> > > I doubt your fix with grabbing module refcnt is good or generic enough.
> > 
> > A device_del() *can* happen in other areas other than module exit sure,
> > but the issue is if a shared lock is used *before* device_del() and also
> > used on a sysfs op. Typically this can happen on module exit, and the
> > other common use case in my experience is on sysfs ops, such is the case
> > with the zram driver. Both cases are covered then by this fix.
> 
> Again, can you share the related zram code about the issue? In
> zram_drv.c of linus or next tree, I don't see any lock is held before
> calling del_gendisk().

There is another bug with CPU hotplug multistate support in the zram
driver which a patch in this series fixes, refer to the patch titled
"zram: fix crashes with cpu hotplug multistate". In zram's case we need
to contend a generic lock on certain sysfs attributes due to the way CPU
hotplug is used.

If we tried to generalize this on the block layer the closest we get is
the disk->fops->owner, however zram is an example driver where the
disk->fops is actually be even changed *after* module load, and so the
original disk->fops->owner can be dynamic. In zram's case the
fops->owner is the same, however we have no semantics to ensure this is
the case for all block drivers.

In the case for live patching, refer to the use of klp_mutex. The way
that was solved there was a combination of completions and deferred
works to solve it, so that all kobject_put calls are outside of the
critical sections, refer to commit 3ec24776bfd0 ("livepatch:
allow removal of a disabled patch").

And so it was encouraged a generic solution be sought after.

> > If there are other areas, that is still driver specific, but of the
> > things we *can* generalize, definitely module exit is a common path.
> > 
> > > Except for your cooked test_sys module, how many real drivers do suffer the
> > > problem? What are they?
> > 
> > I only really seriously considered trying to generalize this after it
> 
> IMO your generalization isn't good or correct because this kind of issue
> is _not_ related with module exit at all. What matters is just that one lock is
> held before calling device_del(), meantime the same lock is required
> in the device's attribute show/store function().

Your point that a race for a deadlock still can exist beyond module
removal is valid but unfortunately there are no possible semantics I can
see to fix that generically at this time.

> There are many cases in which we call device_del() not from module_exit(),
> such as scsi scan, scsi sysfs store(), or even handling event from
> device side, nvme error handling, usb hotplug, ...

These are really good points.

> > was hinted to me live patching was also affected, and so clearly
> > something generic was desirable.
> 
> It might be just the only two drivers(zram and live patch) with this bug, and
> it is one simply AA bug in driver. Not mention I don't see such usage in
> zram_drv.c.

Well... given what you say above about other uses cases other than
module removal which can remove sysfs files and having them be used,
the possibilities of this deadlock existing elsewhere should increase,
not decrease.

> > There may be other drivers for sure, but a hunt for that with semantics
> > would require a bit complex coccinelle patch with iteration support.
> > 
> > > Why can't we fix the exact driver?
> > 
> > You can try, the way the lock is used in zram is correct, specially
> 
> What is the lock in zram? Again can you share the related functions?

If you git checked out the tree I mentioned try looking at the code
there with the fix for CPU hotplug multistate in mind.

> > after my other fix in this series which addresses another unrelated bug
> > with cpu hotplug multistate support. So we then can proceed to either
> > take the position to say: "Thou shalt not use a shared lock on module
> > exit and a sysfs op" and try to fix all places, or we generalize a fix
> > for this. A generic fix seems more desirable.
> 
> What matters is that the lock is held before calling device_del()
> instead of being held in module_exit().

I agree the possibilities can include more than just module exit.
Unfortunately I can't see a way to generalize this further. I tried,
see below, and this moves the ideas from a module to the kobject, but
even with that, it does not get us any closer to fixing this
generically. The reason a fix works for module removal is the
try_module_get() call when getting the kernfs active reference
will trump the module exit call completely, and so we *do* prevent
the context which will issue the lock in this case if a sysfs
operation is in progress.

Outside of that call sequence I am afraid we'd need separate solutions
or side with the 'though shall not use a shared lock on a sysfs op
and when issuing a device_del(), other than module exit'.

Below is an attempt to generalize this further, but it does not work,
let me know if you have further ideas.

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b57b3db9a6a7..4edf3b37fd2c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
 
 	kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
-				  0, rft->kf_ops, rft, NULL, NULL);
+				  0, rft->kf_ops, rft, NULL, NULL, NULL);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);
 
@@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
 
 	kn = __kernfs_create_file(parent_kn, name, 0444,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
-				  &kf_mondata_ops, priv, NULL, NULL);
+				  &kf_mondata_ops, priv, NULL, NULL, NULL);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);
 
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 7758223f040c..38f07072ab44 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -3507,6 +3507,7 @@ bool kill_device(struct device *dev)
 	if (dev->p->dead)
 		return false;
 	dev->p->dead = true;
+	kobject_set_being_removed(&dev->kobj);
 	return true;
 }
 EXPORT_SYMBOL_GPL(kill_device);
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index ba581429bf7b..7d14f6b2c12d 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -14,6 +14,7 @@
 #include <linux/slab.h>
 #include <linux/security.h>
 #include <linux/hash.h>
+#include <linux/kobject.h>
 
 #include "kernfs-internal.h"
 
@@ -414,15 +415,38 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
  */
 struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
 {
+	int v;
+
 	if (unlikely(!kn))
 		return NULL;
 
 	if (!atomic_inc_unless_negative(&kn->active))
 		return NULL;
 
+	/*
+	 * If a kobject created the kernfs_node, the kobject cannot possibly be
+	 * removed if the above atomic_inc_unless_negative() succeeded. But we
+	 * need to inspect if its on its way out to ensure that we don't
+	 * deadlock in case a kernfs operation and the code responsible for
+	 * the kobject removal used a shared lock.
+	 */
+	if (kn->kobj) {
+		if (WARN_ON(!kobject_get_unless_zero(kn->kobj))) {
+			goto fail;
+		} else if (kobject_being_removed(kn->kobj)) {
+			kobject_put(kn->kobj);
+			goto fail;
+		}
+	}
+
 	if (kernfs_lockdep(kn))
 		rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
 	return kn;
+fail:
+	v = atomic_dec_return(&kn->active);
+	if (unlikely(v == KN_DEACTIVATED_BIAS))
+		wake_up_all(&kernfs_root(kn)->deactivate_waitq);
+	return NULL;
 }
 
 /**
@@ -442,6 +466,7 @@ void kernfs_put_active(struct kernfs_node *kn)
 	if (kernfs_lockdep(kn))
 		rwsem_release(&kn->dep_map, _RET_IP_);
 	v = atomic_dec_return(&kn->active);
+	kobject_put(kn->kobj);
 	if (likely(v != KN_DEACTIVATED_BIAS))
 		return;
 
@@ -572,7 +597,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 					     struct kernfs_node *parent,
 					     const char *name, umode_t mode,
 					     kuid_t uid, kgid_t gid,
-					     unsigned flags)
+					     unsigned flags,
+					     struct kobject *kobj)
 {
 	struct kernfs_node *kn;
 	u32 id_highbits;
@@ -607,6 +633,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 	kn->name = name;
 	kn->mode = mode;
 	kn->flags = flags;
+	kn->kobj = kobj;
 
 	if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
 		struct iattr iattr = {
@@ -640,12 +667,13 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
 				    kuid_t uid, kgid_t gid,
-				    unsigned flags)
+				    unsigned flags,
+				    struct kobject *kobj)
 {
 	struct kernfs_node *kn;
 
 	kn = __kernfs_new_node(kernfs_root(parent), parent,
-			       name, mode, uid, gid, flags);
+			       name, mode, uid, gid, flags, kobj);
 	if (kn) {
 		kernfs_get(parent);
 		kn->parent = parent;
@@ -927,7 +955,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 
 	kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
 			       GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
-			       KERNFS_DIR);
+			       KERNFS_DIR, NULL);
 	if (!kn) {
 		idr_destroy(&root->ino_idr);
 		kfree(root);
@@ -969,20 +997,22 @@ void kernfs_destroy_root(struct kernfs_root *root)
  * @gid: gid of the new directory
  * @priv: opaque data associated with the new directory
  * @ns: optional namespace tag of the directory
+ * @kobj: if set, the kobject responsible for this directory
  *
  * Returns the created node on success, ERR_PTR() value on failure.
  */
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
 					 kuid_t uid, kgid_t gid,
-					 void *priv, const void *ns)
+					 void *priv, const void *ns,
+					 struct kobject *kobj)
 {
 	struct kernfs_node *kn;
 	int rc;
 
 	/* allocate */
 	kn = kernfs_new_node(parent, name, mode | S_IFDIR,
-			     uid, gid, KERNFS_DIR);
+			     uid, gid, KERNFS_DIR, kobj);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
@@ -1014,7 +1044,8 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 
 	/* allocate */
 	kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
-			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
+			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR,
+			     parent->kobj);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 4479c6580333..1b02f3e69c81 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -978,6 +978,7 @@ const struct file_operations kernfs_file_fops = {
  * @priv: private data for the file
  * @ns: optional namespace tag of the file
  * @key: lockdep key for the file's active_ref, %NULL to disable lockdep
+ * @kobj: if set, the kobject responsible for the file
  *
  * Returns the created node on success, ERR_PTR() value on error.
  */
@@ -987,7 +988,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 loff_t size,
 					 const struct kernfs_ops *ops,
 					 void *priv, const void *ns,
-					 struct lock_class_key *key)
+					 struct lock_class_key *key,
+					 struct kobject *kobj)
 {
 	struct kernfs_node *kn;
 	unsigned flags;
@@ -996,7 +998,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 	flags = KERNFS_FILE;
 
 	kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
-			     uid, gid, flags);
+			     uid, gid, flags, kobj);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 9e3abf597e2d..44983720d292 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -134,7 +134,8 @@ int kernfs_add_one(struct kernfs_node *kn);
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
 				    kuid_t uid, kgid_t gid,
-				    unsigned flags);
+				    unsigned flags,
+				    struct kobject *kobj);
 
 /*
  * file.c
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 19a6c71c6ff5..c877de06e53a 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
 		gid = target->iattr->ia_gid;
 	}
 
-	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
+	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK,
+			     target->kobj);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index b6b6796e1616..9cc159e9fb55 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -57,7 +57,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
 	kobject_get_ownership(kobj, &uid, &gid);
 
 	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
-				  kobj, ns);
+				  kobj, ns, kobj);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, kobject_name(kobj));
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 42dcf96881b6..e1a3315dba35 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -292,7 +292,7 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
 #endif
 
 	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
-				  PAGE_SIZE, ops, (void *)attr, ns, key);
+				  PAGE_SIZE, ops, (void *)attr, ns, key, kobj);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, attr->name);
@@ -309,6 +309,7 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
 	struct lock_class_key *key = NULL;
 	const struct kernfs_ops *ops;
 	struct kernfs_node *kn;
+	struct kobject *kobj = parent->priv;
 
 	if (battr->mmap)
 		ops = &sysfs_bin_kfops_mmap;
@@ -327,7 +328,8 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
 #endif
 
 	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
-				  battr->size, ops, (void *)attr, ns, key);
+				  battr->size, ops, (void *)attr, ns, key,
+				  kobj);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, attr->name);
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index eeb0e3099421..36022fe2b21d 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -135,7 +135,8 @@ static int internal_create_group(struct kobject *kobj, int update,
 		} else {
 			kn = kernfs_create_dir_ns(kobj->sd, grp->name,
 						  S_IRWXU | S_IRUGO | S_IXUGO,
-						  uid, gid, kobj, NULL);
+						  uid, gid, kobj, NULL,
+						  kobj);
 			if (IS_ERR(kn)) {
 				if (PTR_ERR(kn) == -EEXIST)
 					sysfs_warn_dup(kobj->sd, grp->name);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index cd968ee2b503..38155414e6e5 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -161,6 +161,7 @@ struct kernfs_node {
 	unsigned short		flags;
 	umode_t			mode;
 	struct kernfs_iattrs	*iattr;
+	struct kobject		*kobj;
 };
 
 /*
@@ -370,7 +371,8 @@ void kernfs_destroy_root(struct kernfs_root *root);
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
 					 kuid_t uid, kgid_t gid,
-					 void *priv, const void *ns);
+					 void *priv, const void *ns,
+					 struct kobject *kobj);
 struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 					    const char *name);
 struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
@@ -379,7 +381,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 loff_t size,
 					 const struct kernfs_ops *ops,
 					 void *priv, const void *ns,
-					 struct lock_class_key *key);
+					 struct lock_class_key *key,
+					 struct kobject *kobj);
 struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
 				       const char *name,
 				       struct kernfs_node *target);
@@ -472,14 +475,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { }
 static inline struct kernfs_node *
 kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
 		     umode_t mode, kuid_t uid, kgid_t gid,
-		     void *priv, const void *ns)
+		     void *priv, const void *ns, struct kobject *kobj)
 { return ERR_PTR(-ENOSYS); }
 
 static inline struct kernfs_node *
 __kernfs_create_file(struct kernfs_node *parent, const char *name,
 		     umode_t mode, kuid_t uid, kgid_t gid,
 		     loff_t size, const struct kernfs_ops *ops,
-		     void *priv, const void *ns, struct lock_class_key *key)
+		     void *priv, const void *ns, struct lock_class_key *key,
+		     struct kobject *kobj)
 { return ERR_PTR(-ENOSYS); }
 
 static inline struct kernfs_node *
@@ -566,7 +570,7 @@ kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
 {
 	return kernfs_create_dir_ns(parent, name, mode,
 				    GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
-				    priv, NULL);
+				    priv, NULL, parent->kobj);
 }
 
 static inline int kernfs_remove_by_name(struct kernfs_node *parent,
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index efd56f990a46..cb26ebeb7cf1 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -77,6 +77,7 @@ struct kobject {
 	unsigned int state_add_uevent_sent:1;
 	unsigned int state_remove_uevent_sent:1;
 	unsigned int uevent_suppress:1;
+	unsigned int being_removed:1;
 };
 
 extern __printf(2, 3)
@@ -117,6 +118,15 @@ extern void kobject_get_ownership(struct kobject *kobj,
 				  kuid_t *uid, kgid_t *gid);
 extern char *kobject_get_path(struct kobject *kobj, gfp_t flag);
 
+static inline bool kobject_being_removed(const struct kobject *kobj)
+{
+	if (!kobj)
+		return false;
+	return !!kobj->being_removed;
+}
+
+void kobject_set_being_removed(struct kobject *kobj);
+
 /**
  * kobject_has_children - Returns whether a kobject has children.
  * @kobj: the object to test
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9e0390000025..c6b0a28f599c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3975,7 +3975,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
 				  cgroup_file_mode(cft),
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
 				  0, cft->kf_ops, cft,
-				  NULL, key);
+				  NULL, key, NULL);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);
 
diff --git a/lib/kobject.c b/lib/kobject.c
index 4a56f519139d..ef89bf2ac218 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -221,6 +221,12 @@ static void kobject_init_internal(struct kobject *kobj)
 	kobj->state_initialized = 1;
 }
 
+void kobject_set_being_removed(struct kobject *kobj)
+{
+	if (!kobj)
+		return;
+	kobj->being_removed = 1;
+}
 
 static int kobject_add_internal(struct kobject *kobj)
 {
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b57b3db9a6a7..4edf3b37fd2c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -209,7 +209,7 @@  static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
 
 	kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
-				  0, rft->kf_ops, rft, NULL, NULL);
+				  0, rft->kf_ops, rft, NULL, NULL, NULL);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);
 
@@ -2482,7 +2482,7 @@  static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
 
 	kn = __kernfs_create_file(parent_kn, name, 0444,
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
-				  &kf_mondata_ops, priv, NULL, NULL);
+				  &kf_mondata_ops, priv, NULL, NULL, NULL);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);
 
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index ba581429bf7b..e841201fd11b 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -14,6 +14,7 @@ 
 #include <linux/slab.h>
 #include <linux/security.h>
 #include <linux/hash.h>
+#include <linux/module.h>
 
 #include "kernfs-internal.h"
 
@@ -414,12 +415,29 @@  static bool kernfs_unlink_sibling(struct kernfs_node *kn)
  */
 struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
 {
+	int v;
+
 	if (unlikely(!kn))
 		return NULL;
 
 	if (!atomic_inc_unless_negative(&kn->active))
 		return NULL;
 
+	/*
+	 * If a module created the kernfs_node, the module cannot possibly be
+	 * removed if the above atomic_inc_unless_negative() succeeded. So the
+	 * try_module_get() below is not to protect the lifetime of the module
+	 * as that is already guaranteed. The try_module_get() below is used
+	 * to ensure that we don't deadlock in case a kernfs operation and
+	 * module removal used a shared lock.
+	 */
+	if (!try_module_get(kn->owner)) {
+		v = atomic_dec_return(&kn->active);
+		if (unlikely(v == KN_DEACTIVATED_BIAS))
+			wake_up_all(&kernfs_root(kn)->deactivate_waitq);
+		return NULL;
+	}
+
 	if (kernfs_lockdep(kn))
 		rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
 	return kn;
@@ -442,6 +460,13 @@  void kernfs_put_active(struct kernfs_node *kn)
 	if (kernfs_lockdep(kn))
 		rwsem_release(&kn->dep_map, _RET_IP_);
 	v = atomic_dec_return(&kn->active);
+
+	/*
+	 * We prevent module exit *until* we know for sure all possible
+	 * kernfs ops are done.
+	 */
+	module_put(kn->owner);
+
 	if (likely(v != KN_DEACTIVATED_BIAS))
 		return;
 
@@ -572,7 +597,8 @@  static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 					     struct kernfs_node *parent,
 					     const char *name, umode_t mode,
 					     kuid_t uid, kgid_t gid,
-					     unsigned flags)
+					     unsigned flags,
+					     struct module *owner)
 {
 	struct kernfs_node *kn;
 	u32 id_highbits;
@@ -607,6 +633,7 @@  static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 	kn->name = name;
 	kn->mode = mode;
 	kn->flags = flags;
+	kn->owner = owner;
 
 	if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
 		struct iattr iattr = {
@@ -640,12 +667,13 @@  static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
 				    kuid_t uid, kgid_t gid,
-				    unsigned flags)
+				    unsigned flags,
+				    struct module *owner)
 {
 	struct kernfs_node *kn;
 
 	kn = __kernfs_new_node(kernfs_root(parent), parent,
-			       name, mode, uid, gid, flags);
+			       name, mode, uid, gid, flags, owner);
 	if (kn) {
 		kernfs_get(parent);
 		kn->parent = parent;
@@ -927,7 +955,7 @@  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
 
 	kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
 			       GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
-			       KERNFS_DIR);
+			       KERNFS_DIR, NULL);
 	if (!kn) {
 		idr_destroy(&root->ino_idr);
 		kfree(root);
@@ -969,20 +997,22 @@  void kernfs_destroy_root(struct kernfs_root *root)
  * @gid: gid of the new directory
  * @priv: opaque data associated with the new directory
  * @ns: optional namespace tag of the directory
+ * @owner: if set, the module owner of this directory
  *
  * Returns the created node on success, ERR_PTR() value on failure.
  */
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
 					 kuid_t uid, kgid_t gid,
-					 void *priv, const void *ns)
+					 void *priv, const void *ns,
+					 struct module *owner)
 {
 	struct kernfs_node *kn;
 	int rc;
 
 	/* allocate */
 	kn = kernfs_new_node(parent, name, mode | S_IFDIR,
-			     uid, gid, KERNFS_DIR);
+			     uid, gid, KERNFS_DIR, owner);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
@@ -1014,7 +1044,7 @@  struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 
 	/* allocate */
 	kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
-			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
+			     GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR, NULL);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 4479c6580333..0e125287e050 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -978,6 +978,7 @@  const struct file_operations kernfs_file_fops = {
  * @priv: private data for the file
  * @ns: optional namespace tag of the file
  * @key: lockdep key for the file's active_ref, %NULL to disable lockdep
+ * @owner: if set, the module owner of the file
  *
  * Returns the created node on success, ERR_PTR() value on error.
  */
@@ -987,7 +988,8 @@  struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 loff_t size,
 					 const struct kernfs_ops *ops,
 					 void *priv, const void *ns,
-					 struct lock_class_key *key)
+					 struct lock_class_key *key,
+					 struct module *owner)
 {
 	struct kernfs_node *kn;
 	unsigned flags;
@@ -996,7 +998,7 @@  struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 	flags = KERNFS_FILE;
 
 	kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
-			     uid, gid, flags);
+			     uid, gid, flags, owner);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 9e3abf597e2d..6d275d661987 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -134,7 +134,8 @@  int kernfs_add_one(struct kernfs_node *kn);
 struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
 				    const char *name, umode_t mode,
 				    kuid_t uid, kgid_t gid,
-				    unsigned flags);
+				    unsigned flags,
+				    struct module *owner);
 
 /*
  * file.c
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 19a6c71c6ff5..5a053eebee52 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,7 +36,8 @@  struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
 		gid = target->iattr->ia_gid;
 	}
 
-	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
+	kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK,
+			     target->owner);
 	if (!kn)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index b6b6796e1616..9763c2fde3c7 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -57,7 +57,7 @@  int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
 	kobject_get_ownership(kobj, &uid, &gid);
 
 	kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
-				  kobj, ns);
+				  kobj, ns, NULL);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, kobject_name(kobj));
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 42dcf96881b6..af9e91fd1a92 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -292,7 +292,8 @@  int sysfs_add_file_mode_ns(struct kernfs_node *parent,
 #endif
 
 	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
-				  PAGE_SIZE, ops, (void *)attr, ns, key);
+				  PAGE_SIZE, ops, (void *)attr, ns, key,
+				  attr->owner);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, attr->name);
@@ -327,7 +328,8 @@  int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
 #endif
 
 	kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
-				  battr->size, ops, (void *)attr, ns, key);
+				  battr->size, ops, (void *)attr, ns, key,
+				  attr->owner);
 	if (IS_ERR(kn)) {
 		if (PTR_ERR(kn) == -EEXIST)
 			sysfs_warn_dup(parent, attr->name);
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index eeb0e3099421..372864d1cb54 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -135,7 +135,8 @@  static int internal_create_group(struct kobject *kobj, int update,
 		} else {
 			kn = kernfs_create_dir_ns(kobj->sd, grp->name,
 						  S_IRWXU | S_IRUGO | S_IXUGO,
-						  uid, gid, kobj, NULL);
+						  uid, gid, kobj, NULL,
+						  grp->owner);
 			if (IS_ERR(kn)) {
 				if (PTR_ERR(kn) == -EEXIST)
 					sysfs_warn_dup(kobj->sd, grp->name);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index cd968ee2b503..818b00ebd107 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -161,6 +161,7 @@  struct kernfs_node {
 	unsigned short		flags;
 	umode_t			mode;
 	struct kernfs_iattrs	*iattr;
+	struct module           *owner;
 };
 
 /*
@@ -370,7 +371,8 @@  void kernfs_destroy_root(struct kernfs_root *root);
 struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
 					 const char *name, umode_t mode,
 					 kuid_t uid, kgid_t gid,
-					 void *priv, const void *ns);
+					 void *priv, const void *ns,
+					 struct module *owner);
 struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
 					    const char *name);
 struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
@@ -379,7 +381,8 @@  struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
 					 loff_t size,
 					 const struct kernfs_ops *ops,
 					 void *priv, const void *ns,
-					 struct lock_class_key *key);
+					 struct lock_class_key *key,
+					 struct module *owner);
 struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
 				       const char *name,
 				       struct kernfs_node *target);
@@ -472,14 +475,15 @@  static inline void kernfs_destroy_root(struct kernfs_root *root) { }
 static inline struct kernfs_node *
 kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
 		     umode_t mode, kuid_t uid, kgid_t gid,
-		     void *priv, const void *ns)
+		     void *priv, const void *ns, struct module *owner)
 { return ERR_PTR(-ENOSYS); }
 
 static inline struct kernfs_node *
 __kernfs_create_file(struct kernfs_node *parent, const char *name,
 		     umode_t mode, kuid_t uid, kgid_t gid,
 		     loff_t size, const struct kernfs_ops *ops,
-		     void *priv, const void *ns, struct lock_class_key *key)
+		     void *priv, const void *ns, struct lock_class_key *key,
+		     struct module *owner)
 { return ERR_PTR(-ENOSYS); }
 
 static inline struct kernfs_node *
@@ -566,7 +570,7 @@  kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
 {
 	return kernfs_create_dir_ns(parent, name, mode,
 				    GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
-				    priv, NULL);
+				    priv, NULL, parent->owner);
 }
 
 static inline int kernfs_remove_by_name(struct kernfs_node *parent,
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index e3f1e8ac1f85..babbabb460dc 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -30,6 +30,7 @@  enum kobj_ns_type;
 struct attribute {
 	const char		*name;
 	umode_t			mode;
+	struct module           *owner;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	bool			ignore_lockdep:1;
 	struct lock_class_key	*key;
@@ -80,6 +81,7 @@  do {							\
  * @attrs:	Pointer to NULL terminated list of attributes.
  * @bin_attrs:	Pointer to NULL terminated list of binary attributes.
  *		Either attrs or bin_attrs or both must be provided.
+ * @module:	If set, module responsible for this attribute group
  */
 struct attribute_group {
 	const char		*name;
@@ -89,6 +91,7 @@  struct attribute_group {
 						  struct bin_attribute *, int);
 	struct attribute	**attrs;
 	struct bin_attribute	**bin_attrs;
+	struct module           *owner;
 };
 
 /*
@@ -100,38 +103,52 @@  struct attribute_group {
 
 #define __ATTR(_name, _mode, _show, _store) {				\
 	.attr = {.name = __stringify(_name),				\
-		 .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
+		 .mode = VERIFY_OCTAL_PERMISSIONS(_mode),               \
+		 .owner  = THIS_MODULE,                                 \
+	},                                                             \
 	.show	= _show,						\
 	.store	= _store,						\
 }
 
 #define __ATTR_PREALLOC(_name, _mode, _show, _store) {			\
 	.attr = {.name = __stringify(_name),				\
-		 .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\
+		 .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\
+		 .owner  = THIS_MODULE,                                 \
+	},                                                              \
 	.show	= _show,						\
 	.store	= _store,						\
 }
 
 #define __ATTR_RO(_name) {						\
-	.attr	= { .name = __stringify(_name), .mode = 0444 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0444,					\
+		    .owner  = THIS_MODULE,				\
+		},                                                     \
 	.show	= _name##_show,						\
 }
 
 #define __ATTR_RO_MODE(_name, _mode) {					\
 	.attr	= { .name = __stringify(_name),				\
-		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
+		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode),            \
+		    .owner  = THIS_MODULE,				\
+	},                                                              \
 	.show	= _name##_show,						\
 }
 
 #define __ATTR_RW_MODE(_name, _mode) {					\
 	.attr	= { .name = __stringify(_name),				\
-		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },		\
+		    .mode = VERIFY_OCTAL_PERMISSIONS(_mode),            \
+		    .owner  = THIS_MODULE,                              \
+	},								\
 	.show	= _name##_show,						\
 	.store	= _name##_store,					\
 }
 
 #define __ATTR_WO(_name) {						\
-	.attr	= { .name = __stringify(_name), .mode = 0200 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0200,					\
+		    .owner  = THIS_MODULE,				\
+	},                                                              \
 	.store	= _name##_store,					\
 }
 
@@ -141,8 +158,11 @@  struct attribute_group {
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 #define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) {	\
-	.attr = {.name = __stringify(_name), .mode = _mode,	\
-			.ignore_lockdep = true },		\
+	.attr = {.name = __stringify(_name),                    \
+		 .mode = _mode,					\
+		 .ignore_lockdep = true,                        \
+		 .owner  = THIS_MODULE,                         \
+	},							\
 	.show		= _show,				\
 	.store		= _store,				\
 }
@@ -159,6 +179,7 @@  static const struct attribute_group *_name##_groups[] = {	\
 #define ATTRIBUTE_GROUPS(_name)					\
 static const struct attribute_group _name##_group = {		\
 	.attrs = _name##_attrs,					\
+	.owner = THIS_MODULE,					\
 };								\
 __ATTRIBUTE_GROUPS(_name)
 
@@ -199,20 +220,29 @@  struct bin_attribute {
 
 /* macros to create static binary attributes easier */
 #define __BIN_ATTR(_name, _mode, _read, _write, _size) {		\
-	.attr = { .name = __stringify(_name), .mode = _mode },		\
+	.attr = { .name = __stringify(_name),                           \
+		   .mode = _mode,					\
+		   .owner = THIS_MODULE,				\
+	},								\
 	.read	= _read,						\
 	.write	= _write,						\
 	.size	= _size,						\
 }
 
 #define __BIN_ATTR_RO(_name, _size) {					\
-	.attr	= { .name = __stringify(_name), .mode = 0444 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0444,					\
+		    .owner = THIS_MODULE,				\
+	},								\
 	.read	= _name##_read,						\
 	.size	= _size,						\
 }
 
 #define __BIN_ATTR_WO(_name, _size) {					\
-	.attr	= { .name = __stringify(_name), .mode = 0200 },		\
+	.attr	= { .name = __stringify(_name),                         \
+		    .mode = 0200,					\
+		    .owner = THIS_MODULE,				\
+	},								\
 	.write	= _name##_write,					\
 	.size	= _size,						\
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9e0390000025..c6b0a28f599c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3975,7 +3975,7 @@  static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
 				  cgroup_file_mode(cft),
 				  GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
 				  0, cft->kf_ops, cft,
-				  NULL, key);
+				  NULL, key, NULL);
 	if (IS_ERR(kn))
 		return PTR_ERR(kn);