diff mbox series

[v8,11/12] zram: fix crashes with cpu hotplug multistate

Message ID 20210927163805.808907-12-mcgrof@kernel.org (mailing list archive)
State New, archived
Headers show
Series syfs: generic deadlock fix with module removal | expand

Commit Message

Luis Chamberlain Sept. 27, 2021, 4:38 p.m. UTC
Provide a simple state machine to fix races with driver exit where we
remove the CPU multistate callbacks and re-initialization / creation of
new per CPU instances which should be managed by these callbacks.

The zram driver makes use of cpu hotplug multistate support, whereby it
associates a struct zcomp per CPU. Each struct zcomp represents a
compression algorithm in charge of managing compression streams per
CPU. Although a compiled zram driver only supports a fixed set of
compression algorithms, each zram device gets a struct zcomp allocated
per CPU. The "multi" in CPU hotplug multstate refers to these per
cpu struct zcomp instances. Each of these will have the CPU hotplug
callback called for it on CPU plug / unplug. The kernel's CPU hotplug
multistate keeps a linked list of these different structures so that
it will iterate over them on CPU transitions.

By default at driver initialization we will create just one zram device
(num_devices=1) and a zcomp structure then set for the now default
lzo-rle comrpession algorithm. At driver removal we first remove each
zram device, and so we destroy the associated struct zcomp per CPU. But
since we expose sysfs attributes to create new devices or reset /
initialize existing zram devices, we can easily end up re-initializing
a struct zcomp for a zram device before the exit routine of the module
removes the cpu hotplug callback. When this happens the kernel's CPU
hotplug will detect that at least one instance (struct zcomp for us)
exists. This can happen in the following situation:

CPU 1                            CPU 2

                                disksize_store(...);
class_unregister(...);
idr_for_each(...);
zram_debugfs_destroy();

idr_destroy(...);
unregister_blkdev(...);
cpuhp_remove_multi_state(...);

The warning comes up on cpuhp_remove_multi_state() when it sees that the
state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
In this case, that a struct zcom still exists, the driver allowed its
creation per CPU even though we could have just freed them per CPU
though a call on another CPU, and we are then later trying to remove the
hotplug callback.

Fix all this by providing a zram initialization boolean
protected the shared in the driver zram_index_mutex, which we
can use to annotate when sysfs attributes are safe to use or
not -- once the driver is properly initialized. When the driver
is going down we also are sure to not let userspace muck with
attributes which may affect each per cpu struct zcomp.

This also fixes a series of possible memory leaks. The
crashes and memory leaks can easily be caused by issuing
the zram02.sh script from the LTP project [0] in a loop
in two separate windows:

  cd testcases/kernel/device-drivers/zram
  while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done

You end up with a splat as follows:

kernel: zram: Removed device: zram0
kernel: zram: Added device: zram0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: Adding 104857596k swap on /dev/zram0.  <etc>
kernel: zram0: detected capacitky change from 209715200 to 0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: ------------[ cut here ]------------
kernel: Error: Removing state 63 which has instances left.
kernel: WARNING: CPU: 7 PID: 70457 at \
	kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G            \
	E     5.12.0-rc1-next-20210304 #3
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \
	BIOS 1.14.0-2 04/01/2014
kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Code: <etc>
kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc>
kernel: CS:  0010 DS: 0000 ES 0000 CR0: 0000000080050033
kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel: __cpuhp_remove_state+0x2e/0x80
kernel: __do_sys_delete_module+0x190/0x2a0
kernel:  do_syscall_64+0x33/0x80
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae

The "Error: Removing state 63 which has instances left" refers
to the zram per CPU struct zcomp instances left.

[0] https://github.com/linux-test-project/ltp.git

Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
 1 file changed, 55 insertions(+), 8 deletions(-)

Comments

Kees Cook Oct. 5, 2021, 8:55 p.m. UTC | #1
On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> Provide a simple state machine to fix races with driver exit where we
> remove the CPU multistate callbacks and re-initialization / creation of
> new per CPU instances which should be managed by these callbacks.
> 
> The zram driver makes use of cpu hotplug multistate support, whereby it
> associates a struct zcomp per CPU. Each struct zcomp represents a
> compression algorithm in charge of managing compression streams per
> CPU. Although a compiled zram driver only supports a fixed set of
> compression algorithms, each zram device gets a struct zcomp allocated
> per CPU. The "multi" in CPU hotplug multstate refers to these per
> cpu struct zcomp instances. Each of these will have the CPU hotplug
> callback called for it on CPU plug / unplug. The kernel's CPU hotplug
> multistate keeps a linked list of these different structures so that
> it will iterate over them on CPU transitions.
> 
> By default at driver initialization we will create just one zram device
> (num_devices=1) and a zcomp structure then set for the now default
> lzo-rle comrpession algorithm. At driver removal we first remove each
> zram device, and so we destroy the associated struct zcomp per CPU. But
> since we expose sysfs attributes to create new devices or reset /
> initialize existing zram devices, we can easily end up re-initializing
> a struct zcomp for a zram device before the exit routine of the module
> removes the cpu hotplug callback. When this happens the kernel's CPU
> hotplug will detect that at least one instance (struct zcomp for us)
> exists. This can happen in the following situation:
> 
> CPU 1                            CPU 2
> 
>                                 disksize_store(...);
> class_unregister(...);
> idr_for_each(...);
> zram_debugfs_destroy();
> 
> idr_destroy(...);
> unregister_blkdev(...);
> cpuhp_remove_multi_state(...);

So this is strictly separate from the sysfs/module unloading race?

-Kees

> 
> The warning comes up on cpuhp_remove_multi_state() when it sees that the
> state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
> In this case, that a struct zcom still exists, the driver allowed its
> creation per CPU even though we could have just freed them per CPU
> though a call on another CPU, and we are then later trying to remove the
> hotplug callback.
> 
> Fix all this by providing a zram initialization boolean
> protected the shared in the driver zram_index_mutex, which we
> can use to annotate when sysfs attributes are safe to use or
> not -- once the driver is properly initialized. When the driver
> is going down we also are sure to not let userspace muck with
> attributes which may affect each per cpu struct zcomp.
> 
> This also fixes a series of possible memory leaks. The
> crashes and memory leaks can easily be caused by issuing
> the zram02.sh script from the LTP project [0] in a loop
> in two separate windows:
> 
>   cd testcases/kernel/device-drivers/zram
>   while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
> 
> You end up with a splat as follows:
> 
> kernel: zram: Removed device: zram0
> kernel: zram: Added device: zram0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: Adding 104857596k swap on /dev/zram0.  <etc>
> kernel: zram0: detected capacitky change from 209715200 to 0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: ------------[ cut here ]------------
> kernel: Error: Removing state 63 which has instances left.
> kernel: WARNING: CPU: 7 PID: 70457 at \
> 	kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
> kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G            \
> 	E     5.12.0-rc1-next-20210304 #3
> kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \
> 	BIOS 1.14.0-2 04/01/2014
> kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Code: <etc>
> kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
> kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
> kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
> kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
> kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
> kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
> kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc>
> kernel: CS:  0010 DS: 0000 ES 0000 CR0: 0000000080050033
> kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
> kernel: Call Trace:
> kernel: __cpuhp_remove_state+0x2e/0x80
> kernel: __do_sys_delete_module+0x190/0x2a0
> kernel:  do_syscall_64+0x33/0x80
> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> The "Error: Removing state 63 which has instances left" refers
> to the zram per CPU struct zcomp instances left.
> 
> [0] https://github.com/linux-test-project/ltp.git
> 
> Acked-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
>  1 file changed, 55 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index f61910c65f0f..b26abcb955cc 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
>  static int zram_major;
>  static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
>  
> +static bool zram_up;
> +
>  /* Module params (documentation at end) */
>  static unsigned int num_devices = 1;
>  /*
> @@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram)
>  	comp = zram->comp;
>  	disksize = zram->disksize;
>  	zram->disksize = 0;
> +	zram->comp = NULL;
>  
>  	set_capacity_and_notify(zram->disk, 0);
>  	part_stat_set_all(zram->disk->part0, 0);
> @@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev,
>  	struct zram *zram = dev_to_zram(dev);
>  	int err;
>  
> +	mutex_lock(&zram_index_mutex);
> +
> +	if (!zram_up) {
> +		err = -ENODEV;
> +		goto out;
> +	}
> +
>  	disksize = memparse(buf, NULL);
> -	if (!disksize)
> -		return -EINVAL;
> +	if (!disksize) {
> +		err = -EINVAL;
> +		goto out;
> +	}
>  
>  	down_write(&zram->init_lock);
>  	if (init_done(zram)) {
> @@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev,
>  	set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
>  	up_write(&zram->init_lock);
>  
> +	mutex_unlock(&zram_index_mutex);
> +
>  	return len;
>  
>  out_free_meta:
>  	zram_meta_free(zram, disksize);
>  out_unlock:
>  	up_write(&zram->init_lock);
> +out:
> +	mutex_unlock(&zram_index_mutex);
>  	return err;
>  }
>  
> @@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev,
>  	if (ret)
>  		return ret;
>  
> -	if (!do_reset)
> -		return -EINVAL;
> +	mutex_lock(&zram_index_mutex);
> +
> +	if (!zram_up) {
> +		len = -ENODEV;
> +		goto out;
> +	}
> +
> +	if (!do_reset) {
> +		len = -EINVAL;
> +		goto out;
> +	}
>  
>  	zram = dev_to_zram(dev);
>  	bdev = zram->disk->part0;
> @@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev,
>  	/* Do not reset an active device or claimed device */
>  	if (bdev->bd_openers || zram->claim) {
>  		mutex_unlock(&bdev->bd_disk->open_mutex);
> -		return -EBUSY;
> +		len = -EBUSY;
> +		goto out;
>  	}
>  
>  	/* From now on, anyone can't open /dev/zram[0-9] */
> @@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev,
>  	zram->claim = false;
>  	mutex_unlock(&bdev->bd_disk->open_mutex);
>  
> +out:
> +	mutex_unlock(&zram_index_mutex);
>  	return len;
>  }
>  
> @@ -2010,6 +2038,10 @@ static ssize_t hot_add_show(struct class *class,
>  	int ret;
>  
>  	mutex_lock(&zram_index_mutex);
> +	if (!zram_up) {
> +		mutex_unlock(&zram_index_mutex);
> +		return -ENODEV;
> +	}
>  	ret = zram_add();
>  	mutex_unlock(&zram_index_mutex);
>  
> @@ -2037,6 +2069,11 @@ static ssize_t hot_remove_store(struct class *class,
>  
>  	mutex_lock(&zram_index_mutex);
>  
> +	if (!zram_up) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
>  	zram = idr_find(&zram_index_idr, dev_id);
>  	if (zram) {
>  		ret = zram_remove(zram);
> @@ -2046,6 +2083,7 @@ static ssize_t hot_remove_store(struct class *class,
>  		ret = -ENODEV;
>  	}
>  
> +out:
>  	mutex_unlock(&zram_index_mutex);
>  	return ret ? ret : count;
>  }
> @@ -2072,12 +2110,15 @@ static int zram_remove_cb(int id, void *ptr, void *data)
>  
>  static void destroy_devices(void)
>  {
> +	mutex_lock(&zram_index_mutex);
> +	zram_up = false;
>  	class_unregister(&zram_control_class);
>  	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
>  	zram_debugfs_destroy();
>  	idr_destroy(&zram_index_idr);
>  	unregister_blkdev(zram_major, "zram");
>  	cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> +	mutex_unlock(&zram_index_mutex);
>  }
>  
>  static int __init zram_init(void)
> @@ -2105,15 +2146,21 @@ static int __init zram_init(void)
>  		return -EBUSY;
>  	}
>  
> +	mutex_lock(&zram_index_mutex);
> +
>  	while (num_devices != 0) {
> -		mutex_lock(&zram_index_mutex);
>  		ret = zram_add();
> -		mutex_unlock(&zram_index_mutex);
> -		if (ret < 0)
> +		if (ret < 0) {
> +			mutex_unlock(&zram_index_mutex);
>  			goto out_error;
> +		}
>  		num_devices--;
>  	}
>  
> +	zram_up = true;
> +
> +	mutex_unlock(&zram_index_mutex);
> +
>  	return 0;
>  
>  out_error:
> -- 
> 2.30.2
>
Luis Chamberlain Oct. 11, 2021, 6:27 p.m. UTC | #2
On Tue, Oct 05, 2021 at 01:55:35PM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> > Provide a simple state machine to fix races with driver exit where we
> > remove the CPU multistate callbacks and re-initialization / creation of
> > new per CPU instances which should be managed by these callbacks.
> > 
> > The zram driver makes use of cpu hotplug multistate support, whereby it
> > associates a struct zcomp per CPU. Each struct zcomp represents a
> > compression algorithm in charge of managing compression streams per
> > CPU. Although a compiled zram driver only supports a fixed set of
> > compression algorithms, each zram device gets a struct zcomp allocated
> > per CPU. The "multi" in CPU hotplug multstate refers to these per
> > cpu struct zcomp instances. Each of these will have the CPU hotplug
> > callback called for it on CPU plug / unplug. The kernel's CPU hotplug
> > multistate keeps a linked list of these different structures so that
> > it will iterate over them on CPU transitions.
> > 
> > By default at driver initialization we will create just one zram device
> > (num_devices=1) and a zcomp structure then set for the now default
> > lzo-rle comrpession algorithm. At driver removal we first remove each
> > zram device, and so we destroy the associated struct zcomp per CPU. But
> > since we expose sysfs attributes to create new devices or reset /
> > initialize existing zram devices, we can easily end up re-initializing
> > a struct zcomp for a zram device before the exit routine of the module
> > removes the cpu hotplug callback. When this happens the kernel's CPU
> > hotplug will detect that at least one instance (struct zcomp for us)
> > exists. This can happen in the following situation:
> > 
> > CPU 1                            CPU 2
> > 
> >                                 disksize_store(...);
> > class_unregister(...);
> > idr_for_each(...);
> > zram_debugfs_destroy();
> > 
> > idr_destroy(...);
> > unregister_blkdev(...);
> > cpuhp_remove_multi_state(...);
> 
> So this is strictly separate from the sysfs/module unloading race?

It is only related in the sense that the sysfs/module unloading race
happened *after* this other issue, but addressing these through
separate threads created a break in conversation and focus. For
instance, a theoretical race was mentioned in one thread, which
I worked to prove/disprove and then I disproved it was not possible.

But at this point, yes, this is a purely separate issue, and this
patch *should* be picked up already.

Andrew, can you merge this? It already has the respective maintainer
Ack, and I can continue to work on the rest of the patches. The only
issue I can think of would be a conflict with the last patch but
that's a oneliner, I think chances are low that would create a conflict
if its all merged separately, and if so, it should be an easy fix for
a merge conflict.

  Luis
Ming Lei Oct. 14, 2021, 1:55 a.m. UTC | #3
On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> Provide a simple state machine to fix races with driver exit where we
> remove the CPU multistate callbacks and re-initialization / creation of
> new per CPU instances which should be managed by these callbacks.
> 
> The zram driver makes use of cpu hotplug multistate support, whereby it
> associates a struct zcomp per CPU. Each struct zcomp represents a
> compression algorithm in charge of managing compression streams per
> CPU. Although a compiled zram driver only supports a fixed set of
> compression algorithms, each zram device gets a struct zcomp allocated
> per CPU. The "multi" in CPU hotplug multstate refers to these per
> cpu struct zcomp instances. Each of these will have the CPU hotplug
> callback called for it on CPU plug / unplug. The kernel's CPU hotplug
> multistate keeps a linked list of these different structures so that
> it will iterate over them on CPU transitions.
> 
> By default at driver initialization we will create just one zram device
> (num_devices=1) and a zcomp structure then set for the now default
> lzo-rle comrpession algorithm. At driver removal we first remove each
> zram device, and so we destroy the associated struct zcomp per CPU. But
> since we expose sysfs attributes to create new devices or reset /
> initialize existing zram devices, we can easily end up re-initializing
> a struct zcomp for a zram device before the exit routine of the module
> removes the cpu hotplug callback. When this happens the kernel's CPU
> hotplug will detect that at least one instance (struct zcomp for us)
> exists. This can happen in the following situation:
> 
> CPU 1                            CPU 2
> 
>                                 disksize_store(...);
> class_unregister(...);
> idr_for_each(...);
> zram_debugfs_destroy();
> 
> idr_destroy(...);
> unregister_blkdev(...);
> cpuhp_remove_multi_state(...);
> 
> The warning comes up on cpuhp_remove_multi_state() when it sees that the
> state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
> In this case, that a struct zcom still exists, the driver allowed its
> creation per CPU even though we could have just freed them per CPU
> though a call on another CPU, and we are then later trying to remove the
> hotplug callback.
> 
> Fix all this by providing a zram initialization boolean
> protected the shared in the driver zram_index_mutex, which we
> can use to annotate when sysfs attributes are safe to use or
> not -- once the driver is properly initialized. When the driver
> is going down we also are sure to not let userspace muck with
> attributes which may affect each per cpu struct zcomp.
> 
> This also fixes a series of possible memory leaks. The
> crashes and memory leaks can easily be caused by issuing
> the zram02.sh script from the LTP project [0] in a loop
> in two separate windows:
> 
>   cd testcases/kernel/device-drivers/zram
>   while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
> 
> You end up with a splat as follows:
> 
> kernel: zram: Removed device: zram0
> kernel: zram: Added device: zram0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: Adding 104857596k swap on /dev/zram0.  <etc>
> kernel: zram0: detected capacitky change from 209715200 to 0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: ------------[ cut here ]------------
> kernel: Error: Removing state 63 which has instances left.
> kernel: WARNING: CPU: 7 PID: 70457 at \
> 	kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
> kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G            \
> 	E     5.12.0-rc1-next-20210304 #3
> kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \
> 	BIOS 1.14.0-2 04/01/2014
> kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Code: <etc>
> kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
> kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
> kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
> kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
> kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
> kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
> kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc>
> kernel: CS:  0010 DS: 0000 ES 0000 CR0: 0000000080050033
> kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
> kernel: Call Trace:
> kernel: __cpuhp_remove_state+0x2e/0x80
> kernel: __do_sys_delete_module+0x190/0x2a0
> kernel:  do_syscall_64+0x33/0x80
> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> The "Error: Removing state 63 which has instances left" refers
> to the zram per CPU struct zcomp instances left.
> 
> [0] https://github.com/linux-test-project/ltp.git
> 
> Acked-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---

Hello Luis,

Can you test the following patch and see if the issue can be addressed?

Please see the idea from the inline comment.

Also zram_index_mutex isn't needed in zram disk's store() compared with
your patch, then the deadlock issue you are addressing in this series can
be avoided.


diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fcaf2750f68f..3c17927d23a7 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
 
 	/* Make sure all the pending I/O are finished */
 	fsync_bdev(bdev);
-	zram_reset_device(zram);
 
 	pr_info("Removed device: %s\n", zram->disk->disk_name);
 
 	del_gendisk(zram->disk);
+
+	/*
+	 * reset device after gendisk is removed, so any change from sysfs
+	 * store won't come in, then we can really reset device here
+	 */
+	zram_reset_device(zram);
+
 	blk_cleanup_disk(zram->disk);
 	kfree(zram);
 	return 0;
@@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
 static void destroy_devices(void)
 {
 	class_unregister(&zram_control_class);
+
+	/* hold the global lock so new device can't be added */
+	mutex_lock(&zram_index_mutex);
 	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
+	mutex_unlock(&zram_index_mutex);
+
 	zram_debugfs_destroy();
 	idr_destroy(&zram_index_idr);
 	unregister_blkdev(zram_major, "zram");

Thanks,
Ming
Ming Lei Oct. 14, 2021, 2:11 a.m. UTC | #4
On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
> On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:

...

> 
> Hello Luis,
> 
> Can you test the following patch and see if the issue can be addressed?
> 
> Please see the idea from the inline comment.
> 
> Also zram_index_mutex isn't needed in zram disk's store() compared with
> your patch, then the deadlock issue you are addressing in this series can
> be avoided.
> 
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index fcaf2750f68f..3c17927d23a7 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
>  
>  	/* Make sure all the pending I/O are finished */
>  	fsync_bdev(bdev);
> -	zram_reset_device(zram);
>  
>  	pr_info("Removed device: %s\n", zram->disk->disk_name);
>  
>  	del_gendisk(zram->disk);
> +
> +	/*
> +	 * reset device after gendisk is removed, so any change from sysfs
> +	 * store won't come in, then we can really reset device here
> +	 */
> +	zram_reset_device(zram);
> +
>  	blk_cleanup_disk(zram->disk);
>  	kfree(zram);
>  	return 0;
> @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
>  static void destroy_devices(void)
>  {
>  	class_unregister(&zram_control_class);
> +
> +	/* hold the global lock so new device can't be added */
> +	mutex_lock(&zram_index_mutex);
>  	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> +	mutex_unlock(&zram_index_mutex);
> +

Actually zram_index_mutex isn't needed when calling zram_remove_cb()
since the zram-control sysfs interface has been removed, so userspace
can't add new device any more, then the issue is supposed to be fixed
by the following one line change, please test it:

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fcaf2750f68f..96dd641de233 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
 
 	/* Make sure all the pending I/O are finished */
 	fsync_bdev(bdev);
-	zram_reset_device(zram);
 
 	pr_info("Removed device: %s\n", zram->disk->disk_name);
 
 	del_gendisk(zram->disk);
+
+	/*
+	 * reset device after gendisk is removed, so any change from sysfs
+	 * store won't come in, then we can really reset device here
+	 */
+	zram_reset_device(zram);
+
 	blk_cleanup_disk(zram->disk);
 	kfree(zram);
 	return 0;



Thanks, 
Ming
Luis Chamberlain Oct. 14, 2021, 8:24 p.m. UTC | #5
On Thu, Oct 14, 2021 at 10:11:46AM +0800, Ming Lei wrote:
> On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
> > On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> 
> ...
> 
> > 
> > Hello Luis,
> > 
> > Can you test the following patch and see if the issue can be addressed?
> > 
> > Please see the idea from the inline comment.
> > 
> > Also zram_index_mutex isn't needed in zram disk's store() compared with
> > your patch, then the deadlock issue you are addressing in this series can
> > be avoided.
> > 
> > 
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index fcaf2750f68f..3c17927d23a7 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> >  
> >  	/* Make sure all the pending I/O are finished */
> >  	fsync_bdev(bdev);
> > -	zram_reset_device(zram);
> >  
> >  	pr_info("Removed device: %s\n", zram->disk->disk_name);
> >  
> >  	del_gendisk(zram->disk);
> > +
> > +	/*
> > +	 * reset device after gendisk is removed, so any change from sysfs
> > +	 * store won't come in, then we can really reset device here
> > +	 */
> > +	zram_reset_device(zram);
> > +
> >  	blk_cleanup_disk(zram->disk);
> >  	kfree(zram);
> >  	return 0;
> > @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
> >  static void destroy_devices(void)
> >  {
> >  	class_unregister(&zram_control_class);
> > +
> > +	/* hold the global lock so new device can't be added */
> > +	mutex_lock(&zram_index_mutex);
> >  	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > +	mutex_unlock(&zram_index_mutex);
> > +
> 
> Actually zram_index_mutex isn't needed when calling zram_remove_cb()
> since the zram-control sysfs interface has been removed, so userspace
> can't add new device any more, then the issue is supposed to be fixed
> by the following one line change, please test it:
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index fcaf2750f68f..96dd641de233 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
>  
>  	/* Make sure all the pending I/O are finished */
>  	fsync_bdev(bdev);
> -	zram_reset_device(zram);
>  
>  	pr_info("Removed device: %s\n", zram->disk->disk_name);
>  
>  	del_gendisk(zram->disk);
> +
> +	/*
> +	 * reset device after gendisk is removed, so any change from sysfs
> +	 * store won't come in, then we can really reset device here
> +	 */
> +	zram_reset_device(zram);
> +
>  	blk_cleanup_disk(zram->disk);
>  	kfree(zram);
>  	return 0;

Sorry but nope, the cpu multistate issue is still present and we end up
eventually with page faults. I tried with both patches.

Oct 14 20:21:34 kdevops kernel: ------------[ cut here ]------------
Oct 14 20:21:34 kdevops kernel: Error: Removing state 65 which has
instances left.
Oct 14 20:21:34 kdevops kernel: WARNING: CPU: 4 PID: 3358 at
kernel/cpu.c:2151 __cpuhp_remove_state_cpuslocked+0xf9/0x100
Oct 14 20:21:34 kdevops kernel: Modules linked in: zram(E-) zstd(E)
zsmalloc(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E)
crc32_pclmul(E) ghash_clmulni_intel(E) >
Oct 14 20:21:34 kdevops kernel: CPU: 4 PID: 3358 Comm: rmmod Tainted: G
E     5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:34 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:34 kdevops kernel: RIP:
0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
Oct 14 20:21:34 kdevops kernel: Code: 21 00 48 c7 43 18 00 00 00 00 5b
5d 41 5c 41 5d 41 5e 41 5f e9 d8 17 84 00 0f 0b 44 89 e6 48 c7 c7 78 0c
8b ad e8 56 92 7f 00 <0f> 0b >
Oct 14 20:21:34 kdevops kernel: RSP: 0018:ffffaac980a1fe90 EFLAGS:
00010286
Oct 14 20:21:34 kdevops kernel: RAX: 0000000000000000 RBX:
ffffffffada3e208 RCX: 0000000000000000
Oct 14 20:21:34 kdevops kernel: RDX: 0000000000000001 RSI:
ffffffffad8efdb6 RDI: 00000000ffffffff
Oct 14 20:21:34 kdevops kernel: RBP: 0000000000000000 R08:
0000000000000000 R09: ffffaac980a1fcc0
Oct 14 20:21:34 kdevops kernel: R10: ffffaac980a1fcb8 R11:
ffffffffadac3c68 R12: 0000000000000041
Oct 14 20:21:34 kdevops kernel: R13: 0000000000000a28 R14:
0000000000000000 R15: 0000000000000000
Oct 14 20:21:34 kdevops kernel: FS:  00007fc0c2882580(0000)
GS:ffff9ed6f7d00000(0000) knlGS:0000000000000000
Oct 14 20:21:34 kdevops kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 14 20:21:34 kdevops kernel: CR2: 00005621b0490b78 CR3:
000000011a538005 CR4: 0000000000370ee0
Oct 14 20:21:34 kdevops kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 14 20:21:34 kdevops kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 14 20:21:34 kdevops kernel: Call Trace:
Oct 14 20:21:34 kdevops kernel:  <TASK>
Oct 14 20:21:34 kdevops kernel:  __cpuhp_remove_state+0x4d/0xc0
Oct 14 20:21:34 kdevops kernel:  __do_sys_delete_module+0x18d/0x2a0
Oct 14 20:21:34 kdevops kernel:  ?
fpregs_assert_state_consistent+0x1e/0x40
Oct 14 20:21:34 kdevops kernel:  ? exit_to_user_mode_prepare+0x3a/0x180
Oct 14 20:21:34 kdevops kernel:  do_syscall_64+0x38/0xc0
Oct 14 20:21:34 kdevops kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:34 kdevops kernel: RIP: 0033:0x7fc0c29a84a7
<etc>
Oct 14 20:21:35 kdevops kernel: sysfs: cannot create duplicate filename
'/devices/virtual/block/zram0'
Oct 14 20:21:35 kdevops kernel: CPU: 5 PID: 3388 Comm: modprobe Tainted:
G        W   E     5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:35 kdevops kernel: Call Trace:
Oct 14 20:21:35 kdevops kernel:  <TASK>
Oct 14 20:21:35 kdevops kernel:  dump_stack_lvl+0x48/0x5e
Oct 14 20:21:35 kdevops kernel:  sysfs_warn_dup.cold+0x17/0x24
Oct 14 20:21:35 kdevops kernel:  sysfs_create_dir_ns+0xbc/0xd0
Oct 14 20:21:35 kdevops kernel:  kobject_add_internal+0xbd/0x2b0
Oct 14 20:21:35 kdevops kernel:  kobject_add+0x7e/0xb0
Oct 14 20:21:35 kdevops kernel:  ? _raw_spin_unlock_irqrestore+0x25/0x40
Oct 14 20:21:35 kdevops kernel:  ? preempt_count_add+0x68/0xa0
Oct 14 20:21:35 kdevops kernel:  device_add+0x11a/0x980
Oct 14 20:21:35 kdevops kernel:  ? dev_set_name+0x53/0x70
Oct 14 20:21:35 kdevops kernel:  device_add_disk+0x9d/0x3a0
Oct 14 20:21:35 kdevops kernel:  zram_add+0x1ad/0x200 [zram]
Oct 14 20:21:35 kdevops kernel:  ? 0xffffffffc0c10000
Oct 14 20:21:35 kdevops kernel:  zram_init+0xd7/0x1000 [zram]
Oct 14 20:21:35 kdevops kernel:  do_one_initcall+0x41/0x200
Oct 14 20:21:35 kdevops kernel:  ? _raw_spin_unlock_irqrestore+0x25/0x40
Oct 14 20:21:35 kdevops kernel:  ? kmem_cache_alloc_trace+0x2ab/0x420
Oct 14 20:21:35 kdevops kernel:  do_init_module+0x5c/0x270
Oct 14 20:21:35 kdevops kernel:  __do_sys_finit_module+0xae/0x110
Oct 14 20:21:35 kdevops kernel:  do_syscall_64+0x38/0xc0
Oct 14 20:21:35 kdevops kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:35 kdevops kernel: RIP: 0033:0x7fca3aa555e9
Oct 14 20:21:35 kdevops kernel: Code: 00 c3 66 2e 0f 1f 84 00 00 00 00
00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8
4c 8b 4c 24 08 0f 05 <48> 3d >
Oct 14 20:21:35 kdevops kernel: RSP: 002b:00007fff142417b8 EFLAGS:
00000246 ORIG_RAX: 0000000000000139
Oct 14 20:21:35 kdevops kernel: RAX: ffffffffffffffda RBX:
0000558ba9491bd0 RCX: 00007fca3aa555e9
Oct 14 20:21:35 kdevops kernel: RDX: 0000000000000000 RSI:
0000558ba9491f60 RDI: 0000000000000003
Oct 14 20:21:35 kdevops kernel: RBP: 0000000000040000 R08:
0000000000000000 R09: 0000558ba9491db0
Oct 14 20:21:35 kdevops kernel: R10: 0000000000000003 R11:
0000000000000246 R12: 0000558ba9491f60
Oct 14 20:21:35 kdevops kernel: R13: 0000000000000000 R14:
0000558ba9491d00 R15: 0000558ba9491bd0
Oct 14 20:21:35 kdevops kernel:  </TASK>
<etc>
Oct 14 20:21:35 kdevops kernel: kobject_add_internal failed for zram0
with -EEXIST, don't try to register things with the same name in the
same directory.
Oct 14 20:21:35 kdevops kernel: ------------[ cut here ]------------
Oct 14 20:21:35 kdevops kernel: WARNING: CPU: 5 PID: 3388 at
block/genhd.c:537 device_add_disk+0x1b9/0x3a0
Oct 14 20:21:35 kdevops kernel: Modules linked in: zram(E+) zstd(E)
zsmalloc(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E)
crc32_pclmul(E) ghash_clmulni_intel(E) >
Oct 14 20:21:35 kdevops kernel: CPU: 5 PID: 3388 Comm: modprobe Tainted:
G        W   E     5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:35 kdevops kernel: RIP: 0010:device_add_disk+0x1b9/0x3a0
Oct 14 20:21:35 kdevops kernel: Code: 00 03 01 00 00 0f 85 32 ff ff ff
e9 1e ff ff ff 0f 0b 41 bc ea ff ff ff e9 29 ff ff ff 4c 89 ff e8 5c 45
1c 00 e9 ef fe ff ff <0f> 0b >
Oct 14 20:21:35 kdevops kernel: RSP: 0018:ffffaac980607d90 EFLAGS:
00010287
Oct 14 20:21:35 kdevops kernel: RAX: 0000000000000000 RBX:
0000000000000000 RCX: 0000000000023005
Oct 14 20:21:35 kdevops kernel: RDX: 0000000000022e05 RSI:
ffffffffacc4b710 RDI: 0000000000000000
Oct 14 20:21:35 kdevops kernel: RBP: ffff9ed5d788a600 R08:
0000000000000000 R09: ffffaac980607a98
Oct 14 20:21:35 kdevops kernel: R10: ffff9ed5c795ef00 R11:
ffffffffadac3c68 R12: 00000000ffffffef
Oct 14 20:21:35 kdevops kernel: R13: ffff9ed5d5600000 R14:
ffffffffc0a52100 R15: ffff9ed5d5600040
Oct 14 20:21:35 kdevops kernel: FS:  00007fca3a935580(0000)
GS:ffff9ed6f7d40000(0000) knlGS:0000000000000000
Oct 14 20:21:35 kdevops kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 14 20:21:35 kdevops kernel: CR2: 00007fff1423e6d8 CR3:
0000000136752002 CR4: 0000000000370ee0
Oct 14 20:21:35 kdevops kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 14 20:21:35 kdevops kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 14 20:21:35 kdevops kernel: Call Trace:
Oct 14 20:21:35 kdevops kernel:  <TASK>
Oct 14 20:21:35 kdevops kernel:  zram_add+0x1ad/0x200 [zram]
Oct 14 20:21:35 kdevops kernel:  ? 0xffffffffc0c10000
Oct 14 20:21:35 kdevops kernel:  zram_init+0xd7/0x1000 [zram]
Oct 14 20:21:35 kdevops kernel:  do_one_initcall+0x41/0x200
Oct 14 20:21:35 kdevops kernel:  ? _raw_spin_unlock_irqrestore+0x25/0x40
Oct 14 20:21:35 kdevops kernel:  ? kmem_cache_alloc_trace+0x2ab/0x420
Oct 14 20:21:35 kdevops kernel:  do_init_module+0x5c/0x270
Oct 14 20:21:35 kdevops kernel:  __do_sys_finit_module+0xae/0x110
Oct 14 20:21:35 kdevops kernel:  do_syscall_64+0x38/0xc0
Oct 14 20:21:35 kdevops kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:35 kdevops kernel: RIP: 0033:0x7fca3aa555e9
<etc>
Oct 14 20:21:35 kdevops kernel: ------------[ cut here ]------------
Oct 14 20:21:35 kdevops kernel: WARNING: CPU: 2 PID: 3457 at
block/genhd.c:564 del_gendisk+0x1a2/0x1d0
Oct 14 20:21:35 kdevops kernel: Modules linked in: 842(E)
842_decompress(E) 842_compress(E) zram(E-) zstd(E) zsmalloc(E)
kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E>
Oct 14 20:21:35 kdevops kernel: CPU: 2 PID: 3457 Comm: rmmod Tainted: G
W   E     5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:35 kdevops kernel: RIP: 0010:del_gendisk+0x1a2/0x1d0
Oct 14 20:21:35 kdevops kernel: Code: 48 8d 78 40 e8 8f 87 1d 00 48 8b
7b 40 5b 5d 41 5c 48 83 c7 40 e9 4e 47 1c 00 48 8b 70 40 eb ce f6 43 61
04 0f 85 85 fe ff ff <0f> 0b >
Oct 14 20:21:35 kdevops kernel: RSP: 0018:ffffaac9807cfe30 EFLAGS:
00010246
Oct 14 20:21:35 kdevops kernel: RAX: ffff9ed5d5600380 RBX:
ffff9ed5d788a600 RCX: 0000000000000000
Oct 14 20:21:35 kdevops kernel: RDX: 0000000000000000 RSI:
ffffffffad8efdb6 RDI: ffff9ed5d788a600
Oct 14 20:21:35 kdevops kernel: RBP: ffff9ed5d788b600 R08:
0000000000000000 R09: ffffaac9807cfc88
Oct 14 20:21:35 kdevops kernel: R10: ffffaac9807cfc80 R11:
ffffffffadac3c68 R12: ffff9ed5d5600000
Oct 14 20:21:35 kdevops kernel: R13: 0000000000000000 R14:
ffffffffc0a52360 R15: ffff9ed5c4a87b78
Oct 14 20:21:35 kdevops kernel: FS:  00007f292a2bb580(0000)
GS:ffff9ed6f7c80000(0000) knlGS:0000000000000000
Oct 14 20:21:35 kdevops kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 14 20:21:35 kdevops kernel: CR2: 000056161b453b78 CR3:
000000013213e002 CR4: 0000000000370ee0
Oct 14 20:21:35 kdevops kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 14 20:21:35 kdevops kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 14 20:21:35 kdevops kernel: Call Trace:
Oct 14 20:21:35 kdevops kernel:  <TASK>
Oct 14 20:21:35 kdevops kernel:  zram_remove+0x96/0xc0 [zram]
Oct 14 20:21:35 kdevops kernel:  ? hot_remove_store+0xe0/0xe0 [zram]
Oct 14 20:21:35 kdevops kernel:  zram_remove_cb+0xd/0x10 [zram]
Oct 14 20:21:35 kdevops kernel:  idr_for_each+0x5b/0xd0
Oct 14 20:21:35 kdevops kernel:  destroy_devices+0x32/0x68 [zram]
Oct 14 20:21:35 kdevops kernel:  __do_sys_delete_module+0x18d/0x2a0
Oct 14 20:21:35 kdevops kernel:  ?
fpregs_assert_state_consistent+0x1e/0x40
Oct 14 20:21:35 kdevops kernel:  ? exit_to_user_mode_prepare+0x3a/0x180
Oct 14 20:21:35 kdevops kernel:  do_syscall_64+0x38/0xc0
Oct 14 20:21:35 kdevops kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:35 kdevops kernel: RIP: 0033:0x7f292a3e14a7
<etc>
Oct 14 20:21:35 kdevops kernel: BUG: unable to handle page fault for
address: ffffffffc0a4e0ae
Oct 14 20:21:35 kdevops kernel: #PF: supervisor instruction fetch in
kernel mode
Oct 14 20:21:35 kdevops kernel: #PF: error_code(0x0010) - not-present
page
Oct 14 20:21:35 kdevops kernel: PGD 3ba0e067 P4D 3ba0e067 PUD 3ba10067
PMD 10526c067 PTE 0
Oct 14 20:21:35 kdevops kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
Oct 14 20:21:35 kdevops kernel: CPU: 6 PID: 3655 Comm: zram02.sh
Tainted: G        W   E     5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:35 kdevops kernel: RIP: 0010:0xffffffffc0a4e0ae
Oct 14 20:21:35 kdevops kernel: Code: Unable to access opcode bytes at
RIP 0xffffffffc0a4e084.
Oct 14 20:21:35 kdevops kernel: RSP: 0018:ffffaac980687da8 EFLAGS:
00010286
Oct 14 20:21:35 kdevops kernel: RAX: 0000000000000000 RBX:
ffff9ed5c40be400 RCX: 0000000080400035
Oct 14 20:21:35 kdevops kernel: RDX: 0000000080400036 RSI:
fffffa3544561080 RDI: 0000000040000000
Oct 14 20:21:35 kdevops kernel: RBP: 0000000001900000 R08:
ffff9ed5d5842cc0 R09: 0000000080400035
Oct 14 20:21:35 kdevops kernel: R10: ffff9ed5d5842c00 R11:
ffff9ed5f1341350 R12: 0000000001900000
Oct 14 20:21:35 kdevops kernel: R13: ffff9ed5d5666c00 R14:
ffff9ed5c40be420 R15: ffff9ed5dfa8c8c0
Oct 14 20:21:35 kdevops kernel: FS:  00007f978fe2d5c0(0000)
GS:ffff9ed6f7d80000(0000) knlGS:0000000000000000
Oct 14 20:21:35 kdevops kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 14 20:21:35 kdevops kernel: CR2: ffffffffc0a4e084 CR3:
0000000133fd4006 CR4: 0000000000370ee0
Oct 14 20:21:35 kdevops kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 14 20:21:35 kdevops kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 14 20:21:35 kdevops kernel: Call Trace:
Oct 14 20:21:35 kdevops kernel:  <TASK>
Oct 14 20:21:35 kdevops kernel:  ? kernfs_fop_write_iter+0x177/0x220
Oct 14 20:21:35 kdevops kernel:  ? new_sync_write+0x11c/0x1b0
Oct 14 20:21:35 kdevops kernel:  ? vfs_write+0x20d/0x2a0
Oct 14 20:21:35 kdevops kernel:  ? ksys_write+0x5f/0xe0
Oct 14 20:21:35 kdevops kernel:  ? do_syscall_64+0x38/0xc0
Oct 14 20:21:35 kdevops kernel:  ?
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:35 kdevops kernel:  </TASK>
<etc, etc, etc, this goes on and on>

  Luis
Ming Lei Oct. 14, 2021, 11:52 p.m. UTC | #6
On Thu, Oct 14, 2021 at 01:24:32PM -0700, Luis Chamberlain wrote:
> On Thu, Oct 14, 2021 at 10:11:46AM +0800, Ming Lei wrote:
> > On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
> > > On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> > 
> > ...
> > 
> > > 
> > > Hello Luis,
> > > 
> > > Can you test the following patch and see if the issue can be addressed?
> > > 
> > > Please see the idea from the inline comment.
> > > 
> > > Also zram_index_mutex isn't needed in zram disk's store() compared with
> > > your patch, then the deadlock issue you are addressing in this series can
> > > be avoided.
> > > 
> > > 
> > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > index fcaf2750f68f..3c17927d23a7 100644
> > > --- a/drivers/block/zram/zram_drv.c
> > > +++ b/drivers/block/zram/zram_drv.c
> > > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> > >  
> > >  	/* Make sure all the pending I/O are finished */
> > >  	fsync_bdev(bdev);
> > > -	zram_reset_device(zram);
> > >  
> > >  	pr_info("Removed device: %s\n", zram->disk->disk_name);
> > >  
> > >  	del_gendisk(zram->disk);
> > > +
> > > +	/*
> > > +	 * reset device after gendisk is removed, so any change from sysfs
> > > +	 * store won't come in, then we can really reset device here
> > > +	 */
> > > +	zram_reset_device(zram);
> > > +
> > >  	blk_cleanup_disk(zram->disk);
> > >  	kfree(zram);
> > >  	return 0;
> > > @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
> > >  static void destroy_devices(void)
> > >  {
> > >  	class_unregister(&zram_control_class);
> > > +
> > > +	/* hold the global lock so new device can't be added */
> > > +	mutex_lock(&zram_index_mutex);
> > >  	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > > +	mutex_unlock(&zram_index_mutex);
> > > +
> > 
> > Actually zram_index_mutex isn't needed when calling zram_remove_cb()
> > since the zram-control sysfs interface has been removed, so userspace
> > can't add new device any more, then the issue is supposed to be fixed
> > by the following one line change, please test it:
> > 
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index fcaf2750f68f..96dd641de233 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> >  
> >  	/* Make sure all the pending I/O are finished */
> >  	fsync_bdev(bdev);
> > -	zram_reset_device(zram);
> >  
> >  	pr_info("Removed device: %s\n", zram->disk->disk_name);
> >  
> >  	del_gendisk(zram->disk);
> > +
> > +	/*
> > +	 * reset device after gendisk is removed, so any change from sysfs
> > +	 * store won't come in, then we can really reset device here
> > +	 */
> > +	zram_reset_device(zram);
> > +
> >  	blk_cleanup_disk(zram->disk);
> >  	kfree(zram);
> >  	return 0;
> 
> Sorry but nope, the cpu multistate issue is still present and we end up
> eventually with page faults. I tried with both patches.

In theory disksize_store() can't come in after del_gendisk() returns,
then zram_reset_device() should cleanup everything, that is the issue
you described in commit log.

We need to understand the exact reason why there is still cpuhp node
left, can you share us the exact steps for reproducing the issue?
Otherwise we may have to trace and narrow down the reason.



thanks,
Ming
Luis Chamberlain Oct. 15, 2021, 12:22 a.m. UTC | #7
On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> On Thu, Oct 14, 2021 at 01:24:32PM -0700, Luis Chamberlain wrote:
> > On Thu, Oct 14, 2021 at 10:11:46AM +0800, Ming Lei wrote:
> > > On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
> > > > On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> > > 
> > > ...
> > > 
> > > > 
> > > > Hello Luis,
> > > > 
> > > > Can you test the following patch and see if the issue can be addressed?
> > > > 
> > > > Please see the idea from the inline comment.
> > > > 
> > > > Also zram_index_mutex isn't needed in zram disk's store() compared with
> > > > your patch, then the deadlock issue you are addressing in this series can
> > > > be avoided.
> > > > 
> > > > 
> > > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > > index fcaf2750f68f..3c17927d23a7 100644
> > > > --- a/drivers/block/zram/zram_drv.c
> > > > +++ b/drivers/block/zram/zram_drv.c
> > > > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> > > >  
> > > >  	/* Make sure all the pending I/O are finished */
> > > >  	fsync_bdev(bdev);
> > > > -	zram_reset_device(zram);
> > > >  
> > > >  	pr_info("Removed device: %s\n", zram->disk->disk_name);
> > > >  
> > > >  	del_gendisk(zram->disk);
> > > > +
> > > > +	/*
> > > > +	 * reset device after gendisk is removed, so any change from sysfs
> > > > +	 * store won't come in, then we can really reset device here
> > > > +	 */
> > > > +	zram_reset_device(zram);
> > > > +
> > > >  	blk_cleanup_disk(zram->disk);
> > > >  	kfree(zram);
> > > >  	return 0;
> > > > @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
> > > >  static void destroy_devices(void)
> > > >  {
> > > >  	class_unregister(&zram_control_class);
> > > > +
> > > > +	/* hold the global lock so new device can't be added */
> > > > +	mutex_lock(&zram_index_mutex);
> > > >  	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > > > +	mutex_unlock(&zram_index_mutex);
> > > > +
> > > 
> > > Actually zram_index_mutex isn't needed when calling zram_remove_cb()
> > > since the zram-control sysfs interface has been removed, so userspace
> > > can't add new device any more, then the issue is supposed to be fixed
> > > by the following one line change, please test it:
> > > 
> > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > index fcaf2750f68f..96dd641de233 100644
> > > --- a/drivers/block/zram/zram_drv.c
> > > +++ b/drivers/block/zram/zram_drv.c
> > > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> > >  
> > >  	/* Make sure all the pending I/O are finished */
> > >  	fsync_bdev(bdev);
> > > -	zram_reset_device(zram);
> > >  
> > >  	pr_info("Removed device: %s\n", zram->disk->disk_name);
> > >  
> > >  	del_gendisk(zram->disk);
> > > +
> > > +	/*
> > > +	 * reset device after gendisk is removed, so any change from sysfs
> > > +	 * store won't come in, then we can really reset device here
> > > +	 */
> > > +	zram_reset_device(zram);
> > > +
> > >  	blk_cleanup_disk(zram->disk);
> > >  	kfree(zram);
> > >  	return 0;
> > 
> > Sorry but nope, the cpu multistate issue is still present and we end up
> > eventually with page faults. I tried with both patches.
> 
> In theory disksize_store() can't come in after del_gendisk() returns,
> then zram_reset_device() should cleanup everything, that is the issue
> you described in commit log.
> 
> We need to understand the exact reason why there is still cpuhp node
> left, can you share us the exact steps for reproducing the issue?
> Otherwise we may have to trace and narrow down the reason.

See my commit log for my own fix for this issue.

  Luis
Ming Lei Oct. 15, 2021, 8:36 a.m. UTC | #8
On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
...
> > 
> > We need to understand the exact reason why there is still cpuhp node
> > left, can you share us the exact steps for reproducing the issue?
> > Otherwise we may have to trace and narrow down the reason.
> 
> See my commit log for my own fix for this issue.

OK, thanks!

I can reproduce the issue, and the reason is that reset_store fails
zram_remove() when unloading module, then the warning is caused.

The top 3 patches in the following tree can fix the issue:

https://github.com/ming1/linux/commits/my_v5.15-blk-dev


Thanks,
Ming
Greg KH Oct. 15, 2021, 8:52 a.m. UTC | #9
On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> ...
> > > 
> > > We need to understand the exact reason why there is still cpuhp node
> > > left, can you share us the exact steps for reproducing the issue?
> > > Otherwise we may have to trace and narrow down the reason.
> > 
> > See my commit log for my own fix for this issue.
> 
> OK, thanks!
> 
> I can reproduce the issue, and the reason is that reset_store fails
> zram_remove() when unloading module, then the warning is caused.
> 
> The top 3 patches in the following tree can fix the issue:
> 
> https://github.com/ming1/linux/commits/my_v5.15-blk-dev

At a quick glance, those look sane to me, nice work.

greg k-h
Luis Chamberlain Oct. 15, 2021, 5:31 p.m. UTC | #10
On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> ...
> > > 
> > > We need to understand the exact reason why there is still cpuhp node
> > > left, can you share us the exact steps for reproducing the issue?
> > > Otherwise we may have to trace and narrow down the reason.
> > 
> > See my commit log for my own fix for this issue.
> 
> OK, thanks!
> 
> I can reproduce the issue, and the reason is that reset_store fails
> zram_remove() when unloading module, then the warning is caused.
> 
> The top 3 patches in the following tree can fix the issue:
> 
> https://github.com/ming1/linux/commits/my_v5.15-blk-dev

Thanks for trying an alternative fix! A crash stops yes, however this
also ends up leaving the driver in an unrecoverable state after a few
tries. Ie, you CTRL-C the scripts and try again over and over again and
the driver ends up in a situation where it just says:

zram: Can't change algorithm for initialized device

And the zram module can't be removed at that point.

  Luis
Ming Lei Oct. 16, 2021, 11:28 a.m. UTC | #11
On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > ...
> > > > 
> > > > We need to understand the exact reason why there is still cpuhp node
> > > > left, can you share us the exact steps for reproducing the issue?
> > > > Otherwise we may have to trace and narrow down the reason.
> > > 
> > > See my commit log for my own fix for this issue.
> > 
> > OK, thanks!
> > 
> > I can reproduce the issue, and the reason is that reset_store fails
> > zram_remove() when unloading module, then the warning is caused.
> > 
> > The top 3 patches in the following tree can fix the issue:
> > 
> > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> 
> Thanks for trying an alternative fix! A crash stops yes, however this

I doubt it is alternative since your patchset doesn't mention the exact
reason of 'Error: Removing state 63 which has instances left.', that is
simply caused by failing to remove zram because ->claim is set during
unloading module.

Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
however I don't think it is reproduced easily in the test because the race
window is pretty small, also it can be fixed easily in my 3rd path
without any complicated tricks.

Not dig into details of your patchset via grabbing module reference
count during show/store attribute of kernfs which is done in your patch
9, but IMO this way isn't necessary:

1) any driver module has to cleanup anything which may refer to symbols
or data defined in module_exit of this driver

2) device_del() is often done in module_exit(), once device_del()
returns, no any new show/store on the device's kobject attribute
is possible.

3) it is _not_ a must or pattern for fixing bugs to hold one lock before
calling device_del(), meantime the lock is required in the device's
attribute show()/store(), which causes AA deadlock easily. Your approach
just avoids the issue by not releasing module until all show/store are
done.

Also the model of using module refcount is usually that if anyone will
use the module, grab one extra ref, and once the use is done, release
it. For example of block device, the driver's module refcnt is grabbed
when the disk/part is opened, and released when the disk/part is closed.


> also ends up leaving the driver in an unrecoverable state after a few
> tries. Ie, you CTRL-C the scripts and try again over and over again and
> the driver ends up in a situation where it just says:
> 
> zram: Can't change algorithm for initialized device

It means the algorithm can't be changed for one initialized device
at the exact time. That is understandable because two zram02.sh are
running concurrently.

Your test script just runs two ./zram02.sh tasks concurrently forever,
so what is your expected result for the test? Of course, it can't be
over.

I can't reproduce the 'unrecoverable' state in my test, can you share the
stack trace log after that happens?

Is the zram02.sh still running or slept somewhere in the 'unrecoverable'
state? If it is still running, it means the current sleep point isn't
interruptable when running 'CTRL-C'. In my test, after several 'CTRL-C',
both the two zram02.sh started from two terminals can be terminated. If
it is slept somewhere forever, it can be one problem.

> 
> And the zram module can't be removed at that point.

It is just that systemd opens the zram or the disk is opened as swap
disk, and once systemd closes it or after you run swapoff, it can be
unloaded.


Thanks,
Ming
Luis Chamberlain Oct. 18, 2021, 7:32 p.m. UTC | #12
On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
> On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> > On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > > ...
> > > > > 
> > > > > We need to understand the exact reason why there is still cpuhp node
> > > > > left, can you share us the exact steps for reproducing the issue?
> > > > > Otherwise we may have to trace and narrow down the reason.
> > > > 
> > > > See my commit log for my own fix for this issue.
> > > 
> > > OK, thanks!
> > > 
> > > I can reproduce the issue, and the reason is that reset_store fails
> > > zram_remove() when unloading module, then the warning is caused.
> > > 
> > > The top 3 patches in the following tree can fix the issue:
> > > 
> > > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> > 
> > Thanks for trying an alternative fix! A crash stops yes, however this
> 
> I doubt it is alternative since your patchset doesn't mention the exact
> reason of 'Error: Removing state 63 which has instances left.', that is
> simply caused by failing to remove zram because ->claim is set during
> unloading module.

Well I disagree because it does explain how the race can happen, and it
also explains how since the sysfs interface is exposed until module
removal completes, it leaves exposed knobs to allow re-initializing of a
struct zcomp for a zram device before the exit.

> Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
> however I don't think it is reproduced easily in the test because the race
> window is pretty small, also it can be fixed easily in my 3rd path
> without any complicated tricks.

Reproducing for me is... extremely easy.

> Not dig into details of your patchset via grabbing module reference
> count during show/store attribute of kernfs which is done in your patch
> 9, but IMO this way isn't necessary:

That's to address the deadlock only.

> 1) any driver module has to cleanup anything which may refer to symbols
> or data defined in module_exit of this driver

Yes, and as the cpu multistate hotplug documentation warns (although
such documentation is kind of hidden) that driver authors need to be
careful with module removal too, refer to the warning at the end of
__cpuhp_remove_state_cpuslocked() about module removal.

> 2) device_del() is often done in module_exit(), once device_del()
> returns, no any new show/store on the device's kobject attribute
> is possible.

Right and if a syfs knob is exposed before device_del() completely
and is allowed to do things, the driver should take care to prevent
races for CPU multistate support. The small state machine I added ensures
we don't run over any expectations from cpu hotplug multistate support.

I've *never* suggested there cannot be alternatives to my solution with
the small state machine, but for you to say it is incorrect is simply
not right either.

> 3) it is _not_ a must or pattern for fixing bugs to hold one lock before
> calling device_del(), meantime the lock is required in the device's
> attribute show()/store(), which causes AA deadlock easily. Your approach
> just avoids the issue by not releasing module until all show/store are
> done.

Right, there are two approaches here:

a) Your approach is to accept the deadlock as a requirement and so
you would prefer to implement an alternative to using a shared lock
on module exit and sysfs op.

b) While I address such a deadlock head on as I think this sort of locking
be allowed for two reasons:
   b1) as we never documented such requirement otherwise.
   b2) There is a possibility that other drivers already exist too
       which *do* use a shared lock on module removal and sysfs ops
       (and I just confirmed this to be true)

By you only addressing the deadlock as a requirement on approach a) you are
forgetting that there *may* already be present drivers which *do* implement
such patterns in the kernel. I worked on addressing the deadlock because
I was informed livepatching *did* have that issue as well and so very
likely a generic solution to the deadlock could be beneficial to other
random drivers.

So I *really* don't think it is wise for us to simply accept this new
found deadlock as a *new* requirement, specially if we can fix it easily.

A cursory review using Coccinelle potential issues with mutex lock
directly used on module exit (so this doesn't cover drivers like zram
which uses a routine and then grabs the lock through indirection) and a
sysfs op shows these drivers are also affected by this deadlock:

  * arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c
  * lib/test_firmware.c

Note that this cursory review does not cover spin_lock uses, and other
forms locks. Consider the case where a routine is used and then that
routine grabs a lock, so one level indirection. There are many levels
of indirections possible here. And likewise there are different types
of locks.

> > also ends up leaving the driver in an unrecoverable state after a few
> > tries. Ie, you CTRL-C the scripts and try again over and over again and
> > the driver ends up in a situation where it just says:
> > 
> > zram: Can't change algorithm for initialized device
> 
> It means the algorithm can't be changed for one initialized device
> at the exact time. That is understandable because two zram02.sh are
> running concurrently.

Indeed but with your patch it can get stuck and cannot be taken out of this
state.

> Your test script just runs two ./zram02.sh tasks concurrently forever,
> so what is your expected result for the test? Of course, it can't be
> over.
>
> I can't reproduce the 'unrecoverable' state in my test, can you share the
> stack trace log after that happens?

Try a bit harder, cancel the scripts after running for a while randomly
(CTRL C a few times until the script finishes) and have them race again.
Do this a few times.

> > And the zram module can't be removed at that point.
> 
> It is just that systemd opens the zram or the disk is opened as swap
> disk, and once systemd closes it or after you run swapoff, it can be
> unloaded.

With my patch this issues does not happen.

  Luis
Ming Lei Oct. 19, 2021, 2:34 a.m. UTC | #13
On Mon, Oct 18, 2021 at 12:32:11PM -0700, Luis Chamberlain wrote:
> On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
> > On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> > > On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > > > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > > > ...
> > > > > > 
> > > > > > We need to understand the exact reason why there is still cpuhp node
> > > > > > left, can you share us the exact steps for reproducing the issue?
> > > > > > Otherwise we may have to trace and narrow down the reason.
> > > > > 
> > > > > See my commit log for my own fix for this issue.
> > > > 
> > > > OK, thanks!
> > > > 
> > > > I can reproduce the issue, and the reason is that reset_store fails
> > > > zram_remove() when unloading module, then the warning is caused.
> > > > 
> > > > The top 3 patches in the following tree can fix the issue:
> > > > 
> > > > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> > > 
> > > Thanks for trying an alternative fix! A crash stops yes, however this
> > 
> > I doubt it is alternative since your patchset doesn't mention the exact
> > reason of 'Error: Removing state 63 which has instances left.', that is
> > simply caused by failing to remove zram because ->claim is set during
> > unloading module.
> 
> Well I disagree because it does explain how the race can happen, and it
> also explains how since the sysfs interface is exposed until module
> removal completes, it leaves exposed knobs to allow re-initializing of a
> struct zcomp for a zram device before the exit.
> 
> > Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
> > however I don't think it is reproduced easily in the test because the race
> > window is pretty small, also it can be fixed easily in my 3rd path
> > without any complicated tricks.
> 
> Reproducing for me is... extremely easy.

In my observation, failing zram_remove() is extremely easy to trigger, which
is caused by reset_store() which sets ->reclaim as true, so
zram_remove() is failed and zram_reset_device() is bypassed , then the
failure of 'Error: Removing state 63 which has instances left.' is caused.

We are in same page?

> 
> > Not dig into details of your patchset via grabbing module reference
> > count during show/store attribute of kernfs which is done in your patch
> > 9, but IMO this way isn't necessary:
> 
> That's to address the deadlock only.
> 
> > 1) any driver module has to cleanup anything which may refer to symbols
> > or data defined in module_exit of this driver
> 
> Yes, and as the cpu multistate hotplug documentation warns (although
> such documentation is kind of hidden) that driver authors need to be
> careful with module removal too, refer to the warning at the end of
> __cpuhp_remove_state_cpuslocked() about module removal.

It is zram's bug. zram has to clean everything in module_exit(),
unfortunately zram_remove() can be failed when calling from
module_exit() because ->claim is set as true by reset_store(), then
zram_reset_device()(->zcomp_destroy) isn't called, and this failure should
not happen when unloading module, should it?

> 
> > 2) device_del() is often done in module_exit(), once device_del()
> > returns, no any new show/store on the device's kobject attribute
> > is possible.
> 
> Right and if a syfs knob is exposed before device_del() completely
> and is allowed to do things, the driver should take care to prevent
> races for CPU multistate support. The small state machine I added ensures

What is the race for CPU multistate support? If you mean 'Error: Removing
state 63 which has instances left.', it is zram's bug since zram has to
cleanup everything in module_exit().

> we don't run over any expectations from cpu hotplug multistate support.
> 
> I've *never* suggested there cannot be alternatives to my solution with
> the small state machine, but for you to say it is incorrect is simply
> not right either.
> 
> > 3) it is _not_ a must or pattern for fixing bugs to hold one lock before
> > calling device_del(), meantime the lock is required in the device's
> > attribute show()/store(), which causes AA deadlock easily. Your approach
> > just avoids the issue by not releasing module until all show/store are
> > done.
> 
> Right, there are two approaches here:
> 
> a) Your approach is to accept the deadlock as a requirement and so
> you would prefer to implement an alternative to using a shared lock
> on module exit and sysfs op.

wrt. in-tree zram, there is neither any deadlock in linus tree, nor after
applying my 3 patches. If you think there is, please share us the code
or lockdep warning.

> 
> b) While I address such a deadlock head on as I think this sort of locking
> be allowed for two reasons:
>    b1) as we never documented such requirement otherwise.
>    b2) There is a possibility that other drivers already exist too
>        which *do* use a shared lock on module removal and sysfs ops
>        (and I just confirmed this to be true)

The 'deadlock' is actually caused by your out-of-tree patch of 'zram: fix
crashes with cpu hotplug multistate' which adds mutex_lock(zram_index_mutex)
in destroy_devices().

We can fix this issue easily without needing the global lock, please see the
attached(pre-V2) patch.

> 
> By you only addressing the deadlock as a requirement on approach a) you are
> forgetting that there *may* already be present drivers which *do* implement
> such patterns in the kernel. I worked on addressing the deadlock because
> I was informed livepatching *did* have that issue as well and so very
> likely a generic solution to the deadlock could be beneficial to other
> random drivers.

In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
just fixed it, and seems it has been fixed by 3ec24776bfd0.

> 
> So I *really* don't think it is wise for us to simply accept this new
> found deadlock as a *new* requirement, specially if we can fix it easily.
> 
> A cursory review using Coccinelle potential issues with mutex lock
> directly used on module exit (so this doesn't cover drivers like zram
> which uses a routine and then grabs the lock through indirection) and a
> sysfs op shows these drivers are also affected by this deadlock:
> 
>   * arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c

In fsl_wakeup_sys_exit(), device_remove_file() is called before
acquiring &sysfs_lock, so there shouldn't be such AA deadlock.

>   * lib/test_firmware.c

Yeah, there is the AA deadlock risk, but it should be fixed by moving
misc_deregister() out of &test_fw_mutex.

> 
> Note that this cursory review does not cover spin_lock uses, and other
> forms locks. Consider the case where a routine is used and then that
> routine grabs a lock, so one level indirection. There are many levels
> of indirections possible here. And likewise there are different types
> of locks.
> 
> > > also ends up leaving the driver in an unrecoverable state after a few
> > > tries. Ie, you CTRL-C the scripts and try again over and over again and
> > > the driver ends up in a situation where it just says:
> > > 
> > > zram: Can't change algorithm for initialized device
> > 
> > It means the algorithm can't be changed for one initialized device
> > at the exact time. That is understandable because two zram02.sh are
> > running concurrently.
> 
> Indeed but with your patch it can get stuck and cannot be taken out of this
> state.

OK, I can keep current behavior: fail open() in case of removing or
resetting, meantime not hold open_mutex when sync bdev and reset device,
see attached patch.

> 
> > Your test script just runs two ./zram02.sh tasks concurrently forever,
> > so what is your expected result for the test? Of course, it can't be
> > over.
> >
> > I can't reproduce the 'unrecoverable' state in my test, can you share the
> > stack trace log after that happens?
> 
> Try a bit harder, cancel the scripts after running for a while randomly
> (CTRL C a few times until the script finishes) and have them race again.
> Do this a few times.
> 
> > > And the zram module can't be removed at that point.
> > 
> > It is just that systemd opens the zram or the disk is opened as swap
> > disk, and once systemd closes it or after you run swapoff, it can be
> > unloaded.
> 
> With my patch this issues does not happen.

It is because the patch 2 holds ->open_mutex() for sync bdev and reset
zram, so several 'CTRL-C' is needed for terminating the test script, then
zram02.sh's cleanup handler can be interrupted too. We can keep current
behavior easily.

Please try the following patch against upstream(linus or next) tree(basically
fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
module_exit(), race between zram_remove() and disksize_store()), and see if
everything is fine for you:


diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a68297fb51a2..320822a80b64 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1967,25 +1967,45 @@ static int zram_add(void)
 static int zram_remove(struct zram *zram)
 {
 	struct block_device *bdev = zram->disk->part0;
+	bool claimed;
 
 	mutex_lock(&bdev->bd_disk->open_mutex);
-	if (bdev->bd_openers || zram->claim) {
+	if (bdev->bd_openers) {
 		mutex_unlock(&bdev->bd_disk->open_mutex);
 		return -EBUSY;
 	}
 
-	zram->claim = true;
+	claimed = zram->claim;
+	if (!claimed)
+		zram->claim = true;
 	mutex_unlock(&bdev->bd_disk->open_mutex);
 
 	zram_debugfs_unregister(zram);
 
-	/* Make sure all the pending I/O are finished */
-	fsync_bdev(bdev);
-	zram_reset_device(zram);
+	if (claimed) {
+		/*
+		 * If we were claimed by reset_store(), del_gendisk() will
+		 * wait until sync & reset is completed, so do nothing here.
+		 */
+		;
+	} else {
+		/* Make sure all the pending I/O are finished */
+		sync_blockdev(bdev);
+		zram_reset_device(zram);
+	}
 
 	pr_info("Removed device: %s\n", zram->disk->disk_name);
 
 	del_gendisk(zram->disk);
+
+	WARN_ON_ONCE(claimed && zram->claim);
+
+	/*
+	 * disksize store may come after the above zram_reset_device
+	 * returns, so run the last reset to avoid the race
+	 */
+	zram_reset_device(zram);
+
 	blk_cleanup_disk(zram->disk);
 	kfree(zram);
 	return 0;


Thanks,
Ming
Miroslav Benes Oct. 19, 2021, 6:23 a.m. UTC | #14
> > By you only addressing the deadlock as a requirement on approach a) you are
> > forgetting that there *may* already be present drivers which *do* implement
> > such patterns in the kernel. I worked on addressing the deadlock because
> > I was informed livepatching *did* have that issue as well and so very
> > likely a generic solution to the deadlock could be beneficial to other
> > random drivers.
> 
> In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> just fixed it, and seems it has been fixed by 3ec24776bfd0.

I would not call it a fix. It is a kind of ugly workaround because the 
generic infrastructure lacked (lacks) the proper support in my opinion. 
Luis is trying to fix that.

Just my two cents.

Miroslav
Ming Lei Oct. 19, 2021, 9:23 a.m. UTC | #15
On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > By you only addressing the deadlock as a requirement on approach a) you are
> > > forgetting that there *may* already be present drivers which *do* implement
> > > such patterns in the kernel. I worked on addressing the deadlock because
> > > I was informed livepatching *did* have that issue as well and so very
> > > likely a generic solution to the deadlock could be beneficial to other
> > > random drivers.
> > 
> > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> 
> I would not call it a fix. It is a kind of ugly workaround because the 
> generic infrastructure lacked (lacks) the proper support in my opinion. 
> Luis is trying to fix that.

What is the proper support of the generic infrastructure? I am not
familiar with livepatching's model(especially with module unload), you mean
livepatching have to do the following way from sysfs:

1) during module exit:
	
	mutex_lock(lp_lock);
	kobject_put(lp_kobj);
	mutex_unlock(lp_lock);
	
2) show()/store() method of attributes of lp_kobj
	
	mutex_lock(lp_lock)
	...
	mutex_unlock(lp_lock)

IMO, the above usage simply caused AA deadlock. Even in Luis's patch
'zram: fix crashes with cpu hotplug multistate', new/same AA deadlock
(hot_remove_store() vs. disksize_store() or reset_store()) is added
because hot_remove_store() isn't called from module_exit().

Luis tries to delay unloading module until all show()/store() are done. But
that can be obtained by the following way simply during module_exit():

	kobject_del(lp_kobj); //all pending store()/show() from lp_kobj are done,
						  //no new store()/show() can come after
						  //kobject_del() returns	
	mutex_lock(lp_lock);
	kobject_put(lp_kobj);
	mutex_unlock(lp_lock);

Or can you explain your requirement on kobject/module unload in a bit
details?


Thanks,
Ming
Luis Chamberlain Oct. 19, 2021, 3:28 p.m. UTC | #16
On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> Please try the following patch against upstream(linus or next) tree(basically
> fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
> module_exit(), race between zram_remove() and disksize_store()), and see if
> everything is fine for you:

Page fault ...

[   18.284256] zram: Removed device: zram0
[   18.312974] BUG: unable to handle page fault for address:
ffffad86de903008
[   18.313707] #PF: supervisor read access in kernel mode
[   18.314248] #PF: error_code(0x0000) - not-present page
[   18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067
PTE 0
[   18.315538] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   18.316012] CPU: 3 PID: 1198 Comm: rmmod Tainted: G            E
5.15.0-rc3-next-20210927+ #89
[   18.316979] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.14.0-2 04/01/2014
[   18.317876] RIP: 0010:zram_free_page+0x1b/0xf0 [zram]
[   18.318430] Code: 1f 44 00 00 48 89 c8 c3 0f 1f 80 00 00 00 00 0f 1f
44 00 00 41 54 49 89 f4 55 89 f5 53 48 8b 17 48 c1 e5 04 48 89 fb 48 01
ea <48> 8b 42 08 a9 00 00 00 20 74 14 48 25 ff ff ff df 48 89 42 08 48
[   18.320412] RSP: 0018:ffffad86f8013df8 EFLAGS: 00010286
[   18.320978] RAX: 0000000000000001 RBX: ffff9b7b435c7800 RCX:
0000000000000200
[   18.321758] RDX: ffffad86de903000 RSI: 0000000000000000 RDI:
ffff9b7b435c7800
[   18.322524] RBP: 0000000000000000 R08: 0000000000000200 R09:
0000000000000000
[   18.323299] R10: 0000000000000200 R11: 0000000000000000 R12:
0000000000000000
[   18.324030] R13: ffff9b7b55191800 R14: ffff9b7b435c7820 R15:
ffff9b7b4677f960
[   18.324784] FS:  00007fc8e4c90580(0000) GS:ffff9b7c77cc0000(0000)
knlGS:0000000000000000
[   18.325651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.326272] CR2: ffffad86de903008 CR3: 000000014f1de003 CR4:
0000000000370ee0
[   18.327047] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   18.327818] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[   18.328586] Call Trace:
[   18.328852]  <TASK>
[   18.329284]  zram_reset_device+0xd8/0x140 [zram]
[   18.329983]  zram_remove.cold+0xa/0x20 [zram]
[   18.330644]  ? hot_remove_store+0xe0/0xe0 [zram]
[   18.331367]  zram_remove_cb+0xd/0x10 [zram]
[   18.332010]  idr_for_each+0x5b/0xd0
[   18.332578]  destroy_devices+0x26/0x50 [zram]
[   18.333238]  __do_sys_delete_module+0x18d/0x2a0
[   18.333913]  ? fpregs_assert_state_consistent+0x1e/0x40
[   18.334665]  ? exit_to_user_mode_prepare+0x3a/0x180
[   18.335395]  do_syscall_64+0x38/0xc0
[   18.335966]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   18.336681] RIP: 0033:0x7fc8e4db64a7
Luis Chamberlain Oct. 19, 2021, 3:50 p.m. UTC | #17
On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> On Mon, Oct 18, 2021 at 12:32:11PM -0700, Luis Chamberlain wrote:
> > On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
> > > On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> > > > On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > > > > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > > > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > > > > ...
> > > > > > > 
> > > > > > > We need to understand the exact reason why there is still cpuhp node
> > > > > > > left, can you share us the exact steps for reproducing the issue?
> > > > > > > Otherwise we may have to trace and narrow down the reason.
> > > > > > 
> > > > > > See my commit log for my own fix for this issue.
> > > > > 
> > > > > OK, thanks!
> > > > > 
> > > > > I can reproduce the issue, and the reason is that reset_store fails
> > > > > zram_remove() when unloading module, then the warning is caused.
> > > > > 
> > > > > The top 3 patches in the following tree can fix the issue:
> > > > > 
> > > > > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> > > > 
> > > > Thanks for trying an alternative fix! A crash stops yes, however this
> > > 
> > > I doubt it is alternative since your patchset doesn't mention the exact
> > > reason of 'Error: Removing state 63 which has instances left.', that is
> > > simply caused by failing to remove zram because ->claim is set during
> > > unloading module.
> > 
> > Well I disagree because it does explain how the race can happen, and it
> > also explains how since the sysfs interface is exposed until module
> > removal completes, it leaves exposed knobs to allow re-initializing of a
> > struct zcomp for a zram device before the exit.
> > 
> > > Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
> > > however I don't think it is reproduced easily in the test because the race
> > > window is pretty small, also it can be fixed easily in my 3rd path
> > > without any complicated tricks.
> > 
> > Reproducing for me is... extremely easy.
> 
> In my observation, failing zram_remove() is extremely easy to trigger, which
> is caused by reset_store() which sets ->reclaim as true, so
> zram_remove() is failed and zram_reset_device() is bypassed , then the
> failure of 'Error: Removing state 63 which has instances left.' is caused.
> 
> We are in same page?

The actual first issue is the CPU hotplug remove callback is long gone and
in the meantime we allow a race to add a new "instance", in the zram
driver's case a cpu struct zcomp instance though the sysfs interface.

Regardless of if zram_remove() can fail or not, the above race needs to
be addressed.

> > > Not dig into details of your patchset via grabbing module reference
> > > count during show/store attribute of kernfs which is done in your patch
> > > 9, but IMO this way isn't necessary:
> > 
> > That's to address the deadlock only.
> > 
> > > 1) any driver module has to cleanup anything which may refer to symbols
> > > or data defined in module_exit of this driver
> > 
> > Yes, and as the cpu multistate hotplug documentation warns (although
> > such documentation is kind of hidden) that driver authors need to be
> > careful with module removal too, refer to the warning at the end of
> > __cpuhp_remove_state_cpuslocked() about module removal.
> 
> It is zram's bug. zram has to clean everything in module_exit(),
> unfortunately zram_remove() can be failed when calling from
> module_exit() because ->claim is set as true by reset_store(), then
> zram_reset_device()(->zcomp_destroy) isn't called, and this failure should
> not happen when unloading module, should it?

You're addressing a possible failig zram_remove() while I address not
allowing entry to muck with the zram driver at all once we're bailing
on module removal.

> > > 2) device_del() is often done in module_exit(), once device_del()
> > > returns, no any new show/store on the device's kobject attribute
> > > is possible.
> > 
> > Right and if a syfs knob is exposed before device_del() completely
> > and is allowed to do things, the driver should take care to prevent
> > races for CPU multistate support. The small state machine I added ensures
> 
> What is the race for CPU multistate support? If you mean 'Error: Removing
> state 63 which has instances left.', it is zram's bug since zram has to
> cleanup everything in module_exit().

Yes. And it is what my out of tree yet Acked patch, 'zram: fix     
crashes with cpu hotplug multistate' does.

> > we don't run over any expectations from cpu hotplug multistate support.
> > 
> > I've *never* suggested there cannot be alternatives to my solution with
> > the small state machine, but for you to say it is incorrect is simply
> > not right either.
> > 
> > > 3) it is _not_ a must or pattern for fixing bugs to hold one lock before
> > > calling device_del(), meantime the lock is required in the device's
> > > attribute show()/store(), which causes AA deadlock easily. Your approach
> > > just avoids the issue by not releasing module until all show/store are
> > > done.
> > 
> > Right, there are two approaches here:
> > 
> > a) Your approach is to accept the deadlock as a requirement and so
> > you would prefer to implement an alternative to using a shared lock
> > on module exit and sysfs op.
> 
> wrt. in-tree zram, there is neither any deadlock in linus tree, nor after
> applying my 3 patches. If you think there is, please share us the code
> or lockdep warning.

Right, 'zram: fix crashes with cpu hotplug multistate' is not yet
merged, my approach to fixing that does add a lock use on module removal
which does introduce a possible deadlock with syfs, which is later addressed
generically between sysfs and module removal for all drivers.

> > b) While I address such a deadlock head on as I think this sort of locking
> > be allowed for two reasons:
> >    b1) as we never documented such requirement otherwise.
> >    b2) There is a possibility that other drivers already exist too
> >        which *do* use a shared lock on module removal and sysfs ops
> >        (and I just confirmed this to be true)
> 
> The 'deadlock' is actually caused by your out-of-tree patch of 'zram: fix
> crashes with cpu hotplug multistate' which adds mutex_lock(zram_index_mutex)
> in destroy_devices().

Yes yes, but you are completely throwing out the window that other
possible deadlocks can exist in the kernel *and* that *new* cases of
the deadlock can easily also be added!

> We can fix this issue easily without needing the global lock, please see the
> attached(pre-V2) patch.

So far your patches do not fix the issues though...

> > So I *really* don't think it is wise for us to simply accept this new
> > found deadlock as a *new* requirement, specially if we can fix it easily.
> > 
> > A cursory review using Coccinelle potential issues with mutex lock
> > directly used on module exit (so this doesn't cover drivers like zram
> > which uses a routine and then grabs the lock through indirection) and a
> > sysfs op shows these drivers are also affected by this deadlock:
> > 
> >   * arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c
> 
> In fsl_wakeup_sys_exit(), device_remove_file() is called before
> acquiring &sysfs_lock, so there shouldn't be such AA deadlock.
> 
> >   * lib/test_firmware.c
> 
> Yeah, there is the AA deadlock risk, but it should be fixed by moving
> misc_deregister() out of &test_fw_mutex.

And just like that you are ignoring other possible uses in the kernel
which might have similar deadlocks.

So do you want to take the position:

Hey driver authors: you cannot use any shared lock on module removal and
on sysfs ops?

  Luis
Greg KH Oct. 19, 2021, 4:25 p.m. UTC | #18
On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> So do you want to take the position:
> 
> Hey driver authors: you cannot use any shared lock on module removal and
> on sysfs ops?

Yes, I would not recommend using such a lock at all.  sysfs operations
happen on a per-device basis, so you can lock the device structure.
Module removal happens on a driver basis, and I have no idea what you
want to lock there, but odds are it is NOT shared with your per-device
structures either, right?

If so, then yes, that is a bug, but a very rare one as drivers should do
almost nothing except register/unregister_driver() in their module
init/exit calls.

zram is not a "normal" driver at all here, so fixing this type of
problem up should be done in the zram code, it is not a generic
module/sysfs issue at all.

thanks,

greg k-h
Ming Lei Oct. 19, 2021, 4:29 p.m. UTC | #19
On Tue, Oct 19, 2021 at 08:28:21AM -0700, Luis Chamberlain wrote:
> On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> > Please try the following patch against upstream(linus or next) tree(basically
> > fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
> > module_exit(), race between zram_remove() and disksize_store()), and see if
> > everything is fine for you:
> 
> Page fault ...
> 
> [   18.284256] zram: Removed device: zram0
> [   18.312974] BUG: unable to handle page fault for address:
> ffffad86de903008
> [   18.313707] #PF: supervisor read access in kernel mode
> [   18.314248] #PF: error_code(0x0000) - not-present page
> [   18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067

That is another race between zram_reset_device() and disksize_store(),
which is supposed to be covered by ->init_lock, and follows the delta fix
against the last patch I posted, and the whole patch can be found in the
github link:

https://github.com/ming1/linux/commit/fa6045b1371eb301f392ac84adaf3ad53bb16894


diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d0cae7a42f4d..a14ba3d350ea 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
 	set_capacity_and_notify(zram->disk, 0);
 	part_stat_set_all(zram->disk->part0, 0);
 
-	up_write(&zram->init_lock);
 	/* I/O operation under all of CPU are done so let's free */
 	zram_meta_free(zram, disksize);
 	memset(&zram->stats, 0, sizeof(zram->stats));
 	zcomp_destroy(comp);
 	reset_bdev(zram);
+	up_write(&zram->init_lock);
 }
 
 static ssize_t disksize_store(struct device *dev,
Luis Chamberlain Oct. 19, 2021, 4:30 p.m. UTC | #20
On Tue, Oct 19, 2021 at 06:25:18PM +0200, Greg KH wrote:
> On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > So do you want to take the position:
> > 
> > Hey driver authors: you cannot use any shared lock on module removal and
> > on sysfs ops?
> 
> Yes, I would not recommend using such a lock at all.  sysfs operations
> happen on a per-device basis, so you can lock the device structure.

All devices are going to be removed on module removal and so cannot be locked.

  Luis
Ming Lei Oct. 19, 2021, 4:39 p.m. UTC | #21
On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> > On Mon, Oct 18, 2021 at 12:32:11PM -0700, Luis Chamberlain wrote:
> > > On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
> > > > On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> > > > > On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > > > > > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > > > > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > > > > > ...
> > > > > > > > 
> > > > > > > > We need to understand the exact reason why there is still cpuhp node
> > > > > > > > left, can you share us the exact steps for reproducing the issue?
> > > > > > > > Otherwise we may have to trace and narrow down the reason.
> > > > > > > 
> > > > > > > See my commit log for my own fix for this issue.
> > > > > > 
> > > > > > OK, thanks!
> > > > > > 
> > > > > > I can reproduce the issue, and the reason is that reset_store fails
> > > > > > zram_remove() when unloading module, then the warning is caused.
> > > > > > 
> > > > > > The top 3 patches in the following tree can fix the issue:
> > > > > > 
> > > > > > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> > > > > 
> > > > > Thanks for trying an alternative fix! A crash stops yes, however this
> > > > 
> > > > I doubt it is alternative since your patchset doesn't mention the exact
> > > > reason of 'Error: Removing state 63 which has instances left.', that is
> > > > simply caused by failing to remove zram because ->claim is set during
> > > > unloading module.
> > > 
> > > Well I disagree because it does explain how the race can happen, and it
> > > also explains how since the sysfs interface is exposed until module
> > > removal completes, it leaves exposed knobs to allow re-initializing of a
> > > struct zcomp for a zram device before the exit.
> > > 
> > > > Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
> > > > however I don't think it is reproduced easily in the test because the race
> > > > window is pretty small, also it can be fixed easily in my 3rd path
> > > > without any complicated tricks.
> > > 
> > > Reproducing for me is... extremely easy.
> > 
> > In my observation, failing zram_remove() is extremely easy to trigger, which
> > is caused by reset_store() which sets ->reclaim as true, so
> > zram_remove() is failed and zram_reset_device() is bypassed , then the
> > failure of 'Error: Removing state 63 which has instances left.' is caused.
> > 
> > We are in same page?
> 
> The actual first issue is the CPU hotplug remove callback is long gone and
> in the meantime we allow a race to add a new "instance", in the zram
> driver's case a cpu struct zcomp instance though the sysfs interface.
> 
> Regardless of if zram_remove() can fail or not, the above race needs to
> be addressed.
> 
> > > > Not dig into details of your patchset via grabbing module reference
> > > > count during show/store attribute of kernfs which is done in your patch
> > > > 9, but IMO this way isn't necessary:
> > > 
> > > That's to address the deadlock only.
> > > 
> > > > 1) any driver module has to cleanup anything which may refer to symbols
> > > > or data defined in module_exit of this driver
> > > 
> > > Yes, and as the cpu multistate hotplug documentation warns (although
> > > such documentation is kind of hidden) that driver authors need to be
> > > careful with module removal too, refer to the warning at the end of
> > > __cpuhp_remove_state_cpuslocked() about module removal.
> > 
> > It is zram's bug. zram has to clean everything in module_exit(),
> > unfortunately zram_remove() can be failed when calling from
> > module_exit() because ->claim is set as true by reset_store(), then
> > zram_reset_device()(->zcomp_destroy) isn't called, and this failure should
> > not happen when unloading module, should it?
> 
> You're addressing a possible failig zram_remove() while I address not
> allowing entry to muck with the zram driver at all once we're bailing
> on module removal.
> 
> > > > 2) device_del() is often done in module_exit(), once device_del()
> > > > returns, no any new show/store on the device's kobject attribute
> > > > is possible.
> > > 
> > > Right and if a syfs knob is exposed before device_del() completely
> > > and is allowed to do things, the driver should take care to prevent
> > > races for CPU multistate support. The small state machine I added ensures
> > 
> > What is the race for CPU multistate support? If you mean 'Error: Removing
> > state 63 which has instances left.', it is zram's bug since zram has to
> > cleanup everything in module_exit().
> 
> Yes. And it is what my out of tree yet Acked patch, 'zram: fix     
> crashes with cpu hotplug multistate' does.

Unfortunately that patch adds new deadlock between hot_remove_store() and
disksize_store() & others, see my below comment.

> 
> > > we don't run over any expectations from cpu hotplug multistate support.
> > > 
> > > I've *never* suggested there cannot be alternatives to my solution with
> > > the small state machine, but for you to say it is incorrect is simply
> > > not right either.
> > > 
> > > > 3) it is _not_ a must or pattern for fixing bugs to hold one lock before
> > > > calling device_del(), meantime the lock is required in the device's
> > > > attribute show()/store(), which causes AA deadlock easily. Your approach
> > > > just avoids the issue by not releasing module until all show/store are
> > > > done.
> > > 
> > > Right, there are two approaches here:
> > > 
> > > a) Your approach is to accept the deadlock as a requirement and so
> > > you would prefer to implement an alternative to using a shared lock
> > > on module exit and sysfs op.
> > 
> > wrt. in-tree zram, there is neither any deadlock in linus tree, nor after
> > applying my 3 patches. If you think there is, please share us the code
> > or lockdep warning.
> 
> Right, 'zram: fix crashes with cpu hotplug multistate' is not yet
> merged, my approach to fixing that does add a lock use on module removal
> which does introduce a possible deadlock with syfs, which is later addressed
> generically between sysfs and module removal for all drivers.
> 
> > > b) While I address such a deadlock head on as I think this sort of locking
> > > be allowed for two reasons:
> > >    b1) as we never documented such requirement otherwise.
> > >    b2) There is a possibility that other drivers already exist too
> > >        which *do* use a shared lock on module removal and sysfs ops
> > >        (and I just confirmed this to be true)
> > 
> > The 'deadlock' is actually caused by your out-of-tree patch of 'zram: fix
> > crashes with cpu hotplug multistate' which adds mutex_lock(zram_index_mutex)
> > in destroy_devices().
> 
> Yes yes, but you are completely throwing out the window that other
> possible deadlocks can exist in the kernel *and* that *new* cases of
> the deadlock can easily also be added!
> 
> > We can fix this issue easily without needing the global lock, please see the
> > attached(pre-V2) patch.
> 
> So far your patches do not fix the issues though...
> 
> > > So I *really* don't think it is wise for us to simply accept this new
> > > found deadlock as a *new* requirement, specially if we can fix it easily.
> > > 
> > > A cursory review using Coccinelle potential issues with mutex lock
> > > directly used on module exit (so this doesn't cover drivers like zram
> > > which uses a routine and then grabs the lock through indirection) and a
> > > sysfs op shows these drivers are also affected by this deadlock:
> > > 
> > >   * arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c
> > 
> > In fsl_wakeup_sys_exit(), device_remove_file() is called before
> > acquiring &sysfs_lock, so there shouldn't be such AA deadlock.
> > 
> > >   * lib/test_firmware.c
> > 
> > Yeah, there is the AA deadlock risk, but it should be fixed by moving
> > misc_deregister() out of &test_fw_mutex.
> 
> And just like that you are ignoring other possible uses in the kernel
> which might have similar deadlocks.
> 
> So do you want to take the position:
> 
> Hey driver authors: you cannot use any shared lock on module removal and
> on sysfs ops?

IMO, yes, in your patch of 'zram: fix crashes with cpu hotplug multistate',
when you added mutex_lock(zram_index_mutex) to disksize_store() and
other attribute show() or store() method. You have added new deadlock
between hot_remove_store() and disksize_store() & others, which can't be
addressed by your approach of holding module refcnt.

So far not see ltp tests covers hot add/remove interface yet.


Thanks, 
Ming
Greg KH Oct. 19, 2021, 5:28 p.m. UTC | #22
On Tue, Oct 19, 2021 at 09:30:05AM -0700, Luis Chamberlain wrote:
> On Tue, Oct 19, 2021 at 06:25:18PM +0200, Greg KH wrote:
> > On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > > So do you want to take the position:
> > > 
> > > Hey driver authors: you cannot use any shared lock on module removal and
> > > on sysfs ops?
> > 
> > Yes, I would not recommend using such a lock at all.  sysfs operations
> > happen on a per-device basis, so you can lock the device structure.
> 
> All devices are going to be removed on module removal and so cannot be locked.

devices are not normally created by a driver, that is up to the bus
controller logic.  A module will just disconnect itself from the device,
the device does not go away.

But yes, there are exceptions, and if you are doing something odd like
that, then you need to be aware of crazy things like this, so be
careful.  But for all normal drivers, they do not have to worry about
this.

thanks,

greg k-h
Luis Chamberlain Oct. 19, 2021, 7:36 p.m. UTC | #23
On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
> On Tue, Oct 19, 2021 at 08:28:21AM -0700, Luis Chamberlain wrote:
> > On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> > > Please try the following patch against upstream(linus or next) tree(basically
> > > fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
> > > module_exit(), race between zram_remove() and disksize_store()), and see if
> > > everything is fine for you:
> > 
> > Page fault ...
> > 
> > [   18.284256] zram: Removed device: zram0
> > [   18.312974] BUG: unable to handle page fault for address:
> > ffffad86de903008
> > [   18.313707] #PF: supervisor read access in kernel mode
> > [   18.314248] #PF: error_code(0x0000) - not-present page
> > [   18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067
> 
> That is another race between zram_reset_device() and disksize_store(),
> which is supposed to be covered by ->init_lock, and follows the delta fix
> against the last patch I posted, and the whole patch can be found in the
> github link:
> 
> https://github.com/ming1/linux/commit/fa6045b1371eb301f392ac84adaf3ad53bb16894
> 
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index d0cae7a42f4d..a14ba3d350ea 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
>  	set_capacity_and_notify(zram->disk, 0);
>  	part_stat_set_all(zram->disk->part0, 0);
>  
> -	up_write(&zram->init_lock);
>  	/* I/O operation under all of CPU are done so let's free */
>  	zram_meta_free(zram, disksize);
>  	memset(&zram->stats, 0, sizeof(zram->stats));
>  	zcomp_destroy(comp);
>  	reset_bdev(zram);
> +	up_write(&zram->init_lock);
>  }
>  
>  static ssize_t disksize_store(struct device *dev,

With this, it still ends up in a state where we loop and can't get out of:

zram: Can't change algorithm for initialized device

  Luis
Luis Chamberlain Oct. 19, 2021, 7:38 p.m. UTC | #24
On Wed, Oct 20, 2021 at 12:39:22AM +0800, Ming Lei wrote:
> On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > So do you want to take the position:
> > 
> > Hey driver authors: you cannot use any shared lock on module removal and
> > on sysfs ops?
> 
> IMO, yes, in your patch of 'zram: fix crashes with cpu hotplug multistate',
> when you added mutex_lock(zram_index_mutex) to disksize_store() and
> other attribute show() or store() method. You have added new deadlock
> between hot_remove_store() and disksize_store() & others, which can't be
> addressed by your approach of holding module refcnt.
> 
> So far not see ltp tests covers hot add/remove interface yet.

Care to show what commands to use to cause this deadlock with my patches?

  Luis
Luis Chamberlain Oct. 19, 2021, 7:46 p.m. UTC | #25
On Tue, Oct 19, 2021 at 07:28:35PM +0200, Greg KH wrote:
> On Tue, Oct 19, 2021 at 09:30:05AM -0700, Luis Chamberlain wrote:
> > On Tue, Oct 19, 2021 at 06:25:18PM +0200, Greg KH wrote:
> > > On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > > > So do you want to take the position:
> > > > 
> > > > Hey driver authors: you cannot use any shared lock on module removal and
> > > > on sysfs ops?
> > > 
> > > Yes, I would not recommend using such a lock at all.  sysfs operations
> > > happen on a per-device basis, so you can lock the device structure.
> > 
> > All devices are going to be removed on module removal and so cannot be locked.
> 
> devices are not normally created by a driver, that is up to the bus
> controller logic.  A module will just disconnect itself from the device,
> the device does not go away.
> 
> But yes, there are exceptions, and if you are doing something odd like
> that, then you need to be aware of crazy things like this, so be
> careful.  But for all normal drivers, they do not have to worry about
> this.

"Recommend" is a weak position to take given a possible deadlock with sysfs.

Do we want to at the very least document this is not a supported scheme?

If so I can also add a simple 1 level indirrection coccinelle patch to
detect these schemes and complain about them as wel, if we are going to
take this position.

But to simply disregard this as "not an issue", or we won't do anything
seems pretty counter productive given we *do* had drivers with this
issue before *and* still have them upstream, and can end up with more
drivers like this later.

  Luis
Ming Lei Oct. 20, 2021, 12:55 a.m. UTC | #26
On Tue, Oct 19, 2021 at 12:38:42PM -0700, Luis Chamberlain wrote:
> On Wed, Oct 20, 2021 at 12:39:22AM +0800, Ming Lei wrote:
> > On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > > So do you want to take the position:
> > > 
> > > Hey driver authors: you cannot use any shared lock on module removal and
> > > on sysfs ops?
> > 
> > IMO, yes, in your patch of 'zram: fix crashes with cpu hotplug multistate',
> > when you added mutex_lock(zram_index_mutex) to disksize_store() and
> > other attribute show() or store() method. You have added new deadlock
> > between hot_remove_store() and disksize_store() & others, which can't be
> > addressed by your approach of holding module refcnt.
> > 
> > So far not see ltp tests covers hot add/remove interface yet.
> 
> Care to show what commands to use to cause this deadlock with my patches?

Build a kernel with your patch 4,7,8,9,11 and 12(all others are test module or
document change), with lockdep enabled, run the following command, then you
will see the warning, and it is one real deadlock, not false warning.

BTW, your patch 9 can't be applied cleanly against both linus and next
tree, so I edited it manually, but that can't make difference wrt. this issue.


[root@ktest-09 ~]# lsblk | grep zram
zram0   253:0    0    0B  0 disk 
cat /sys/class/zram-control/hot_add
[root@ktest-09 ~]# lsblk | grep zram
zram0   253:0    0    0B  0 disk 
zram1   253:1    0    0B  0 disk 
[root@ktest-09 ~]# echo 256M > /sys/block/zram1/disksize 
[root@ktest-09 ~]# echo 1 >  /sys/class/zram-control/hot_remove 
[root@ktest-09 ~]# dmesg
...
[   75.599882] ======================================================
[   75.601355] WARNING: possible circular locking dependency detected
[   75.602818] 5.15.0-rc3_zram_fix_luis+ #24 Not tainted
[   75.604038] ------------------------------------------------------
[   75.605512] bash/1154 is trying to acquire lock:
[   75.606634] ffff91ce026cd428 (kn->active#237){++++}-{0:0}, at: __kernfs_remove+0x1ab/0x1e0
[   75.608570]
               but task is already holding lock:
[   75.609955] ffffffff839e3ef0 (zram_index_mutex){+.+.}-{3:3}, at: hot_remove_store+0x52/0xf0
[   75.611910]
               which lock already depends on the new lock.

[   75.613896]
               the existing dependency chain (in reverse order) is:
[   75.615830]
               -> #1 (zram_index_mutex){+.+.}-{3:3}:
[   75.617483]        __lock_acquire+0x4d2/0x930
[   75.618650]        lock_acquire+0xbb/0x2d0
[   75.619748]        __mutex_lock+0x8e/0x8a0
[   75.620854]        disksize_store+0x38/0x180
[   75.621996]        kernfs_fop_write_iter+0x134/0x1d0
[   75.623287]        new_sync_write+0x122/0x1b0
[   75.624442]        vfs_write+0x23e/0x350
[   75.625506]        ksys_write+0x68/0xe0
[   75.626550]        do_syscall_64+0x3b/0x90
[   75.627649]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[   75.629070]
               -> #0 (kn->active#237){++++}-{0:0}:
[   75.630677]        check_prev_add+0x91/0xc10
[   75.631816]        validate_chain+0x474/0x500
[   75.632972]        __lock_acquire+0x4d2/0x930
[   75.634131]        lock_acquire+0xbb/0x2d0
[   75.635234]        kernfs_drain+0x139/0x190
[   75.636355]        __kernfs_remove+0x1ab/0x1e0
[   75.637532]        kernfs_remove_by_name_ns+0x3f/0x80
[   75.638843]        remove_files+0x2b/0x60
[   75.639926]        sysfs_remove_group+0x38/0x80
[   75.641120]        sysfs_remove_groups+0x29/0x40
[   75.642334]        device_remove_attrs+0x5b/0x90
[   75.643552]        device_del+0x184/0x400
[   75.644635]        zram_remove+0xac/0xc0
[   75.645700]        hot_remove_store+0xa3/0xf0
[   75.646856]        kernfs_fop_write_iter+0x134/0x1d0
[   75.648147]        new_sync_write+0x122/0x1b0
[   75.649311]        vfs_write+0x23e/0x350
[   75.650372]        ksys_write+0x68/0xe0
[   75.651412]        do_syscall_64+0x3b/0x90
[   75.652512]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[   75.653929]
               other info that might help us debug this:

[   75.656054]  Possible unsafe locking scenario:

[   75.657637]        CPU0                    CPU1
[   75.658833]        ----                    ----
[   75.660020]   lock(zram_index_mutex);
[   75.661024]                                lock(kn->active#237);
[   75.662549]                                lock(zram_index_mutex);
[   75.664103]   lock(kn->active#237);
[   75.665072]
                *** DEADLOCK ***

[   75.666736] 4 locks held by bash/1154:
[   75.667767]  #0: ffff91ce06983470 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x68/0xe0
[   75.669802]  #1: ffff91ce4123d290 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x100/0x1d0
[   75.672050]  #2: ffff91ce05a7ac40 (kn->active#238){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x108/0x1d0
[   75.674383]  #3: ffffffff839e3ef0 (zram_index_mutex){+.+.}-{3:3}, at: hot_remove_store+0x52/0xf0
[   75.676595]
               stack backtrace:
[   75.677835] CPU: 2 PID: 1154 Comm: bash Not tainted 5.15.0-rc3_zram_fix_luis+ #24
[   75.679768] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1.fc33 04/01/2014
[   75.681927] Call Trace:
[   75.682674]  dump_stack_lvl+0x57/0x7d
[   75.683680]  check_noncircular+0xff/0x110
[   75.684758]  ? stack_trace_save+0x4b/0x70
[   75.685843]  check_prev_add+0x91/0xc10
[   75.686867]  ? add_chain_cache+0x112/0x2d0
[   75.687965]  validate_chain+0x474/0x500
[   75.689005]  __lock_acquire+0x4d2/0x930
[   75.690054]  lock_acquire+0xbb/0x2d0
[   75.691038]  ? __kernfs_remove+0x1ab/0x1e0
[   75.692131]  ? __lock_release+0x179/0x2c0
[   75.693212]  ? kernfs_drain+0x5b/0x190
[   75.694239]  kernfs_drain+0x139/0x190
[   75.695240]  ? __kernfs_remove+0x1ab/0x1e0
[   75.696341]  __kernfs_remove+0x1ab/0x1e0
[   75.697408]  kernfs_remove_by_name_ns+0x3f/0x80
[   75.698607]  remove_files+0x2b/0x60
[   75.699576]  sysfs_remove_group+0x38/0x80
[   75.700661]  sysfs_remove_groups+0x29/0x40
[   75.701770]  device_remove_attrs+0x5b/0x90
[   75.702870]  device_del+0x184/0x400
[   75.703835]  zram_remove+0xac/0xc0
[   75.704785]  hot_remove_store+0xa3/0xf0
[   75.705831]  kernfs_fop_write_iter+0x134/0x1d0
[   75.707004]  new_sync_write+0x122/0x1b0
[   75.708048]  ? __do_fast_syscall_32+0xe0/0xf0
[   75.709214]  vfs_write+0x23e/0x350
[   75.710161]  ksys_write+0x68/0xe0
[   75.711088]  do_syscall_64+0x3b/0x90
[   75.712078]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   75.713389] RIP: 0033:0x7fcc1893f927
[   75.714381] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[   75.718879] RSP: 002b:00007ffcd56d91a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   75.720832] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fcc1893f927
[   75.722592] RDX: 0000000000000002 RSI: 000055d7d33f78c0 RDI: 0000000000000001
[   75.724352] RBP: 000055d7d33f78c0 R08: 0000000000000000 R09: 00007fcc189f44e0
[   75.726123] R10: 00007fcc189f43e0 R11: 0000000000000246 R12: 0000000000000002
[   75.727884] R13: 00007fcc18a395a0 R14: 0000000000000002 R15: 00007fcc18a397a0



Thanks, 
Ming
Ming Lei Oct. 20, 2021, 1:15 a.m. UTC | #27
On Tue, Oct 19, 2021 at 12:36:42PM -0700, Luis Chamberlain wrote:
> On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
> > On Tue, Oct 19, 2021 at 08:28:21AM -0700, Luis Chamberlain wrote:
> > > On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> > > > Please try the following patch against upstream(linus or next) tree(basically
> > > > fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
> > > > module_exit(), race between zram_remove() and disksize_store()), and see if
> > > > everything is fine for you:
> > > 
> > > Page fault ...
> > > 
> > > [   18.284256] zram: Removed device: zram0
> > > [   18.312974] BUG: unable to handle page fault for address:
> > > ffffad86de903008
> > > [   18.313707] #PF: supervisor read access in kernel mode
> > > [   18.314248] #PF: error_code(0x0000) - not-present page
> > > [   18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067
> > 
> > That is another race between zram_reset_device() and disksize_store(),
> > which is supposed to be covered by ->init_lock, and follows the delta fix
> > against the last patch I posted, and the whole patch can be found in the
> > github link:
> > 
> > https://github.com/ming1/linux/commit/fa6045b1371eb301f392ac84adaf3ad53bb16894
> > 
> > 
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index d0cae7a42f4d..a14ba3d350ea 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
> >  	set_capacity_and_notify(zram->disk, 0);
> >  	part_stat_set_all(zram->disk->part0, 0);
> >  
> > -	up_write(&zram->init_lock);
> >  	/* I/O operation under all of CPU are done so let's free */
> >  	zram_meta_free(zram, disksize);
> >  	memset(&zram->stats, 0, sizeof(zram->stats));
> >  	zcomp_destroy(comp);
> >  	reset_bdev(zram);
> > +	up_write(&zram->init_lock);
> >  }
> >  
> >  static ssize_t disksize_store(struct device *dev,
> 
> With this, it still ends up in a state where we loop and can't get out of:
> 
> zram: Can't change algorithm for initialized device

Again, you are running two zram02.sh[1] on /dev/zram0, that isn't unexpected
behavior. Here the difference is just timing. In my test VM,
this message shows a while on one task, then it may be switched to
another task.

Just run your patches a while, nothing real difference here, and the
following message can be dumped from one task for long time:

	can't set '107374182400' to /sys/block/zram0/disksize

Also you did not answer my question about your test expected result when
running the following script from two terminal concurrently:

	while true; do
		PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh;
	done




Thanks,
Ming
Miroslav Benes Oct. 20, 2021, 6:43 a.m. UTC | #28
On Tue, 19 Oct 2021, Ming Lei wrote:

> On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > forgetting that there *may* already be present drivers which *do* implement
> > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > I was informed livepatching *did* have that issue as well and so very
> > > > likely a generic solution to the deadlock could be beneficial to other
> > > > random drivers.
> > > 
> > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > 
> > I would not call it a fix. It is a kind of ugly workaround because the 
> > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > Luis is trying to fix that.
> 
> What is the proper support of the generic infrastructure? I am not
> familiar with livepatching's model(especially with module unload), you mean
> livepatching have to do the following way from sysfs:
> 
> 1) during module exit:
> 	
> 	mutex_lock(lp_lock);
> 	kobject_put(lp_kobj);
> 	mutex_unlock(lp_lock);
> 	
> 2) show()/store() method of attributes of lp_kobj
> 	
> 	mutex_lock(lp_lock)
> 	...
> 	mutex_unlock(lp_lock)

Yes, this was exactly the case. We then reworked it a lot (see 
958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
now the call sequence is different. kobject_put() is basically offloaded 
to a workqueue scheduled right from the store() method. Meaning that 
Luis's work would probably not help us currently, but on the other hand 
the issues with AA deadlock were one of the main drivers of the redesign 
(if I remember correctly). There were other reasons too as the changelog 
of the commit describes.

So, from my perspective, if there was a way to easily synchronize between 
a data cleanup from module_exit callback and sysfs/kernfs operations, it 
could spare people many headaches.
 
> IMO, the above usage simply caused AA deadlock. Even in Luis's patch
> 'zram: fix crashes with cpu hotplug multistate', new/same AA deadlock
> (hot_remove_store() vs. disksize_store() or reset_store()) is added
> because hot_remove_store() isn't called from module_exit().
> 
> Luis tries to delay unloading module until all show()/store() are done. But
> that can be obtained by the following way simply during module_exit():
> 
> 	kobject_del(lp_kobj); //all pending store()/show() from lp_kobj are done,
> 						  //no new store()/show() can come after
> 						  //kobject_del() returns	
> 	mutex_lock(lp_lock);
> 	kobject_put(lp_kobj);
> 	mutex_unlock(lp_lock);

kobject_del() already calls kobject_put(). Did you mean __kobject_del(). 
That one is internal though.
 
> Or can you explain your requirement on kobject/module unload in a bit
> details?

Does the above makes sense?

Thanks

Miroslav
Ming Lei Oct. 20, 2021, 7:49 a.m. UTC | #29
On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> On Tue, 19 Oct 2021, Ming Lei wrote:
> 
> > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > random drivers.
> > > > 
> > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > 
> > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > Luis is trying to fix that.
> > 
> > What is the proper support of the generic infrastructure? I am not
> > familiar with livepatching's model(especially with module unload), you mean
> > livepatching have to do the following way from sysfs:
> > 
> > 1) during module exit:
> > 	
> > 	mutex_lock(lp_lock);
> > 	kobject_put(lp_kobj);
> > 	mutex_unlock(lp_lock);
> > 	
> > 2) show()/store() method of attributes of lp_kobj
> > 	
> > 	mutex_lock(lp_lock)
> > 	...
> > 	mutex_unlock(lp_lock)
> 
> Yes, this was exactly the case. We then reworked it a lot (see 
> 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> now the call sequence is different. kobject_put() is basically offloaded 
> to a workqueue scheduled right from the store() method. Meaning that 
> Luis's work would probably not help us currently, but on the other hand 
> the issues with AA deadlock were one of the main drivers of the redesign 
> (if I remember correctly). There were other reasons too as the changelog 
> of the commit describes.
> 
> So, from my perspective, if there was a way to easily synchronize between 
> a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> could spare people many headaches.

kobject_del() is supposed to do so, but you can't hold a shared lock
which is required in show()/store() method. Once kobject_del() returns,
no pending show()/store() any more.

The question is that why one shared lock is required for livepatching to
delete the kobject. What are you protecting when you delete one kobject?

>  
> > IMO, the above usage simply caused AA deadlock. Even in Luis's patch
> > 'zram: fix crashes with cpu hotplug multistate', new/same AA deadlock
> > (hot_remove_store() vs. disksize_store() or reset_store()) is added
> > because hot_remove_store() isn't called from module_exit().
> > 
> > Luis tries to delay unloading module until all show()/store() are done. But
> > that can be obtained by the following way simply during module_exit():
> > 
> > 	kobject_del(lp_kobj); //all pending store()/show() from lp_kobj are done,
> > 						  //no new store()/show() can come after
> > 						  //kobject_del() returns	
> > 	mutex_lock(lp_lock);
> > 	kobject_put(lp_kobj);
> > 	mutex_unlock(lp_lock);
> 
> kobject_del() already calls kobject_put(). Did you mean __kobject_del(). 
> That one is internal though.

kobject_del() is counter-part of kobject_add(), and kobject_put() will
call kobject_del() automatically() if it isn't deleted yet, but usually
kobject_put() is for releasing the object only. It is more often to
release kobject by calling kobject_del() and kobject_put().

>  
> > Or can you explain your requirement on kobject/module unload in a bit
> > details?
> 
> Does the above makes sense?

I think now focus is the shared lock between kobject_del() and
show()/store() of the kobject's attributes.


Thanks,
Ming
Miroslav Benes Oct. 20, 2021, 8:19 a.m. UTC | #30
On Wed, 20 Oct 2021, Ming Lei wrote:

> On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > On Tue, 19 Oct 2021, Ming Lei wrote:
> > 
> > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > random drivers.
> > > > > 
> > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > 
> > > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > > Luis is trying to fix that.
> > > 
> > > What is the proper support of the generic infrastructure? I am not
> > > familiar with livepatching's model(especially with module unload), you mean
> > > livepatching have to do the following way from sysfs:
> > > 
> > > 1) during module exit:
> > > 	
> > > 	mutex_lock(lp_lock);
> > > 	kobject_put(lp_kobj);
> > > 	mutex_unlock(lp_lock);
> > > 	
> > > 2) show()/store() method of attributes of lp_kobj
> > > 	
> > > 	mutex_lock(lp_lock)
> > > 	...
> > > 	mutex_unlock(lp_lock)
> > 
> > Yes, this was exactly the case. We then reworked it a lot (see 
> > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> > now the call sequence is different. kobject_put() is basically offloaded 
> > to a workqueue scheduled right from the store() method. Meaning that 
> > Luis's work would probably not help us currently, but on the other hand 
> > the issues with AA deadlock were one of the main drivers of the redesign 
> > (if I remember correctly). There were other reasons too as the changelog 
> > of the commit describes.
> > 
> > So, from my perspective, if there was a way to easily synchronize between 
> > a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> > could spare people many headaches.
> 
> kobject_del() is supposed to do so, but you can't hold a shared lock
> which is required in show()/store() method. Once kobject_del() returns,
> no pending show()/store() any more.
> 
> The question is that why one shared lock is required for livepatching to
> delete the kobject. What are you protecting when you delete one kobject?

I think it boils down to the fact that we embed kobject statically to 
structures which livepatch uses to maintain data. That is discouraged 
generally, but all the attempts to implement it correctly were utter 
failures.

Miroslav
Greg KH Oct. 20, 2021, 8:28 a.m. UTC | #31
On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> On Wed, 20 Oct 2021, Ming Lei wrote:
> 
> > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > > 
> > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > random drivers.
> > > > > > 
> > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > > 
> > > > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > > > Luis is trying to fix that.
> > > > 
> > > > What is the proper support of the generic infrastructure? I am not
> > > > familiar with livepatching's model(especially with module unload), you mean
> > > > livepatching have to do the following way from sysfs:
> > > > 
> > > > 1) during module exit:
> > > > 	
> > > > 	mutex_lock(lp_lock);
> > > > 	kobject_put(lp_kobj);
> > > > 	mutex_unlock(lp_lock);
> > > > 	
> > > > 2) show()/store() method of attributes of lp_kobj
> > > > 	
> > > > 	mutex_lock(lp_lock)
> > > > 	...
> > > > 	mutex_unlock(lp_lock)
> > > 
> > > Yes, this was exactly the case. We then reworked it a lot (see 
> > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> > > now the call sequence is different. kobject_put() is basically offloaded 
> > > to a workqueue scheduled right from the store() method. Meaning that 
> > > Luis's work would probably not help us currently, but on the other hand 
> > > the issues with AA deadlock were one of the main drivers of the redesign 
> > > (if I remember correctly). There were other reasons too as the changelog 
> > > of the commit describes.
> > > 
> > > So, from my perspective, if there was a way to easily synchronize between 
> > > a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> > > could spare people many headaches.
> > 
> > kobject_del() is supposed to do so, but you can't hold a shared lock
> > which is required in show()/store() method. Once kobject_del() returns,
> > no pending show()/store() any more.
> > 
> > The question is that why one shared lock is required for livepatching to
> > delete the kobject. What are you protecting when you delete one kobject?
> 
> I think it boils down to the fact that we embed kobject statically to 
> structures which livepatch uses to maintain data. That is discouraged 
> generally, but all the attempts to implement it correctly were utter 
> failures.

Sounds like this is the real problem that needs to be fixed.  kobjects
should always control the lifespan of the structure they are embedded
in.  If not, then that is a design flaw of the user of the kobject :(

Where in the kernel is this happening?  And where have been the attempts
to fix this up?

thanks,

greg k-h
Ming Lei Oct. 20, 2021, 10:09 a.m. UTC | #32
On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> On Wed, 20 Oct 2021, Ming Lei wrote:
> 
> > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > > 
> > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > random drivers.
> > > > > > 
> > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > > 
> > > > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > > > Luis is trying to fix that.
> > > > 
> > > > What is the proper support of the generic infrastructure? I am not
> > > > familiar with livepatching's model(especially with module unload), you mean
> > > > livepatching have to do the following way from sysfs:
> > > > 
> > > > 1) during module exit:
> > > > 	
> > > > 	mutex_lock(lp_lock);
> > > > 	kobject_put(lp_kobj);
> > > > 	mutex_unlock(lp_lock);
> > > > 	
> > > > 2) show()/store() method of attributes of lp_kobj
> > > > 	
> > > > 	mutex_lock(lp_lock)
> > > > 	...
> > > > 	mutex_unlock(lp_lock)
> > > 
> > > Yes, this was exactly the case. We then reworked it a lot (see 
> > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> > > now the call sequence is different. kobject_put() is basically offloaded 
> > > to a workqueue scheduled right from the store() method. Meaning that 
> > > Luis's work would probably not help us currently, but on the other hand 
> > > the issues with AA deadlock were one of the main drivers of the redesign 
> > > (if I remember correctly). There were other reasons too as the changelog 
> > > of the commit describes.
> > > 
> > > So, from my perspective, if there was a way to easily synchronize between 
> > > a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> > > could spare people many headaches.
> > 
> > kobject_del() is supposed to do so, but you can't hold a shared lock
> > which is required in show()/store() method. Once kobject_del() returns,
> > no pending show()/store() any more.
> > 
> > The question is that why one shared lock is required for livepatching to
> > delete the kobject. What are you protecting when you delete one kobject?
> 
> I think it boils down to the fact that we embed kobject statically to 
> structures which livepatch uses to maintain data. That is discouraged 
> generally, but all the attempts to implement it correctly were utter 
> failures.

OK, then it isn't one common usage, in which kobject covers the release
of the external object. What is the exact kobject in livepatching?

But kobject_del() won't release the kobject, you shouldn't need the lock
to delete kobject first. After the kobject is deleted, no any show() and
store() any more, isn't such sync[1] you expected?


Thanks,
Ming
Luis Chamberlain Oct. 20, 2021, 3:48 p.m. UTC | #33
On Wed, Oct 20, 2021 at 09:15:20AM +0800, Ming Lei wrote:
> On Tue, Oct 19, 2021 at 12:36:42PM -0700, Luis Chamberlain wrote:
> > On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
> > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > index d0cae7a42f4d..a14ba3d350ea 100644
> > > --- a/drivers/block/zram/zram_drv.c
> > > +++ b/drivers/block/zram/zram_drv.c
> > > @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
> > >  	set_capacity_and_notify(zram->disk, 0);
> > >  	part_stat_set_all(zram->disk->part0, 0);
> > >  
> > > -	up_write(&zram->init_lock);
> > >  	/* I/O operation under all of CPU are done so let's free */
> > >  	zram_meta_free(zram, disksize);
> > >  	memset(&zram->stats, 0, sizeof(zram->stats));
> > >  	zcomp_destroy(comp);
> > >  	reset_bdev(zram);
> > > +	up_write(&zram->init_lock);
> > >  }
> > >  
> > >  static ssize_t disksize_store(struct device *dev,
> > 
> > With this, it still ends up in a state where we loop and can't get out of:
> > 
> > zram: Can't change algorithm for initialized device
> 
> Again, you are running two zram02.sh[1] on /dev/zram0, that isn't unexpected

You mean that it is not expected? If so then yes, of course.

> behavior. Here the difference is just timing.

Right, but that is what helped reproduce a difficutl to re-produce customer
bug. Once you find an easy way to reproduce a reported issue you stick
with it and try to make the situation worse to ensure no more bugs are
present.

> Also you did not answer my question about your test expected result when
> running the following script from two terminal concurrently:
> 
> 	while true; do
> 		PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh;
> 	done

If you run this, you should see no failures.

Once you start a second script that one should cause odd issues on both
sides but never crash or stall the module.

A second series of tests is hitting CTRL-C on either randonly and
restarting testing once again randomly.

Again, neither should crash the kernel or stall the module.

In the end of these tests you should be able to run the script alone
just once and not see issues.

  Luis
Ming Lei Oct. 21, 2021, 12:39 a.m. UTC | #34
On Wed, Oct 20, 2021 at 08:48:04AM -0700, Luis Chamberlain wrote:
> On Wed, Oct 20, 2021 at 09:15:20AM +0800, Ming Lei wrote:
> > On Tue, Oct 19, 2021 at 12:36:42PM -0700, Luis Chamberlain wrote:
> > > On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
> > > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > > index d0cae7a42f4d..a14ba3d350ea 100644
> > > > --- a/drivers/block/zram/zram_drv.c
> > > > +++ b/drivers/block/zram/zram_drv.c
> > > > @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
> > > >  	set_capacity_and_notify(zram->disk, 0);
> > > >  	part_stat_set_all(zram->disk->part0, 0);
> > > >  
> > > > -	up_write(&zram->init_lock);
> > > >  	/* I/O operation under all of CPU are done so let's free */
> > > >  	zram_meta_free(zram, disksize);
> > > >  	memset(&zram->stats, 0, sizeof(zram->stats));
> > > >  	zcomp_destroy(comp);
> > > >  	reset_bdev(zram);
> > > > +	up_write(&zram->init_lock);
> > > >  }
> > > >  
> > > >  static ssize_t disksize_store(struct device *dev,
> > > 
> > > With this, it still ends up in a state where we loop and can't get out of:
> > > 
> > > zram: Can't change algorithm for initialized device
> > 
> > Again, you are running two zram02.sh[1] on /dev/zram0, that isn't unexpected
> 
> You mean that it is not expected? If so then yes, of course.

My meaning is clear: it is not unexpected, so it is expected.

> 
> > behavior. Here the difference is just timing.
> 
> Right, but that is what helped reproduce a difficutl to re-produce customer
> bug. Once you find an easy way to reproduce a reported issue you stick
> with it and try to make the situation worse to ensure no more bugs are
> present.
> 
> > Also you did not answer my question about your test expected result when
> > running the following script from two terminal concurrently:
> > 
> > 	while true; do
> > 		PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh;
> > 	done
> 
> If you run this, you should see no failures.

OK, not see any failure when running single zram02.sh after applying my
patch V2.

> 
> Once you start a second script that one should cause odd issues on both
> sides but never crash or stall the module.

crash can't be observed with my patch V2, what do you mean 'stall'
the module? Is that 'zram' can't be unloaded after the test is
terminated via multiple 'ctrl-c'?

> 
> A second series of tests is hitting CTRL-C on either randonly and
> restarting testing once again randomly.

ltp/zram02.sh has cleanup handler via trap to clean everything(swapoff/umount/reset/
rmmod), ctrl-c will terminate current forground task and cause shell to run the
cleanup handler first, but further 'ctrl-c' will terminate the cleanup handler,
then the cleanup won't be done completely, such as zram disk is left as swap
device and zram can't be unloaded. The idea can be observed via the following
script:

	#!/bin/bash
	trap 'echo "enter trap"; sleep 20; echo "exit trap";' INT
	sleep 30

After the above script is run foreground, when 1st ctrl-c is pressed, 'sleep 30'
is terminated, then the trap command is run, so you can see "enter trap"
dumped. Then if you pressed 2nd ctrl-c, 'sleep 20' is terminated immediately.
So 'swapoff' from zram02.sh's trap function can be terminated in this way.

zram disk being left as swap disk can be observed with your patch too
after terminating via multiple ctrl-c which has to be done this way because
the test is dead loop.

So it is hard to cleanup everything completely after multiple 'CTRL-C' is
involved, and it should be impossible. It needs violent multiple ctrl-c to
terminate the dealoop test.

So it isn't reasonable to expect that zram can be always unloaded successfully
after the test script is terminated via multiple ctrl-c.

But zram can be unloaded after running swapoff manually, from driver
viewpoint, nothing is wrong.

> 
> Again, neither should crash the kernel or stall the module.
> 
> In the end of these tests you should be able to run the script alone
> just once and not see issues.


Thanks,
Ming
Luis Chamberlain Oct. 21, 2021, 5:18 p.m. UTC | #35
On Thu, Oct 21, 2021 at 08:39:05AM +0800, Ming Lei wrote:
> On Wed, Oct 20, 2021 at 08:48:04AM -0700, Luis Chamberlain wrote:
> > A second series of tests is hitting CTRL-C on either randonly and
> > restarting testing once again randomly.
> 
> ltp/zram02.sh has cleanup handler via trap to clean everything(swapoff/umount/reset/
> rmmod), ctrl-c will terminate current forground task and cause shell to run the
> cleanup handler first, but further 'ctrl-c' will terminate the cleanup handler,
> then the cleanup won't be done completely, such as zram disk is left as swap
> device and zram can't be unloaded. The idea can be observed via the following
> script:
> 
> 	#!/bin/bash
> 	trap 'echo "enter trap"; sleep 20; echo "exit trap";' INT
> 	sleep 30
> 
> After the above script is run foreground, when 1st ctrl-c is pressed, 'sleep 30'
> is terminated, then the trap command is run, so you can see "enter trap"
> dumped. Then if you pressed 2nd ctrl-c, 'sleep 20' is terminated immediately.
> So 'swapoff' from zram02.sh's trap function can be terminated in this way.
> 
> zram disk being left as swap disk can be observed with your patch too
> after terminating via multiple ctrl-c which has to be done this way because
> the test is dead loop.
> 
> So it is hard to cleanup everything completely after multiple 'CTRL-C' is
> involved, and it should be impossible. It needs violent multiple ctrl-c to
> terminate the dealoop test.
> 
> So it isn't reasonable to expect that zram can be always unloaded successfully
> after the test script is terminated via multiple ctrl-c.

For the life of me, I do not run into these issue with my patch. But
with yours I had.

To be clear, I run zram02.sh on two terminals. Then to interrupt I just leave
CTRL-C pressed to issue multiple terminations until the script is done
on each terminal at a time, until I see both have completed.

I repeat the same test, noting always that when I start one one terminal
the test is succeeding. And also when I cancel completely one script the
test continue fine without issue.

> But zram can be unloaded after running swapoff manually, from driver
> viewpoint, nothing is wrong.

I had not run into that issue with my patch FWIW.

  Luis
Ming Lei Oct. 22, 2021, 12:05 a.m. UTC | #36
On Thu, Oct 21, 2021 at 10:18:47AM -0700, Luis Chamberlain wrote:
> On Thu, Oct 21, 2021 at 08:39:05AM +0800, Ming Lei wrote:
> > On Wed, Oct 20, 2021 at 08:48:04AM -0700, Luis Chamberlain wrote:
> > > A second series of tests is hitting CTRL-C on either randonly and
> > > restarting testing once again randomly.
> > 
> > ltp/zram02.sh has cleanup handler via trap to clean everything(swapoff/umount/reset/
> > rmmod), ctrl-c will terminate current forground task and cause shell to run the
> > cleanup handler first, but further 'ctrl-c' will terminate the cleanup handler,
> > then the cleanup won't be done completely, such as zram disk is left as swap
> > device and zram can't be unloaded. The idea can be observed via the following
> > script:
> > 
> > 	#!/bin/bash
> > 	trap 'echo "enter trap"; sleep 20; echo "exit trap";' INT
> > 	sleep 30
> > 
> > After the above script is run foreground, when 1st ctrl-c is pressed, 'sleep 30'
> > is terminated, then the trap command is run, so you can see "enter trap"
> > dumped. Then if you pressed 2nd ctrl-c, 'sleep 20' is terminated immediately.
> > So 'swapoff' from zram02.sh's trap function can be terminated in this way.
> > 
> > zram disk being left as swap disk can be observed with your patch too
> > after terminating via multiple ctrl-c which has to be done this way because
> > the test is dead loop.
> > 
> > So it is hard to cleanup everything completely after multiple 'CTRL-C' is
> > involved, and it should be impossible. It needs violent multiple ctrl-c to
> > terminate the dealoop test.
> > 
> > So it isn't reasonable to expect that zram can be always unloaded successfully
> > after the test script is terminated via multiple ctrl-c.
> 
> For the life of me, I do not run into these issue with my patch. But
> with yours I had.
> 
> To be clear, I run zram02.sh on two terminals. Then to interrupt I just leave
> CTRL-C pressed to issue multiple terminations until the script is done
> on each terminal at a time, until I see both have completed.
> 
> I repeat the same test, noting always that when I start one one terminal
> the test is succeeding. And also when I cancel completely one script the
> test continue fine without issue.

As I explained wrt. shell's trap, this issue won't be avoided from
userspace because trap function can be terminated by ctrl-c too,
otherwise one shell script may not be terminated at all.

The unclean shutdown can be observed in single 'while true; do zram02.sh; done'
too on both your patches and mine.

Also it is insane to write write test in a deadloop, and people seldom
do that, not see such way in either blktests/xfstests.

I you limit completion time of this test in long enough time(one or
several hours) or big enough loops, I believe it can be done cleanly,
such as:

cnt=0
MAX=10000
while [ $cnt -lt $MAX ]; do
	PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh;
done


Thanks,
Ming
Miroslav Benes Oct. 25, 2021, 9:58 a.m. UTC | #37
On Wed, 20 Oct 2021, Greg KH wrote:

> On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> > On Wed, 20 Oct 2021, Ming Lei wrote:
> > 
> > > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > > > 
> > > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > > random drivers.
> > > > > > > 
> > > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > > > 
> > > > > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > > > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > > > > Luis is trying to fix that.
> > > > > 
> > > > > What is the proper support of the generic infrastructure? I am not
> > > > > familiar with livepatching's model(especially with module unload), you mean
> > > > > livepatching have to do the following way from sysfs:
> > > > > 
> > > > > 1) during module exit:
> > > > > 	
> > > > > 	mutex_lock(lp_lock);
> > > > > 	kobject_put(lp_kobj);
> > > > > 	mutex_unlock(lp_lock);
> > > > > 	
> > > > > 2) show()/store() method of attributes of lp_kobj
> > > > > 	
> > > > > 	mutex_lock(lp_lock)
> > > > > 	...
> > > > > 	mutex_unlock(lp_lock)
> > > > 
> > > > Yes, this was exactly the case. We then reworked it a lot (see 
> > > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> > > > now the call sequence is different. kobject_put() is basically offloaded 
> > > > to a workqueue scheduled right from the store() method. Meaning that 
> > > > Luis's work would probably not help us currently, but on the other hand 
> > > > the issues with AA deadlock were one of the main drivers of the redesign 
> > > > (if I remember correctly). There were other reasons too as the changelog 
> > > > of the commit describes.
> > > > 
> > > > So, from my perspective, if there was a way to easily synchronize between 
> > > > a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> > > > could spare people many headaches.
> > > 
> > > kobject_del() is supposed to do so, but you can't hold a shared lock
> > > which is required in show()/store() method. Once kobject_del() returns,
> > > no pending show()/store() any more.
> > > 
> > > The question is that why one shared lock is required for livepatching to
> > > delete the kobject. What are you protecting when you delete one kobject?
> > 
> > I think it boils down to the fact that we embed kobject statically to 
> > structures which livepatch uses to maintain data. That is discouraged 
> > generally, but all the attempts to implement it correctly were utter 
> > failures.
> 
> Sounds like this is the real problem that needs to be fixed.  kobjects
> should always control the lifespan of the structure they are embedded
> in.  If not, then that is a design flaw of the user of the kobject :(

Right, and you've already told us. A couple of times.

For example 
here https://lore.kernel.org/all/20190502074230.GA27847@kroah.com/

:)
 
> Where in the kernel is this happening?  And where have been the attempts
> to fix this up?

include/linux/livepatch.h and kernel/livepatch/core.c. See 
klp_{patch,object,func}.

It took some archeology, but I think 
https://lore.kernel.org/all/1464018848-4303-1-git-send-email-pmladek@suse.com/ 
is it. Petr might correct me.

It was long before we added some important features to the code, so it 
might be even more difficult today.

It resurfaced later when Tobin tried to fix some of kobject call sites in 
the kernel...

https://lore.kernel.org/all/20190430001534.26246-1-tobin@kernel.org/
https://lore.kernel.org/all/20190430233803.GB10777@eros.localdomain/
https://lore.kernel.org/all/20190502023142.20139-6-tobin@kernel.org/

There are probably more references.

Anyway, the current code works fine (well, one could argue about that). If 
someone wants to take a (another) stab at this, then why not, but it 
seemed like a rabbit hole without a substantial gain in the past. On the 
other hand, we currently misuse the API to some extent.

/me scratches head

Miroslav
Petr Mladek Oct. 26, 2021, 8:48 a.m. UTC | #38
On Wed 2021-10-20 18:09:51, Ming Lei wrote:
> On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> > On Wed, 20 Oct 2021, Ming Lei wrote:
> > 
> > > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > > > 
> > > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > > random drivers.
> > > > > > > 
> > > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > > > 
> > > > > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > > > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > > > > Luis is trying to fix that.
> > > > > 
> > > > > What is the proper support of the generic infrastructure? I am not
> > > > > familiar with livepatching's model(especially with module unload), you mean
> > > > > livepatching have to do the following way from sysfs:
> > > > > 
> > > > > 1) during module exit:
> > > > > 	
> > > > > 	mutex_lock(lp_lock);
> > > > > 	kobject_put(lp_kobj);
> > > > > 	mutex_unlock(lp_lock);
> > > > > 	
> > > > > 2) show()/store() method of attributes of lp_kobj
> > > > > 	
> > > > > 	mutex_lock(lp_lock)
> > > > > 	...
> > > > > 	mutex_unlock(lp_lock)
> > > > 
> > > > Yes, this was exactly the case. We then reworked it a lot (see 
> > > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> > > > now the call sequence is different. kobject_put() is basically offloaded 
> > > > to a workqueue scheduled right from the store() method. Meaning that 
> > > > Luis's work would probably not help us currently, but on the other hand 
> > > > the issues with AA deadlock were one of the main drivers of the redesign 
> > > > (if I remember correctly). There were other reasons too as the changelog 
> > > > of the commit describes.
> > > > 
> > > > So, from my perspective, if there was a way to easily synchronize between 
> > > > a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> > > > could spare people many headaches.
> > > 
> > > kobject_del() is supposed to do so, but you can't hold a shared lock
> > > which is required in show()/store() method. Once kobject_del() returns,
> > > no pending show()/store() any more.
> > > 
> > > The question is that why one shared lock is required for livepatching to
> > > delete the kobject. What are you protecting when you delete one kobject?
> > 
> > I think it boils down to the fact that we embed kobject statically to 
> > structures which livepatch uses to maintain data. That is discouraged 
> > generally, but all the attempts to implement it correctly were utter 
> > failures.
> 
> OK, then it isn't one common usage, in which kobject covers the release
> of the external object. What is the exact kobject in livepatching?

Below are more details about the livepatch code. I hope that it will
help you to see if zram has similar problems or not.

We have kobject in three structures: klp_func, klp_object, and
klp_patch, see include/linux/livepatch.h.

These structures have to be statically defined in the module sources
because they define what is livepatched, see
samples/livepatch/livepatch-sample.c

The kobject is used there to show information about the patch, patched
objects, and patched functions, in sysfs. And most importantly,
the sysfs interface can be used to disable the livepatch.

The problem with static structures is that the module must stay
in the memory as long as the sysfs interface exists. It can be
solved in module_exit() callback. It could wait until the sysfs
interface is destroyed.

kobject API does not support this scenario. The relase() callbacks
are called asynchronously. It expects that the structure is bundled
in a dynamically allocated structure.  As a result, the sysfs
interface can be removed even after the module removal.

The livepatching might create the dynamic structures by duplicating
the structures defined in the module statically. It might safe us
some headaches with kobject release. But it would also need an extra code
that would need to be maintained. The structure constrains strings
than need to be duplicated and later freed...


> But kobject_del() won't release the kobject, you shouldn't need the lock
> to delete kobject first. After the kobject is deleted, no any show() and
> store() any more, isn't such sync[1] you expected?

Livepatch code never called kobject_del() under a lock. It would cause
the obvious deadlock. The historic code only waited in the
module_exit() callback until the sysfs interface was removed.

It has changed in the commit 958ef1e39d24d6cb8bf2a740 ("livepatch:
Simplify API by removing registration step"). The livepatch could
never get enabled again after it was disabled now. The sysfs interface
is removed when the livepatch gets disabled. The module could
be removed only after the sysfs interface is destroyed, see
the module_put() in klp_free_patch_finish().

The livepatch code uses workqueue because the livepatch can be
disabled via sysfs interface. It obviously could not wait until
the sysfs interface is removed in the sysfs write() callback
that triggered the removal.

HTH,
Petr
Ming Lei Oct. 26, 2021, 3:37 p.m. UTC | #39
On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> On Wed 2021-10-20 18:09:51, Ming Lei wrote:
> > On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> > > On Wed, 20 Oct 2021, Ming Lei wrote:
> > > 
> > > > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > > > > 
> > > > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > > > random drivers.
> > > > > > > > 
> > > > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > > > > 
> > > > > > > I would not call it a fix. It is a kind of ugly workaround because the 
> > > > > > > generic infrastructure lacked (lacks) the proper support in my opinion. 
> > > > > > > Luis is trying to fix that.
> > > > > > 
> > > > > > What is the proper support of the generic infrastructure? I am not
> > > > > > familiar with livepatching's model(especially with module unload), you mean
> > > > > > livepatching have to do the following way from sysfs:
> > > > > > 
> > > > > > 1) during module exit:
> > > > > > 	
> > > > > > 	mutex_lock(lp_lock);
> > > > > > 	kobject_put(lp_kobj);
> > > > > > 	mutex_unlock(lp_lock);
> > > > > > 	
> > > > > > 2) show()/store() method of attributes of lp_kobj
> > > > > > 	
> > > > > > 	mutex_lock(lp_lock)
> > > > > > 	...
> > > > > > 	mutex_unlock(lp_lock)
> > > > > 
> > > > > Yes, this was exactly the case. We then reworked it a lot (see 
> > > > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so 
> > > > > now the call sequence is different. kobject_put() is basically offloaded 
> > > > > to a workqueue scheduled right from the store() method. Meaning that 
> > > > > Luis's work would probably not help us currently, but on the other hand 
> > > > > the issues with AA deadlock were one of the main drivers of the redesign 
> > > > > (if I remember correctly). There were other reasons too as the changelog 
> > > > > of the commit describes.
> > > > > 
> > > > > So, from my perspective, if there was a way to easily synchronize between 
> > > > > a data cleanup from module_exit callback and sysfs/kernfs operations, it 
> > > > > could spare people many headaches.
> > > > 
> > > > kobject_del() is supposed to do so, but you can't hold a shared lock
> > > > which is required in show()/store() method. Once kobject_del() returns,
> > > > no pending show()/store() any more.
> > > > 
> > > > The question is that why one shared lock is required for livepatching to
> > > > delete the kobject. What are you protecting when you delete one kobject?
> > > 
> > > I think it boils down to the fact that we embed kobject statically to 
> > > structures which livepatch uses to maintain data. That is discouraged 
> > > generally, but all the attempts to implement it correctly were utter 
> > > failures.
> > 
> > OK, then it isn't one common usage, in which kobject covers the release
> > of the external object. What is the exact kobject in livepatching?
> 
> Below are more details about the livepatch code. I hope that it will
> help you to see if zram has similar problems or not.
> 
> We have kobject in three structures: klp_func, klp_object, and
> klp_patch, see include/linux/livepatch.h.
> 
> These structures have to be statically defined in the module sources
> because they define what is livepatched, see
> samples/livepatch/livepatch-sample.c
> 
> The kobject is used there to show information about the patch, patched
> objects, and patched functions, in sysfs. And most importantly,
> the sysfs interface can be used to disable the livepatch.
> 
> The problem with static structures is that the module must stay
> in the memory as long as the sysfs interface exists. It can be
> solved in module_exit() callback. It could wait until the sysfs
> interface is destroyed.
> 
> kobject API does not support this scenario. The relase() callbacks

kobject_delete() is for supporting this scenario, that is why we don't
need to grab module refcnt before calling show()/store() of the
kobject's attributes.

kobject_delete() can be called in module_exit(), then any show()/store()
will be done after kobject_delete() returns.

> are called asynchronously. It expects that the structure is bundled
> in a dynamically allocated structure.  As a result, the sysfs
> interface can be removed even after the module removal.

That should be one bug, otherwise store()/show() method could be called
into after the module is unloaded.

> 
> The livepatching might create the dynamic structures by duplicating
> the structures defined in the module statically. It might safe us
> some headaches with kobject release. But it would also need an extra code
> that would need to be maintained. The structure constrains strings
> than need to be duplicated and later freed...
> 
> 
> > But kobject_del() won't release the kobject, you shouldn't need the lock
> > to delete kobject first. After the kobject is deleted, no any show() and
> > store() any more, isn't such sync[1] you expected?
> 
> Livepatch code never called kobject_del() under a lock. It would cause
> the obvious deadlock. The historic code only waited in the
> module_exit() callback until the sysfs interface was removed.

OK, then Luis shouldn't consider livepatching as one such issue to solve
with one generic solution.

> 
> It has changed in the commit 958ef1e39d24d6cb8bf2a740 ("livepatch:
> Simplify API by removing registration step"). The livepatch could
> never get enabled again after it was disabled now. The sysfs interface
> is removed when the livepatch gets disabled. The module could
> be removed only after the sysfs interface is destroyed, see
> the module_put() in klp_free_patch_finish().

OK, that is livepatching's implementation: all the kobjects are deleted &
freed after disabling the livepatch module, that looks one kill-me
operation, instead of disabling, so this way isn't a normal usage,
scsi has similar sysfs interface of delete. Also kobjects can't be
removed in enable's store() directly, since deadlock could be
caused, looks wq has to be used here for avoiding deadlock.

BTW, what is the livepatching module use model? try_module_get() is
called in klp_init_patch_early()<-klp_enable_patch()<-module_init(),
module_put() is called in klp_free_patch_finish() which seems only be
called after 'echo 0 > /sys/kernel/livepatch/$lp_mod/enabled'.

Usually when the module isn't used, module_exit() gets chance to be called
by userspace rmmod, then all kobjects created in this module can be
deleted in module_exit().

> 
> The livepatch code uses workqueue because the livepatch can be
> disabled via sysfs interface. It obviously could not wait until
> the sysfs interface is removed in the sysfs write() callback
> that triggered the removal.

If klp_free_patch_* is moved into module_exit() and not let enable
store() to kill kobjects, all kobjects can be deleted in module_exit(),
then wait_for_completion(patch->finish) may be removed, also wq isn't
required for the async cleanup.



Thanks, 
Ming
Luis Chamberlain Oct. 26, 2021, 5:01 p.m. UTC | #40
On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
> On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > Livepatch code never called kobject_del() under a lock. It would cause
> > the obvious deadlock.

Never?

> > The historic code only waited in the
> > module_exit() callback until the sysfs interface was removed.
> 
> OK, then Luis shouldn't consider livepatching as one such issue to solve
> with one generic solution.

It's not what I was told when the deadlock was found with zram, so I was
informed quite the contrary.

I'm working on a generic coccinelle patch which hunts for actual cases
using iteration (a feature of coccinelle for complex searches). The
search is pretty involved, so I don't think I'll have an answer to this
soon.

Since the question of how generic this deadlock is remains questionable,
I think it makes sense to put the generic deadlock fix off the table for
now, and we address this once we have a more concrete search with
coccinelle.

But to say we *don't* have drivers which can cause this is obviously
wrong as well, from a cursory search so far. But let's wait and see how
big this list actually is.

I'll drop the deadlock generic fixes and move on with at least a starter
kernfs / sysfs tests.

  Luis
Miroslav Benes Oct. 27, 2021, 11:42 a.m. UTC | #41
> > 
> > The livepatch code uses workqueue because the livepatch can be
> > disabled via sysfs interface. It obviously could not wait until
> > the sysfs interface is removed in the sysfs write() callback
> > that triggered the removal.
> 
> If klp_free_patch_* is moved into module_exit() and not let enable
> store() to kill kobjects, all kobjects can be deleted in module_exit(),
> then wait_for_completion(patch->finish) may be removed, also wq isn't
> required for the async cleanup.

It sounds like a nice cleanup. If we combine kobject_del() to prevent any 
show()/store() accesses and free everything later in module_exit(), it 
could work. If I am not missing something around how we maintain internal 
lists of live patches and their modules.

Thanks

Miroslav
Miroslav Benes Oct. 27, 2021, 11:57 a.m. UTC | #42
On Tue, 26 Oct 2021, Luis Chamberlain wrote:

> On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
> > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > Livepatch code never called kobject_del() under a lock. It would cause
> > > the obvious deadlock.
> 
> Never?

kobject_put() to be precise.

When I started working on the support for module/live patches removal, 
calling kobject_put() under our klp_mutex lock was the obvious first 
choice given how the code was structured, but I ran into problems with 
deadlocks immediately. So it was changed to async approach with the 
workqueue. Thus the mainline code has never suffered from this, but we 
knew about the issues.
 
> > > The historic code only waited in the
> > > module_exit() callback until the sysfs interface was removed.
> > 
> > OK, then Luis shouldn't consider livepatching as one such issue to solve
> > with one generic solution.
> 
> It's not what I was told when the deadlock was found with zram, so I was
> informed quite the contrary.

From my perspective, it is quite easy to get it wrong due to either a lack 
of generic support, or missing rules/documentation. So if this thread 
leads to "do not share locks between a module removal and a sysfs 
operation" strict rule, it would be at least something. In the same 
manner as Luis proposed to document try_module_get() expectations.
 
> I'm working on a generic coccinelle patch which hunts for actual cases
> using iteration (a feature of coccinelle for complex searches). The
> search is pretty involved, so I don't think I'll have an answer to this
> soon.
> 
> Since the question of how generic this deadlock is remains questionable,
> I think it makes sense to put the generic deadlock fix off the table for
> now, and we address this once we have a more concrete search with
> coccinelle.
> 
> But to say we *don't* have drivers which can cause this is obviously
> wrong as well, from a cursory search so far. But let's wait and see how
> big this list actually is.
> 
> I'll drop the deadlock generic fixes and move on with at least a starter
> kernfs / sysfs tests.

It makes sense to me.

Thanks, Luis, for pursuing it.

Miroslav
Luis Chamberlain Oct. 27, 2021, 2:27 p.m. UTC | #43
On Wed, Oct 27, 2021 at 01:57:40PM +0200, Miroslav Benes wrote:
> On Tue, 26 Oct 2021, Luis Chamberlain wrote:
> 
> > On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
> > > OK, then Luis shouldn't consider livepatching as one such issue to solve
> > > with one generic solution.
> > 
> > It's not what I was told when the deadlock was found with zram, so I was
> > informed quite the contrary.
> 
> From my perspective, it is quite easy to get it wrong due to either a lack 
> of generic support, or missing rules/documentation.

Indeed. I agree some level of guidence is needed, even if subtle, rather
than tribal knowledge. I'll start off with the test_sysfs demo'ing what
not to do and documenting this there. I don't think it makes sense to
formalize yet documentation for "though shalt not do this" generically
until a full depth search is done with Coccinelle.

> So if this thread 
> leads to "do not share locks between a module removal and a sysfs 
> operation" strict rule, it would be at least something.

I think that's where we are at. I'll wait to complete my coccinelle
deadlock hunt patch to complete the full search, and that could be
useful to *warn* aboute new use cases, so to prevent this deadlock
in the future. Until then I agree that the complexity introduced is
not worth it given the evidence of users, but the full evidence of
actual users still remains to be determined. A perfect job left to
advances with Coccinelle.

> In the same 
> manner as Luis proposed to document try_module_get() expectations.

Right and so sysfs ops using try_module_get() *still* remains safe,
and so will keep that patch in my next iteration because there *are*
*many* uses cases for that.

  Luis
Petr Mladek Nov. 2, 2021, 2:15 p.m. UTC | #44
On Tue 2021-10-26 23:37:30, Ming Lei wrote:
> On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > Below are more details about the livepatch code. I hope that it will
> > help you to see if zram has similar problems or not.
> > 
> > We have kobject in three structures: klp_func, klp_object, and
> > klp_patch, see include/linux/livepatch.h.
> > 
> > These structures have to be statically defined in the module sources
> > because they define what is livepatched, see
> > samples/livepatch/livepatch-sample.c
> > 
> > The kobject is used there to show information about the patch, patched
> > objects, and patched functions, in sysfs. And most importantly,
> > the sysfs interface can be used to disable the livepatch.
> > 
> > The problem with static structures is that the module must stay
> > in the memory as long as the sysfs interface exists. It can be
> > solved in module_exit() callback. It could wait until the sysfs
> > interface is destroyed.
> > 
> > kobject API does not support this scenario. The relase() callbacks
> 
> kobject_delete() is for supporting this scenario, that is why we don't
> need to grab module refcnt before calling show()/store() of the
> kobject's attributes.
> 
> kobject_delete() can be called in module_exit(), then any show()/store()
> will be done after kobject_delete() returns.

I am a bit confused. I do not see kobject_delete() anywhere in kernel
sources.

I see only kobject_del() and kobject_put(). AFAIK, they do _not_
guarantee that either the sysfs interface was destroyed or
the release callbacks were called. For example, see
schedule_delayed_work(&kobj->release, delay) in kobject_release().

By other words, anyone could still be using either the sysfs interface
or the related structures after kobject_del() or kobject_put()
returns.

IMHO, kobject API does not support static structures and module
removal.

Best Regards,
Petr
Petr Mladek Nov. 2, 2021, 2:51 p.m. UTC | #45
On Tue 2021-11-02 15:15:19, Petr Mladek wrote:
> On Tue 2021-10-26 23:37:30, Ming Lei wrote:
> > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > Below are more details about the livepatch code. I hope that it will
> > > help you to see if zram has similar problems or not.
> > > 
> > > We have kobject in three structures: klp_func, klp_object, and
> > > klp_patch, see include/linux/livepatch.h.
> > > 
> > > These structures have to be statically defined in the module sources
> > > because they define what is livepatched, see
> > > samples/livepatch/livepatch-sample.c
> > > 
> > > The kobject is used there to show information about the patch, patched
> > > objects, and patched functions, in sysfs. And most importantly,
> > > the sysfs interface can be used to disable the livepatch.
> > > 
> > > The problem with static structures is that the module must stay
> > > in the memory as long as the sysfs interface exists. It can be
> > > solved in module_exit() callback. It could wait until the sysfs
> > > interface is destroyed.
> > > 
> > > kobject API does not support this scenario. The relase() callbacks
> > 
> > kobject_delete() is for supporting this scenario, that is why we don't
> > need to grab module refcnt before calling show()/store() of the
> > kobject's attributes.
> > 
> > kobject_delete() can be called in module_exit(), then any show()/store()
> > will be done after kobject_delete() returns.
> 
> I am a bit confused. I do not see kobject_delete() anywhere in kernel
> sources.
> 
> I see only kobject_del() and kobject_put(). AFAIK, they do _not_
> guarantee that either the sysfs interface was destroyed or
> the release callbacks were called. For example, see
> schedule_delayed_work(&kobj->release, delay) in kobject_release().

Grr, I always get confused by the code. kobject_del() actually waits
until the sysfs interface gets destroyed. This is why there is
the deadlock.

But kobject_put() is _not_ synchronous. And the comment above
kobject_add() repeat 3 times that kobject_put() must be called
on success:

 * Return: If this function returns an error, kobject_put() must be
 *         called to properly clean up the memory associated with the
 *         object.  Under no instance should the kobject that is passed
 *         to this function be directly freed with a call to kfree(),
 *         that can leak memory.
 *
 *         If this function returns success, kobject_put() must also be called
 *         in order to properly clean up the memory associated with the object.
 *
 *         In short, once this function is called, kobject_put() MUST be called
 *         when the use of the object is finished in order to properly free
 *         everything.

and similar text in Documentation/core-api/kobject.rst

  After a kobject has been registered with the kobject core successfully, it
  must be cleaned up when the code is finished with it.  To do that, call
  kobject_put().


If I read the code correctly then kobject_put() calls kref_put()
that might call kobject_delayed_cleanup(). This function does a lot
of things and need to access struct kobject.

> IMHO, kobject API does not support static structures and module
> removal.

If kobject_put() has to be called also for static structures then
module_exit() must explicitly wait until the clean up is finished.

Best Regards,
Petr
Ming Lei Nov. 2, 2021, 2:56 p.m. UTC | #46
On Tue, Nov 02, 2021 at 03:15:15PM +0100, Petr Mladek wrote:
> On Tue 2021-10-26 23:37:30, Ming Lei wrote:
> > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > Below are more details about the livepatch code. I hope that it will
> > > help you to see if zram has similar problems or not.
> > > 
> > > We have kobject in three structures: klp_func, klp_object, and
> > > klp_patch, see include/linux/livepatch.h.
> > > 
> > > These structures have to be statically defined in the module sources
> > > because they define what is livepatched, see
> > > samples/livepatch/livepatch-sample.c
> > > 
> > > The kobject is used there to show information about the patch, patched
> > > objects, and patched functions, in sysfs. And most importantly,
> > > the sysfs interface can be used to disable the livepatch.
> > > 
> > > The problem with static structures is that the module must stay
> > > in the memory as long as the sysfs interface exists. It can be
> > > solved in module_exit() callback. It could wait until the sysfs
> > > interface is destroyed.
> > > 
> > > kobject API does not support this scenario. The relase() callbacks
> > 
> > kobject_delete() is for supporting this scenario, that is why we don't
> > need to grab module refcnt before calling show()/store() of the
> > kobject's attributes.
> > 
> > kobject_delete() can be called in module_exit(), then any show()/store()
> > will be done after kobject_delete() returns.
> 
> I am a bit confused. I do not see kobject_delete() anywhere in kernel
> sources.
> 
> I see only kobject_del() and kobject_put(). AFAIK, they do _not_
> guarantee that either the sysfs interface was destroyed or
> the release callbacks were called. For example, see
> schedule_delayed_work(&kobj->release, delay) in kobject_release().

After kobject_del() returns, no one can call run into show()/store(),
and all pending show()/store() are drained meantime. But yes, the release
handler may still be called later, and the kobject has to be freed
during or before module_exit().

https://lore.kernel.org/lkml/20211101112548.3364086-2-ming.lei@redhat.com/

> 
> By other words, anyone could still be using either the sysfs interface
> or the related structures after kobject_del() or kobject_put()
> returns.

No, no one can do that after kobject_del() returns.

> 
> IMHO, kobject API does not support static structures and module
> removal.

But so far klp_patch can only be defined as static instance, and it
depends on the implementation, especially the release handler.


Thanks,
Ming
Ming Lei Nov. 2, 2021, 3:17 p.m. UTC | #47
On Tue, Nov 02, 2021 at 03:51:33PM +0100, Petr Mladek wrote:
> On Tue 2021-11-02 15:15:19, Petr Mladek wrote:
> > On Tue 2021-10-26 23:37:30, Ming Lei wrote:
> > > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > > Below are more details about the livepatch code. I hope that it will
> > > > help you to see if zram has similar problems or not.
> > > > 
> > > > We have kobject in three structures: klp_func, klp_object, and
> > > > klp_patch, see include/linux/livepatch.h.
> > > > 
> > > > These structures have to be statically defined in the module sources
> > > > because they define what is livepatched, see
> > > > samples/livepatch/livepatch-sample.c
> > > > 
> > > > The kobject is used there to show information about the patch, patched
> > > > objects, and patched functions, in sysfs. And most importantly,
> > > > the sysfs interface can be used to disable the livepatch.
> > > > 
> > > > The problem with static structures is that the module must stay
> > > > in the memory as long as the sysfs interface exists. It can be
> > > > solved in module_exit() callback. It could wait until the sysfs
> > > > interface is destroyed.
> > > > 
> > > > kobject API does not support this scenario. The relase() callbacks
> > > 
> > > kobject_delete() is for supporting this scenario, that is why we don't
> > > need to grab module refcnt before calling show()/store() of the
> > > kobject's attributes.
> > > 
> > > kobject_delete() can be called in module_exit(), then any show()/store()
> > > will be done after kobject_delete() returns.
> > 
> > I am a bit confused. I do not see kobject_delete() anywhere in kernel
> > sources.
> > 
> > I see only kobject_del() and kobject_put(). AFAIK, they do _not_
> > guarantee that either the sysfs interface was destroyed or
> > the release callbacks were called. For example, see
> > schedule_delayed_work(&kobj->release, delay) in kobject_release().
> 
> Grr, I always get confused by the code. kobject_del() actually waits
> until the sysfs interface gets destroyed. This is why there is
> the deadlock.

Right.

> 
> But kobject_put() is _not_ synchronous. And the comment above
> kobject_add() repeat 3 times that kobject_put() must be called
> on success:
> 
>  * Return: If this function returns an error, kobject_put() must be
>  *         called to properly clean up the memory associated with the
>  *         object.  Under no instance should the kobject that is passed
>  *         to this function be directly freed with a call to kfree(),
>  *         that can leak memory.
>  *
>  *         If this function returns success, kobject_put() must also be called
>  *         in order to properly clean up the memory associated with the object.
>  *
>  *         In short, once this function is called, kobject_put() MUST be called
>  *         when the use of the object is finished in order to properly free
>  *         everything.
> 
> and similar text in Documentation/core-api/kobject.rst
> 
>   After a kobject has been registered with the kobject core successfully, it
>   must be cleaned up when the code is finished with it.  To do that, call
>   kobject_put().
> 
> 
> If I read the code correctly then kobject_put() calls kref_put()
> that might call kobject_delayed_cleanup(). This function does a lot
> of things and need to access struct kobject.

Yes, then what is the problem here wrt. kobject_put() which may not be
synchronous?

> 
> > IMHO, kobject API does not support static structures and module
> > removal.
> 
> If kobject_put() has to be called also for static structures then
> module_exit() must explicitly wait until the clean up is finished.

Right, that is exactly how klp_patch kobject is implemented. klp_patch
kobject has to be disabled first, then module refcnt can be dropped after
the klp_patch kobject is released. Then module_exit() is possible.

Thanks,
Ming
Petr Mladek Nov. 2, 2021, 3:24 p.m. UTC | #48
On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
> On Tue, 26 Oct 2021, Luis Chamberlain wrote:
> 
> > On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
> > > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > > Livepatch code never called kobject_del() under a lock. It would cause
> > > > the obvious deadlock.

I have to correct myself. IMHO, the deadlock is far from obvious. I
always get lost in the code and the documentation is not clear.
I always get lost.

> >
> > Never?
> 
> kobject_put() to be precise.

IMHO, the problem is actually with kobject_del() that gets blocked
until the sysfs interface gets removed. kobject_put() will have
the same problem only when the clean up is not delayed.


> When I started working on the support for module/live patches removal, 
> calling kobject_put() under our klp_mutex lock was the obvious first 
> choice given how the code was structured, but I ran into problems with 
> deadlocks immediately. So it was changed to async approach with the 
> workqueue. Thus the mainline code has never suffered from this, but we 
> knew about the issues.
>  
> > > > The historic code only waited in the
> > > > module_exit() callback until the sysfs interface was removed.
> > > 
> > > OK, then Luis shouldn't consider livepatching as one such issue to solve
> > > with one generic solution.
> > 
> > It's not what I was told when the deadlock was found with zram, so I was
> > informed quite the contrary.
> 
> >From my perspective, it is quite easy to get it wrong due to either a lack 
> of generic support, or missing rules/documentation. So if this thread 
> leads to "do not share locks between a module removal and a sysfs 
> operation" strict rule, it would be at least something. In the same 
> manner as Luis proposed to document try_module_get() expectations.

The rule "do not share locks between a module removal and a sysfs
operation" is not clear to me.

IMHO, there are the following rules:

1. rule: kobject_del() or kobject_put() must not be called under a lock that
	 is used by store()/show() callbacks.

   reason: kobject_del() waits until the sysfs interface is destroyed.
	 It has to wait until all store()/show() callbacks are finished.


2. rule: kobject_del()/kobject_put() must not be called from the
	related store() callbacks.

   reason: same as in 1st rule.


3. rule: module_exit() must wait until all release() callbacks are called
	 when kobject are static.

   reason: kobject_put() must be called to clean up internal
	dependencies. The clean up might be done asynchronously
	and need access to the kobject structure.


Best Regards,
Petr

PS: I am sorry if I am messing things. I want to be sure that we are
    all talking about the same and understand it the same way.
Luis Chamberlain Nov. 2, 2021, 4:25 p.m. UTC | #49
On Tue, Nov 02, 2021 at 04:24:06PM +0100, Petr Mladek wrote:
> On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
> > >From my perspective, it is quite easy to get it wrong due to either a lack 
> > of generic support, or missing rules/documentation. So if this thread 
> > leads to "do not share locks between a module removal and a sysfs 
> > operation" strict rule, it would be at least something. In the same 
> > manner as Luis proposed to document try_module_get() expectations.
> 
> The rule "do not share locks between a module removal and a sysfs
> operation" is not clear to me.

That's exactly it. It *is* not. The test_sysfs selftest will hopefully
help with this. But I'll wait to take a final position on whether or not
a generic fix should be merged until the Coccinelle patch which looks
for all uses cases completes.

So I think that once that Coccinelle hunt is done for the deadlock, we
should also remind folks of the potential deadlock and some of the rules
you mentioned below so that if we take a position that we don't support
this, we at least inform developers why and what to avoid. If Coccinelle
finds quite a bit of cases, then perhaps evaluating the generic fix
might be worth evaluating.

> IMHO, there are the following rules:
> 
> 1. rule: kobject_del() or kobject_put() must not be called under a lock that
> 	 is used by store()/show() callbacks.
> 
>    reason: kobject_del() waits until the sysfs interface is destroyed.
> 	 It has to wait until all store()/show() callbacks are finished.

Right, this is what actually started this entire conversation.

Note that as Ming pointed out, the generic kernfs fix I proposed would
only cover the case when kobject_del() ends up being called on module
exit, so it would not cover the cases where perhaps kobject_del() might
be called outside of module exit, and so the cope of the possible
deadlock then increases in scope.

Likewise, the Coccinelle hunt I'm trying would only cover the module
exit case. I'm a bit of afraid of the complexity of a generic hunt
as expresed in rule 1.

> 
> 2. rule: kobject_del()/kobject_put() must not be called from the
> 	related store() callbacks.
> 
>    reason: same as in 1st rule.

Sensible corollary.

Given tha the exact kobjet_del() / kobject_put() which must not be
called from the respective sysfs ops depends on which kobject is
underneath the device for which the sysfs ops is being created,
it would make this hunt in Coccinelle a bit tricky. My current iteration
of a coccinelle hunt cheats and looks at any sysfs looking op and
ensures a module exit exists.

> 3. rule: module_exit() must wait until all release() callbacks are called
> 	 when kobject are static.
> 
>    reason: kobject_put() must be called to clean up internal
> 	dependencies. The clean up might be done asynchronously
> 	and need access to the kobject structure.

This might be an easier rule to implement a respective Coccinelle rule
for.

  Luis
Ming Lei Nov. 3, 2021, 12:01 a.m. UTC | #50
On Tue, Nov 02, 2021 at 09:25:44AM -0700, Luis Chamberlain wrote:
> On Tue, Nov 02, 2021 at 04:24:06PM +0100, Petr Mladek wrote:
> > On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
> > > >From my perspective, it is quite easy to get it wrong due to either a lack 
> > > of generic support, or missing rules/documentation. So if this thread 
> > > leads to "do not share locks between a module removal and a sysfs 
> > > operation" strict rule, it would be at least something. In the same 
> > > manner as Luis proposed to document try_module_get() expectations.
> > 
> > The rule "do not share locks between a module removal and a sysfs
> > operation" is not clear to me.
> 
> That's exactly it. It *is* not. The test_sysfs selftest will hopefully
> help with this. But I'll wait to take a final position on whether or not
> a generic fix should be merged until the Coccinelle patch which looks
> for all uses cases completes.
> 
> So I think that once that Coccinelle hunt is done for the deadlock, we
> should also remind folks of the potential deadlock and some of the rules
> you mentioned below so that if we take a position that we don't support
> this, we at least inform developers why and what to avoid. If Coccinelle
> finds quite a bit of cases, then perhaps evaluating the generic fix
> might be worth evaluating.
> 
> > IMHO, there are the following rules:
> > 
> > 1. rule: kobject_del() or kobject_put() must not be called under a lock that
> > 	 is used by store()/show() callbacks.
> > 
> >    reason: kobject_del() waits until the sysfs interface is destroyed.
> > 	 It has to wait until all store()/show() callbacks are finished.
> 
> Right, this is what actually started this entire conversation.
> 
> Note that as Ming pointed out, the generic kernfs fix I proposed would
> only cover the case when kobject_del() ends up being called on module
> exit, so it would not cover the cases where perhaps kobject_del() might
> be called outside of module exit, and so the cope of the possible
> deadlock then increases in scope.
> 
> Likewise, the Coccinelle hunt I'm trying would only cover the module
> exit case. I'm a bit of afraid of the complexity of a generic hunt
> as expresed in rule 1.

Question is that why one shared lock is required between kobject_del()
and its show()/store(), both zram and livepatch needn't that. Is it
one common usage?

> 
> > 
> > 2. rule: kobject_del()/kobject_put() must not be called from the
> > 	related store() callbacks.
> > 
> >    reason: same as in 1st rule.
> 
> Sensible corollary.
> 
> Given tha the exact kobjet_del() / kobject_put() which must not be
> called from the respective sysfs ops depends on which kobject is
> underneath the device for which the sysfs ops is being created,
> it would make this hunt in Coccinelle a bit tricky. My current iteration
> of a coccinelle hunt cheats and looks at any sysfs looking op and
> ensures a module exit exists.

Actually kernfs/sysfs provides interface for supporting deleting
kobject/attr from the attr's show()/store(), see example of
sdev_store_delete(), and the livepatch example:

https://lore.kernel.org/lkml/20211102145932.3623108-4-ming.lei@redhat.com/

> 
> > 3. rule: module_exit() must wait until all release() callbacks are called
> > 	 when kobject are static.
> > 
> >    reason: kobject_put() must be called to clean up internal
> > 	dependencies. The clean up might be done asynchronously
> > 	and need access to the kobject structure.
> 
> This might be an easier rule to implement a respective Coccinelle rule
> for.

If kobject_del() is done in module_exit() or before module_exit(),
kobject should have been freed in module_exit() via kobject_put().

But yes, it can be asynchronously because of CONFIG_DEBUG_KOBJECT_RELEASE,
seems like one real issue.


Thanks,
Ming
Luis Chamberlain Nov. 3, 2021, 12:44 p.m. UTC | #51
On Wed, Nov 03, 2021 at 08:01:45AM +0800, Ming Lei wrote:
> On Tue, Nov 02, 2021 at 09:25:44AM -0700, Luis Chamberlain wrote:
> > On Tue, Nov 02, 2021 at 04:24:06PM +0100, Petr Mladek wrote:
> > > On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
> > > > >From my perspective, it is quite easy to get it wrong due to either a lack 
> > > > of generic support, or missing rules/documentation. So if this thread 
> > > > leads to "do not share locks between a module removal and a sysfs 
> > > > operation" strict rule, it would be at least something. In the same 
> > > > manner as Luis proposed to document try_module_get() expectations.
> > > 
> > > The rule "do not share locks between a module removal and a sysfs
> > > operation" is not clear to me.
> > 
> > That's exactly it. It *is* not. The test_sysfs selftest will hopefully
> > help with this. But I'll wait to take a final position on whether or not
> > a generic fix should be merged until the Coccinelle patch which looks
> > for all uses cases completes.
> > 
> > So I think that once that Coccinelle hunt is done for the deadlock, we
> > should also remind folks of the potential deadlock and some of the rules
> > you mentioned below so that if we take a position that we don't support
> > this, we at least inform developers why and what to avoid. If Coccinelle
> > finds quite a bit of cases, then perhaps evaluating the generic fix
> > might be worth evaluating.
> > 
> > > IMHO, there are the following rules:
> > > 
> > > 1. rule: kobject_del() or kobject_put() must not be called under a lock that
> > > 	 is used by store()/show() callbacks.
> > > 
> > >    reason: kobject_del() waits until the sysfs interface is destroyed.
> > > 	 It has to wait until all store()/show() callbacks are finished.
> > 
> > Right, this is what actually started this entire conversation.
> > 
> > Note that as Ming pointed out, the generic kernfs fix I proposed would
> > only cover the case when kobject_del() ends up being called on module
> > exit, so it would not cover the cases where perhaps kobject_del() might
> > be called outside of module exit, and so the cope of the possible
> > deadlock then increases in scope.
> > 
> > Likewise, the Coccinelle hunt I'm trying would only cover the module
> > exit case. I'm a bit of afraid of the complexity of a generic hunt
> > as expresed in rule 1.
> 
> Question is that why one shared lock is required between kobject_del()
> and its show()/store(), both zram and livepatch needn't that. Is it
> one common usage?

That is the question the coccinelle hunt is aimed at finding. Answering
that in the context of module removal is easier than the generic case.

But also note that I had mentioned before that we have semantics to
check *when* we're in the module removal case, and as such can address
that case. For the other cases we have no possible semantics to be able to
address a generic fix. I tried though, refer to my reply in this
thread and refer to the new kobject_being_removed() I'm adding:

https://lkml.kernel.org/r/YWdMpv8lAFYtc18c@bombadil.infradead.org

So we have semantics for knowing when about to remove a module but,
my attempt with kobject_being_removed() isn't sufficient to address this
generically.

In either case, having a gauge of how common this is either on module
removal of generally would be wonderful. It is easier to answer the
question from a module removal perspective though.

> > > 2. rule: kobject_del()/kobject_put() must not be called from the
> > > 	related store() callbacks.
> > > 
> > >    reason: same as in 1st rule.
> > 
> > Sensible corollary.
> > 
> > Given tha the exact kobjet_del() / kobject_put() which must not be
> > called from the respective sysfs ops depends on which kobject is
> > underneath the device for which the sysfs ops is being created,
> > it would make this hunt in Coccinelle a bit tricky. My current iteration
> > of a coccinelle hunt cheats and looks at any sysfs looking op and
> > ensures a module exit exists.
> 
> Actually kernfs/sysfs provides interface for supporting deleting
> kobject/attr from the attr's show()/store(), see example of
> sdev_store_delete(), and the livepatch example:
> 
> https://lore.kernel.org/lkml/20211102145932.3623108-4-ming.lei@redhat.com/

Imagine that.. is that the suicidal thing?

> > > 3. rule: module_exit() must wait until all release() callbacks are called
> > > 	 when kobject are static.
> > > 
> > >    reason: kobject_put() must be called to clean up internal
> > > 	dependencies. The clean up might be done asynchronously
> > > 	and need access to the kobject structure.
> > 
> > This might be an easier rule to implement a respective Coccinelle rule
> > for.
> 
> If kobject_del() is done in module_exit() or before module_exit(),
> kobject should have been freed in module_exit() via kobject_put().
> 
> But yes, it can be asynchronously because of CONFIG_DEBUG_KOBJECT_RELEASE,
> seems like one real issue.

Alright thanks for confirming.

  Luis
diff mbox series

Patch

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f61910c65f0f..b26abcb955cc 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -44,6 +44,8 @@  static DEFINE_MUTEX(zram_index_mutex);
 static int zram_major;
 static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
 
+static bool zram_up;
+
 /* Module params (documentation at end) */
 static unsigned int num_devices = 1;
 /*
@@ -1704,6 +1706,7 @@  static void zram_reset_device(struct zram *zram)
 	comp = zram->comp;
 	disksize = zram->disksize;
 	zram->disksize = 0;
+	zram->comp = NULL;
 
 	set_capacity_and_notify(zram->disk, 0);
 	part_stat_set_all(zram->disk->part0, 0);
@@ -1724,9 +1727,18 @@  static ssize_t disksize_store(struct device *dev,
 	struct zram *zram = dev_to_zram(dev);
 	int err;
 
+	mutex_lock(&zram_index_mutex);
+
+	if (!zram_up) {
+		err = -ENODEV;
+		goto out;
+	}
+
 	disksize = memparse(buf, NULL);
-	if (!disksize)
-		return -EINVAL;
+	if (!disksize) {
+		err = -EINVAL;
+		goto out;
+	}
 
 	down_write(&zram->init_lock);
 	if (init_done(zram)) {
@@ -1754,12 +1766,16 @@  static ssize_t disksize_store(struct device *dev,
 	set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
 	up_write(&zram->init_lock);
 
+	mutex_unlock(&zram_index_mutex);
+
 	return len;
 
 out_free_meta:
 	zram_meta_free(zram, disksize);
 out_unlock:
 	up_write(&zram->init_lock);
+out:
+	mutex_unlock(&zram_index_mutex);
 	return err;
 }
 
@@ -1775,8 +1791,17 @@  static ssize_t reset_store(struct device *dev,
 	if (ret)
 		return ret;
 
-	if (!do_reset)
-		return -EINVAL;
+	mutex_lock(&zram_index_mutex);
+
+	if (!zram_up) {
+		len = -ENODEV;
+		goto out;
+	}
+
+	if (!do_reset) {
+		len = -EINVAL;
+		goto out;
+	}
 
 	zram = dev_to_zram(dev);
 	bdev = zram->disk->part0;
@@ -1785,7 +1810,8 @@  static ssize_t reset_store(struct device *dev,
 	/* Do not reset an active device or claimed device */
 	if (bdev->bd_openers || zram->claim) {
 		mutex_unlock(&bdev->bd_disk->open_mutex);
-		return -EBUSY;
+		len = -EBUSY;
+		goto out;
 	}
 
 	/* From now on, anyone can't open /dev/zram[0-9] */
@@ -1800,6 +1826,8 @@  static ssize_t reset_store(struct device *dev,
 	zram->claim = false;
 	mutex_unlock(&bdev->bd_disk->open_mutex);
 
+out:
+	mutex_unlock(&zram_index_mutex);
 	return len;
 }
 
@@ -2010,6 +2038,10 @@  static ssize_t hot_add_show(struct class *class,
 	int ret;
 
 	mutex_lock(&zram_index_mutex);
+	if (!zram_up) {
+		mutex_unlock(&zram_index_mutex);
+		return -ENODEV;
+	}
 	ret = zram_add();
 	mutex_unlock(&zram_index_mutex);
 
@@ -2037,6 +2069,11 @@  static ssize_t hot_remove_store(struct class *class,
 
 	mutex_lock(&zram_index_mutex);
 
+	if (!zram_up) {
+		ret = -ENODEV;
+		goto out;
+	}
+
 	zram = idr_find(&zram_index_idr, dev_id);
 	if (zram) {
 		ret = zram_remove(zram);
@@ -2046,6 +2083,7 @@  static ssize_t hot_remove_store(struct class *class,
 		ret = -ENODEV;
 	}
 
+out:
 	mutex_unlock(&zram_index_mutex);
 	return ret ? ret : count;
 }
@@ -2072,12 +2110,15 @@  static int zram_remove_cb(int id, void *ptr, void *data)
 
 static void destroy_devices(void)
 {
+	mutex_lock(&zram_index_mutex);
+	zram_up = false;
 	class_unregister(&zram_control_class);
 	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
 	zram_debugfs_destroy();
 	idr_destroy(&zram_index_idr);
 	unregister_blkdev(zram_major, "zram");
 	cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
+	mutex_unlock(&zram_index_mutex);
 }
 
 static int __init zram_init(void)
@@ -2105,15 +2146,21 @@  static int __init zram_init(void)
 		return -EBUSY;
 	}
 
+	mutex_lock(&zram_index_mutex);
+
 	while (num_devices != 0) {
-		mutex_lock(&zram_index_mutex);
 		ret = zram_add();
-		mutex_unlock(&zram_index_mutex);
-		if (ret < 0)
+		if (ret < 0) {
+			mutex_unlock(&zram_index_mutex);
 			goto out_error;
+		}
 		num_devices--;
 	}
 
+	zram_up = true;
+
+	mutex_unlock(&zram_index_mutex);
+
 	return 0;
 
 out_error: