mbox series

[v5,0/3] "Hotremove" persistent memory

Message ID 20190502184337.20538-1-pasha.tatashin@soleen.com (mailing list archive)
Headers show
Series "Hotremove" persistent memory | expand

Message

Pasha Tatashin May 2, 2019, 6:43 p.m. UTC
Changelog:
v5
- Addressed comments from Dan Williams: made remove_memory() to return
  an error code, and use this function from dax.

v4
- Addressed comments from Dave Hansen

v3
- Addressed comments from David Hildenbrand. Don't release
  lock_device_hotplug after checking memory status, and rename
  memblock_offlined_cb() to check_memblock_offlined_cb()

v2
- Dan Williams mentioned that drv->remove() return is ignored
  by unbind. Unbind always succeeds. Because we cannot guarantee
  that memory can be offlined from the driver, don't even
  attempt to do so. Simply check that every section is offlined
  beforehand and only then proceed with removing dax memory.

---

Recently, adding a persistent memory to be used like a regular RAM was
added to Linux. This work extends this functionality to also allow hot
removing persistent memory.

We (Microsoft) have an important use case for this functionality.

The requirement is for physical machines with small amount of RAM (~8G)
to be able to reboot in a very short period of time (<1s). Yet, there is
a userland state that is expensive to recreate (~2G).

The solution is to boot machines with 2G preserved for persistent
memory.

Copy the state, and hotadd the persistent memory so machine still has
all 8G available for runtime. Before reboot, offline and hotremove
device-dax 2G, copy the memory that is needed to be preserved to pmem0
device, and reboot.

The series of operations look like this:

1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
   and free ramdisk.
2. Convert raw pmem0 to devdax
   ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
3. Hotadd to System RAM
   echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
   echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
   echo online_movable > /sys/devices/system/memoryXXX/state
4. Before reboot hotremove device-dax memory from System RAM
   echo offline > /sys/devices/system/memoryXXX/state
   echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
5. Create raw pmem0 device
   ndctl create-namespace --mode raw  -e namespace0.0 -f
6. Copy the state that was stored by apps to ramdisk to pmem device
7. Do kexec reboot or reboot through firmware if firmware does not
   zero memory in pmem0 region (These machines have only regular
   volatile memory). So to have pmem0 device either memmap kernel
   parameter is used, or devices nodes in dtb are specified.

Pavel Tatashin (3):
  device-dax: fix memory and resource leak if hotplug fails
  mm/hotplug: make remove_memory() interface useable
  device-dax: "Hotremove" persistent memory that is used like normal RAM

 drivers/dax/dax-private.h      |  2 ++
 drivers/dax/kmem.c             | 46 ++++++++++++++++++++++---
 include/linux/memory_hotplug.h |  8 +++--
 mm/memory_hotplug.c            | 61 ++++++++++++++++++++++------------
 4 files changed, 89 insertions(+), 28 deletions(-)

Comments

Verma, Vishal L May 2, 2019, 8:50 p.m. UTC | #1
On Thu, 2019-05-02 at 14:43 -0400, Pavel Tatashin wrote:
> The series of operations look like this:
> 
> 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
>    and free ramdisk.
> 2. Convert raw pmem0 to devdax
>    ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> 3. Hotadd to System RAM
>    echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>    echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>    echo online_movable > /sys/devices/system/memoryXXX/state
> 4. Before reboot hotremove device-dax memory from System RAM
>    echo offline > /sys/devices/system/memoryXXX/state
>    echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind

Hi Pavel,

I am working on adding this sort of a workflow into a new daxctl command
(daxctl-reconfigure-device)- this will allow changing the 'mode' of a
dax device to kmem, online the resulting memory, and with your patches,
also attempt to offline the memory, and change back to device-dax.

In running with these patches, and testing the offlining part, I ran
into the following lockdep below.

This is with just these three patches on top of -rc7.


[  +0.004886] ======================================================
[  +0.001576] WARNING: possible circular locking dependency detected
[  +0.001506] 5.1.0-rc7+ #13 Tainted: G           O     
[  +0.000929] ------------------------------------------------------
[  +0.000708] daxctl/22950 is trying to acquire lock:
[  +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
[  +0.000922] 
              but task is already holding lock:
[  +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
[  +0.000960] 
              which lock already depends on the new lock.

[  +0.001001] 
              the existing dependency chain (in reverse order) is:
[  +0.000837] 
              -> #3 (mem_sysfs_mutex){+.+.}:
[  +0.000631]        __mutex_lock+0x82/0x9a0
[  +0.000477]        unregister_memory_section+0x22/0xa0
[  +0.000582]        __remove_pages+0xe9/0x520
[  +0.000489]        arch_remove_memory+0x81/0xc0
[  +0.000510]        devm_memremap_pages_release+0x180/0x270
[  +0.000633]        release_nodes+0x234/0x280
[  +0.000483]        device_release_driver_internal+0xf4/0x1d0
[  +0.000701]        bus_remove_device+0xfc/0x170
[  +0.000529]        device_del+0x16a/0x380
[  +0.000459]        unregister_dev_dax+0x23/0x50
[  +0.000526]        release_nodes+0x234/0x280
[  +0.000487]        device_release_driver_internal+0xf4/0x1d0
[  +0.000646]        unbind_store+0x9b/0x130
[  +0.000467]        kernfs_fop_write+0xf0/0x1a0
[  +0.000510]        vfs_write+0xba/0x1c0
[  +0.000438]        ksys_write+0x5a/0xe0
[  +0.000521]        do_syscall_64+0x60/0x210
[  +0.000489]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000637] 
              -> #2 (mem_hotplug_lock.rw_sem){++++}:
[  +0.000717]        get_online_mems+0x3e/0x80
[  +0.000491]        kmem_cache_create_usercopy+0x2e/0x270
[  +0.000609]        kmem_cache_create+0x12/0x20
[  +0.000507]        ptlock_cache_init+0x20/0x28
[  +0.000506]        start_kernel+0x240/0x4d0
[  +0.000480]        secondary_startup_64+0xa4/0xb0
[  +0.000539] 
              -> #1 (cpu_hotplug_lock.rw_sem){++++}:
[  +0.000784]        cpus_read_lock+0x3e/0x80
[  +0.000511]        online_pages+0x37/0x310
[  +0.000469]        memory_subsys_online+0x34/0x60
[  +0.000611]        device_online+0x60/0x80
[  +0.000611]        state_store+0x66/0xd0
[  +0.000552]        kernfs_fop_write+0xf0/0x1a0
[  +0.000649]        vfs_write+0xba/0x1c0
[  +0.000487]        ksys_write+0x5a/0xe0
[  +0.000459]        do_syscall_64+0x60/0x210
[  +0.000482]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000646] 
              -> #0 (kn->count#424){++++}:
[  +0.000669]        lock_acquire+0x9e/0x180
[  +0.000471]        __kernfs_remove+0x26a/0x310
[  +0.000518]        kernfs_remove_by_name_ns+0x40/0x80
[  +0.000583]        remove_files.isra.1+0x30/0x70
[  +0.000555]        sysfs_remove_group+0x3d/0x80
[  +0.000524]        sysfs_remove_groups+0x29/0x40
[  +0.000532]        device_remove_attrs+0x42/0x80
[  +0.000522]        device_del+0x162/0x380
[  +0.000464]        device_unregister+0x16/0x60
[  +0.000505]        unregister_memory_section+0x6e/0xa0
[  +0.000591]        __remove_pages+0xe9/0x520
[  +0.000492]        arch_remove_memory+0x81/0xc0
[  +0.000568]        try_remove_memory+0xba/0xd0
[  +0.000510]        remove_memory+0x23/0x40
[  +0.000483]        dev_dax_kmem_remove+0x29/0x57 [kmem]
[  +0.000608]        device_release_driver_internal+0xe4/0x1d0
[  +0.000637]        unbind_store+0x9b/0x130
[  +0.000464]        kernfs_fop_write+0xf0/0x1a0
[  +0.000685]        vfs_write+0xba/0x1c0
[  +0.000594]        ksys_write+0x5a/0xe0
[  +0.000449]        do_syscall_64+0x60/0x210
[  +0.000481]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000619] 
              other info that might help us debug this:

[  +0.000889] Chain exists of:
                kn->count#424 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex

[  +0.001269]  Possible unsafe locking scenario:

[  +0.000652]        CPU0                    CPU1
[  +0.000505]        ----                    ----
[  +0.000523]   lock(mem_sysfs_mutex);
[  +0.000422]                                lock(mem_hotplug_lock.rw_sem);
[  +0.000905]                                lock(mem_sysfs_mutex);
[  +0.000793]   lock(kn->count#424);
[  +0.000394] 
               *** DEADLOCK ***

[  +0.000665] 7 locks held by daxctl/22950:
[  +0.000458]  #0: 000000005f6d3c13 (sb_writers#4){.+.+}, at: vfs_write+0x159/0x1c0
[  +0.000943]  #1: 00000000e468825d (&of->mutex){+.+.}, at: kernfs_fop_write+0xbd/0x1a0
[  +0.000895]  #2: 00000000caa17dbb (&dev->mutex){....}, at: device_release_driver_internal+0x1a/0x1d0
[  +0.001019]  #3: 000000002119b22c (device_hotplug_lock){+.+.}, at: remove_memory+0x16/0x40
[  +0.000942]  #4: 00000000150c8efe (cpu_hotplug_lock.rw_sem){++++}, at: try_remove_memory+0x2e/0xd0
[  +0.001019]  #5: 000000003d6b2a0f (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x25/0x120
[  +0.001118]  #6: 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
[  +0.001033] 
              stack backtrace:
[  +0.000507] CPU: 5 PID: 22950 Comm: daxctl Tainted: G           O      5.1.0-rc7+ #13
[  +0.000896] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
[  +0.001360] Call Trace:
[  +0.000293]  dump_stack+0x85/0xc0
[  +0.000390]  print_circular_bug.isra.41.cold.60+0x15c/0x195
[  +0.000651]  check_prev_add.constprop.50+0x5fd/0xbe0
[  +0.000563]  ? call_rcu_zapped+0x80/0x80
[  +0.000449]  __lock_acquire+0xcee/0xfd0
[  +0.000437]  lock_acquire+0x9e/0x180
[  +0.000428]  ? kernfs_remove_by_name_ns+0x40/0x80
[  +0.000531]  __kernfs_remove+0x26a/0x310
[  +0.000451]  ? kernfs_remove_by_name_ns+0x40/0x80
[  +0.000529]  ? kernfs_name_hash+0x12/0x80
[  +0.000462]  kernfs_remove_by_name_ns+0x40/0x80
[  +0.000513]  remove_files.isra.1+0x30/0x70
[  +0.000483]  sysfs_remove_group+0x3d/0x80
[  +0.000458]  sysfs_remove_groups+0x29/0x40
[  +0.000477]  device_remove_attrs+0x42/0x80
[  +0.000461]  device_del+0x162/0x380
[  +0.000399]  device_unregister+0x16/0x60
[  +0.000442]  unregister_memory_section+0x6e/0xa0
[  +0.001232]  __remove_pages+0xe9/0x520
[  +0.000443]  arch_remove_memory+0x81/0xc0
[  +0.000459]  try_remove_memory+0xba/0xd0
[  +0.000460]  remove_memory+0x23/0x40
[  +0.000461]  dev_dax_kmem_remove+0x29/0x57 [kmem]
[  +0.000603]  device_release_driver_internal+0xe4/0x1d0
[  +0.000590]  unbind_store+0x9b/0x130
[  +0.000409]  kernfs_fop_write+0xf0/0x1a0
[  +0.000448]  vfs_write+0xba/0x1c0
[  +0.000395]  ksys_write+0x5a/0xe0
[  +0.000382]  do_syscall_64+0x60/0x210
[  +0.000418]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  +0.000573] RIP: 0033:0x7fd1f7442fa8
[  +0.000407] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 75 77 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[  +0.002119] RSP: 002b:00007ffd48f58e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000833] RAX: ffffffffffffffda RBX: 000000000210c817 RCX: 00007fd1f7442fa8
[  +0.000795] RDX: 0000000000000007 RSI: 000000000210c817 RDI: 0000000000000003
[  +0.000816] RBP: 0000000000000007 R08: 000000000210c7d0 R09: 00007fd1f74d4e80
[  +0.000808] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[  +0.000819] R13: 00007fd1f72b9ce8 R14: 0000000000000000 R15: 00007ffd48f58e70
Pasha Tatashin May 2, 2019, 9:44 p.m. UTC | #2
On Thu, May 2, 2019 at 4:50 PM Verma, Vishal L <vishal.l.verma@intel.com> wrote:
>
> On Thu, 2019-05-02 at 14:43 -0400, Pavel Tatashin wrote:
> > The series of operations look like this:
> >
> > 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
> >    and free ramdisk.
> > 2. Convert raw pmem0 to devdax
> >    ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> > 3. Hotadd to System RAM
> >    echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> >    echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> >    echo online_movable > /sys/devices/system/memoryXXX/state
> > 4. Before reboot hotremove device-dax memory from System RAM
> >    echo offline > /sys/devices/system/memoryXXX/state
> >    echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
>
> Hi Pavel,
>
> I am working on adding this sort of a workflow into a new daxctl command
> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> dax device to kmem, online the resulting memory, and with your patches,
> also attempt to offline the memory, and change back to device-dax.
>
> In running with these patches, and testing the offlining part, I ran
> into the following lockdep below.
>
> This is with just these three patches on top of -rc7.

Hi Verma,

Thank you for testing. I wonder if there is a command sequence that I
could run to reproduce it?
Also, could you please send your config and qemu arguments.

Thank you,
Pasha

>
>
> [  +0.004886] ======================================================
> [  +0.001576] WARNING: possible circular locking dependency detected
> [  +0.001506] 5.1.0-rc7+ #13 Tainted: G           O
> [  +0.000929] ------------------------------------------------------
> [  +0.000708] daxctl/22950 is trying to acquire lock:
> [  +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000922]
>               but task is already holding lock:
> [  +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
> [  +0.000960]
>               which lock already depends on the new lock.
>
> [  +0.001001]
>               the existing dependency chain (in reverse order) is:
> [  +0.000837]
>               -> #3 (mem_sysfs_mutex){+.+.}:
> [  +0.000631]        __mutex_lock+0x82/0x9a0
> [  +0.000477]        unregister_memory_section+0x22/0xa0
> [  +0.000582]        __remove_pages+0xe9/0x520
> [  +0.000489]        arch_remove_memory+0x81/0xc0
> [  +0.000510]        devm_memremap_pages_release+0x180/0x270
> [  +0.000633]        release_nodes+0x234/0x280
> [  +0.000483]        device_release_driver_internal+0xf4/0x1d0
> [  +0.000701]        bus_remove_device+0xfc/0x170
> [  +0.000529]        device_del+0x16a/0x380
> [  +0.000459]        unregister_dev_dax+0x23/0x50
> [  +0.000526]        release_nodes+0x234/0x280
> [  +0.000487]        device_release_driver_internal+0xf4/0x1d0
> [  +0.000646]        unbind_store+0x9b/0x130
> [  +0.000467]        kernfs_fop_write+0xf0/0x1a0
> [  +0.000510]        vfs_write+0xba/0x1c0
> [  +0.000438]        ksys_write+0x5a/0xe0
> [  +0.000521]        do_syscall_64+0x60/0x210
> [  +0.000489]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  +0.000637]
>               -> #2 (mem_hotplug_lock.rw_sem){++++}:
> [  +0.000717]        get_online_mems+0x3e/0x80
> [  +0.000491]        kmem_cache_create_usercopy+0x2e/0x270
> [  +0.000609]        kmem_cache_create+0x12/0x20
> [  +0.000507]        ptlock_cache_init+0x20/0x28
> [  +0.000506]        start_kernel+0x240/0x4d0
> [  +0.000480]        secondary_startup_64+0xa4/0xb0
> [  +0.000539]
>               -> #1 (cpu_hotplug_lock.rw_sem){++++}:
> [  +0.000784]        cpus_read_lock+0x3e/0x80
> [  +0.000511]        online_pages+0x37/0x310
> [  +0.000469]        memory_subsys_online+0x34/0x60
> [  +0.000611]        device_online+0x60/0x80
> [  +0.000611]        state_store+0x66/0xd0
> [  +0.000552]        kernfs_fop_write+0xf0/0x1a0
> [  +0.000649]        vfs_write+0xba/0x1c0
> [  +0.000487]        ksys_write+0x5a/0xe0
> [  +0.000459]        do_syscall_64+0x60/0x210
> [  +0.000482]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  +0.000646]
>               -> #0 (kn->count#424){++++}:
> [  +0.000669]        lock_acquire+0x9e/0x180
> [  +0.000471]        __kernfs_remove+0x26a/0x310
> [  +0.000518]        kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000583]        remove_files.isra.1+0x30/0x70
> [  +0.000555]        sysfs_remove_group+0x3d/0x80
> [  +0.000524]        sysfs_remove_groups+0x29/0x40
> [  +0.000532]        device_remove_attrs+0x42/0x80
> [  +0.000522]        device_del+0x162/0x380
> [  +0.000464]        device_unregister+0x16/0x60
> [  +0.000505]        unregister_memory_section+0x6e/0xa0
> [  +0.000591]        __remove_pages+0xe9/0x520
> [  +0.000492]        arch_remove_memory+0x81/0xc0
> [  +0.000568]        try_remove_memory+0xba/0xd0
> [  +0.000510]        remove_memory+0x23/0x40
> [  +0.000483]        dev_dax_kmem_remove+0x29/0x57 [kmem]
> [  +0.000608]        device_release_driver_internal+0xe4/0x1d0
> [  +0.000637]        unbind_store+0x9b/0x130
> [  +0.000464]        kernfs_fop_write+0xf0/0x1a0
> [  +0.000685]        vfs_write+0xba/0x1c0
> [  +0.000594]        ksys_write+0x5a/0xe0
> [  +0.000449]        do_syscall_64+0x60/0x210
> [  +0.000481]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  +0.000619]
>               other info that might help us debug this:
>
> [  +0.000889] Chain exists of:
>                 kn->count#424 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
>
> [  +0.001269]  Possible unsafe locking scenario:
>
> [  +0.000652]        CPU0                    CPU1
> [  +0.000505]        ----                    ----
> [  +0.000523]   lock(mem_sysfs_mutex);
> [  +0.000422]                                lock(mem_hotplug_lock.rw_sem);
> [  +0.000905]                                lock(mem_sysfs_mutex);
> [  +0.000793]   lock(kn->count#424);
> [  +0.000394]
>                *** DEADLOCK ***
>
> [  +0.000665] 7 locks held by daxctl/22950:
> [  +0.000458]  #0: 000000005f6d3c13 (sb_writers#4){.+.+}, at: vfs_write+0x159/0x1c0
> [  +0.000943]  #1: 00000000e468825d (&of->mutex){+.+.}, at: kernfs_fop_write+0xbd/0x1a0
> [  +0.000895]  #2: 00000000caa17dbb (&dev->mutex){....}, at: device_release_driver_internal+0x1a/0x1d0
> [  +0.001019]  #3: 000000002119b22c (device_hotplug_lock){+.+.}, at: remove_memory+0x16/0x40
> [  +0.000942]  #4: 00000000150c8efe (cpu_hotplug_lock.rw_sem){++++}, at: try_remove_memory+0x2e/0xd0
> [  +0.001019]  #5: 000000003d6b2a0f (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x25/0x120
> [  +0.001118]  #6: 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
> [  +0.001033]
>               stack backtrace:
> [  +0.000507] CPU: 5 PID: 22950 Comm: daxctl Tainted: G           O      5.1.0-rc7+ #13
> [  +0.000896] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
> [  +0.001360] Call Trace:
> [  +0.000293]  dump_stack+0x85/0xc0
> [  +0.000390]  print_circular_bug.isra.41.cold.60+0x15c/0x195
> [  +0.000651]  check_prev_add.constprop.50+0x5fd/0xbe0
> [  +0.000563]  ? call_rcu_zapped+0x80/0x80
> [  +0.000449]  __lock_acquire+0xcee/0xfd0
> [  +0.000437]  lock_acquire+0x9e/0x180
> [  +0.000428]  ? kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000531]  __kernfs_remove+0x26a/0x310
> [  +0.000451]  ? kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000529]  ? kernfs_name_hash+0x12/0x80
> [  +0.000462]  kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000513]  remove_files.isra.1+0x30/0x70
> [  +0.000483]  sysfs_remove_group+0x3d/0x80
> [  +0.000458]  sysfs_remove_groups+0x29/0x40
> [  +0.000477]  device_remove_attrs+0x42/0x80
> [  +0.000461]  device_del+0x162/0x380
> [  +0.000399]  device_unregister+0x16/0x60
> [  +0.000442]  unregister_memory_section+0x6e/0xa0
> [  +0.001232]  __remove_pages+0xe9/0x520
> [  +0.000443]  arch_remove_memory+0x81/0xc0
> [  +0.000459]  try_remove_memory+0xba/0xd0
> [  +0.000460]  remove_memory+0x23/0x40
> [  +0.000461]  dev_dax_kmem_remove+0x29/0x57 [kmem]
> [  +0.000603]  device_release_driver_internal+0xe4/0x1d0
> [  +0.000590]  unbind_store+0x9b/0x130
> [  +0.000409]  kernfs_fop_write+0xf0/0x1a0
> [  +0.000448]  vfs_write+0xba/0x1c0
> [  +0.000395]  ksys_write+0x5a/0xe0
> [  +0.000382]  do_syscall_64+0x60/0x210
> [  +0.000418]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  +0.000573] RIP: 0033:0x7fd1f7442fa8
> [  +0.000407] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 75 77 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
> [  +0.002119] RSP: 002b:00007ffd48f58e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  +0.000833] RAX: ffffffffffffffda RBX: 000000000210c817 RCX: 00007fd1f7442fa8
> [  +0.000795] RDX: 0000000000000007 RSI: 000000000210c817 RDI: 0000000000000003
> [  +0.000816] RBP: 0000000000000007 R08: 000000000210c7d0 R09: 00007fd1f74d4e80
> [  +0.000808] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> [  +0.000819] R13: 00007fd1f72b9ce8 R14: 0000000000000000 R15: 00007ffd48f58e70
Verma, Vishal L May 2, 2019, 10:29 p.m. UTC | #3
On Thu, 2019-05-02 at 17:44 -0400, Pavel Tatashin wrote:

> > In running with these patches, and testing the offlining part, I ran
> > into the following lockdep below.
> > 
> > This is with just these three patches on top of -rc7.
> 
> Hi Verma,
> 
> Thank you for testing. I wonder if there is a command sequence that I
> could run to reproduce it?
> Also, could you please send your config and qemu arguments.
> 
Yes, here is the qemu config:

qemu-system-x86_64
	-machine accel=kvm
	-machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm
	-cpu Haswell-noTSX
	-m 12G,slots=3,maxmem=44G
	-realtime mlock=off
	-smp 8,sockets=2,cores=4,threads=1
	-numa node,nodeid=0,cpus=0-3,mem=6G
	-numa node,nodeid=1,cpus=4-7,mem=6G
	-numa node,nodeid=2
	-numa node,nodeid=3
	-drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1
	-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1
	-object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M
	-device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2
	-object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M
	-device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3
	-serial stdio
	-display none

For the command list - I'm using WIP patches to ndctl/daxctl to add the
command I mentioned earlier. Using this command, I can reproduce the
lockdep issue. I thought I should be able to reproduce the issue by
onlining/offlining through sysfs directly too - something like:

   node="$(cat /sys/bus/dax/devices/dax0.0/target_node)"
   for mem in /sys/devices/system/node/node"$node"/memory*; do
     echo "offline" > $mem/state
   done

But with that I can't reproduce the problem.

I'll try to dig a bit deeper into what might be happening, the daxctl
modifications simply amount to doing the same thing as above in C, so
I'm not immediately sure what might be happening.

If you're interested, I can post the ndctl patches - maybe as an RFC -
to test with.

Thanks,
-Vishal
Pasha Tatashin May 2, 2019, 10:36 p.m. UTC | #4
On Thu, May 2, 2019 at 6:29 PM Verma, Vishal L <vishal.l.verma@intel.com> wrote:
>
> On Thu, 2019-05-02 at 17:44 -0400, Pavel Tatashin wrote:
>
> > > In running with these patches, and testing the offlining part, I ran
> > > into the following lockdep below.
> > >
> > > This is with just these three patches on top of -rc7.
> >
> > Hi Verma,
> >
> > Thank you for testing. I wonder if there is a command sequence that I
> > could run to reproduce it?
> > Also, could you please send your config and qemu arguments.
> >
> Yes, here is the qemu config:
>
> qemu-system-x86_64
>         -machine accel=kvm
>         -machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm
>         -cpu Haswell-noTSX
>         -m 12G,slots=3,maxmem=44G
>         -realtime mlock=off
>         -smp 8,sockets=2,cores=4,threads=1
>         -numa node,nodeid=0,cpus=0-3,mem=6G
>         -numa node,nodeid=1,cpus=4-7,mem=6G
>         -numa node,nodeid=2
>         -numa node,nodeid=3
>         -drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1
>         -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1
>         -object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M
>         -device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2
>         -object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M
>         -device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3
>         -serial stdio
>         -display none
>
> For the command list - I'm using WIP patches to ndctl/daxctl to add the
> command I mentioned earlier. Using this command, I can reproduce the
> lockdep issue. I thought I should be able to reproduce the issue by
> onlining/offlining through sysfs directly too - something like:
>
>    node="$(cat /sys/bus/dax/devices/dax0.0/target_node)"
>    for mem in /sys/devices/system/node/node"$node"/memory*; do
>      echo "offline" > $mem/state
>    done
>
> But with that I can't reproduce the problem.
>
> I'll try to dig a bit deeper into what might be happening, the daxctl
> modifications simply amount to doing the same thing as above in C, so
> I'm not immediately sure what might be happening.
>
> If you're interested, I can post the ndctl patches - maybe as an RFC -
> to test with.

I could apply the patches and test with them. Also, could you please
send your kernel config.

Thank you,
Pasha

>
> Thanks,
> -Vishal
>
>
>
Verma, Vishal L May 3, 2019, 9:48 p.m. UTC | #5
On Thu, 2019-05-02 at 18:36 -0400, Pavel Tatashin wrote:
> > Yes, here is the qemu config:
> > 
> > qemu-system-x86_64
> >         -machine accel=kvm
> >         -machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm
> >         -cpu Haswell-noTSX
> >         -m 12G,slots=3,maxmem=44G
> >         -realtime mlock=off
> >         -smp 8,sockets=2,cores=4,threads=1
> >         -numa node,nodeid=0,cpus=0-3,mem=6G
> >         -numa node,nodeid=1,cpus=4-7,mem=6G
> >         -numa node,nodeid=2
> >         -numa node,nodeid=3
> >         -drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1
> >         -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1
> >         -object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M
> >         -device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2
> >         -object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M
> >         -device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3
> >         -serial stdio
> >         -display none
> > 
> > For the command list - I'm using WIP patches to ndctl/daxctl to add the
> > command I mentioned earlier. Using this command, I can reproduce the
> > lockdep issue. I thought I should be able to reproduce the issue by
> > onlining/offlining through sysfs directly too - something like:
> > 
> >    node="$(cat /sys/bus/dax/devices/dax0.0/target_node)"
> >    for mem in /sys/devices/system/node/node"$node"/memory*; do
> >      echo "offline" > $mem/state
> >    done
> > 
> > But with that I can't reproduce the problem.
> > 
> > I'll try to dig a bit deeper into what might be happening, the daxctl
> > modifications simply amount to doing the same thing as above in C, so
> > I'm not immediately sure what might be happening.
> > 
> > If you're interested, I can post the ndctl patches - maybe as an RFC -
> > to test with.
> 
> I could apply the patches and test with them. Also, could you please
> send your kernel config.
> 
Hi Pavel,

I've CC'd you on the patches mentioned above, and also pushed them to a
'kmem-pending' branch on github:

https://github.com/pmem/ndctl/tree/kmem-pending

After building ndctl from the above, you will want to run:

# daxctl reconfigure-device --mode=system-ram dax0.0
(this will also have onlined the memory sections)

# daxctl reconfigure-device --mode=devdax --attempt-offline dax0.0
(this triggers the lockdep warnings)

I've attached the kernel config here too (gzipped).

Thanks,
	-Vishal
Pasha Tatashin May 15, 2019, 6:11 p.m. UTC | #6
> Hi Pavel,
>
> I am working on adding this sort of a workflow into a new daxctl command
> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> dax device to kmem, online the resulting memory, and with your patches,
> also attempt to offline the memory, and change back to device-dax.
>
> In running with these patches, and testing the offlining part, I ran
> into the following lockdep below.
>
> This is with just these three patches on top of -rc7.
>
>
> [  +0.004886] ======================================================
> [  +0.001576] WARNING: possible circular locking dependency detected
> [  +0.001506] 5.1.0-rc7+ #13 Tainted: G           O
> [  +0.000929] ------------------------------------------------------
> [  +0.000708] daxctl/22950 is trying to acquire lock:
> [  +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000922]
>               but task is already holding lock:
> [  +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0

I have studied this issue, and now have a clear understanding why it
happens, I am not yet sure how to fix it, so suggestions are welcomed
:)

Here is the problem:

When we offline pages we have the following call stack:

# echo offline > /sys/devices/system/memory/memory8/state
ksys_write
 vfs_write
  __vfs_write
   kernfs_fop_write
    kernfs_get_active
     lock_acquire                       kn->count#122 (lock for
"memory8/state" kn)
    sysfs_kf_write
     dev_attr_store
      state_store
       device_offline
        memory_subsys_offline
         memory_block_action
          offline_pages
           __offline_pages
            percpu_down_write
             down_write
              lock_acquire              mem_hotplug_lock.rw_sem

When we unbind dax0.0 we have the following  stack:
# echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
drv_attr_store
 unbind_store
  device_driver_detach
   device_release_driver_internal
    dev_dax_kmem_remove
     remove_memory                      device_hotplug_lock
      try_remove_memory                 mem_hotplug_lock.rw_sem
       arch_remove_memory
        __remove_pages
         __remove_section
          unregister_memory_section
           remove_memory_section        mem_sysfs_mutex
            unregister_memory
             device_unregister
              device_del
               device_remove_attrs
                sysfs_remove_groups
                 sysfs_remove_group
                  remove_files
                   kernfs_remove_by_name
                    kernfs_remove_by_name_ns
                     __kernfs_remove    kn->count#122

So, lockdep found the ordering issue with the above two stacks:

1. kn->count#122 -> mem_hotplug_lock.rw_sem
2. mem_hotplug_lock.rw_sem -> kn->count#122
Dan Williams May 16, 2019, 12:42 a.m. UTC | #7
On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> > Hi Pavel,
> >
> > I am working on adding this sort of a workflow into a new daxctl command
> > (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> > dax device to kmem, online the resulting memory, and with your patches,
> > also attempt to offline the memory, and change back to device-dax.
> >
> > In running with these patches, and testing the offlining part, I ran
> > into the following lockdep below.
> >
> > This is with just these three patches on top of -rc7.
> >
> >
> > [  +0.004886] ======================================================
> > [  +0.001576] WARNING: possible circular locking dependency detected
> > [  +0.001506] 5.1.0-rc7+ #13 Tainted: G           O
> > [  +0.000929] ------------------------------------------------------
> > [  +0.000708] daxctl/22950 is trying to acquire lock:
> > [  +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
> > [  +0.000922]
> >               but task is already holding lock:
> > [  +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
>
> I have studied this issue, and now have a clear understanding why it
> happens, I am not yet sure how to fix it, so suggestions are welcomed
> :)

I would think that ACPI hotplug would have a similar problem, but it does this:

                acpi_unbind_memory_blocks(info);
                __remove_memory(nid, info->start_addr, info->length);

I wonder if that ordering prevents going too deep into the
device_unregister() call stack that you highlighted below.


>
> Here is the problem:
>
> When we offline pages we have the following call stack:
>
> # echo offline > /sys/devices/system/memory/memory8/state
> ksys_write
>  vfs_write
>   __vfs_write
>    kernfs_fop_write
>     kernfs_get_active
>      lock_acquire                       kn->count#122 (lock for
> "memory8/state" kn)
>     sysfs_kf_write
>      dev_attr_store
>       state_store
>        device_offline
>         memory_subsys_offline
>          memory_block_action
>           offline_pages
>            __offline_pages
>             percpu_down_write
>              down_write
>               lock_acquire              mem_hotplug_lock.rw_sem
>
> When we unbind dax0.0 we have the following  stack:
> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
> drv_attr_store
>  unbind_store
>   device_driver_detach
>    device_release_driver_internal
>     dev_dax_kmem_remove
>      remove_memory                      device_hotplug_lock
>       try_remove_memory                 mem_hotplug_lock.rw_sem
>        arch_remove_memory
>         __remove_pages
>          __remove_section
>           unregister_memory_section
>            remove_memory_section        mem_sysfs_mutex
>             unregister_memory
>              device_unregister
>               device_del
>                device_remove_attrs
>                 sysfs_remove_groups
>                  sysfs_remove_group
>                   remove_files
>                    kernfs_remove_by_name
>                     kernfs_remove_by_name_ns
>                      __kernfs_remove    kn->count#122
>
> So, lockdep found the ordering issue with the above two stacks:
>
> 1. kn->count#122 -> mem_hotplug_lock.rw_sem
> 2. mem_hotplug_lock.rw_sem -> kn->count#122
David Hildenbrand May 16, 2019, 7:10 a.m. UTC | #8
On 16.05.19 02:42, Dan Williams wrote:
> On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin
> <pasha.tatashin@soleen.com> wrote:
>>
>>> Hi Pavel,
>>>
>>> I am working on adding this sort of a workflow into a new daxctl command
>>> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
>>> dax device to kmem, online the resulting memory, and with your patches,
>>> also attempt to offline the memory, and change back to device-dax.
>>>
>>> In running with these patches, and testing the offlining part, I ran
>>> into the following lockdep below.
>>>
>>> This is with just these three patches on top of -rc7.
>>>
>>>
>>> [  +0.004886] ======================================================
>>> [  +0.001576] WARNING: possible circular locking dependency detected
>>> [  +0.001506] 5.1.0-rc7+ #13 Tainted: G           O
>>> [  +0.000929] ------------------------------------------------------
>>> [  +0.000708] daxctl/22950 is trying to acquire lock:
>>> [  +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80
>>> [  +0.000922]
>>>               but task is already holding lock:
>>> [  +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0
>>
>> I have studied this issue, and now have a clear understanding why it
>> happens, I am not yet sure how to fix it, so suggestions are welcomed
>> :)
> 
> I would think that ACPI hotplug would have a similar problem, but it does this:
> 
>                 acpi_unbind_memory_blocks(info);
>                 __remove_memory(nid, info->start_addr, info->length);
> 
> I wonder if that ordering prevents going too deep into the
> device_unregister() call stack that you highlighted below.
> 

If that doesn't help, after we have

[PATCH v2 0/8] mm/memory_hotplug: Factor out memory block device handling

we could probably pull the memory device removal phase out from the
mem_hotplug_lock protection and let it be protected by the
device_hotplug_lock only. Might require some more work, though.
Pasha Tatashin May 17, 2019, 2:09 p.m. UTC | #9
>
> I would think that ACPI hotplug would have a similar problem, but it does this:
>
>                 acpi_unbind_memory_blocks(info);
>                 __remove_memory(nid, info->start_addr, info->length);

ACPI does have exactly the same problem, so this is not a bug for this
series, I will submit a new version of my series with comments
addressed, but without fix for this issue.

I was able to reproduce this issue on the current mainline kernel.
Also, I been thinking more about how to fix it, and there is no easy
fix without a major hotplug redesign. Basically, we have to remove
sysfs memory entries before or after memory is hotplugged/hotremoved.
But, we also have to guarantee that hotplug/hotremove will succeed or
reinstate sysfs entries.

Qemu script:

qemu-system-x86_64                                                      \
        -enable-kvm                                                     \
        -cpu host                                                       \
        -parallel none                                                  \
        -echr 1                                                         \
        -serial none                                                    \
        -chardev stdio,id=console,signal=off,mux=on                     \
        -serial chardev:console                                         \
        -mon chardev=console                                            \
        -vga none                                                       \
        -display none                                                   \
        -kernel pmem/native/arch/x86/boot/bzImage                       \
        -m 8G,slots=1,maxmem=16G                                        \
        -smp 8                                                          \
        -fsdev local,id=virtfs1,path=/,security_model=none              \
        -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs            \
        -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
TERM=xterm ip=dhcp loglevel=7'

Config is attached.

Steps to reproduce:
#
# QEMU 4.0.0 monitor - type 'help' for more information
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
(qemu)

# echo online_movable > /sys/devices/system/memory/memory79/state
[   23.029552] Built 1 zonelists, mobility grouping on.  Total pages: 2045370
[   23.032591] Policy zone: Normal
# (qemu) device_del dimm1
(qemu) [   32.013950] Offlined Pages 32768
[   32.014307] Built 1 zonelists, mobility grouping on.  Total pages: 2031022
[   32.014843] Policy zone: Normal
[   32.015733]
[   32.015881] ======================================================
[   32.016390] WARNING: possible circular locking dependency detected
[   32.016881] 5.1.0_pt_pmem #38 Not tainted
[   32.017202] ------------------------------------------------------
[   32.017680] kworker/u16:4/380 is trying to acquire lock:
[   32.018096] 00000000675cc7e1 (kn->count#18){++++}, at:
kernfs_remove_by_name_ns+0x3b/0x80
[   32.018745]
[   32.018745] but task is already holding lock:
[   32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[   32.019859]
[   32.019859] which lock already depends on the new lock.
[   32.019859]
[   32.020499]
[   32.020499] the existing dependency chain (in reverse order) is:
[   32.021080]
[   32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
[   32.021522]        __mutex_lock+0x8b/0x900
[   32.021843]        hotplug_memory_register+0x26/0xa0
[   32.022231]        __add_pages+0xe7/0x160
[   32.022545]        add_pages+0xd/0x60
[   32.022835]        add_memory_resource+0xc3/0x1d0
[   32.023207]        __add_memory+0x57/0x80
[   32.023530]        acpi_memory_device_add+0x13a/0x2d0
[   32.023928]        acpi_bus_attach+0xf1/0x200
[   32.024272]        acpi_bus_scan+0x3e/0x90
[   32.024597]        acpi_device_hotplug+0x284/0x3e0
[   32.024972]        acpi_hotplug_work_fn+0x15/0x20
[   32.025342]        process_one_work+0x2a0/0x650
[   32.025755]        worker_thread+0x34/0x3d0
[   32.026077]        kthread+0x118/0x130
[   32.026442]        ret_from_fork+0x3a/0x50
[   32.026766]
[   32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}:
[   32.027261]        get_online_mems+0x39/0x80
[   32.027600]        kmem_cache_create_usercopy+0x29/0x2c0
[   32.028019]        kmem_cache_create+0xd/0x10
[   32.028367]        ptlock_cache_init+0x1b/0x23
[   32.028724]        start_kernel+0x1d2/0x4b8
[   32.029060]        secondary_startup_64+0xa4/0xb0
[   32.029447]
[   32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}:
[   32.030007]        cpus_read_lock+0x39/0x80
[   32.030360]        __offline_pages+0x32/0x790
[   32.030709]        memory_subsys_offline+0x3a/0x60
[   32.031089]        device_offline+0x7e/0xb0
[   32.031425]        acpi_bus_offline+0xd8/0x140
[   32.031821]        acpi_device_hotplug+0x1b2/0x3e0
[   32.032202]        acpi_hotplug_work_fn+0x15/0x20
[   32.032576]        process_one_work+0x2a0/0x650
[   32.032942]        worker_thread+0x34/0x3d0
[   32.033283]        kthread+0x118/0x130
[   32.033588]        ret_from_fork+0x3a/0x50
[   32.033919]
[   32.033919] -> #1 (&device->physical_node_lock){+.+.}:
[   32.034450]        __mutex_lock+0x8b/0x900
[   32.034784]        acpi_get_first_physical_node+0x16/0x60
[   32.035217]        acpi_companion_match+0x3b/0x60
[   32.035594]        acpi_device_uevent_modalias+0x9/0x20
[   32.036012]        platform_uevent+0xd/0x40
[   32.036352]        dev_uevent+0x85/0x1c0
[   32.036674]        kobject_uevent_env+0x1e2/0x640
[   32.037044]        kobject_synth_uevent+0x2b7/0x324
[   32.037428]        uevent_store+0x17/0x30
[   32.037752]        kernfs_fop_write+0xeb/0x1a0
[   32.038112]        vfs_write+0xb2/0x1b0
[   32.038417]        ksys_write+0x57/0xd0
[   32.038721]        do_syscall_64+0x4b/0x1a0
[   32.039053]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   32.039491]
[   32.039491] -> #0 (kn->count#18){++++}:
[   32.039913]        lock_acquire+0xaa/0x180
[   32.040242]        __kernfs_remove+0x244/0x2d0
[   32.040593]        kernfs_remove_by_name_ns+0x3b/0x80
[   32.040991]        device_del+0x14a/0x370
[   32.041309]        device_unregister+0x9/0x20
[   32.041653]        unregister_memory_section+0x69/0xa0
[   32.042059]        __remove_pages+0x112/0x460
[   32.042402]        arch_remove_memory+0x6f/0xa0
[   32.042758]        __remove_memory+0xab/0x130
[   32.043103]        acpi_memory_device_remove+0x67/0xe0
[   32.043537]        acpi_bus_trim+0x50/0x90
[   32.043889]        acpi_device_hotplug+0x2fa/0x3e0
[   32.044300]        acpi_hotplug_work_fn+0x15/0x20
[   32.044686]        process_one_work+0x2a0/0x650
[   32.045044]        worker_thread+0x34/0x3d0
[   32.045381]        kthread+0x118/0x130
[   32.045679]        ret_from_fork+0x3a/0x50
[   32.046005]
[   32.046005] other info that might help us debug this:
[   32.046005]
[   32.046636] Chain exists of:
[   32.046636]   kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
[   32.046636]
[   32.047514]  Possible unsafe locking scenario:
[   32.047514]
[   32.047976]        CPU0                    CPU1
[   32.048337]        ----                    ----
[   32.048697]   lock(mem_sysfs_mutex);
[   32.048983]                                lock(mem_hotplug_lock.rw_sem);
[   32.049519]                                lock(mem_sysfs_mutex);
[   32.050004]   lock(kn->count#18);
[   32.050270]
[   32.050270]  *** DEADLOCK ***
[   32.050270]
[   32.050736] 7 locks held by kworker/u16:4/380:
[   32.051087]  #0: 00000000a22fe78e
((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650
[   32.051830]  #1: 00000000944f2dca
((work_completion)(&hpw->work)){+.+.}, at:
process_one_work+0x21e/0x650
[   32.052577]  #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at:
acpi_device_hotplug+0x2e/0x3e0
[   32.053271]  #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at:
acpi_device_hotplug+0x3c/0x3e0
[   32.053916]  #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++},
at: __remove_memory+0x3b/0x130
[   32.054602]  #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++},
at: percpu_down_write+0x1d/0x110
[   32.055315]  #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[   32.056016]
[   32.056016] stack backtrace:
[   32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38
[   32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-20181126_142135-anatol 04/01/2014
[   32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   32.058144] Call Trace:
[   32.058344]  dump_stack+0x67/0x90
[   32.058604]  print_circular_bug.cold.60+0x15c/0x195
[   32.058989]  __lock_acquire+0x17de/0x1d30
[   32.059308]  ? find_held_lock+0x2d/0x90
[   32.059611]  ? __kernfs_remove+0x199/0x2d0
[   32.059937]  lock_acquire+0xaa/0x180
[   32.060223]  ? kernfs_remove_by_name_ns+0x3b/0x80
[   32.060596]  __kernfs_remove+0x244/0x2d0
[   32.060908]  ? kernfs_remove_by_name_ns+0x3b/0x80
[   32.061283]  ? kernfs_name_hash+0xd/0x80
[   32.061596]  ? kernfs_find_ns+0x68/0xf0
[   32.061907]  kernfs_remove_by_name_ns+0x3b/0x80
[   32.062266]  device_del+0x14a/0x370
[   32.062548]  ? unregister_mem_sect_under_nodes+0x4f/0xc0
[   32.062973]  device_unregister+0x9/0x20
[   32.063285]  unregister_memory_section+0x69/0xa0
[   32.063651]  __remove_pages+0x112/0x460
[   32.063949]  arch_remove_memory+0x6f/0xa0
[   32.064271]  __remove_memory+0xab/0x130
[   32.064579]  ? walk_memory_range+0xa1/0xe0
[   32.064907]  acpi_memory_device_remove+0x67/0xe0
[   32.065274]  acpi_bus_trim+0x50/0x90
[   32.065560]  acpi_device_hotplug+0x2fa/0x3e0
[   32.065900]  acpi_hotplug_work_fn+0x15/0x20
[   32.066249]  process_one_work+0x2a0/0x650
[   32.066591]  worker_thread+0x34/0x3d0
[   32.066925]  ? process_one_work+0x650/0x650
[   32.067275]  kthread+0x118/0x130
[   32.067542]  ? kthread_create_on_node+0x60/0x60
[   32.067909]  ret_from_fork+0x3a/0x50

>
> I wonder if that ordering prevents going too deep into the
> device_unregister() call stack that you highlighted below.
>
>
> >
> > Here is the problem:
> >
> > When we offline pages we have the following call stack:
> >
> > # echo offline > /sys/devices/system/memory/memory8/state
> > ksys_write
> >  vfs_write
> >   __vfs_write
> >    kernfs_fop_write
> >     kernfs_get_active
> >      lock_acquire                       kn->count#122 (lock for
> > "memory8/state" kn)
> >     sysfs_kf_write
> >      dev_attr_store
> >       state_store
> >        device_offline
> >         memory_subsys_offline
> >          memory_block_action
> >           offline_pages
> >            __offline_pages
> >             percpu_down_write
> >              down_write
> >               lock_acquire              mem_hotplug_lock.rw_sem
> >
> > When we unbind dax0.0 we have the following  stack:
> > # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
> > drv_attr_store
> >  unbind_store
> >   device_driver_detach
> >    device_release_driver_internal
> >     dev_dax_kmem_remove
> >      remove_memory                      device_hotplug_lock
> >       try_remove_memory                 mem_hotplug_lock.rw_sem
> >        arch_remove_memory
> >         __remove_pages
> >          __remove_section
> >           unregister_memory_section
> >            remove_memory_section        mem_sysfs_mutex
> >             unregister_memory
> >              device_unregister
> >               device_del
> >                device_remove_attrs
> >                 sysfs_remove_groups
> >                  sysfs_remove_group
> >                   remove_files
> >                    kernfs_remove_by_name
> >                     kernfs_remove_by_name_ns
> >                      __kernfs_remove    kn->count#122
> >
> > So, lockdep found the ordering issue with the above two stacks:
> >
> > 1. kn->count#122 -> mem_hotplug_lock.rw_sem
> > 2. mem_hotplug_lock.rw_sem -> kn->count#122
David Hildenbrand May 20, 2019, 7:57 a.m. UTC | #10
On 17.05.19 16:09, Pavel Tatashin wrote:
>>
>> I would think that ACPI hotplug would have a similar problem, but it does this:
>>
>>                 acpi_unbind_memory_blocks(info);
>>                 __remove_memory(nid, info->start_addr, info->length);
> 
> ACPI does have exactly the same problem, so this is not a bug for this
> series, I will submit a new version of my series with comments
> addressed, but without fix for this issue.
> 
> I was able to reproduce this issue on the current mainline kernel.
> Also, I been thinking more about how to fix it, and there is no easy
> fix without a major hotplug redesign. Basically, we have to remove
> sysfs memory entries before or after memory is hotplugged/hotremoved.
> But, we also have to guarantee that hotplug/hotremove will succeed or
> reinstate sysfs entries.
> 
> Qemu script:
> 
> qemu-system-x86_64                                                      \
>         -enable-kvm                                                     \
>         -cpu host                                                       \
>         -parallel none                                                  \
>         -echr 1                                                         \
>         -serial none                                                    \
>         -chardev stdio,id=console,signal=off,mux=on                     \
>         -serial chardev:console                                         \
>         -mon chardev=console                                            \
>         -vga none                                                       \
>         -display none                                                   \
>         -kernel pmem/native/arch/x86/boot/bzImage                       \
>         -m 8G,slots=1,maxmem=16G                                        \
>         -smp 8                                                          \
>         -fsdev local,id=virtfs1,path=/,security_model=none              \
>         -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs            \
>         -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
> TERM=xterm ip=dhcp loglevel=7'
> 
> Config is attached.
> 
> Steps to reproduce:
> #
> # QEMU 4.0.0 monitor - type 'help' for more information
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> (qemu)
> 
> # echo online_movable > /sys/devices/system/memory/memory79/state
> [   23.029552] Built 1 zonelists, mobility grouping on.  Total pages: 2045370
> [   23.032591] Policy zone: Normal
> # (qemu) device_del dimm1
> (qemu) [   32.013950] Offlined Pages 32768
> [   32.014307] Built 1 zonelists, mobility grouping on.  Total pages: 2031022
> [   32.014843] Policy zone: Normal
> [   32.015733]
> [   32.015881] ======================================================
> [   32.016390] WARNING: possible circular locking dependency detected
> [   32.016881] 5.1.0_pt_pmem #38 Not tainted
> [   32.017202] ------------------------------------------------------
> [   32.017680] kworker/u16:4/380 is trying to acquire lock:
> [   32.018096] 00000000675cc7e1 (kn->count#18){++++}, at:
> kernfs_remove_by_name_ns+0x3b/0x80
> [   32.018745]
> [   32.018745] but task is already holding lock:
> [   32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
> unregister_memory_section+0x1d/0xa0
> [   32.019859]
> [   32.019859] which lock already depends on the new lock.
> [   32.019859]
> [   32.020499]
> [   32.020499] the existing dependency chain (in reverse order) is:
> [   32.021080]
> [   32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
> [   32.021522]        __mutex_lock+0x8b/0x900
> [   32.021843]        hotplug_memory_register+0x26/0xa0
> [   32.022231]        __add_pages+0xe7/0x160
> [   32.022545]        add_pages+0xd/0x60
> [   32.022835]        add_memory_resource+0xc3/0x1d0
> [   32.023207]        __add_memory+0x57/0x80
> [   32.023530]        acpi_memory_device_add+0x13a/0x2d0
> [   32.023928]        acpi_bus_attach+0xf1/0x200
> [   32.024272]        acpi_bus_scan+0x3e/0x90
> [   32.024597]        acpi_device_hotplug+0x284/0x3e0
> [   32.024972]        acpi_hotplug_work_fn+0x15/0x20
> [   32.025342]        process_one_work+0x2a0/0x650
> [   32.025755]        worker_thread+0x34/0x3d0
> [   32.026077]        kthread+0x118/0x130
> [   32.026442]        ret_from_fork+0x3a/0x50
> [   32.026766]
> [   32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}:
> [   32.027261]        get_online_mems+0x39/0x80
> [   32.027600]        kmem_cache_create_usercopy+0x29/0x2c0
> [   32.028019]        kmem_cache_create+0xd/0x10
> [   32.028367]        ptlock_cache_init+0x1b/0x23
> [   32.028724]        start_kernel+0x1d2/0x4b8
> [   32.029060]        secondary_startup_64+0xa4/0xb0
> [   32.029447]
> [   32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}:
> [   32.030007]        cpus_read_lock+0x39/0x80
> [   32.030360]        __offline_pages+0x32/0x790
> [   32.030709]        memory_subsys_offline+0x3a/0x60
> [   32.031089]        device_offline+0x7e/0xb0
> [   32.031425]        acpi_bus_offline+0xd8/0x140
> [   32.031821]        acpi_device_hotplug+0x1b2/0x3e0
> [   32.032202]        acpi_hotplug_work_fn+0x15/0x20
> [   32.032576]        process_one_work+0x2a0/0x650
> [   32.032942]        worker_thread+0x34/0x3d0
> [   32.033283]        kthread+0x118/0x130
> [   32.033588]        ret_from_fork+0x3a/0x50
> [   32.033919]
> [   32.033919] -> #1 (&device->physical_node_lock){+.+.}:
> [   32.034450]        __mutex_lock+0x8b/0x900
> [   32.034784]        acpi_get_first_physical_node+0x16/0x60
> [   32.035217]        acpi_companion_match+0x3b/0x60
> [   32.035594]        acpi_device_uevent_modalias+0x9/0x20
> [   32.036012]        platform_uevent+0xd/0x40
> [   32.036352]        dev_uevent+0x85/0x1c0
> [   32.036674]        kobject_uevent_env+0x1e2/0x640
> [   32.037044]        kobject_synth_uevent+0x2b7/0x324
> [   32.037428]        uevent_store+0x17/0x30
> [   32.037752]        kernfs_fop_write+0xeb/0x1a0
> [   32.038112]        vfs_write+0xb2/0x1b0
> [   32.038417]        ksys_write+0x57/0xd0
> [   32.038721]        do_syscall_64+0x4b/0x1a0
> [   32.039053]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   32.039491]
> [   32.039491] -> #0 (kn->count#18){++++}:
> [   32.039913]        lock_acquire+0xaa/0x180
> [   32.040242]        __kernfs_remove+0x244/0x2d0
> [   32.040593]        kernfs_remove_by_name_ns+0x3b/0x80
> [   32.040991]        device_del+0x14a/0x370
> [   32.041309]        device_unregister+0x9/0x20
> [   32.041653]        unregister_memory_section+0x69/0xa0
> [   32.042059]        __remove_pages+0x112/0x460
> [   32.042402]        arch_remove_memory+0x6f/0xa0
> [   32.042758]        __remove_memory+0xab/0x130
> [   32.043103]        acpi_memory_device_remove+0x67/0xe0
> [   32.043537]        acpi_bus_trim+0x50/0x90
> [   32.043889]        acpi_device_hotplug+0x2fa/0x3e0
> [   32.044300]        acpi_hotplug_work_fn+0x15/0x20
> [   32.044686]        process_one_work+0x2a0/0x650
> [   32.045044]        worker_thread+0x34/0x3d0
> [   32.045381]        kthread+0x118/0x130
> [   32.045679]        ret_from_fork+0x3a/0x50
> [   32.046005]
> [   32.046005] other info that might help us debug this:
> [   32.046005]
> [   32.046636] Chain exists of:
> [   32.046636]   kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
> [   32.046636]
> [   32.047514]  Possible unsafe locking scenario:
> [   32.047514]
> [   32.047976]        CPU0                    CPU1
> [   32.048337]        ----                    ----
> [   32.048697]   lock(mem_sysfs_mutex);
> [   32.048983]                                lock(mem_hotplug_lock.rw_sem);
> [   32.049519]                                lock(mem_sysfs_mutex);
> [   32.050004]   lock(kn->count#18);
> [   32.050270]
> [   32.050270]  *** DEADLOCK ***
> [   32.050270]
> [   32.050736] 7 locks held by kworker/u16:4/380:
> [   32.051087]  #0: 00000000a22fe78e
> ((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650
> [   32.051830]  #1: 00000000944f2dca
> ((work_completion)(&hpw->work)){+.+.}, at:
> process_one_work+0x21e/0x650
> [   32.052577]  #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at:
> acpi_device_hotplug+0x2e/0x3e0
> [   32.053271]  #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at:
> acpi_device_hotplug+0x3c/0x3e0
> [   32.053916]  #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++},
> at: __remove_memory+0x3b/0x130
> [   32.054602]  #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++},
> at: percpu_down_write+0x1d/0x110
> [   32.055315]  #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
> unregister_memory_section+0x1d/0xa0
> [   32.056016]
> [   32.056016] stack backtrace:
> [   32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38
> [   32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.12.0-20181126_142135-anatol 04/01/2014
> [   32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> [   32.058144] Call Trace:
> [   32.058344]  dump_stack+0x67/0x90
> [   32.058604]  print_circular_bug.cold.60+0x15c/0x195
> [   32.058989]  __lock_acquire+0x17de/0x1d30
> [   32.059308]  ? find_held_lock+0x2d/0x90
> [   32.059611]  ? __kernfs_remove+0x199/0x2d0
> [   32.059937]  lock_acquire+0xaa/0x180
> [   32.060223]  ? kernfs_remove_by_name_ns+0x3b/0x80
> [   32.060596]  __kernfs_remove+0x244/0x2d0
> [   32.060908]  ? kernfs_remove_by_name_ns+0x3b/0x80
> [   32.061283]  ? kernfs_name_hash+0xd/0x80
> [   32.061596]  ? kernfs_find_ns+0x68/0xf0
> [   32.061907]  kernfs_remove_by_name_ns+0x3b/0x80
> [   32.062266]  device_del+0x14a/0x370
> [   32.062548]  ? unregister_mem_sect_under_nodes+0x4f/0xc0
> [   32.062973]  device_unregister+0x9/0x20
> [   32.063285]  unregister_memory_section+0x69/0xa0
> [   32.063651]  __remove_pages+0x112/0x460
> [   32.063949]  arch_remove_memory+0x6f/0xa0
> [   32.064271]  __remove_memory+0xab/0x130
> [   32.064579]  ? walk_memory_range+0xa1/0xe0
> [   32.064907]  acpi_memory_device_remove+0x67/0xe0
> [   32.065274]  acpi_bus_trim+0x50/0x90
> [   32.065560]  acpi_device_hotplug+0x2fa/0x3e0
> [   32.065900]  acpi_hotplug_work_fn+0x15/0x20
> [   32.066249]  process_one_work+0x2a0/0x650
> [   32.066591]  worker_thread+0x34/0x3d0
> [   32.066925]  ? process_one_work+0x650/0x650
> [   32.067275]  kthread+0x118/0x130
> [   32.067542]  ? kthread_create_on_node+0x60/0x60
> [   32.067909]  ret_from_fork+0x3a/0x50
> 
>>
>> I wonder if that ordering prevents going too deep into the
>> device_unregister() call stack that you highlighted below.
>>
>>
>>>
>>> Here is the problem:
>>>
>>> When we offline pages we have the following call stack:
>>>
>>> # echo offline > /sys/devices/system/memory/memory8/state
>>> ksys_write
>>>  vfs_write
>>>   __vfs_write
>>>    kernfs_fop_write
>>>     kernfs_get_active
>>>      lock_acquire                       kn->count#122 (lock for
>>> "memory8/state" kn)
>>>     sysfs_kf_write
>>>      dev_attr_store
>>>       state_store
>>>        device_offline
>>>         memory_subsys_offline
>>>          memory_block_action
>>>           offline_pages
>>>            __offline_pages
>>>             percpu_down_write
>>>              down_write
>>>               lock_acquire              mem_hotplug_lock.rw_sem
>>>
>>> When we unbind dax0.0 we have the following  stack:
>>> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
>>> drv_attr_store
>>>  unbind_store
>>>   device_driver_detach
>>>    device_release_driver_internal
>>>     dev_dax_kmem_remove
>>>      remove_memory                      device_hotplug_lock
>>>       try_remove_memory                 mem_hotplug_lock.rw_sem
>>>        arch_remove_memory
>>>         __remove_pages
>>>          __remove_section
>>>           unregister_memory_section
>>>            remove_memory_section        mem_sysfs_mutex
>>>             unregister_memory
>>>              device_unregister
>>>               device_del
>>>                device_remove_attrs
>>>                 sysfs_remove_groups
>>>                  sysfs_remove_group
>>>                   remove_files
>>>                    kernfs_remove_by_name
>>>                     kernfs_remove_by_name_ns
>>>                      __kernfs_remove    kn->count#122
>>>
>>> So, lockdep found the ordering issue with the above two stacks:
>>>
>>> 1. kn->count#122 -> mem_hotplug_lock.rw_sem
>>> 2. mem_hotplug_lock.rw_sem -> kn->count#122

I once documented locking behavior in

Documentation/core-api/memory-hotplug.rst

Both, device_online() and __remove_memory() always have to be called
holding the device_hotplug_lock(), to avoid such races.

# echo offline > /sys/devices/system/memory/memory8/state
and
# echo 0 > /sys/devices/system/memory/memory8/online

either end up in:

-> online_store()
-- lock_device_hotplug_sysfs();
-- device_offline(dev)

-> state_store()
-- lock_device_hotplug_sysfs();
--  device_online(&mem->dev);

So the device_hotplug_lock prohibits the race you describe.

BUT

There is a possible race between the device_hotplug_lock and the
kn->count#122. However, that race can never trigger, as userspace
properly backs off in case it cannot get grip of the device_hotplug_lock.

(lock_device_hotplug_sysfs does a mutex_trylock(&device_hotplug_lock))