Message ID | 20190502184337.20538-1-pasha.tatashin@soleen.com (mailing list archive) |
---|---|
Headers | show |
Series | "Hotremove" persistent memory | expand |
On Thu, 2019-05-02 at 14:43 -0400, Pavel Tatashin wrote: > The series of operations look like this: > > 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps. > and free ramdisk. > 2. Convert raw pmem0 to devdax > ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f > 3. Hotadd to System RAM > echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id > echo online_movable > /sys/devices/system/memoryXXX/state > 4. Before reboot hotremove device-dax memory from System RAM > echo offline > /sys/devices/system/memoryXXX/state > echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind Hi Pavel, I am working on adding this sort of a workflow into a new daxctl command (daxctl-reconfigure-device)- this will allow changing the 'mode' of a dax device to kmem, online the resulting memory, and with your patches, also attempt to offline the memory, and change back to device-dax. In running with these patches, and testing the offlining part, I ran into the following lockdep below. This is with just these three patches on top of -rc7. [ +0.004886] ====================================================== [ +0.001576] WARNING: possible circular locking dependency detected [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O [ +0.000929] ------------------------------------------------------ [ +0.000708] daxctl/22950 is trying to acquire lock: [ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80 [ +0.000922] but task is already holding lock: [ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0 [ +0.000960] which lock already depends on the new lock. [ +0.001001] the existing dependency chain (in reverse order) is: [ +0.000837] -> #3 (mem_sysfs_mutex){+.+.}: [ +0.000631] __mutex_lock+0x82/0x9a0 [ +0.000477] unregister_memory_section+0x22/0xa0 [ +0.000582] __remove_pages+0xe9/0x520 [ +0.000489] arch_remove_memory+0x81/0xc0 [ +0.000510] devm_memremap_pages_release+0x180/0x270 [ +0.000633] release_nodes+0x234/0x280 [ +0.000483] device_release_driver_internal+0xf4/0x1d0 [ +0.000701] bus_remove_device+0xfc/0x170 [ +0.000529] device_del+0x16a/0x380 [ +0.000459] unregister_dev_dax+0x23/0x50 [ +0.000526] release_nodes+0x234/0x280 [ +0.000487] device_release_driver_internal+0xf4/0x1d0 [ +0.000646] unbind_store+0x9b/0x130 [ +0.000467] kernfs_fop_write+0xf0/0x1a0 [ +0.000510] vfs_write+0xba/0x1c0 [ +0.000438] ksys_write+0x5a/0xe0 [ +0.000521] do_syscall_64+0x60/0x210 [ +0.000489] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ +0.000637] -> #2 (mem_hotplug_lock.rw_sem){++++}: [ +0.000717] get_online_mems+0x3e/0x80 [ +0.000491] kmem_cache_create_usercopy+0x2e/0x270 [ +0.000609] kmem_cache_create+0x12/0x20 [ +0.000507] ptlock_cache_init+0x20/0x28 [ +0.000506] start_kernel+0x240/0x4d0 [ +0.000480] secondary_startup_64+0xa4/0xb0 [ +0.000539] -> #1 (cpu_hotplug_lock.rw_sem){++++}: [ +0.000784] cpus_read_lock+0x3e/0x80 [ +0.000511] online_pages+0x37/0x310 [ +0.000469] memory_subsys_online+0x34/0x60 [ +0.000611] device_online+0x60/0x80 [ +0.000611] state_store+0x66/0xd0 [ +0.000552] kernfs_fop_write+0xf0/0x1a0 [ +0.000649] vfs_write+0xba/0x1c0 [ +0.000487] ksys_write+0x5a/0xe0 [ +0.000459] do_syscall_64+0x60/0x210 [ +0.000482] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ +0.000646] -> #0 (kn->count#424){++++}: [ +0.000669] lock_acquire+0x9e/0x180 [ +0.000471] __kernfs_remove+0x26a/0x310 [ +0.000518] kernfs_remove_by_name_ns+0x40/0x80 [ +0.000583] remove_files.isra.1+0x30/0x70 [ +0.000555] sysfs_remove_group+0x3d/0x80 [ +0.000524] sysfs_remove_groups+0x29/0x40 [ +0.000532] device_remove_attrs+0x42/0x80 [ +0.000522] device_del+0x162/0x380 [ +0.000464] device_unregister+0x16/0x60 [ +0.000505] unregister_memory_section+0x6e/0xa0 [ +0.000591] __remove_pages+0xe9/0x520 [ +0.000492] arch_remove_memory+0x81/0xc0 [ +0.000568] try_remove_memory+0xba/0xd0 [ +0.000510] remove_memory+0x23/0x40 [ +0.000483] dev_dax_kmem_remove+0x29/0x57 [kmem] [ +0.000608] device_release_driver_internal+0xe4/0x1d0 [ +0.000637] unbind_store+0x9b/0x130 [ +0.000464] kernfs_fop_write+0xf0/0x1a0 [ +0.000685] vfs_write+0xba/0x1c0 [ +0.000594] ksys_write+0x5a/0xe0 [ +0.000449] do_syscall_64+0x60/0x210 [ +0.000481] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ +0.000619] other info that might help us debug this: [ +0.000889] Chain exists of: kn->count#424 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex [ +0.001269] Possible unsafe locking scenario: [ +0.000652] CPU0 CPU1 [ +0.000505] ---- ---- [ +0.000523] lock(mem_sysfs_mutex); [ +0.000422] lock(mem_hotplug_lock.rw_sem); [ +0.000905] lock(mem_sysfs_mutex); [ +0.000793] lock(kn->count#424); [ +0.000394] *** DEADLOCK *** [ +0.000665] 7 locks held by daxctl/22950: [ +0.000458] #0: 000000005f6d3c13 (sb_writers#4){.+.+}, at: vfs_write+0x159/0x1c0 [ +0.000943] #1: 00000000e468825d (&of->mutex){+.+.}, at: kernfs_fop_write+0xbd/0x1a0 [ +0.000895] #2: 00000000caa17dbb (&dev->mutex){....}, at: device_release_driver_internal+0x1a/0x1d0 [ +0.001019] #3: 000000002119b22c (device_hotplug_lock){+.+.}, at: remove_memory+0x16/0x40 [ +0.000942] #4: 00000000150c8efe (cpu_hotplug_lock.rw_sem){++++}, at: try_remove_memory+0x2e/0xd0 [ +0.001019] #5: 000000003d6b2a0f (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x25/0x120 [ +0.001118] #6: 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0 [ +0.001033] stack backtrace: [ +0.000507] CPU: 5 PID: 22950 Comm: daxctl Tainted: G O 5.1.0-rc7+ #13 [ +0.000896] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014 [ +0.001360] Call Trace: [ +0.000293] dump_stack+0x85/0xc0 [ +0.000390] print_circular_bug.isra.41.cold.60+0x15c/0x195 [ +0.000651] check_prev_add.constprop.50+0x5fd/0xbe0 [ +0.000563] ? call_rcu_zapped+0x80/0x80 [ +0.000449] __lock_acquire+0xcee/0xfd0 [ +0.000437] lock_acquire+0x9e/0x180 [ +0.000428] ? kernfs_remove_by_name_ns+0x40/0x80 [ +0.000531] __kernfs_remove+0x26a/0x310 [ +0.000451] ? kernfs_remove_by_name_ns+0x40/0x80 [ +0.000529] ? kernfs_name_hash+0x12/0x80 [ +0.000462] kernfs_remove_by_name_ns+0x40/0x80 [ +0.000513] remove_files.isra.1+0x30/0x70 [ +0.000483] sysfs_remove_group+0x3d/0x80 [ +0.000458] sysfs_remove_groups+0x29/0x40 [ +0.000477] device_remove_attrs+0x42/0x80 [ +0.000461] device_del+0x162/0x380 [ +0.000399] device_unregister+0x16/0x60 [ +0.000442] unregister_memory_section+0x6e/0xa0 [ +0.001232] __remove_pages+0xe9/0x520 [ +0.000443] arch_remove_memory+0x81/0xc0 [ +0.000459] try_remove_memory+0xba/0xd0 [ +0.000460] remove_memory+0x23/0x40 [ +0.000461] dev_dax_kmem_remove+0x29/0x57 [kmem] [ +0.000603] device_release_driver_internal+0xe4/0x1d0 [ +0.000590] unbind_store+0x9b/0x130 [ +0.000409] kernfs_fop_write+0xf0/0x1a0 [ +0.000448] vfs_write+0xba/0x1c0 [ +0.000395] ksys_write+0x5a/0xe0 [ +0.000382] do_syscall_64+0x60/0x210 [ +0.000418] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ +0.000573] RIP: 0033:0x7fd1f7442fa8 [ +0.000407] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 75 77 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 [ +0.002119] RSP: 002b:00007ffd48f58e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ +0.000833] RAX: ffffffffffffffda RBX: 000000000210c817 RCX: 00007fd1f7442fa8 [ +0.000795] RDX: 0000000000000007 RSI: 000000000210c817 RDI: 0000000000000003 [ +0.000816] RBP: 0000000000000007 R08: 000000000210c7d0 R09: 00007fd1f74d4e80 [ +0.000808] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 [ +0.000819] R13: 00007fd1f72b9ce8 R14: 0000000000000000 R15: 00007ffd48f58e70
On Thu, May 2, 2019 at 4:50 PM Verma, Vishal L <vishal.l.verma@intel.com> wrote: > > On Thu, 2019-05-02 at 14:43 -0400, Pavel Tatashin wrote: > > The series of operations look like this: > > > > 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps. > > and free ramdisk. > > 2. Convert raw pmem0 to devdax > > ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f > > 3. Hotadd to System RAM > > echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > > echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id > > echo online_movable > /sys/devices/system/memoryXXX/state > > 4. Before reboot hotremove device-dax memory from System RAM > > echo offline > /sys/devices/system/memoryXXX/state > > echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind > > Hi Pavel, > > I am working on adding this sort of a workflow into a new daxctl command > (daxctl-reconfigure-device)- this will allow changing the 'mode' of a > dax device to kmem, online the resulting memory, and with your patches, > also attempt to offline the memory, and change back to device-dax. > > In running with these patches, and testing the offlining part, I ran > into the following lockdep below. > > This is with just these three patches on top of -rc7. Hi Verma, Thank you for testing. I wonder if there is a command sequence that I could run to reproduce it? Also, could you please send your config and qemu arguments. Thank you, Pasha > > > [ +0.004886] ====================================================== > [ +0.001576] WARNING: possible circular locking dependency detected > [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O > [ +0.000929] ------------------------------------------------------ > [ +0.000708] daxctl/22950 is trying to acquire lock: > [ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80 > [ +0.000922] > but task is already holding lock: > [ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0 > [ +0.000960] > which lock already depends on the new lock. > > [ +0.001001] > the existing dependency chain (in reverse order) is: > [ +0.000837] > -> #3 (mem_sysfs_mutex){+.+.}: > [ +0.000631] __mutex_lock+0x82/0x9a0 > [ +0.000477] unregister_memory_section+0x22/0xa0 > [ +0.000582] __remove_pages+0xe9/0x520 > [ +0.000489] arch_remove_memory+0x81/0xc0 > [ +0.000510] devm_memremap_pages_release+0x180/0x270 > [ +0.000633] release_nodes+0x234/0x280 > [ +0.000483] device_release_driver_internal+0xf4/0x1d0 > [ +0.000701] bus_remove_device+0xfc/0x170 > [ +0.000529] device_del+0x16a/0x380 > [ +0.000459] unregister_dev_dax+0x23/0x50 > [ +0.000526] release_nodes+0x234/0x280 > [ +0.000487] device_release_driver_internal+0xf4/0x1d0 > [ +0.000646] unbind_store+0x9b/0x130 > [ +0.000467] kernfs_fop_write+0xf0/0x1a0 > [ +0.000510] vfs_write+0xba/0x1c0 > [ +0.000438] ksys_write+0x5a/0xe0 > [ +0.000521] do_syscall_64+0x60/0x210 > [ +0.000489] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [ +0.000637] > -> #2 (mem_hotplug_lock.rw_sem){++++}: > [ +0.000717] get_online_mems+0x3e/0x80 > [ +0.000491] kmem_cache_create_usercopy+0x2e/0x270 > [ +0.000609] kmem_cache_create+0x12/0x20 > [ +0.000507] ptlock_cache_init+0x20/0x28 > [ +0.000506] start_kernel+0x240/0x4d0 > [ +0.000480] secondary_startup_64+0xa4/0xb0 > [ +0.000539] > -> #1 (cpu_hotplug_lock.rw_sem){++++}: > [ +0.000784] cpus_read_lock+0x3e/0x80 > [ +0.000511] online_pages+0x37/0x310 > [ +0.000469] memory_subsys_online+0x34/0x60 > [ +0.000611] device_online+0x60/0x80 > [ +0.000611] state_store+0x66/0xd0 > [ +0.000552] kernfs_fop_write+0xf0/0x1a0 > [ +0.000649] vfs_write+0xba/0x1c0 > [ +0.000487] ksys_write+0x5a/0xe0 > [ +0.000459] do_syscall_64+0x60/0x210 > [ +0.000482] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [ +0.000646] > -> #0 (kn->count#424){++++}: > [ +0.000669] lock_acquire+0x9e/0x180 > [ +0.000471] __kernfs_remove+0x26a/0x310 > [ +0.000518] kernfs_remove_by_name_ns+0x40/0x80 > [ +0.000583] remove_files.isra.1+0x30/0x70 > [ +0.000555] sysfs_remove_group+0x3d/0x80 > [ +0.000524] sysfs_remove_groups+0x29/0x40 > [ +0.000532] device_remove_attrs+0x42/0x80 > [ +0.000522] device_del+0x162/0x380 > [ +0.000464] device_unregister+0x16/0x60 > [ +0.000505] unregister_memory_section+0x6e/0xa0 > [ +0.000591] __remove_pages+0xe9/0x520 > [ +0.000492] arch_remove_memory+0x81/0xc0 > [ +0.000568] try_remove_memory+0xba/0xd0 > [ +0.000510] remove_memory+0x23/0x40 > [ +0.000483] dev_dax_kmem_remove+0x29/0x57 [kmem] > [ +0.000608] device_release_driver_internal+0xe4/0x1d0 > [ +0.000637] unbind_store+0x9b/0x130 > [ +0.000464] kernfs_fop_write+0xf0/0x1a0 > [ +0.000685] vfs_write+0xba/0x1c0 > [ +0.000594] ksys_write+0x5a/0xe0 > [ +0.000449] do_syscall_64+0x60/0x210 > [ +0.000481] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [ +0.000619] > other info that might help us debug this: > > [ +0.000889] Chain exists of: > kn->count#424 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex > > [ +0.001269] Possible unsafe locking scenario: > > [ +0.000652] CPU0 CPU1 > [ +0.000505] ---- ---- > [ +0.000523] lock(mem_sysfs_mutex); > [ +0.000422] lock(mem_hotplug_lock.rw_sem); > [ +0.000905] lock(mem_sysfs_mutex); > [ +0.000793] lock(kn->count#424); > [ +0.000394] > *** DEADLOCK *** > > [ +0.000665] 7 locks held by daxctl/22950: > [ +0.000458] #0: 000000005f6d3c13 (sb_writers#4){.+.+}, at: vfs_write+0x159/0x1c0 > [ +0.000943] #1: 00000000e468825d (&of->mutex){+.+.}, at: kernfs_fop_write+0xbd/0x1a0 > [ +0.000895] #2: 00000000caa17dbb (&dev->mutex){....}, at: device_release_driver_internal+0x1a/0x1d0 > [ +0.001019] #3: 000000002119b22c (device_hotplug_lock){+.+.}, at: remove_memory+0x16/0x40 > [ +0.000942] #4: 00000000150c8efe (cpu_hotplug_lock.rw_sem){++++}, at: try_remove_memory+0x2e/0xd0 > [ +0.001019] #5: 000000003d6b2a0f (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x25/0x120 > [ +0.001118] #6: 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0 > [ +0.001033] > stack backtrace: > [ +0.000507] CPU: 5 PID: 22950 Comm: daxctl Tainted: G O 5.1.0-rc7+ #13 > [ +0.000896] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014 > [ +0.001360] Call Trace: > [ +0.000293] dump_stack+0x85/0xc0 > [ +0.000390] print_circular_bug.isra.41.cold.60+0x15c/0x195 > [ +0.000651] check_prev_add.constprop.50+0x5fd/0xbe0 > [ +0.000563] ? call_rcu_zapped+0x80/0x80 > [ +0.000449] __lock_acquire+0xcee/0xfd0 > [ +0.000437] lock_acquire+0x9e/0x180 > [ +0.000428] ? kernfs_remove_by_name_ns+0x40/0x80 > [ +0.000531] __kernfs_remove+0x26a/0x310 > [ +0.000451] ? kernfs_remove_by_name_ns+0x40/0x80 > [ +0.000529] ? kernfs_name_hash+0x12/0x80 > [ +0.000462] kernfs_remove_by_name_ns+0x40/0x80 > [ +0.000513] remove_files.isra.1+0x30/0x70 > [ +0.000483] sysfs_remove_group+0x3d/0x80 > [ +0.000458] sysfs_remove_groups+0x29/0x40 > [ +0.000477] device_remove_attrs+0x42/0x80 > [ +0.000461] device_del+0x162/0x380 > [ +0.000399] device_unregister+0x16/0x60 > [ +0.000442] unregister_memory_section+0x6e/0xa0 > [ +0.001232] __remove_pages+0xe9/0x520 > [ +0.000443] arch_remove_memory+0x81/0xc0 > [ +0.000459] try_remove_memory+0xba/0xd0 > [ +0.000460] remove_memory+0x23/0x40 > [ +0.000461] dev_dax_kmem_remove+0x29/0x57 [kmem] > [ +0.000603] device_release_driver_internal+0xe4/0x1d0 > [ +0.000590] unbind_store+0x9b/0x130 > [ +0.000409] kernfs_fop_write+0xf0/0x1a0 > [ +0.000448] vfs_write+0xba/0x1c0 > [ +0.000395] ksys_write+0x5a/0xe0 > [ +0.000382] do_syscall_64+0x60/0x210 > [ +0.000418] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [ +0.000573] RIP: 0033:0x7fd1f7442fa8 > [ +0.000407] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 75 77 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 > [ +0.002119] RSP: 002b:00007ffd48f58e28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > [ +0.000833] RAX: ffffffffffffffda RBX: 000000000210c817 RCX: 00007fd1f7442fa8 > [ +0.000795] RDX: 0000000000000007 RSI: 000000000210c817 RDI: 0000000000000003 > [ +0.000816] RBP: 0000000000000007 R08: 000000000210c7d0 R09: 00007fd1f74d4e80 > [ +0.000808] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 > [ +0.000819] R13: 00007fd1f72b9ce8 R14: 0000000000000000 R15: 00007ffd48f58e70
On Thu, 2019-05-02 at 17:44 -0400, Pavel Tatashin wrote: > > In running with these patches, and testing the offlining part, I ran > > into the following lockdep below. > > > > This is with just these three patches on top of -rc7. > > Hi Verma, > > Thank you for testing. I wonder if there is a command sequence that I > could run to reproduce it? > Also, could you please send your config and qemu arguments. > Yes, here is the qemu config: qemu-system-x86_64 -machine accel=kvm -machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm -cpu Haswell-noTSX -m 12G,slots=3,maxmem=44G -realtime mlock=off -smp 8,sockets=2,cores=4,threads=1 -numa node,nodeid=0,cpus=0-3,mem=6G -numa node,nodeid=1,cpus=4-7,mem=6G -numa node,nodeid=2 -numa node,nodeid=3 -drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1 -object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M -device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2 -object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M -device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3 -serial stdio -display none For the command list - I'm using WIP patches to ndctl/daxctl to add the command I mentioned earlier. Using this command, I can reproduce the lockdep issue. I thought I should be able to reproduce the issue by onlining/offlining through sysfs directly too - something like: node="$(cat /sys/bus/dax/devices/dax0.0/target_node)" for mem in /sys/devices/system/node/node"$node"/memory*; do echo "offline" > $mem/state done But with that I can't reproduce the problem. I'll try to dig a bit deeper into what might be happening, the daxctl modifications simply amount to doing the same thing as above in C, so I'm not immediately sure what might be happening. If you're interested, I can post the ndctl patches - maybe as an RFC - to test with. Thanks, -Vishal
On Thu, May 2, 2019 at 6:29 PM Verma, Vishal L <vishal.l.verma@intel.com> wrote: > > On Thu, 2019-05-02 at 17:44 -0400, Pavel Tatashin wrote: > > > > In running with these patches, and testing the offlining part, I ran > > > into the following lockdep below. > > > > > > This is with just these three patches on top of -rc7. > > > > Hi Verma, > > > > Thank you for testing. I wonder if there is a command sequence that I > > could run to reproduce it? > > Also, could you please send your config and qemu arguments. > > > Yes, here is the qemu config: > > qemu-system-x86_64 > -machine accel=kvm > -machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm > -cpu Haswell-noTSX > -m 12G,slots=3,maxmem=44G > -realtime mlock=off > -smp 8,sockets=2,cores=4,threads=1 > -numa node,nodeid=0,cpus=0-3,mem=6G > -numa node,nodeid=1,cpus=4-7,mem=6G > -numa node,nodeid=2 > -numa node,nodeid=3 > -drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1 > -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1 > -object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M > -device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2 > -object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M > -device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3 > -serial stdio > -display none > > For the command list - I'm using WIP patches to ndctl/daxctl to add the > command I mentioned earlier. Using this command, I can reproduce the > lockdep issue. I thought I should be able to reproduce the issue by > onlining/offlining through sysfs directly too - something like: > > node="$(cat /sys/bus/dax/devices/dax0.0/target_node)" > for mem in /sys/devices/system/node/node"$node"/memory*; do > echo "offline" > $mem/state > done > > But with that I can't reproduce the problem. > > I'll try to dig a bit deeper into what might be happening, the daxctl > modifications simply amount to doing the same thing as above in C, so > I'm not immediately sure what might be happening. > > If you're interested, I can post the ndctl patches - maybe as an RFC - > to test with. I could apply the patches and test with them. Also, could you please send your kernel config. Thank you, Pasha > > Thanks, > -Vishal > > >
On Thu, 2019-05-02 at 18:36 -0400, Pavel Tatashin wrote: > > Yes, here is the qemu config: > > > > qemu-system-x86_64 > > -machine accel=kvm > > -machine pc-i440fx-2.6,accel=kvm,usb=off,vmport=off,dump-guest-core=off,nvdimm > > -cpu Haswell-noTSX > > -m 12G,slots=3,maxmem=44G > > -realtime mlock=off > > -smp 8,sockets=2,cores=4,threads=1 > > -numa node,nodeid=0,cpus=0-3,mem=6G > > -numa node,nodeid=1,cpus=4-7,mem=6G > > -numa node,nodeid=2 > > -numa node,nodeid=3 > > -drive file=/virt/fedora-test.qcow2,format=qcow2,if=none,id=drive-virtio-disk1 > > -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x9,drive=drive-virtio-disk1,id=virtio-disk1,bootindex=1 > > -object memory-backend-file,id=mem1,share,mem-path=/virt/nvdimm1,size=16G,align=128M > > -device nvdimm,memdev=mem1,id=nv1,label-size=2M,node=2 > > -object memory-backend-file,id=mem2,share,mem-path=/virt/nvdimm2,size=16G,align=128M > > -device nvdimm,memdev=mem2,id=nv2,label-size=2M,node=3 > > -serial stdio > > -display none > > > > For the command list - I'm using WIP patches to ndctl/daxctl to add the > > command I mentioned earlier. Using this command, I can reproduce the > > lockdep issue. I thought I should be able to reproduce the issue by > > onlining/offlining through sysfs directly too - something like: > > > > node="$(cat /sys/bus/dax/devices/dax0.0/target_node)" > > for mem in /sys/devices/system/node/node"$node"/memory*; do > > echo "offline" > $mem/state > > done > > > > But with that I can't reproduce the problem. > > > > I'll try to dig a bit deeper into what might be happening, the daxctl > > modifications simply amount to doing the same thing as above in C, so > > I'm not immediately sure what might be happening. > > > > If you're interested, I can post the ndctl patches - maybe as an RFC - > > to test with. > > I could apply the patches and test with them. Also, could you please > send your kernel config. > Hi Pavel, I've CC'd you on the patches mentioned above, and also pushed them to a 'kmem-pending' branch on github: https://github.com/pmem/ndctl/tree/kmem-pending After building ndctl from the above, you will want to run: # daxctl reconfigure-device --mode=system-ram dax0.0 (this will also have onlined the memory sections) # daxctl reconfigure-device --mode=devdax --attempt-offline dax0.0 (this triggers the lockdep warnings) I've attached the kernel config here too (gzipped). Thanks, -Vishal
> Hi Pavel, > > I am working on adding this sort of a workflow into a new daxctl command > (daxctl-reconfigure-device)- this will allow changing the 'mode' of a > dax device to kmem, online the resulting memory, and with your patches, > also attempt to offline the memory, and change back to device-dax. > > In running with these patches, and testing the offlining part, I ran > into the following lockdep below. > > This is with just these three patches on top of -rc7. > > > [ +0.004886] ====================================================== > [ +0.001576] WARNING: possible circular locking dependency detected > [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O > [ +0.000929] ------------------------------------------------------ > [ +0.000708] daxctl/22950 is trying to acquire lock: > [ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80 > [ +0.000922] > but task is already holding lock: > [ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0 I have studied this issue, and now have a clear understanding why it happens, I am not yet sure how to fix it, so suggestions are welcomed :) Here is the problem: When we offline pages we have the following call stack: # echo offline > /sys/devices/system/memory/memory8/state ksys_write vfs_write __vfs_write kernfs_fop_write kernfs_get_active lock_acquire kn->count#122 (lock for "memory8/state" kn) sysfs_kf_write dev_attr_store state_store device_offline memory_subsys_offline memory_block_action offline_pages __offline_pages percpu_down_write down_write lock_acquire mem_hotplug_lock.rw_sem When we unbind dax0.0 we have the following stack: # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind drv_attr_store unbind_store device_driver_detach device_release_driver_internal dev_dax_kmem_remove remove_memory device_hotplug_lock try_remove_memory mem_hotplug_lock.rw_sem arch_remove_memory __remove_pages __remove_section unregister_memory_section remove_memory_section mem_sysfs_mutex unregister_memory device_unregister device_del device_remove_attrs sysfs_remove_groups sysfs_remove_group remove_files kernfs_remove_by_name kernfs_remove_by_name_ns __kernfs_remove kn->count#122 So, lockdep found the ordering issue with the above two stacks: 1. kn->count#122 -> mem_hotplug_lock.rw_sem 2. mem_hotplug_lock.rw_sem -> kn->count#122
On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin <pasha.tatashin@soleen.com> wrote: > > > Hi Pavel, > > > > I am working on adding this sort of a workflow into a new daxctl command > > (daxctl-reconfigure-device)- this will allow changing the 'mode' of a > > dax device to kmem, online the resulting memory, and with your patches, > > also attempt to offline the memory, and change back to device-dax. > > > > In running with these patches, and testing the offlining part, I ran > > into the following lockdep below. > > > > This is with just these three patches on top of -rc7. > > > > > > [ +0.004886] ====================================================== > > [ +0.001576] WARNING: possible circular locking dependency detected > > [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O > > [ +0.000929] ------------------------------------------------------ > > [ +0.000708] daxctl/22950 is trying to acquire lock: > > [ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80 > > [ +0.000922] > > but task is already holding lock: > > [ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0 > > I have studied this issue, and now have a clear understanding why it > happens, I am not yet sure how to fix it, so suggestions are welcomed > :) I would think that ACPI hotplug would have a similar problem, but it does this: acpi_unbind_memory_blocks(info); __remove_memory(nid, info->start_addr, info->length); I wonder if that ordering prevents going too deep into the device_unregister() call stack that you highlighted below. > > Here is the problem: > > When we offline pages we have the following call stack: > > # echo offline > /sys/devices/system/memory/memory8/state > ksys_write > vfs_write > __vfs_write > kernfs_fop_write > kernfs_get_active > lock_acquire kn->count#122 (lock for > "memory8/state" kn) > sysfs_kf_write > dev_attr_store > state_store > device_offline > memory_subsys_offline > memory_block_action > offline_pages > __offline_pages > percpu_down_write > down_write > lock_acquire mem_hotplug_lock.rw_sem > > When we unbind dax0.0 we have the following stack: > # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind > drv_attr_store > unbind_store > device_driver_detach > device_release_driver_internal > dev_dax_kmem_remove > remove_memory device_hotplug_lock > try_remove_memory mem_hotplug_lock.rw_sem > arch_remove_memory > __remove_pages > __remove_section > unregister_memory_section > remove_memory_section mem_sysfs_mutex > unregister_memory > device_unregister > device_del > device_remove_attrs > sysfs_remove_groups > sysfs_remove_group > remove_files > kernfs_remove_by_name > kernfs_remove_by_name_ns > __kernfs_remove kn->count#122 > > So, lockdep found the ordering issue with the above two stacks: > > 1. kn->count#122 -> mem_hotplug_lock.rw_sem > 2. mem_hotplug_lock.rw_sem -> kn->count#122
On 16.05.19 02:42, Dan Williams wrote: > On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin > <pasha.tatashin@soleen.com> wrote: >> >>> Hi Pavel, >>> >>> I am working on adding this sort of a workflow into a new daxctl command >>> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a >>> dax device to kmem, online the resulting memory, and with your patches, >>> also attempt to offline the memory, and change back to device-dax. >>> >>> In running with these patches, and testing the offlining part, I ran >>> into the following lockdep below. >>> >>> This is with just these three patches on top of -rc7. >>> >>> >>> [ +0.004886] ====================================================== >>> [ +0.001576] WARNING: possible circular locking dependency detected >>> [ +0.001506] 5.1.0-rc7+ #13 Tainted: G O >>> [ +0.000929] ------------------------------------------------------ >>> [ +0.000708] daxctl/22950 is trying to acquire lock: >>> [ +0.000548] 00000000f4d397f7 (kn->count#424){++++}, at: kernfs_remove_by_name_ns+0x40/0x80 >>> [ +0.000922] >>> but task is already holding lock: >>> [ +0.000657] 000000002aa52a9f (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x22/0xa0 >> >> I have studied this issue, and now have a clear understanding why it >> happens, I am not yet sure how to fix it, so suggestions are welcomed >> :) > > I would think that ACPI hotplug would have a similar problem, but it does this: > > acpi_unbind_memory_blocks(info); > __remove_memory(nid, info->start_addr, info->length); > > I wonder if that ordering prevents going too deep into the > device_unregister() call stack that you highlighted below. > If that doesn't help, after we have [PATCH v2 0/8] mm/memory_hotplug: Factor out memory block device handling we could probably pull the memory device removal phase out from the mem_hotplug_lock protection and let it be protected by the device_hotplug_lock only. Might require some more work, though.
> > I would think that ACPI hotplug would have a similar problem, but it does this: > > acpi_unbind_memory_blocks(info); > __remove_memory(nid, info->start_addr, info->length); ACPI does have exactly the same problem, so this is not a bug for this series, I will submit a new version of my series with comments addressed, but without fix for this issue. I was able to reproduce this issue on the current mainline kernel. Also, I been thinking more about how to fix it, and there is no easy fix without a major hotplug redesign. Basically, we have to remove sysfs memory entries before or after memory is hotplugged/hotremoved. But, we also have to guarantee that hotplug/hotremove will succeed or reinstate sysfs entries. Qemu script: qemu-system-x86_64 \ -enable-kvm \ -cpu host \ -parallel none \ -echr 1 \ -serial none \ -chardev stdio,id=console,signal=off,mux=on \ -serial chardev:console \ -mon chardev=console \ -vga none \ -display none \ -kernel pmem/native/arch/x86/boot/bzImage \ -m 8G,slots=1,maxmem=16G \ -smp 8 \ -fsdev local,id=virtfs1,path=/,security_model=none \ -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs \ -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0 TERM=xterm ip=dhcp loglevel=7' Config is attached. Steps to reproduce: # # QEMU 4.0.0 monitor - type 'help' for more information (qemu) object_add memory-backend-ram,id=mem1,size=1G (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 (qemu) # echo online_movable > /sys/devices/system/memory/memory79/state [ 23.029552] Built 1 zonelists, mobility grouping on. Total pages: 2045370 [ 23.032591] Policy zone: Normal # (qemu) device_del dimm1 (qemu) [ 32.013950] Offlined Pages 32768 [ 32.014307] Built 1 zonelists, mobility grouping on. Total pages: 2031022 [ 32.014843] Policy zone: Normal [ 32.015733] [ 32.015881] ====================================================== [ 32.016390] WARNING: possible circular locking dependency detected [ 32.016881] 5.1.0_pt_pmem #38 Not tainted [ 32.017202] ------------------------------------------------------ [ 32.017680] kworker/u16:4/380 is trying to acquire lock: [ 32.018096] 00000000675cc7e1 (kn->count#18){++++}, at: kernfs_remove_by_name_ns+0x3b/0x80 [ 32.018745] [ 32.018745] but task is already holding lock: [ 32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x1d/0xa0 [ 32.019859] [ 32.019859] which lock already depends on the new lock. [ 32.019859] [ 32.020499] [ 32.020499] the existing dependency chain (in reverse order) is: [ 32.021080] [ 32.021080] -> #4 (mem_sysfs_mutex){+.+.}: [ 32.021522] __mutex_lock+0x8b/0x900 [ 32.021843] hotplug_memory_register+0x26/0xa0 [ 32.022231] __add_pages+0xe7/0x160 [ 32.022545] add_pages+0xd/0x60 [ 32.022835] add_memory_resource+0xc3/0x1d0 [ 32.023207] __add_memory+0x57/0x80 [ 32.023530] acpi_memory_device_add+0x13a/0x2d0 [ 32.023928] acpi_bus_attach+0xf1/0x200 [ 32.024272] acpi_bus_scan+0x3e/0x90 [ 32.024597] acpi_device_hotplug+0x284/0x3e0 [ 32.024972] acpi_hotplug_work_fn+0x15/0x20 [ 32.025342] process_one_work+0x2a0/0x650 [ 32.025755] worker_thread+0x34/0x3d0 [ 32.026077] kthread+0x118/0x130 [ 32.026442] ret_from_fork+0x3a/0x50 [ 32.026766] [ 32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}: [ 32.027261] get_online_mems+0x39/0x80 [ 32.027600] kmem_cache_create_usercopy+0x29/0x2c0 [ 32.028019] kmem_cache_create+0xd/0x10 [ 32.028367] ptlock_cache_init+0x1b/0x23 [ 32.028724] start_kernel+0x1d2/0x4b8 [ 32.029060] secondary_startup_64+0xa4/0xb0 [ 32.029447] [ 32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}: [ 32.030007] cpus_read_lock+0x39/0x80 [ 32.030360] __offline_pages+0x32/0x790 [ 32.030709] memory_subsys_offline+0x3a/0x60 [ 32.031089] device_offline+0x7e/0xb0 [ 32.031425] acpi_bus_offline+0xd8/0x140 [ 32.031821] acpi_device_hotplug+0x1b2/0x3e0 [ 32.032202] acpi_hotplug_work_fn+0x15/0x20 [ 32.032576] process_one_work+0x2a0/0x650 [ 32.032942] worker_thread+0x34/0x3d0 [ 32.033283] kthread+0x118/0x130 [ 32.033588] ret_from_fork+0x3a/0x50 [ 32.033919] [ 32.033919] -> #1 (&device->physical_node_lock){+.+.}: [ 32.034450] __mutex_lock+0x8b/0x900 [ 32.034784] acpi_get_first_physical_node+0x16/0x60 [ 32.035217] acpi_companion_match+0x3b/0x60 [ 32.035594] acpi_device_uevent_modalias+0x9/0x20 [ 32.036012] platform_uevent+0xd/0x40 [ 32.036352] dev_uevent+0x85/0x1c0 [ 32.036674] kobject_uevent_env+0x1e2/0x640 [ 32.037044] kobject_synth_uevent+0x2b7/0x324 [ 32.037428] uevent_store+0x17/0x30 [ 32.037752] kernfs_fop_write+0xeb/0x1a0 [ 32.038112] vfs_write+0xb2/0x1b0 [ 32.038417] ksys_write+0x57/0xd0 [ 32.038721] do_syscall_64+0x4b/0x1a0 [ 32.039053] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 32.039491] [ 32.039491] -> #0 (kn->count#18){++++}: [ 32.039913] lock_acquire+0xaa/0x180 [ 32.040242] __kernfs_remove+0x244/0x2d0 [ 32.040593] kernfs_remove_by_name_ns+0x3b/0x80 [ 32.040991] device_del+0x14a/0x370 [ 32.041309] device_unregister+0x9/0x20 [ 32.041653] unregister_memory_section+0x69/0xa0 [ 32.042059] __remove_pages+0x112/0x460 [ 32.042402] arch_remove_memory+0x6f/0xa0 [ 32.042758] __remove_memory+0xab/0x130 [ 32.043103] acpi_memory_device_remove+0x67/0xe0 [ 32.043537] acpi_bus_trim+0x50/0x90 [ 32.043889] acpi_device_hotplug+0x2fa/0x3e0 [ 32.044300] acpi_hotplug_work_fn+0x15/0x20 [ 32.044686] process_one_work+0x2a0/0x650 [ 32.045044] worker_thread+0x34/0x3d0 [ 32.045381] kthread+0x118/0x130 [ 32.045679] ret_from_fork+0x3a/0x50 [ 32.046005] [ 32.046005] other info that might help us debug this: [ 32.046005] [ 32.046636] Chain exists of: [ 32.046636] kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex [ 32.046636] [ 32.047514] Possible unsafe locking scenario: [ 32.047514] [ 32.047976] CPU0 CPU1 [ 32.048337] ---- ---- [ 32.048697] lock(mem_sysfs_mutex); [ 32.048983] lock(mem_hotplug_lock.rw_sem); [ 32.049519] lock(mem_sysfs_mutex); [ 32.050004] lock(kn->count#18); [ 32.050270] [ 32.050270] *** DEADLOCK *** [ 32.050270] [ 32.050736] 7 locks held by kworker/u16:4/380: [ 32.051087] #0: 00000000a22fe78e ((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650 [ 32.051830] #1: 00000000944f2dca ((work_completion)(&hpw->work)){+.+.}, at: process_one_work+0x21e/0x650 [ 32.052577] #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at: acpi_device_hotplug+0x2e/0x3e0 [ 32.053271] #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at: acpi_device_hotplug+0x3c/0x3e0 [ 32.053916] #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++}, at: __remove_memory+0x3b/0x130 [ 32.054602] #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x1d/0x110 [ 32.055315] #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x1d/0xa0 [ 32.056016] [ 32.056016] stack backtrace: [ 32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38 [ 32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014 [ 32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 32.058144] Call Trace: [ 32.058344] dump_stack+0x67/0x90 [ 32.058604] print_circular_bug.cold.60+0x15c/0x195 [ 32.058989] __lock_acquire+0x17de/0x1d30 [ 32.059308] ? find_held_lock+0x2d/0x90 [ 32.059611] ? __kernfs_remove+0x199/0x2d0 [ 32.059937] lock_acquire+0xaa/0x180 [ 32.060223] ? kernfs_remove_by_name_ns+0x3b/0x80 [ 32.060596] __kernfs_remove+0x244/0x2d0 [ 32.060908] ? kernfs_remove_by_name_ns+0x3b/0x80 [ 32.061283] ? kernfs_name_hash+0xd/0x80 [ 32.061596] ? kernfs_find_ns+0x68/0xf0 [ 32.061907] kernfs_remove_by_name_ns+0x3b/0x80 [ 32.062266] device_del+0x14a/0x370 [ 32.062548] ? unregister_mem_sect_under_nodes+0x4f/0xc0 [ 32.062973] device_unregister+0x9/0x20 [ 32.063285] unregister_memory_section+0x69/0xa0 [ 32.063651] __remove_pages+0x112/0x460 [ 32.063949] arch_remove_memory+0x6f/0xa0 [ 32.064271] __remove_memory+0xab/0x130 [ 32.064579] ? walk_memory_range+0xa1/0xe0 [ 32.064907] acpi_memory_device_remove+0x67/0xe0 [ 32.065274] acpi_bus_trim+0x50/0x90 [ 32.065560] acpi_device_hotplug+0x2fa/0x3e0 [ 32.065900] acpi_hotplug_work_fn+0x15/0x20 [ 32.066249] process_one_work+0x2a0/0x650 [ 32.066591] worker_thread+0x34/0x3d0 [ 32.066925] ? process_one_work+0x650/0x650 [ 32.067275] kthread+0x118/0x130 [ 32.067542] ? kthread_create_on_node+0x60/0x60 [ 32.067909] ret_from_fork+0x3a/0x50 > > I wonder if that ordering prevents going too deep into the > device_unregister() call stack that you highlighted below. > > > > > > Here is the problem: > > > > When we offline pages we have the following call stack: > > > > # echo offline > /sys/devices/system/memory/memory8/state > > ksys_write > > vfs_write > > __vfs_write > > kernfs_fop_write > > kernfs_get_active > > lock_acquire kn->count#122 (lock for > > "memory8/state" kn) > > sysfs_kf_write > > dev_attr_store > > state_store > > device_offline > > memory_subsys_offline > > memory_block_action > > offline_pages > > __offline_pages > > percpu_down_write > > down_write > > lock_acquire mem_hotplug_lock.rw_sem > > > > When we unbind dax0.0 we have the following stack: > > # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind > > drv_attr_store > > unbind_store > > device_driver_detach > > device_release_driver_internal > > dev_dax_kmem_remove > > remove_memory device_hotplug_lock > > try_remove_memory mem_hotplug_lock.rw_sem > > arch_remove_memory > > __remove_pages > > __remove_section > > unregister_memory_section > > remove_memory_section mem_sysfs_mutex > > unregister_memory > > device_unregister > > device_del > > device_remove_attrs > > sysfs_remove_groups > > sysfs_remove_group > > remove_files > > kernfs_remove_by_name > > kernfs_remove_by_name_ns > > __kernfs_remove kn->count#122 > > > > So, lockdep found the ordering issue with the above two stacks: > > > > 1. kn->count#122 -> mem_hotplug_lock.rw_sem > > 2. mem_hotplug_lock.rw_sem -> kn->count#122
On 17.05.19 16:09, Pavel Tatashin wrote: >> >> I would think that ACPI hotplug would have a similar problem, but it does this: >> >> acpi_unbind_memory_blocks(info); >> __remove_memory(nid, info->start_addr, info->length); > > ACPI does have exactly the same problem, so this is not a bug for this > series, I will submit a new version of my series with comments > addressed, but without fix for this issue. > > I was able to reproduce this issue on the current mainline kernel. > Also, I been thinking more about how to fix it, and there is no easy > fix without a major hotplug redesign. Basically, we have to remove > sysfs memory entries before or after memory is hotplugged/hotremoved. > But, we also have to guarantee that hotplug/hotremove will succeed or > reinstate sysfs entries. > > Qemu script: > > qemu-system-x86_64 \ > -enable-kvm \ > -cpu host \ > -parallel none \ > -echr 1 \ > -serial none \ > -chardev stdio,id=console,signal=off,mux=on \ > -serial chardev:console \ > -mon chardev=console \ > -vga none \ > -display none \ > -kernel pmem/native/arch/x86/boot/bzImage \ > -m 8G,slots=1,maxmem=16G \ > -smp 8 \ > -fsdev local,id=virtfs1,path=/,security_model=none \ > -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs \ > -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0 > TERM=xterm ip=dhcp loglevel=7' > > Config is attached. > > Steps to reproduce: > # > # QEMU 4.0.0 monitor - type 'help' for more information > (qemu) object_add memory-backend-ram,id=mem1,size=1G > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 > (qemu) > > # echo online_movable > /sys/devices/system/memory/memory79/state > [ 23.029552] Built 1 zonelists, mobility grouping on. Total pages: 2045370 > [ 23.032591] Policy zone: Normal > # (qemu) device_del dimm1 > (qemu) [ 32.013950] Offlined Pages 32768 > [ 32.014307] Built 1 zonelists, mobility grouping on. Total pages: 2031022 > [ 32.014843] Policy zone: Normal > [ 32.015733] > [ 32.015881] ====================================================== > [ 32.016390] WARNING: possible circular locking dependency detected > [ 32.016881] 5.1.0_pt_pmem #38 Not tainted > [ 32.017202] ------------------------------------------------------ > [ 32.017680] kworker/u16:4/380 is trying to acquire lock: > [ 32.018096] 00000000675cc7e1 (kn->count#18){++++}, at: > kernfs_remove_by_name_ns+0x3b/0x80 > [ 32.018745] > [ 32.018745] but task is already holding lock: > [ 32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at: > unregister_memory_section+0x1d/0xa0 > [ 32.019859] > [ 32.019859] which lock already depends on the new lock. > [ 32.019859] > [ 32.020499] > [ 32.020499] the existing dependency chain (in reverse order) is: > [ 32.021080] > [ 32.021080] -> #4 (mem_sysfs_mutex){+.+.}: > [ 32.021522] __mutex_lock+0x8b/0x900 > [ 32.021843] hotplug_memory_register+0x26/0xa0 > [ 32.022231] __add_pages+0xe7/0x160 > [ 32.022545] add_pages+0xd/0x60 > [ 32.022835] add_memory_resource+0xc3/0x1d0 > [ 32.023207] __add_memory+0x57/0x80 > [ 32.023530] acpi_memory_device_add+0x13a/0x2d0 > [ 32.023928] acpi_bus_attach+0xf1/0x200 > [ 32.024272] acpi_bus_scan+0x3e/0x90 > [ 32.024597] acpi_device_hotplug+0x284/0x3e0 > [ 32.024972] acpi_hotplug_work_fn+0x15/0x20 > [ 32.025342] process_one_work+0x2a0/0x650 > [ 32.025755] worker_thread+0x34/0x3d0 > [ 32.026077] kthread+0x118/0x130 > [ 32.026442] ret_from_fork+0x3a/0x50 > [ 32.026766] > [ 32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}: > [ 32.027261] get_online_mems+0x39/0x80 > [ 32.027600] kmem_cache_create_usercopy+0x29/0x2c0 > [ 32.028019] kmem_cache_create+0xd/0x10 > [ 32.028367] ptlock_cache_init+0x1b/0x23 > [ 32.028724] start_kernel+0x1d2/0x4b8 > [ 32.029060] secondary_startup_64+0xa4/0xb0 > [ 32.029447] > [ 32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}: > [ 32.030007] cpus_read_lock+0x39/0x80 > [ 32.030360] __offline_pages+0x32/0x790 > [ 32.030709] memory_subsys_offline+0x3a/0x60 > [ 32.031089] device_offline+0x7e/0xb0 > [ 32.031425] acpi_bus_offline+0xd8/0x140 > [ 32.031821] acpi_device_hotplug+0x1b2/0x3e0 > [ 32.032202] acpi_hotplug_work_fn+0x15/0x20 > [ 32.032576] process_one_work+0x2a0/0x650 > [ 32.032942] worker_thread+0x34/0x3d0 > [ 32.033283] kthread+0x118/0x130 > [ 32.033588] ret_from_fork+0x3a/0x50 > [ 32.033919] > [ 32.033919] -> #1 (&device->physical_node_lock){+.+.}: > [ 32.034450] __mutex_lock+0x8b/0x900 > [ 32.034784] acpi_get_first_physical_node+0x16/0x60 > [ 32.035217] acpi_companion_match+0x3b/0x60 > [ 32.035594] acpi_device_uevent_modalias+0x9/0x20 > [ 32.036012] platform_uevent+0xd/0x40 > [ 32.036352] dev_uevent+0x85/0x1c0 > [ 32.036674] kobject_uevent_env+0x1e2/0x640 > [ 32.037044] kobject_synth_uevent+0x2b7/0x324 > [ 32.037428] uevent_store+0x17/0x30 > [ 32.037752] kernfs_fop_write+0xeb/0x1a0 > [ 32.038112] vfs_write+0xb2/0x1b0 > [ 32.038417] ksys_write+0x57/0xd0 > [ 32.038721] do_syscall_64+0x4b/0x1a0 > [ 32.039053] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [ 32.039491] > [ 32.039491] -> #0 (kn->count#18){++++}: > [ 32.039913] lock_acquire+0xaa/0x180 > [ 32.040242] __kernfs_remove+0x244/0x2d0 > [ 32.040593] kernfs_remove_by_name_ns+0x3b/0x80 > [ 32.040991] device_del+0x14a/0x370 > [ 32.041309] device_unregister+0x9/0x20 > [ 32.041653] unregister_memory_section+0x69/0xa0 > [ 32.042059] __remove_pages+0x112/0x460 > [ 32.042402] arch_remove_memory+0x6f/0xa0 > [ 32.042758] __remove_memory+0xab/0x130 > [ 32.043103] acpi_memory_device_remove+0x67/0xe0 > [ 32.043537] acpi_bus_trim+0x50/0x90 > [ 32.043889] acpi_device_hotplug+0x2fa/0x3e0 > [ 32.044300] acpi_hotplug_work_fn+0x15/0x20 > [ 32.044686] process_one_work+0x2a0/0x650 > [ 32.045044] worker_thread+0x34/0x3d0 > [ 32.045381] kthread+0x118/0x130 > [ 32.045679] ret_from_fork+0x3a/0x50 > [ 32.046005] > [ 32.046005] other info that might help us debug this: > [ 32.046005] > [ 32.046636] Chain exists of: > [ 32.046636] kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex > [ 32.046636] > [ 32.047514] Possible unsafe locking scenario: > [ 32.047514] > [ 32.047976] CPU0 CPU1 > [ 32.048337] ---- ---- > [ 32.048697] lock(mem_sysfs_mutex); > [ 32.048983] lock(mem_hotplug_lock.rw_sem); > [ 32.049519] lock(mem_sysfs_mutex); > [ 32.050004] lock(kn->count#18); > [ 32.050270] > [ 32.050270] *** DEADLOCK *** > [ 32.050270] > [ 32.050736] 7 locks held by kworker/u16:4/380: > [ 32.051087] #0: 00000000a22fe78e > ((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650 > [ 32.051830] #1: 00000000944f2dca > ((work_completion)(&hpw->work)){+.+.}, at: > process_one_work+0x21e/0x650 > [ 32.052577] #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at: > acpi_device_hotplug+0x2e/0x3e0 > [ 32.053271] #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at: > acpi_device_hotplug+0x3c/0x3e0 > [ 32.053916] #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++}, > at: __remove_memory+0x3b/0x130 > [ 32.054602] #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++}, > at: percpu_down_write+0x1d/0x110 > [ 32.055315] #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at: > unregister_memory_section+0x1d/0xa0 > [ 32.056016] > [ 32.056016] stack backtrace: > [ 32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38 > [ 32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.12.0-20181126_142135-anatol 04/01/2014 > [ 32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn > [ 32.058144] Call Trace: > [ 32.058344] dump_stack+0x67/0x90 > [ 32.058604] print_circular_bug.cold.60+0x15c/0x195 > [ 32.058989] __lock_acquire+0x17de/0x1d30 > [ 32.059308] ? find_held_lock+0x2d/0x90 > [ 32.059611] ? __kernfs_remove+0x199/0x2d0 > [ 32.059937] lock_acquire+0xaa/0x180 > [ 32.060223] ? kernfs_remove_by_name_ns+0x3b/0x80 > [ 32.060596] __kernfs_remove+0x244/0x2d0 > [ 32.060908] ? kernfs_remove_by_name_ns+0x3b/0x80 > [ 32.061283] ? kernfs_name_hash+0xd/0x80 > [ 32.061596] ? kernfs_find_ns+0x68/0xf0 > [ 32.061907] kernfs_remove_by_name_ns+0x3b/0x80 > [ 32.062266] device_del+0x14a/0x370 > [ 32.062548] ? unregister_mem_sect_under_nodes+0x4f/0xc0 > [ 32.062973] device_unregister+0x9/0x20 > [ 32.063285] unregister_memory_section+0x69/0xa0 > [ 32.063651] __remove_pages+0x112/0x460 > [ 32.063949] arch_remove_memory+0x6f/0xa0 > [ 32.064271] __remove_memory+0xab/0x130 > [ 32.064579] ? walk_memory_range+0xa1/0xe0 > [ 32.064907] acpi_memory_device_remove+0x67/0xe0 > [ 32.065274] acpi_bus_trim+0x50/0x90 > [ 32.065560] acpi_device_hotplug+0x2fa/0x3e0 > [ 32.065900] acpi_hotplug_work_fn+0x15/0x20 > [ 32.066249] process_one_work+0x2a0/0x650 > [ 32.066591] worker_thread+0x34/0x3d0 > [ 32.066925] ? process_one_work+0x650/0x650 > [ 32.067275] kthread+0x118/0x130 > [ 32.067542] ? kthread_create_on_node+0x60/0x60 > [ 32.067909] ret_from_fork+0x3a/0x50 > >> >> I wonder if that ordering prevents going too deep into the >> device_unregister() call stack that you highlighted below. >> >> >>> >>> Here is the problem: >>> >>> When we offline pages we have the following call stack: >>> >>> # echo offline > /sys/devices/system/memory/memory8/state >>> ksys_write >>> vfs_write >>> __vfs_write >>> kernfs_fop_write >>> kernfs_get_active >>> lock_acquire kn->count#122 (lock for >>> "memory8/state" kn) >>> sysfs_kf_write >>> dev_attr_store >>> state_store >>> device_offline >>> memory_subsys_offline >>> memory_block_action >>> offline_pages >>> __offline_pages >>> percpu_down_write >>> down_write >>> lock_acquire mem_hotplug_lock.rw_sem >>> >>> When we unbind dax0.0 we have the following stack: >>> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind >>> drv_attr_store >>> unbind_store >>> device_driver_detach >>> device_release_driver_internal >>> dev_dax_kmem_remove >>> remove_memory device_hotplug_lock >>> try_remove_memory mem_hotplug_lock.rw_sem >>> arch_remove_memory >>> __remove_pages >>> __remove_section >>> unregister_memory_section >>> remove_memory_section mem_sysfs_mutex >>> unregister_memory >>> device_unregister >>> device_del >>> device_remove_attrs >>> sysfs_remove_groups >>> sysfs_remove_group >>> remove_files >>> kernfs_remove_by_name >>> kernfs_remove_by_name_ns >>> __kernfs_remove kn->count#122 >>> >>> So, lockdep found the ordering issue with the above two stacks: >>> >>> 1. kn->count#122 -> mem_hotplug_lock.rw_sem >>> 2. mem_hotplug_lock.rw_sem -> kn->count#122 I once documented locking behavior in Documentation/core-api/memory-hotplug.rst Both, device_online() and __remove_memory() always have to be called holding the device_hotplug_lock(), to avoid such races. # echo offline > /sys/devices/system/memory/memory8/state and # echo 0 > /sys/devices/system/memory/memory8/online either end up in: -> online_store() -- lock_device_hotplug_sysfs(); -- device_offline(dev) -> state_store() -- lock_device_hotplug_sysfs(); -- device_online(&mem->dev); So the device_hotplug_lock prohibits the race you describe. BUT There is a possible race between the device_hotplug_lock and the kn->count#122. However, that race can never trigger, as userspace properly backs off in case it cannot get grip of the device_hotplug_lock. (lock_device_hotplug_sysfs does a mutex_trylock(&device_hotplug_lock))