Message ID | 20250209122035.1327325-8-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | block: remove all debugfs dir & don't grab debugfs_mutex | expand |
Hi Ming,
kernel test robot noticed the following build warnings:
[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on linus/master v6.14-rc1 next-20250207]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Ming-Lei/block-remove-hctx-debugfs_dir/20250209-202320
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20250209122035.1327325-8-ming.lei%40redhat.com
patch subject: [PATCH 7/7] block: don't grab q->debugfs_mutex
config: arc-randconfig-001-20250209 (https://download.01.org/0day-ci/archive/20250210/202502100010.Kok1H3KF-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250210/202502100010.Kok1H3KF-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502100010.Kok1H3KF-lkp@intel.com/
All warnings (new ones prefixed by >>):
block/blk-sysfs.c: In function 'blk_debugfs_remove':
>> block/blk-sysfs.c:745:31: warning: unused variable 'q' [-Wunused-variable]
745 | struct request_queue *q = disk->queue;
| ^
vim +/q +745 block/blk-sysfs.c
8324aa91d1e11a Jens Axboe 2008-01-29 742
6fc75f309d291d Christoph Hellwig 2022-11-14 743 static void blk_debugfs_remove(struct gendisk *disk)
6fc75f309d291d Christoph Hellwig 2022-11-14 744 {
6fc75f309d291d Christoph Hellwig 2022-11-14 @745 struct request_queue *q = disk->queue;
6fc75f309d291d Christoph Hellwig 2022-11-14 746
6fc75f309d291d Christoph Hellwig 2022-11-14 747 blk_trace_shutdown(q);
abe3d073fa86b7 Ming Lei 2025-02-09 748 debugfs_lookup_and_remove(disk->disk_name, blk_debugfs_root);
6fc75f309d291d Christoph Hellwig 2022-11-14 749 }
6fc75f309d291d Christoph Hellwig 2022-11-14 750
On Feb 09, 2025 / 20:20, Ming Lei wrote: > All block internal state for dealing adding/removing debugfs entries > have been removed, and debugfs can sync everything for us in fs level, > so don't grab q->debugfs_mutex for adding/removing block internal debugfs > entries. > > Now q->debugfs_mutex is only used for blktrace, meantime move creating > queue debugfs dir code out of q->sysfs_lock. Both the two locks are > connected with queue freeze IO lock. Then queue freeze IO lock chain > with debugfs lock is cut. > > The following lockdep report can be fixed: > > https://lore.kernel.org/linux-block/ougniadskhks7uyxguxihgeuh2pv4yaqv4q3emo4gwuolgzdt6@brotly74p6bs/ > > Follows contexts which adds/removes debugfs entries: > > - update nr_hw_queues > > - add/remove disks > > - elevator switch > > - blktrace > > blktrace only adds entries under disk top directory, so we can ignore it, > because it can only work iff disk is added. Also nothing overlapped with > the other two contex, blktrace context is fine. > > Elevator switch is only allowed after disk is added, so there isn't race > with add/remove disk. blk_mq_update_nr_hw_queues() always restores to > previous elevator, so no race between these two. Elevator switch context > is fine. > > So far blk_mq_update_nr_hw_queues() doesn't hold debugfs lock for > adding/removing hctx entries, there might be race with add/remove disk, > which is just fine in reality: > > - blk_mq_update_nr_hw_queues() is usually for error recovery, and disk > won't be added/removed at the same time > > - even though there is race between the two contexts, it is just fine, > since hctx won't be freed until queue is dead > > - we never see reports in this area without holding debugfs in > blk_mq_update_nr_hw_queues() > > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> > Signed-off-by: Ming Lei <ming.lei@redhat.com> Ming, thank you for this quick action. I applied this series on top of v6.14-rc1 kernel and ran the block/002 test case. Unfortunately, still if fails occasionally with the lockdep "WARNING: possible circular locking dependency detected" below. Now debugfs_mutex is not reported as one of the dependent locks, then I think this fix is working as expected. Instead, eq->sysfs_lock creates similar dependency. My mere guess is that this patch avoids one dependency, but still another dependency is left. [ 115.085704] [ T1023] run blktests block/002 at 2025-02-10 09:22:22 [ 115.383653] [ T1054] sd 9:0:0:0: [sdd] Synchronizing SCSI cache [ 115.641933] [ T1055] scsi_debug:sdebug_driver_probe: scsi_debug: trim poll_queues to 0. poll_q/nr_hw = (0/1) [ 115.642961] [ T1055] scsi host9: scsi_debug: version 0191 [20210520] dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0 [ 115.646207] [ T1055] scsi 9:0:0:0: Direct-Access Linux scsi_debug 0191 PQ: 0 ANSI: 7 [ 115.648225] [ C0] scsi 9:0:0:0: Power-on or device reset occurred [ 115.654243] [ T1055] sd 9:0:0:0: Attached scsi generic sg3 type 0 [ 115.656248] [ T100] sd 9:0:0:0: [sdd] 16384 512-byte logical blocks: (8.39 MB/8.00 MiB) [ 115.658403] [ T100] sd 9:0:0:0: [sdd] Write Protect is off [ 115.659125] [ T100] sd 9:0:0:0: [sdd] Mode Sense: 73 00 10 08 [ 115.661621] [ T100] sd 9:0:0:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA [ 115.669276] [ T100] sd 9:0:0:0: [sdd] permanent stream count = 5 [ 115.673375] [ T100] sd 9:0:0:0: [sdd] Preferred minimum I/O size 512 bytes [ 115.673974] [ T100] sd 9:0:0:0: [sdd] Optimal transfer size 524288 bytes [ 115.710112] [ T100] sd 9:0:0:0: [sdd] Attached SCSI disk [ 116.464802] [ T1079] ====================================================== [ 116.465540] [ T1079] WARNING: possible circular locking dependency detected [ 116.466107] [ T1079] 6.14.0-rc1+ #253 Not tainted [ 116.466581] [ T1079] ------------------------------------------------------ [ 116.467141] [ T1079] blktrace/1079 is trying to acquire lock: [ 116.467708] [ T1079] ffff88810539d1e0 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x99/0x120 [ 116.468439] [ T1079] but task is already holding lock: [ 116.469052] [ T1079] ffff88810a5fd758 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: relay_file_read+0xa3/0x8a0 [ 116.469901] [ T1079] which lock already depends on the new lock. [ 116.470762] [ T1079] the existing dependency chain (in reverse order) is: [ 116.473187] [ T1079] -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}: [ 116.475670] [ T1079] down_read+0x9b/0x470 [ 116.477001] [ T1079] lookup_one_unlocked+0xe9/0x120 [ 116.478333] [ T1079] lookup_positive_unlocked+0x1d/0x90 [ 116.479648] [ T1079] debugfs_lookup+0x47/0xa0 [ 116.480833] [ T1079] blk_mq_debugfs_unregister_sched_hctx+0x23/0x50 [ 116.482215] [ T1079] blk_mq_exit_sched+0xb6/0x2b0 [ 116.483466] [ T1079] elevator_switch+0x12a/0x4b0 [ 116.484676] [ T1079] elv_iosched_store+0x29f/0x380 [ 116.485841] [ T1079] queue_attr_store+0x313/0x480 [ 116.487078] [ T1079] kernfs_fop_write_iter+0x39e/0x5a0 [ 116.488358] [ T1079] vfs_write+0x5f9/0xe90 [ 116.489460] [ T1079] ksys_write+0xf6/0x1c0 [ 116.490582] [ T1079] do_syscall_64+0x93/0x180 [ 116.491694] [ T1079] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 116.492996] [ T1079] -> #4 (&eq->sysfs_lock){+.+.}-{4:4}: [ 116.495135] [ T1079] __mutex_lock+0x1aa/0x1360 [ 116.496363] [ T1079] elevator_switch+0x11f/0x4b0 [ 116.497499] [ T1079] elv_iosched_store+0x29f/0x380 [ 116.498660] [ T1079] queue_attr_store+0x313/0x480 [ 116.499752] [ T1079] kernfs_fop_write_iter+0x39e/0x5a0 [ 116.500884] [ T1079] vfs_write+0x5f9/0xe90 [ 116.501964] [ T1079] ksys_write+0xf6/0x1c0 [ 116.503056] [ T1079] do_syscall_64+0x93/0x180 [ 116.504194] [ T1079] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 116.505356] [ T1079] -> #3 (&q->q_usage_counter(io)#2){++++}-{0:0}: [ 116.507272] [ T1079] blk_mq_submit_bio+0x19b8/0x2070 [ 116.508341] [ T1079] __submit_bio+0x36b/0x6d0 [ 116.509351] [ T1079] submit_bio_noacct_nocheck+0x542/0xca0 [ 116.510447] [ T1079] iomap_readahead+0x4c4/0x7e0 [ 116.511485] [ T1079] read_pages+0x198/0xaf0 [ 116.512468] [ T1079] page_cache_ra_order+0x4d3/0xb50 [ 116.513497] [ T1079] filemap_get_pages+0x2c7/0x1850 [ 116.514511] [ T1079] filemap_read+0x31d/0xc30 [ 116.515492] [ T1079] xfs_file_buffered_read+0x1e9/0x2a0 [xfs] [ 116.517374] [ T1079] xfs_file_read_iter+0x25f/0x4a0 [xfs] [ 116.519428] [ T1079] vfs_read+0x6bc/0xa20 [ 116.520904] [ T1079] ksys_read+0xf6/0x1c0 [ 116.522378] [ T1079] do_syscall_64+0x93/0x180 [ 116.523840] [ T1079] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 116.525467] [ T1079] -> #2 (mapping.invalidate_lock#2){++++}-{4:4}: [ 116.528191] [ T1079] down_read+0x9b/0x470 [ 116.529609] [ T1079] filemap_fault+0x231/0x2ac0 [ 116.531033] [ T1079] __do_fault+0xf4/0x5d0 [ 116.532461] [ T1079] do_fault+0x965/0x11d0 [ 116.533859] [ T1079] __handle_mm_fault+0x1060/0x1f40 [ 116.535343] [ T1079] handle_mm_fault+0x21a/0x6b0 [ 116.536915] [ T1079] do_user_addr_fault+0x322/0xaa0 [ 116.538409] [ T1079] exc_page_fault+0x7a/0x110 [ 116.539811] [ T1079] asm_exc_page_fault+0x22/0x30 [ 116.541247] [ T1079] -> #1 (&vma->vm_lock->lock){++++}-{4:4}: [ 116.543820] [ T1079] down_write+0x8d/0x200 [ 116.545202] [ T1079] vma_link+0x1ff/0x590 [ 116.546588] [ T1079] insert_vm_struct+0x15a/0x340 [ 116.548000] [ T1079] alloc_bprm+0x626/0xbf0 [ 116.549396] [ T1079] kernel_execve+0x85/0x2f0 [ 116.550784] [ T1079] call_usermodehelper_exec_async+0x21b/0x430 [ 116.552352] [ T1079] ret_from_fork+0x30/0x70 [ 116.553751] [ T1079] ret_from_fork_asm+0x1a/0x30 [ 116.555164] [ T1079] -> #0 (&mm->mmap_lock){++++}-{4:4}: [ 116.557848] [ T1079] __lock_acquire+0x2f52/0x5ea0 [ 116.559293] [ T1079] lock_acquire+0x1b1/0x540 [ 116.560678] [ T1079] __might_fault+0xb9/0x120 [ 116.562051] [ T1079] _copy_to_user+0x1e/0x80 [ 116.563446] [ T1079] relay_file_read+0x149/0x8a0 [ 116.564851] [ T1079] full_proxy_read+0x110/0x1d0 [ 116.566262] [ T1079] vfs_read+0x1bb/0xa20 [ 116.567591] [ T1079] ksys_read+0xf6/0x1c0 [ 116.568909] [ T1079] do_syscall_64+0x93/0x180 [ 116.570283] [ T1079] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 116.571781] [ T1079] other info that might help us debug this: [ 116.575369] [ T1079] Chain exists of: &mm->mmap_lock --> &eq->sysfs_lock --> &sb->s_type->i_mutex_key#3 [ 116.579323] [ T1079] Possible unsafe locking scenario: [ 116.580961] [ T1079] CPU0 CPU1 [ 116.581866] [ T1079] ---- ---- [ 116.582737] [ T1079] lock(&sb->s_type->i_mutex_key#3); [ 116.583611] [ T1079] lock(&eq->sysfs_lock); [ 116.584596] [ T1079] lock(&sb->s_type->i_mutex_key#3); [ 116.585646] [ T1079] rlock(&mm->mmap_lock); [ 116.586453] [ T1079] *** DEADLOCK *** [ 116.588484] [ T1079] 2 locks held by blktrace/1079: [ 116.589318] [ T1079] #0: ffff888101390af8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x233/0x2f0 [ 116.590461] [ T1079] #1: ffff88810a5fd758 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: relay_file_read+0xa3/0x8a0 [ 116.591720] [ T1079] stack backtrace: [ 116.593166] [ T1079] CPU: 0 UID: 0 PID: 1079 Comm: blktrace Not tainted 6.14.0-rc1+ #253 [ 116.593169] [ T1079] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 [ 116.593172] [ T1079] Call Trace: [ 116.593176] [ T1079] <TASK> [ 116.593178] [ T1079] dump_stack_lvl+0x6a/0x90 [ 116.593186] [ T1079] print_circular_bug.cold+0x1e0/0x274 [ 116.593191] [ T1079] check_noncircular+0x306/0x3f0 [ 116.593195] [ T1079] ? __pfx_check_noncircular+0x10/0x10 [ 116.593200] [ T1079] ? lockdep_lock+0xca/0x1c0 [ 116.593202] [ T1079] ? __pfx_lockdep_lock+0x10/0x10 [ 116.593206] [ T1079] __lock_acquire+0x2f52/0x5ea0 [ 116.593211] [ T1079] ? lockdep_unlock+0xf1/0x250 [ 116.593213] [ T1079] ? __pfx___lock_acquire+0x10/0x10 [ 116.593216] [ T1079] ? lock_acquire+0x1b1/0x540 [ 116.593220] [ T1079] lock_acquire+0x1b1/0x540 [ 116.593222] [ T1079] ? __might_fault+0x99/0x120 [ 116.593226] [ T1079] ? __pfx_lock_acquire+0x10/0x10 [ 116.593228] [ T1079] ? lock_is_held_type+0xd5/0x130 [ 116.593234] [ T1079] ? __pfx___might_resched+0x10/0x10 [ 116.593239] [ T1079] ? _raw_spin_unlock+0x29/0x50 [ 116.593243] [ T1079] ? __might_fault+0x99/0x120 [ 116.593245] [ T1079] __might_fault+0xb9/0x120 [ 116.593247] [ T1079] ? __might_fault+0x99/0x120 [ 116.593249] [ T1079] ? __check_object_size+0x3f3/0x540 [ 116.593253] [ T1079] _copy_to_user+0x1e/0x80 [ 116.593256] [ T1079] relay_file_read+0x149/0x8a0 [ 116.593261] [ T1079] ? selinux_file_permission+0x36d/0x420 [ 116.593270] [ T1079] full_proxy_read+0x110/0x1d0 [ 116.593272] [ T1079] ? rw_verify_area+0x2f7/0x520 [ 116.593275] [ T1079] vfs_read+0x1bb/0xa20 [ 116.593278] [ T1079] ? __pfx___mutex_lock+0x10/0x10 [ 116.593280] [ T1079] ? __pfx_lock_release+0x10/0x10 [ 116.593283] [ T1079] ? lock_release+0x45b/0x7a0 [ 116.593285] [ T1079] ? __pfx_vfs_read+0x10/0x10 [ 116.593289] [ T1079] ? __fget_files+0x1ae/0x2e0 [ 116.593294] [ T1079] ksys_read+0xf6/0x1c0 [ 116.593296] [ T1079] ? __pfx_ksys_read+0x10/0x10 [ 116.593300] [ T1079] do_syscall_64+0x93/0x180 [ 116.593303] [ T1079] ? lockdep_hardirqs_on_prepare+0x16d/0x400 [ 116.593306] [ T1079] ? do_syscall_64+0x9f/0x180 [ 116.593308] [ T1079] ? lockdep_hardirqs_on+0x78/0x100 [ 116.593310] [ T1079] ? do_syscall_64+0x9f/0x180 [ 116.593313] [ T1079] ? __pfx_vm_mmap_pgoff+0x10/0x10 [ 116.593317] [ T1079] ? __fget_files+0x1ae/0x2e0 [ 116.593321] [ T1079] ? lockdep_hardirqs_on_prepare+0x16d/0x400 [ 116.593324] [ T1079] ? do_syscall_64+0x9f/0x180 [ 116.593326] [ T1079] ? lockdep_hardirqs_on+0x78/0x100 [ 116.593328] [ T1079] ? do_syscall_64+0x9f/0x180 [ 116.593331] [ T1079] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 116.593334] [ T1079] RIP: 0033:0x7f9586b50e4a [ 116.593338] [ T1079] Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 28 58 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 82 58 f8 ff 48 8b [ 116.593341] [ T1079] RSP: 002b:00007f9586a3ed80 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 116.593348] [ T1079] RAX: ffffffffffffffda RBX: 00007f9580000bb0 RCX: 00007f9586b50e4a [ 116.593349] [ T1079] RDX: 0000000000080000 RSI: 00007f9584800000 RDI: 0000000000000004 [ 116.593351] [ T1079] RBP: 00007f9586a3eda0 R08: 0000000000000000 R09: 0000000000000000 [ 116.593352] [ T1079] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 [ 116.593353] [ T1079] R13: 000056218e3f97e0 R14: 0000000000000000 R15: 00007f9580002c90 [ 116.593359] [ T1079] </TASK> [ 116.687238] [ T1023] sd 9:0:0:0: [sdd] Synchronizing SCSI cache
On Mon, Feb 10, 2025 at 8:52 AM Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> wrote: > > On Feb 09, 2025 / 20:20, Ming Lei wrote: > > All block internal state for dealing adding/removing debugfs entries > > have been removed, and debugfs can sync everything for us in fs level, > > so don't grab q->debugfs_mutex for adding/removing block internal debugfs > > entries. > > > > Now q->debugfs_mutex is only used for blktrace, meantime move creating > > queue debugfs dir code out of q->sysfs_lock. Both the two locks are > > connected with queue freeze IO lock. Then queue freeze IO lock chain > > with debugfs lock is cut. > > > > The following lockdep report can be fixed: > > > > https://lore.kernel.org/linux-block/ougniadskhks7uyxguxihgeuh2pv4yaqv4q3emo4gwuolgzdt6@brotly74p6bs/ > > > > Follows contexts which adds/removes debugfs entries: > > > > - update nr_hw_queues > > > > - add/remove disks > > > > - elevator switch > > > > - blktrace > > > > blktrace only adds entries under disk top directory, so we can ignore it, > > because it can only work iff disk is added. Also nothing overlapped with > > the other two contex, blktrace context is fine. > > > > Elevator switch is only allowed after disk is added, so there isn't race > > with add/remove disk. blk_mq_update_nr_hw_queues() always restores to > > previous elevator, so no race between these two. Elevator switch context > > is fine. > > > > So far blk_mq_update_nr_hw_queues() doesn't hold debugfs lock for > > adding/removing hctx entries, there might be race with add/remove disk, > > which is just fine in reality: > > > > - blk_mq_update_nr_hw_queues() is usually for error recovery, and disk > > won't be added/removed at the same time > > > > - even though there is race between the two contexts, it is just fine, > > since hctx won't be freed until queue is dead > > > > - we never see reports in this area without holding debugfs in > > blk_mq_update_nr_hw_queues() > > > > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > Ming, thank you for this quick action. I applied this series on top of > v6.14-rc1 kernel and ran the block/002 test case. Unfortunately, still if fails > occasionally with the lockdep "WARNING: possible circular locking dependency > detected" below. Now debugfs_mutex is not reported as one of the dependent > locks, then I think this fix is working as expected. Instead, eq->sysfs_lock > creates similar dependency. My mere guess is that this patch avoids one > dependency, but still another dependency is left. Indeed, this patch cuts dependency on both q->sysfs_lock and q->debugfs_lock, but elevator ->sysfs_lock isn't covered, :-( Thanks, Ming
On 2/10/25 7:12 AM, Ming Lei wrote: > On Mon, Feb 10, 2025 at 8:52 AM Shinichiro Kawasaki > <shinichiro.kawasaki@wdc.com> wrote: >> >> On Feb 09, 2025 / 20:20, Ming Lei wrote: >>> All block internal state for dealing adding/removing debugfs entries >>> have been removed, and debugfs can sync everything for us in fs level, >>> so don't grab q->debugfs_mutex for adding/removing block internal debugfs >>> entries. >>> >>> Now q->debugfs_mutex is only used for blktrace, meantime move creating >>> queue debugfs dir code out of q->sysfs_lock. Both the two locks are >>> connected with queue freeze IO lock. Then queue freeze IO lock chain >>> with debugfs lock is cut. >>> >>> The following lockdep report can be fixed: >>> >>> https://lore.kernel.org/linux-block/ougniadskhks7uyxguxihgeuh2pv4yaqv4q3emo4gwuolgzdt6@brotly74p6bs/ >>> >>> Follows contexts which adds/removes debugfs entries: >>> >>> - update nr_hw_queues >>> >>> - add/remove disks >>> >>> - elevator switch >>> >>> - blktrace >>> >>> blktrace only adds entries under disk top directory, so we can ignore it, >>> because it can only work iff disk is added. Also nothing overlapped with >>> the other two contex, blktrace context is fine. >>> >>> Elevator switch is only allowed after disk is added, so there isn't race >>> with add/remove disk. blk_mq_update_nr_hw_queues() always restores to >>> previous elevator, so no race between these two. Elevator switch context >>> is fine. >>> >>> So far blk_mq_update_nr_hw_queues() doesn't hold debugfs lock for >>> adding/removing hctx entries, there might be race with add/remove disk, >>> which is just fine in reality: >>> >>> - blk_mq_update_nr_hw_queues() is usually for error recovery, and disk >>> won't be added/removed at the same time >>> >>> - even though there is race between the two contexts, it is just fine, >>> since hctx won't be freed until queue is dead >>> >>> - we never see reports in this area without holding debugfs in >>> blk_mq_update_nr_hw_queues() >>> >>> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com> >> >> Ming, thank you for this quick action. I applied this series on top of >> v6.14-rc1 kernel and ran the block/002 test case. Unfortunately, still if fails >> occasionally with the lockdep "WARNING: possible circular locking dependency >> detected" below. Now debugfs_mutex is not reported as one of the dependent >> locks, then I think this fix is working as expected. Instead, eq->sysfs_lock >> creates similar dependency. My mere guess is that this patch avoids one >> dependency, but still another dependency is left. > > Indeed, this patch cuts dependency on both q->sysfs_lock and q->debugfs_lock, Glad to see that with this patch we're able to cut the dependency between q->sysfs_lock and q->debugfs_lock. > but elevator ->sysfs_lock isn't covered, :-( > I believe that shall be fixed with the current effort undergoing here: https://lore.kernel.org/all/20250205144506.663819-1-nilay@linux.ibm.com/ Thanks, --Nilay
On Mon, Feb 10, 2025 at 12:31:19PM +0530, Nilay Shroff wrote: > > > On 2/10/25 7:12 AM, Ming Lei wrote: > > On Mon, Feb 10, 2025 at 8:52 AM Shinichiro Kawasaki > > <shinichiro.kawasaki@wdc.com> wrote: > >> > >> On Feb 09, 2025 / 20:20, Ming Lei wrote: > >>> All block internal state for dealing adding/removing debugfs entries > >>> have been removed, and debugfs can sync everything for us in fs level, > >>> so don't grab q->debugfs_mutex for adding/removing block internal debugfs > >>> entries. > >>> > >>> Now q->debugfs_mutex is only used for blktrace, meantime move creating > >>> queue debugfs dir code out of q->sysfs_lock. Both the two locks are > >>> connected with queue freeze IO lock. Then queue freeze IO lock chain > >>> with debugfs lock is cut. > >>> > >>> The following lockdep report can be fixed: > >>> > >>> https://lore.kernel.org/linux-block/ougniadskhks7uyxguxihgeuh2pv4yaqv4q3emo4gwuolgzdt6@brotly74p6bs/ > >>> > >>> Follows contexts which adds/removes debugfs entries: > >>> > >>> - update nr_hw_queues > >>> > >>> - add/remove disks > >>> > >>> - elevator switch > >>> > >>> - blktrace > >>> > >>> blktrace only adds entries under disk top directory, so we can ignore it, > >>> because it can only work iff disk is added. Also nothing overlapped with > >>> the other two contex, blktrace context is fine. > >>> > >>> Elevator switch is only allowed after disk is added, so there isn't race > >>> with add/remove disk. blk_mq_update_nr_hw_queues() always restores to > >>> previous elevator, so no race between these two. Elevator switch context > >>> is fine. > >>> > >>> So far blk_mq_update_nr_hw_queues() doesn't hold debugfs lock for > >>> adding/removing hctx entries, there might be race with add/remove disk, > >>> which is just fine in reality: > >>> > >>> - blk_mq_update_nr_hw_queues() is usually for error recovery, and disk > >>> won't be added/removed at the same time > >>> > >>> - even though there is race between the two contexts, it is just fine, > >>> since hctx won't be freed until queue is dead > >>> > >>> - we never see reports in this area without holding debugfs in > >>> blk_mq_update_nr_hw_queues() > >>> > >>> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> > >>> Signed-off-by: Ming Lei <ming.lei@redhat.com> > >> > >> Ming, thank you for this quick action. I applied this series on top of > >> v6.14-rc1 kernel and ran the block/002 test case. Unfortunately, still if fails > >> occasionally with the lockdep "WARNING: possible circular locking dependency > >> detected" below. Now debugfs_mutex is not reported as one of the dependent > >> locks, then I think this fix is working as expected. Instead, eq->sysfs_lock > >> creates similar dependency. My mere guess is that this patch avoids one > >> dependency, but still another dependency is left. > > > > Indeed, this patch cuts dependency on both q->sysfs_lock and q->debugfs_lock, > Glad to see that with this patch we're able to cut the dependency between > q->sysfs_lock and q->debugfs_lock. > > > but elevator ->sysfs_lock isn't covered, :-( > > > I believe that shall be fixed with the current effort undergoing here: > https://lore.kernel.org/all/20250205144506.663819-1-nilay@linux.ibm.com/ I guess it isn't, your patches don't cover elevator_queue->sysfs_lock... Thanks, Ming
On Feb 10, 2025 / 12:31, Nilay Shroff wrote: > > > On 2/10/25 7:12 AM, Ming Lei wrote: > > On Mon, Feb 10, 2025 at 8:52 AM Shinichiro Kawasaki > > <shinichiro.kawasaki@wdc.com> wrote: > >> > >> On Feb 09, 2025 / 20:20, Ming Lei wrote: > >>> All block internal state for dealing adding/removing debugfs entries > >>> have been removed, and debugfs can sync everything for us in fs level, > >>> so don't grab q->debugfs_mutex for adding/removing block internal debugfs > >>> entries. > >>> > >>> Now q->debugfs_mutex is only used for blktrace, meantime move creating > >>> queue debugfs dir code out of q->sysfs_lock. Both the two locks are > >>> connected with queue freeze IO lock. Then queue freeze IO lock chain > >>> with debugfs lock is cut. > >>> > >>> The following lockdep report can be fixed: > >>> > >>> https://lore.kernel.org/linux-block/ougniadskhks7uyxguxihgeuh2pv4yaqv4q3emo4gwuolgzdt6@brotly74p6bs/ > >>> > >>> Follows contexts which adds/removes debugfs entries: > >>> > >>> - update nr_hw_queues > >>> > >>> - add/remove disks > >>> > >>> - elevator switch > >>> > >>> - blktrace > >>> > >>> blktrace only adds entries under disk top directory, so we can ignore it, > >>> because it can only work iff disk is added. Also nothing overlapped with > >>> the other two contex, blktrace context is fine. > >>> > >>> Elevator switch is only allowed after disk is added, so there isn't race > >>> with add/remove disk. blk_mq_update_nr_hw_queues() always restores to > >>> previous elevator, so no race between these two. Elevator switch context > >>> is fine. > >>> > >>> So far blk_mq_update_nr_hw_queues() doesn't hold debugfs lock for > >>> adding/removing hctx entries, there might be race with add/remove disk, > >>> which is just fine in reality: > >>> > >>> - blk_mq_update_nr_hw_queues() is usually for error recovery, and disk > >>> won't be added/removed at the same time > >>> > >>> - even though there is race between the two contexts, it is just fine, > >>> since hctx won't be freed until queue is dead > >>> > >>> - we never see reports in this area without holding debugfs in > >>> blk_mq_update_nr_hw_queues() > >>> > >>> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> > >>> Signed-off-by: Ming Lei <ming.lei@redhat.com> > >> > >> Ming, thank you for this quick action. I applied this series on top of > >> v6.14-rc1 kernel and ran the block/002 test case. Unfortunately, still if fails > >> occasionally with the lockdep "WARNING: possible circular locking dependency > >> detected" below. Now debugfs_mutex is not reported as one of the dependent > >> locks, then I think this fix is working as expected. Instead, eq->sysfs_lock > >> creates similar dependency. My mere guess is that this patch avoids one > >> dependency, but still another dependency is left. > > > > Indeed, this patch cuts dependency on both q->sysfs_lock and q->debugfs_lock, > Glad to see that with this patch we're able to cut the dependency between > q->sysfs_lock and q->debugfs_lock. > > > but elevator ->sysfs_lock isn't covered, :-( > > > I believe that shall be fixed with the current effort undergoing here: > https://lore.kernel.org/all/20250205144506.663819-1-nilay@linux.ibm.com/ Thanks Nilay for the information. I applied your patch series on top of v6.14-rc1 and Ming's. Then I ran block/002. Alas, still the test case fails with the "WARNING: possible circular locking dependency detected". The lockdep splat contents has changed as below: [ 222.549099] [ T1041] run blktests block/002 at 2025-02-10 17:13:18 [ 222.887999] [ T1072] sd 9:0:0:0: [sdd] Synchronizing SCSI cache [ 223.172864] [ T1073] scsi_debug:sdebug_driver_probe: scsi_debug: trim poll_queues to 0. poll_q/nr_hw = (0/1) [ 223.173872] [ T1073] scsi host9: scsi_debug: version 0191 [20210520] dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0 [ 223.177251] [ T1073] scsi 9:0:0:0: Direct-Access Linux scsi_debug 0191 PQ: 0 ANSI: 7 [ 223.179393] [ C0] scsi 9:0:0:0: Power-on or device reset occurred [ 223.185351] [ T1073] sd 9:0:0:0: Attached scsi generic sg3 type 0 [ 223.187514] [ T101] sd 9:0:0:0: [sdd] 16384 512-byte logical blocks: (8.39 MB/8.00 MiB) [ 223.189675] [ T101] sd 9:0:0:0: [sdd] Write Protect is off [ 223.190146] [ T101] sd 9:0:0:0: [sdd] Mode Sense: 73 00 10 08 [ 223.192369] [ T101] sd 9:0:0:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA [ 223.199999] [ T101] sd 9:0:0:0: [sdd] permanent stream count = 5 [ 223.203950] [ T101] sd 9:0:0:0: [sdd] Preferred minimum I/O size 512 bytes [ 223.204568] [ T101] sd 9:0:0:0: [sdd] Optimal transfer size 524288 bytes [ 223.245371] [ T101] sd 9:0:0:0: [sdd] Attached SCSI disk [ 223.999340] [ T1097] ====================================================== [ 224.000044] [ T1097] WARNING: possible circular locking dependency detected [ 224.000651] [ T1097] 6.14.0-rc1+ #255 Not tainted [ 224.001109] [ T1097] ------------------------------------------------------ [ 224.001791] [ T1097] blktrace/1097 is trying to acquire lock: [ 224.002220] [ T1097] ffff88814c53bde0 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x99/0x120 [ 224.002887] [ T1097] but task is already holding lock: [ 224.003423] [ T1097] ffff88810079dc68 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: relay_file_read+0xa3/0x8a0 [ 224.004173] [ T1097] which lock already depends on the new lock. [ 224.004934] [ T1097] the existing dependency chain (in reverse order) is: [ 224.007356] [ T1097] -> #6 (&sb->s_type->i_mutex_key#3){++++}-{4:4}: [ 224.009859] [ T1097] down_write+0x8d/0x200 [ 224.011109] [ T1097] start_creating.part.0+0x82/0x230 [ 224.012465] [ T1097] debugfs_create_dir+0x3a/0x4c0 [ 224.013793] [ T1097] blk_mq_debugfs_register_rqos+0x1a5/0x3a0 [ 224.015101] [ T1097] rq_qos_add+0x2f6/0x450 [ 224.016312] [ T1097] wbt_init+0x359/0x490 [ 224.017507] [ T1097] blk_register_queue+0x1c0/0x330 [ 224.018800] [ T1097] add_disk_fwnode+0x6b1/0x1010 [ 224.020047] [ T1097] sd_probe+0x94e/0xf30 [ 224.021235] [ T1097] really_probe+0x1e3/0x8a0 [ 224.022402] [ T1097] __driver_probe_device+0x18c/0x370 [ 224.023652] [ T1097] driver_probe_device+0x4a/0x120 [ 224.024816] [ T1097] __device_attach_driver+0x15e/0x270 [ 224.026038] [ T1097] bus_for_each_drv+0x114/0x1a0 [ 224.027212] [ T1097] __device_attach_async_helper+0x19c/0x240 [ 224.028476] [ T1097] async_run_entry_fn+0x96/0x4f0 [ 224.029664] [ T1097] process_one_work+0x85a/0x1460 [ 224.030798] [ T1097] worker_thread+0x5e2/0xfc0 [ 224.032037] [ T1097] kthread+0x39d/0x750 [ 224.033136] [ T1097] ret_from_fork+0x30/0x70 [ 224.034232] [ T1097] ret_from_fork_asm+0x1a/0x30 [ 224.035360] [ T1097] -> #5 (&q->rq_qos_mutex){+.+.}-{4:4}: [ 224.037338] [ T1097] __mutex_lock+0x1aa/0x1360 [ 224.038397] [ T1097] wbt_init+0x343/0x490 [ 224.039497] [ T1097] blk_register_queue+0x1c0/0x330 [ 224.040612] [ T1097] add_disk_fwnode+0x6b1/0x1010 [ 224.041675] [ T1097] sd_probe+0x94e/0xf30 [ 224.042662] [ T1097] really_probe+0x1e3/0x8a0 [ 224.043672] [ T1097] __driver_probe_device+0x18c/0x370 [ 224.044728] [ T1097] driver_probe_device+0x4a/0x120 [ 224.045761] [ T1097] __device_attach_driver+0x15e/0x270 [ 224.046818] [ T1097] bus_for_each_drv+0x114/0x1a0 [ 224.047842] [ T1097] __device_attach_async_helper+0x19c/0x240 [ 224.048876] [ T1097] async_run_entry_fn+0x96/0x4f0 [ 224.049840] [ T1097] process_one_work+0x85a/0x1460 [ 224.050794] [ T1097] worker_thread+0x5e2/0xfc0 [ 224.051718] [ T1097] kthread+0x39d/0x750 [ 224.052653] [ T1097] ret_from_fork+0x30/0x70 [ 224.053548] [ T1097] ret_from_fork_asm+0x1a/0x30 [ 224.054492] [ T1097] -> #4 (&q->sysfs_lock){+.+.}-{4:4}: [ 224.056073] [ T1097] __mutex_lock+0x1aa/0x1360 [ 224.056934] [ T1097] elv_iosched_store+0xf2/0x500 [ 224.057809] [ T1097] queue_attr_store+0x35a/0x480 [ 224.058680] [ T1097] kernfs_fop_write_iter+0x39e/0x5a0 [ 224.059593] [ T1097] vfs_write+0x5f9/0xe90 [ 224.060449] [ T1097] ksys_write+0xf6/0x1c0 [ 224.061294] [ T1097] do_syscall_64+0x93/0x180 [ 224.062153] [ T1097] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 224.063087] [ T1097] -> #3 (&q->q_usage_counter(io)#2){++++}-{0:0}: [ 224.064675] [ T1097] blk_mq_submit_bio+0x19b8/0x2070 [ 224.065562] [ T1097] __submit_bio+0x36b/0x6d0 [ 224.066428] [ T1097] submit_bio_noacct_nocheck+0x542/0xca0 [ 224.067371] [ T1097] iomap_readahead+0x4c4/0x7e0 [ 224.068254] [ T1097] read_pages+0x198/0xaf0 [ 224.069085] [ T1097] page_cache_ra_order+0x4d3/0xb50 [ 224.069976] [ T1097] filemap_get_pages+0x2c7/0x1850 [ 224.070862] [ T1097] filemap_read+0x31d/0xc30 [ 224.071707] [ T1097] xfs_file_buffered_read+0x1e9/0x2a0 [xfs] [ 224.072927] [ T1097] xfs_file_read_iter+0x25f/0x4a0 [xfs] [ 224.074093] [ T1097] vfs_read+0x6bc/0xa20 [ 224.074918] [ T1097] ksys_read+0xf6/0x1c0 [ 224.075735] [ T1097] do_syscall_64+0x93/0x180 [ 224.076576] [ T1097] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 224.077534] [ T1097] -> #2 (mapping.invalidate_lock#2){++++}-{4:4}: [ 224.079118] [ T1097] down_read+0x9b/0x470 [ 224.079950] [ T1097] filemap_fault+0x231/0x2ac0 [ 224.080818] [ T1097] __do_fault+0xf4/0x5d0 [ 224.081658] [ T1097] do_fault+0x965/0x11d0 [ 224.082495] [ T1097] __handle_mm_fault+0x1060/0x1f40 [ 224.083408] [ T1097] handle_mm_fault+0x21a/0x6b0 [ 224.084297] [ T1097] do_user_addr_fault+0x322/0xaa0 [ 224.085205] [ T1097] exc_page_fault+0x7a/0x110 [ 224.086061] [ T1097] asm_exc_page_fault+0x22/0x30 [ 224.086941] [ T1097] -> #1 (&vma->vm_lock->lock){++++}-{4:4}: [ 224.088495] [ T1097] down_write+0x8d/0x200 [ 224.089353] [ T1097] vma_link+0x1ff/0x590 [ 224.090200] [ T1097] insert_vm_struct+0x15a/0x340 [ 224.091072] [ T1097] alloc_bprm+0x626/0xbf0 [ 224.091909] [ T1097] kernel_execve+0x85/0x2f0 [ 224.092754] [ T1097] call_usermodehelper_exec_async+0x21b/0x430 [ 224.093738] [ T1097] ret_from_fork+0x30/0x70 [ 224.094587] [ T1097] ret_from_fork_asm+0x1a/0x30 [ 224.095541] [ T1097] -> #0 (&mm->mmap_lock){++++}-{4:4}: [ 224.097077] [ T1097] __lock_acquire+0x2f52/0x5ea0 [ 224.097963] [ T1097] lock_acquire+0x1b1/0x540 [ 224.098818] [ T1097] __might_fault+0xb9/0x120 [ 224.099672] [ T1097] _copy_to_user+0x1e/0x80 [ 224.100519] [ T1097] relay_file_read+0x149/0x8a0 [ 224.101408] [ T1097] full_proxy_read+0x110/0x1d0 [ 224.102297] [ T1097] vfs_read+0x1bb/0xa20 [ 224.103115] [ T1097] ksys_read+0xf6/0x1c0 [ 224.103937] [ T1097] do_syscall_64+0x93/0x180 [ 224.104782] [ T1097] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 224.105726] [ T1097] other info that might help us debug this: [ 224.107884] [ T1097] Chain exists of: &mm->mmap_lock --> &q->rq_qos_mutex --> &sb->s_type->i_mutex_key#3 [ 224.110335] [ T1097] Possible unsafe locking scenario: [ 224.111856] [ T1097] CPU0 CPU1 [ 224.112723] [ T1097] ---- ---- [ 224.113588] [ T1097] lock(&sb->s_type->i_mutex_key#3); [ 224.114492] [ T1097] lock(&q->rq_qos_mutex); [ 224.115499] [ T1097] lock(&sb->s_type->i_mutex_key#3); [ 224.116609] [ T1097] rlock(&mm->mmap_lock); [ 224.117440] [ T1097] *** DEADLOCK *** [ 224.119479] [ T1097] 2 locks held by blktrace/1097: [ 224.120325] [ T1097] #0: ffff88812e5b80f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x233/0x2f0 [ 224.121474] [ T1097] #1: ffff88810079dc68 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: relay_file_read+0xa3/0x8a0 [ 224.122741] [ T1097] stack backtrace: [ 224.124186] [ T1097] CPU: 0 UID: 0 PID: 1097 Comm: blktrace Not tainted 6.14.0-rc1+ #255 [ 224.124189] [ T1097] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 [ 224.124193] [ T1097] Call Trace: [ 224.124195] [ T1097] <TASK> [ 224.124197] [ T1097] dump_stack_lvl+0x6a/0x90 [ 224.124201] [ T1097] print_circular_bug.cold+0x1e0/0x274 [ 224.124205] [ T1097] check_noncircular+0x306/0x3f0 [ 224.124209] [ T1097] ? __pfx_check_noncircular+0x10/0x10 [ 224.124213] [ T1097] ? lockdep_lock+0xca/0x1c0 [ 224.124215] [ T1097] ? __pfx_lockdep_lock+0x10/0x10 [ 224.124219] [ T1097] __lock_acquire+0x2f52/0x5ea0 [ 224.124224] [ T1097] ? find_held_lock+0x2d/0x110 [ 224.124226] [ T1097] ? __pfx___lock_acquire+0x10/0x10 [ 224.124230] [ T1097] ? lock_acquire+0x1b1/0x540 [ 224.124233] [ T1097] lock_acquire+0x1b1/0x540 [ 224.124236] [ T1097] ? __might_fault+0x99/0x120 [ 224.124239] [ T1097] ? __pfx_lock_acquire+0x10/0x10 [ 224.124242] [ T1097] ? lock_is_held_type+0xd5/0x130 [ 224.124245] [ T1097] ? __pfx___might_resched+0x10/0x10 [ 224.124248] [ T1097] ? _raw_spin_unlock+0x29/0x50 [ 224.124250] [ T1097] ? __might_fault+0x99/0x120 [ 224.124252] [ T1097] __might_fault+0xb9/0x120 [ 224.124254] [ T1097] ? __might_fault+0x99/0x120 [ 224.124256] [ T1097] ? __check_object_size+0x3f3/0x540 [ 224.124258] [ T1097] _copy_to_user+0x1e/0x80 [ 224.124261] [ T1097] relay_file_read+0x149/0x8a0 [ 224.124266] [ T1097] ? selinux_file_permission+0x36d/0x420 [ 224.124270] [ T1097] full_proxy_read+0x110/0x1d0 [ 224.124272] [ T1097] ? rw_verify_area+0x2f7/0x520 [ 224.124274] [ T1097] vfs_read+0x1bb/0xa20 [ 224.124277] [ T1097] ? __pfx___mutex_lock+0x10/0x10 [ 224.124279] [ T1097] ? __pfx_lock_release+0x10/0x10 [ 224.124282] [ T1097] ? __pfx_vfs_read+0x10/0x10 [ 224.124286] [ T1097] ? __fget_files+0x1ae/0x2e0 [ 224.124290] [ T1097] ksys_read+0xf6/0x1c0 [ 224.124293] [ T1097] ? __pfx_ksys_read+0x10/0x10 [ 224.124297] [ T1097] do_syscall_64+0x93/0x180 [ 224.124299] [ T1097] ? __pfx___rseq_handle_notify_resume+0x10/0x10 [ 224.124301] [ T1097] ? __pfx_blkcg_maybe_throttle_current+0x10/0x10 [ 224.124305] [ T1097] ? __pfx___rseq_handle_notify_resume+0x10/0x10 [ 224.124306] [ T1097] ? __pfx_blkcg_maybe_throttle_current+0x10/0x10 [ 224.124311] [ T1097] ? lockdep_hardirqs_on_prepare+0x16d/0x400 [ 224.124313] [ T1097] ? do_syscall_64+0x9f/0x180 [ 224.124315] [ T1097] ? lockdep_hardirqs_on+0x78/0x100 [ 224.124318] [ T1097] ? do_syscall_64+0x9f/0x180 [ 224.124319] [ T1097] ? lockdep_hardirqs_on_prepare+0x16d/0x400 [ 224.124322] [ T1097] ? do_syscall_64+0x9f/0x180 [ 224.124324] [ T1097] ? lockdep_hardirqs_on+0x78/0x100 [ 224.124326] [ T1097] ? do_syscall_64+0x9f/0x180 [ 224.124328] [ T1097] ? lockdep_hardirqs_on+0x78/0x100 [ 224.124330] [ T1097] ? do_syscall_64+0x9f/0x180 [ 224.124333] [ T1097] ? handle_mm_fault+0x39d/0x6b0 [ 224.124337] [ T1097] ? lockdep_hardirqs_on_prepare+0x16d/0x400 [ 224.124341] [ T1097] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 224.124344] [ T1097] RIP: 0033:0x7fdbd297be4a [ 224.124348] [ T1097] Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 28 58 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 82 58 f8 ff 48 8b [ 224.124350] [ T1097] RSP: 002b:00007fdbd2869d80 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 224.124354] [ T1097] RAX: ffffffffffffffda RBX: 00007fdbcc000bb0 RCX: 00007fdbd297be4a [ 224.124356] [ T1097] RDX: 0000000000080000 RSI: 00007fdbc3000000 RDI: 0000000000000004 [ 224.124357] [ T1097] RBP: 00007fdbd2869da0 R08: 0000000000000000 R09: 0000000000000000 [ 224.124358] [ T1097] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 [ 224.124360] [ T1097] R13: 000055e34ca547e0 R14: 0000000000000000 R15: 00007fdbcc002c90 [ 224.124364] [ T1097] </TASK> [ 224.220675] [ T1041] sd 9:0:0:0: [sdd] Synchronizing SCSI cache
On 2/10/25 1:55 PM, Shinichiro Kawasaki wrote: > On Feb 10, 2025 / 12:31, Nilay Shroff wrote: >> >> >> On 2/10/25 7:12 AM, Ming Lei wrote: >>> On Mon, Feb 10, 2025 at 8:52 AM Shinichiro Kawasaki >>> <shinichiro.kawasaki@wdc.com> wrote: >>>> >>>> On Feb 09, 2025 / 20:20, Ming Lei wrote: >>>>> All block internal state for dealing adding/removing debugfs entries >>>>> have been removed, and debugfs can sync everything for us in fs level, >>>>> so don't grab q->debugfs_mutex for adding/removing block internal debugfs >>>>> entries. >>>>> >>>>> Now q->debugfs_mutex is only used for blktrace, meantime move creating >>>>> queue debugfs dir code out of q->sysfs_lock. Both the two locks are >>>>> connected with queue freeze IO lock. Then queue freeze IO lock chain >>>>> with debugfs lock is cut. >>>>> >>>>> The following lockdep report can be fixed: >>>>> >>>>> https://lore.kernel.org/linux-block/ougniadskhks7uyxguxihgeuh2pv4yaqv4q3emo4gwuolgzdt6@brotly74p6bs/ >>>>> >>>>> Follows contexts which adds/removes debugfs entries: >>>>> >>>>> - update nr_hw_queues >>>>> >>>>> - add/remove disks >>>>> >>>>> - elevator switch >>>>> >>>>> - blktrace >>>>> >>>>> blktrace only adds entries under disk top directory, so we can ignore it, >>>>> because it can only work iff disk is added. Also nothing overlapped with >>>>> the other two contex, blktrace context is fine. >>>>> >>>>> Elevator switch is only allowed after disk is added, so there isn't race >>>>> with add/remove disk. blk_mq_update_nr_hw_queues() always restores to >>>>> previous elevator, so no race between these two. Elevator switch context >>>>> is fine. >>>>> >>>>> So far blk_mq_update_nr_hw_queues() doesn't hold debugfs lock for >>>>> adding/removing hctx entries, there might be race with add/remove disk, >>>>> which is just fine in reality: >>>>> >>>>> - blk_mq_update_nr_hw_queues() is usually for error recovery, and disk >>>>> won't be added/removed at the same time >>>>> >>>>> - even though there is race between the two contexts, it is just fine, >>>>> since hctx won't be freed until queue is dead >>>>> >>>>> - we never see reports in this area without holding debugfs in >>>>> blk_mq_update_nr_hw_queues() >>>>> >>>>> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> >>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com> >>>> >>>> Ming, thank you for this quick action. I applied this series on top of >>>> v6.14-rc1 kernel and ran the block/002 test case. Unfortunately, still if fails >>>> occasionally with the lockdep "WARNING: possible circular locking dependency >>>> detected" below. Now debugfs_mutex is not reported as one of the dependent >>>> locks, then I think this fix is working as expected. Instead, eq->sysfs_lock >>>> creates similar dependency. My mere guess is that this patch avoids one >>>> dependency, but still another dependency is left. >>> >>> Indeed, this patch cuts dependency on both q->sysfs_lock and q->debugfs_lock, >> Glad to see that with this patch we're able to cut the dependency between >> q->sysfs_lock and q->debugfs_lock. >> >>> but elevator ->sysfs_lock isn't covered, :-( >>> >> I believe that shall be fixed with the current effort undergoing here: >> https://lore.kernel.org/all/20250205144506.663819-1-nilay@linux.ibm.com/ > > Thanks Nilay for the information. I applied your patch series on top of > v6.14-rc1 and Ming's. Then I ran block/002. Alas, still the test case fails with > the "WARNING: possible circular locking dependency detected". The lockdep splat > contents has changed as below: > Thanks Shinichiro for trying out the patch! However this patchset is not complete yet. I have to work on few suggestions from Christoph and Ming and then I will re-post the patchset. We're planning to replace q->sysfs_lock with a new dedicated lock for elevator switch case and that should help cut the dependency between q->q_usage_counter(io) and q->sysfs_lock. So I think you need to wait for sometime before the patchest is ready. Thanks, --Nilay
On 2/10/25 6:22 AM, Shinichiro Kawasaki wrote: > On Feb 09, 2025 / 20:20, Ming Lei wrote: >> All block internal state for dealing adding/removing debugfs entries >> have been removed, and debugfs can sync everything for us in fs level, >> so don't grab q->debugfs_mutex for adding/removing block internal debugfs >> entries. >> >> Now q->debugfs_mutex is only used for blktrace, meantime move creating >> queue debugfs dir code out of q->sysfs_lock. Both the two locks are >> connected with queue freeze IO lock. Then queue freeze IO lock chain >> with debugfs lock is cut. >> >> The following lockdep report can be fixed: >> >> https://lore.kernel.org/linux-block/ougniadskhks7uyxguxihgeuh2pv4yaqv4q3emo4gwuolgzdt6@brotly74p6bs/ >> >> Follows contexts which adds/removes debugfs entries: >> >> - update nr_hw_queues >> >> - add/remove disks >> >> - elevator switch >> >> - blktrace >> >> blktrace only adds entries under disk top directory, so we can ignore it, >> because it can only work iff disk is added. Also nothing overlapped with >> the other two contex, blktrace context is fine. >> >> Elevator switch is only allowed after disk is added, so there isn't race >> with add/remove disk. blk_mq_update_nr_hw_queues() always restores to >> previous elevator, so no race between these two. Elevator switch context >> is fine. >> >> So far blk_mq_update_nr_hw_queues() doesn't hold debugfs lock for >> adding/removing hctx entries, there might be race with add/remove disk, >> which is just fine in reality: >> >> - blk_mq_update_nr_hw_queues() is usually for error recovery, and disk >> won't be added/removed at the same time >> >> - even though there is race between the two contexts, it is just fine, >> since hctx won't be freed until queue is dead >> >> - we never see reports in this area without holding debugfs in >> blk_mq_update_nr_hw_queues() >> >> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> >> Signed-off-by: Ming Lei <ming.lei@redhat.com> > [...] > > [ 115.085704] [ T1023] run blktests block/002 at 2025-02-10 09:22:22 > [ 115.383653] [ T1054] sd 9:0:0:0: [sdd] Synchronizing SCSI cache > [ 115.641933] [ T1055] scsi_debug:sdebug_driver_probe: scsi_debug: trim poll_queues to 0. poll_q/nr_hw = (0/1) > [ 115.642961] [ T1055] scsi host9: scsi_debug: version 0191 [20210520] > dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0 > [ 115.646207] [ T1055] scsi 9:0:0:0: Direct-Access Linux scsi_debug 0191 PQ: 0 ANSI: 7 > [ 115.648225] [ C0] scsi 9:0:0:0: Power-on or device reset occurred > [ 115.654243] [ T1055] sd 9:0:0:0: Attached scsi generic sg3 type 0 > [ 115.656248] [ T100] sd 9:0:0:0: [sdd] 16384 512-byte logical blocks: (8.39 MB/8.00 MiB) > [ 115.658403] [ T100] sd 9:0:0:0: [sdd] Write Protect is off > [ 115.659125] [ T100] sd 9:0:0:0: [sdd] Mode Sense: 73 00 10 08 > [ 115.661621] [ T100] sd 9:0:0:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA > [ 115.669276] [ T100] sd 9:0:0:0: [sdd] permanent stream count = 5 > [ 115.673375] [ T100] sd 9:0:0:0: [sdd] Preferred minimum I/O size 512 bytes > [ 115.673974] [ T100] sd 9:0:0:0: [sdd] Optimal transfer size 524288 bytes > [ 115.710112] [ T100] sd 9:0:0:0: [sdd] Attached SCSI disk > > [ 116.464802] [ T1079] ====================================================== > [ 116.465540] [ T1079] WARNING: possible circular locking dependency detected > [ 116.466107] [ T1079] 6.14.0-rc1+ #253 Not tainted > [ 116.466581] [ T1079] ------------------------------------------------------ > [ 116.467141] [ T1079] blktrace/1079 is trying to acquire lock: > [ 116.467708] [ T1079] ffff88810539d1e0 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x99/0x120 > [ 116.468439] [ T1079] > but task is already holding lock: > [ 116.469052] [ T1079] ffff88810a5fd758 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: relay_file_read+0xa3/0x8a0 > [ 116.469901] [ T1079] > which lock already depends on the new lock. > > [ 116.470762] [ T1079] > the existing dependency chain (in reverse order) is: > [ 116.473187] [ T1079] > -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}: > [ 116.475670] [ T1079] down_read+0x9b/0x470 > [ 116.477001] [ T1079] lookup_one_unlocked+0xe9/0x120 > [ 116.478333] [ T1079] lookup_positive_unlocked+0x1d/0x90 > [ 116.479648] [ T1079] debugfs_lookup+0x47/0xa0 > [ 116.480833] [ T1079] blk_mq_debugfs_unregister_sched_hctx+0x23/0x50 > [ 116.482215] [ T1079] blk_mq_exit_sched+0xb6/0x2b0 > [ 116.483466] [ T1079] elevator_switch+0x12a/0x4b0 > [ 116.484676] [ T1079] elv_iosched_store+0x29f/0x380 > [ 116.485841] [ T1079] queue_attr_store+0x313/0x480 > [ 116.487078] [ T1079] kernfs_fop_write_iter+0x39e/0x5a0 > [ 116.488358] [ T1079] vfs_write+0x5f9/0xe90 > [ 116.489460] [ T1079] ksys_write+0xf6/0x1c0 > [ 116.490582] [ T1079] do_syscall_64+0x93/0x180 > [ 116.491694] [ T1079] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 116.492996] [ T1079] > -> #4 (&eq->sysfs_lock){+.+.}-{4:4}: > [ 116.495135] [ T1079] __mutex_lock+0x1aa/0x1360 > [ 116.496363] [ T1079] elevator_switch+0x11f/0x4b0 > [ 116.497499] [ T1079] elv_iosched_store+0x29f/0x380 > [ 116.498660] [ T1079] queue_attr_store+0x313/0x480 > [ 116.499752] [ T1079] kernfs_fop_write_iter+0x39e/0x5a0 > [ 116.500884] [ T1079] vfs_write+0x5f9/0xe90 > [ 116.501964] [ T1079] ksys_write+0xf6/0x1c0 > [ 116.503056] [ T1079] do_syscall_64+0x93/0x180 > [ 116.504194] [ T1079] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 116.505356] [ T1079] In the above dependency chain, I see that thread #5 is waiting on &sb->s_type->i_mutex_key after acquiring q->sysfs_lock and eq->sysfs_lock. And thread #4 would have acquired q->sysfs_lock and now pending on eq->sysfs_lock. But then how is it possible that two threads are able to acquire q->sysfs_lock at the same time (assuming this is for the same request_queue). Is this a false-positive report from lockdep? Or am I missing something? Thanks, --Nilay
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 6d98c2a6e7c6..9601823730e2 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -770,8 +770,6 @@ void blk_mq_debugfs_register_sched(struct request_queue *q) struct elevator_type *e = q->elevator->type; struct dentry *sched_dir; - lockdep_assert_held(&q->debugfs_mutex); - /* * If the parent directory has not been created yet, return, we will be * called again later on and the directory/files will be created then. @@ -793,8 +791,6 @@ void blk_mq_debugfs_unregister_sched(struct request_queue *q) { struct dentry *queue_dir = blk_mq_get_queue_entry(q); - lockdep_assert_held(&q->debugfs_mutex); - if (IS_ERR_OR_NULL(queue_dir)) return; debugfs_lookup_and_remove("sched", queue_dir); @@ -832,8 +828,6 @@ void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos) struct dentry *queue_dir = blk_mq_get_queue_entry(q); struct dentry *rqos_top; - lockdep_assert_held(&q->debugfs_mutex); - if (IS_ERR_OR_NULL(queue_dir)) return; @@ -853,8 +847,6 @@ void blk_mq_debugfs_register_rqos(struct rq_qos *rqos) struct dentry *rqos_top; struct dentry *rqos_dir; - lockdep_assert_held(&q->debugfs_mutex); - if (!rqos->ops->debugfs_attrs) return; @@ -874,8 +866,6 @@ void blk_mq_debugfs_register_sched_hctx(struct request_queue *q, struct elevator_type *e = q->elevator->type; struct dentry *sched_dir; - lockdep_assert_held(&q->debugfs_mutex); - /* * If the parent debugfs directory has not been created yet, return; * We will be called again later on with appropriate parent debugfs @@ -897,8 +887,6 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx) { struct dentry *sched_dir; - lockdep_assert_held(&hctx->queue->debugfs_mutex); - sched_dir = blk_mq_get_hctx_sched_entry(hctx); if (sched_dir) { debugfs_remove_recursive(sched_dir); diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 7442ca27c2bf..1ca127297b2c 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -469,9 +469,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e) if (ret) goto err_free_map_and_rqs; - mutex_lock(&q->debugfs_mutex); blk_mq_debugfs_register_sched(q); - mutex_unlock(&q->debugfs_mutex); queue_for_each_hw_ctx(q, hctx, i) { if (e->ops.init_hctx) { @@ -484,9 +482,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e) return ret; } } - mutex_lock(&q->debugfs_mutex); blk_mq_debugfs_register_sched_hctx(q, hctx); - mutex_unlock(&q->debugfs_mutex); } return 0; @@ -527,9 +523,7 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e) unsigned int flags = 0; queue_for_each_hw_ctx(q, hctx, i) { - mutex_lock(&q->debugfs_mutex); blk_mq_debugfs_unregister_sched_hctx(hctx); - mutex_unlock(&q->debugfs_mutex); if (e->type->ops.exit_hctx && hctx->sched_data) { e->type->ops.exit_hctx(hctx, i); @@ -538,9 +532,7 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e) flags = hctx->flags; } - mutex_lock(&q->debugfs_mutex); blk_mq_debugfs_unregister_sched(q); - mutex_unlock(&q->debugfs_mutex); if (e->type->ops.exit_sched) e->type->ops.exit_sched(e); diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c index d4d4f4dc0e23..529640ed2ff5 100644 --- a/block/blk-rq-qos.c +++ b/block/blk-rq-qos.c @@ -320,11 +320,8 @@ int rq_qos_add(struct rq_qos *rqos, struct gendisk *disk, enum rq_qos_id id, blk_mq_unfreeze_queue(q, memflags); - if (rqos->ops->debugfs_attrs) { - mutex_lock(&q->debugfs_mutex); + if (rqos->ops->debugfs_attrs) blk_mq_debugfs_register_rqos(rqos); - mutex_unlock(&q->debugfs_mutex); - } return 0; ebusy: @@ -349,7 +346,5 @@ void rq_qos_del(struct rq_qos *rqos) } blk_mq_unfreeze_queue(q, memflags); - mutex_lock(&q->debugfs_mutex); blk_mq_debugfs_unregister_rqos(rqos); - mutex_unlock(&q->debugfs_mutex); } diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 68bd84e06aac..b0bfb4c82e0e 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -744,10 +744,8 @@ static void blk_debugfs_remove(struct gendisk *disk) { struct request_queue *q = disk->queue; - mutex_lock(&q->debugfs_mutex); blk_trace_shutdown(q); debugfs_lookup_and_remove(disk->disk_name, blk_debugfs_root); - mutex_unlock(&q->debugfs_mutex); } /** @@ -769,14 +767,12 @@ int blk_register_queue(struct gendisk *disk) if (ret) goto out_put_queue_kobj; } - mutex_lock(&q->sysfs_lock); - mutex_lock(&q->debugfs_mutex); debugfs_create_dir(disk->disk_name, blk_debugfs_root); if (queue_is_mq(q)) blk_mq_debugfs_register(q); - mutex_unlock(&q->debugfs_mutex); + mutex_lock(&q->sysfs_lock); ret = disk_register_independent_access_ranges(disk); if (ret) goto out_debugfs_remove; diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index 32cda8f5d008..13efd48adcc3 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -775,9 +775,11 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) **/ void blk_trace_shutdown(struct request_queue *q) { + mutex_lock(&q->debugfs_mutex); if (rcu_dereference_protected(q->blk_trace, lockdep_is_held(&q->debugfs_mutex))) __blk_trace_remove(q); + mutex_unlock(&q->debugfs_mutex); } #ifdef CONFIG_BLK_CGROUP
All block internal state for dealing adding/removing debugfs entries have been removed, and debugfs can sync everything for us in fs level, so don't grab q->debugfs_mutex for adding/removing block internal debugfs entries. Now q->debugfs_mutex is only used for blktrace, meantime move creating queue debugfs dir code out of q->sysfs_lock. Both the two locks are connected with queue freeze IO lock. Then queue freeze IO lock chain with debugfs lock is cut. The following lockdep report can be fixed: https://lore.kernel.org/linux-block/ougniadskhks7uyxguxihgeuh2pv4yaqv4q3emo4gwuolgzdt6@brotly74p6bs/ Follows contexts which adds/removes debugfs entries: - update nr_hw_queues - add/remove disks - elevator switch - blktrace blktrace only adds entries under disk top directory, so we can ignore it, because it can only work iff disk is added. Also nothing overlapped with the other two contex, blktrace context is fine. Elevator switch is only allowed after disk is added, so there isn't race with add/remove disk. blk_mq_update_nr_hw_queues() always restores to previous elevator, so no race between these two. Elevator switch context is fine. So far blk_mq_update_nr_hw_queues() doesn't hold debugfs lock for adding/removing hctx entries, there might be race with add/remove disk, which is just fine in reality: - blk_mq_update_nr_hw_queues() is usually for error recovery, and disk won't be added/removed at the same time - even though there is race between the two contexts, it is just fine, since hctx won't be freed until queue is dead - we never see reports in this area without holding debugfs in blk_mq_update_nr_hw_queues() Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-mq-debugfs.c | 12 ------------ block/blk-mq-sched.c | 8 -------- block/blk-rq-qos.c | 7 +------ block/blk-sysfs.c | 6 +----- kernel/trace/blktrace.c | 2 ++ 5 files changed, 4 insertions(+), 31 deletions(-)