Message ID | 20220120105248.117025-3-dwagner@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fix crash by reading /proc/diskstats | expand |
On 1/20/22 11:52 AM, Daniel Wagner wrote: > The request queues can be freed while we operate on them. Make sure we > hold a reference when using blk_mq_queue_tag_busy_iter. > > RIP: blk_mq_queue_tag_busy_iter > Call Trace: > ? blk_mq_hctx_mark_pending > ? diskstats_show > ? blk_mq_hctx_mark_pending > blk_mq_in_flight > diskstats_show > ? klist_next > seq_read > proc_reg_read > vfs_read > ksys_read > do_syscall_64 > entry_SYSCALL_64_after_hwframe > > Signed-off-by: Daniel Wagner <dwagner@suse.de> > --- > block/genhd.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/block/genhd.c b/block/genhd.c > index c9d4386dd177..0e163055a4e6 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -1167,10 +1167,13 @@ static int diskstats_show(struct seq_file *seqf, void *v) > if (bdev_is_partition(hd) && !bdev_nr_sectors(hd)) > continue; > part_stat_read_all(hd, &stat); > + if (blk_queue_enter(gp->queue, BLK_MQ_REQ_NOWAIT)) > + continue; > if (queue_is_mq(gp->queue)) > inflight = blk_mq_in_flight(gp->queue, hd); > else > inflight = part_in_flight(hd); > + blk_queue_exit(gp->queue); > > seq_printf(seqf, "%4d %7d %pg " > "%lu %lu %lu %u " > While this looks good, it seems to have been cut against an earlier branch; the code in question looks different in my copy. Cheers, Hannes
On Thu, Jan 20, 2022 at 12:18:24PM +0100, Hannes Reinecke wrote: > On 1/20/22 11:52 AM, Daniel Wagner wrote: > While this looks good, it seems to have been cut against an earlier > branch; the code in question looks different in my copy. This was cut against linux-block/for-next. Let me know which tree/branch I should use.
On Thu, Jan 20, 2022 at 11:52:48AM +0100, Daniel Wagner wrote: > The request queues can be freed while we operate on them. Make sure we It shouldn't be freed here. > hold a reference when using blk_mq_queue_tag_busy_iter. > > RIP: blk_mq_queue_tag_busy_iter Any chance to pose the whole stack trace? > Call Trace: > ? blk_mq_hctx_mark_pending > ? diskstats_show > ? blk_mq_hctx_mark_pending > blk_mq_in_flight > diskstats_show > ? klist_next > seq_read > proc_reg_read > vfs_read > ksys_read > do_syscall_64 > entry_SYSCALL_64_after_hwframe > > Signed-off-by: Daniel Wagner <dwagner@suse.de> > --- > block/genhd.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/block/genhd.c b/block/genhd.c > index c9d4386dd177..0e163055a4e6 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -1167,10 +1167,13 @@ static int diskstats_show(struct seq_file *seqf, void *v) > if (bdev_is_partition(hd) && !bdev_nr_sectors(hd)) > continue; > part_stat_read_all(hd, &stat); > + if (blk_queue_enter(gp->queue, BLK_MQ_REQ_NOWAIT)) > + continue; The disk device's refcnt is supposed to be grabbed by class device iterator code, so queue shouldn't be released in diskstats_show(). Not mention blk_queue_enter() won't prevent disk from being released... thanks, Ming
Hi Ming, On Thu, Jan 20, 2022 at 08:56:03PM +0800, Ming Lei wrote: > On Thu, Jan 20, 2022 at 11:52:48AM +0100, Daniel Wagner wrote: > > The request queues can be freed while we operate on them. Make sure we > > It shouldn't be freed here. No, that is not what I tried to say. We somehow end here operating on pointers which are invalid. > > hold a reference when using blk_mq_queue_tag_busy_iter. > > > > RIP: blk_mq_queue_tag_busy_iter > > Any chance to pose the whole stack trace? [323465.820735] nvme nvme3: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 100.102.253.10:8009 [323465.821654] nvme nvme3: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery" [323465.850291] nvme nvme3: creating 80 I/O queues. [323465.974196] nvme nvme3: mapped 80/0/0 default/read/poll queues. [323466.031107] nvme nvme3: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.251.11:4420 [323466.093520] nvme nvme4: creating 80 I/O queues. [323466.131740] nvme nvme4: mapped 80/0/0 default/read/poll queues. [323466.194742] nvme nvme4: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.252.12:4420 [323466.207338] nvme nvme5: creating 80 I/O queues. [323466.219473] nvme nvme5: mapped 80/0/0 default/read/poll queues. [323466.256877] nvme nvme5: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.253.12:4420 [323466.295990] nvme nvme6: creating 80 I/O queues. [323466.308171] nvme nvme6: mapped 80/0/0 default/read/poll queues. [323466.359055] nvme nvme6: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.251.12:4420 [323466.366912] nvme nvme7: creating 80 I/O queues. [323466.379132] nvme nvme7: mapped 80/0/0 default/read/poll queues. [323466.432957] nvme nvme7: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.252.11:4420 [323466.458548] nvme nvme8: creating 80 I/O queues. [323466.474414] nvme nvme8: mapped 80/0/0 default/read/poll queues. [323466.548254] nvme nvme8: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.253.11:4420 [323467.255527] BUG: kernel NULL pointer dereference, address: 0000000000000000 [323467.255538] #PF: supervisor read access in kernel mode [323467.255541] #PF: error_code(0x0000) - not-present page [323467.255544] PGD 0 P4D 0 [323467.255550] Oops: 0000 [#1] SMP PTI [323467.255555] CPU: 13 PID: 17640 Comm: iostat Kdump: loaded Tainted: G IOE X 5.3.18-59.27-default #1 SLE15-SP3 [323467.255559] Hardware name: Dell Inc. PowerEdge R940xa/08XR9M, BIOS 2.12.2 07/12/2021 [323467.255569] RIP: 0010:blk_mq_queue_tag_busy_iter+0x1e4/0x2e0 [323467.255573] Code: 89 44 24 48 41 8b 1f 85 db 0f 84 d1 fe ff ff 4c 89 6c 24 20 4c 89 74 24 28 89 e8 44 29 e3 48 8d 04 40 48 c1 e0 06 49 03 47 10 <48> 8b 10 39 d3 0f 43 da 48 8b 90 80 00 00 00 41 01 dc 48 f7 d2 48 [323467.255583] RSP: 0018:ffffb701aefb7c20 EFLAGS: 00010246 [323467.255587] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 [323467.255592] RDX: ffff954f43c14c00 RSI: ffffffffbaa442c0 RDI: ffff954fb3749010 [323467.255597] RBP: 0000000000000000 R08: 0000000800000000 R09: 00000008ffffffff [323467.255602] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [323467.255607] R13: ffff9567b2200000 R14: ffff954798c09bc0 R15: ffff954798c09c20 [323467.255612] FS: 00007f6032444580(0000) GS:ffff956fbf6c0000(0000) knlGS:0000000000000000 [323467.255616] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [323467.255618] CR2: 0000000000000000 CR3: 00000037bc7c0001 CR4: 00000000007726e0 [323467.255621] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [323467.255624] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [323467.255626] PKRU: 55555554 [323467.255628] Call Trace: [323467.255641] ? blk_mq_hctx_mark_pending+0x70/0x70 [323467.255647] ? diskstats_show+0x207/0x2a0 [323467.255651] ? blk_mq_hctx_mark_pending+0x70/0x70 [323467.255654] blk_mq_in_flight+0x35/0x60 [323467.255659] diskstats_show+0x6d/0x2a0 [323467.255669] ? klist_next+0x88/0x110 [323467.255676] seq_read+0x2a6/0x3f0 [323467.255682] proc_reg_read+0x39/0x60 [323467.255690] vfs_read+0x89/0x140 [323467.255693] ksys_read+0xa1/0xe0 [323467.255702] do_syscall_64+0x5b/0x1e0 [323467.255709] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > The disk device's refcnt is supposed to be grabbed by class device > iterator code, so queue shouldn't be released in diskstats_show(). Okay, but it's not the disk device which is invalid as far I understand. It's the request_queue which is bogus. > Not mention blk_queue_enter() won't prevent disk from being released... Understood. Daniel
On Thu, Jan 20, 2022 at 02:19:36PM +0100, Daniel Wagner wrote: > Hi Ming, > > On Thu, Jan 20, 2022 at 08:56:03PM +0800, Ming Lei wrote: > [323467.255527] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [323467.255538] #PF: supervisor read access in kernel mode > [323467.255541] #PF: error_code(0x0000) - not-present page > [323467.255544] PGD 0 P4D 0 > [323467.255550] Oops: 0000 [#1] SMP PTI > [323467.255555] CPU: 13 PID: 17640 Comm: iostat Kdump: loaded Tainted: G IOE X 5.3.18-59.27-default #1 SLE15-SP3 > [323467.255559] Hardware name: Dell Inc. PowerEdge R940xa/08XR9M, BIOS 2.12.2 07/12/2021 > [323467.255569] RIP: 0010:blk_mq_queue_tag_busy_iter+0x1e4/0x2e0 Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to in source code? And what is NULL pointer? With this kind of info, we may know the root cause. Thanks, Ming
On Thu, Jan 20, 2022 at 09:51:18PM +0800, Ming Lei wrote: > Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to > in source code? And what is NULL pointer? Here we go: /usr/src/debug/kernel-default-5.3.18-59.27.1.x86_64/linux-5.3/linux-obj/../include/linux/sbitmap.h: 249 0xffffffffb244a254 <blk_mq_queue_tag_busy_iter+484>: mov (%rax),%rdx 234static inline void __sbitmap_for_each_set(struct sbitmap *sb, 235 unsigned int start, 236 sb_for_each_fn fn, void *data) 237{ 238 unsigned int index; 239 unsigned int nr; 240 unsigned int scanned = 0; 241 242 if (start >= sb->depth) 243 start = 0; 244 index = SB_NR_TO_INDEX(sb, start); 245 nr = SB_NR_TO_BIT(sb, start); 246 247 while (scanned < sb->depth) { 248 unsigned long word; 249 unsigned int depth = min_t(unsigned int, 250 sb->map[index].depth - nr, 251 sb->depth - scanned);
On Thu, Jan 20, 2022 at 03:01:27PM +0100, Daniel Wagner wrote: > On Thu, Jan 20, 2022 at 09:51:18PM +0800, Ming Lei wrote: > > Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to > > in source code? And what is NULL pointer? Yes %rax is NULL > Here we go: > > /usr/src/debug/kernel-default-5.3.18-59.27.1.x86_64/linux-5.3/linux-obj/../include/linux/sbitmap.h: 249 > 0xffffffffb244a254 <blk_mq_queue_tag_busy_iter+484>: mov (%rax),%rdx > > 234static inline void __sbitmap_for_each_set(struct sbitmap *sb, > 235 unsigned int start, > 236 sb_for_each_fn fn, void *data) > 237{ > 238 unsigned int index; > 239 unsigned int nr; > 240 unsigned int scanned = 0; > 241 > 242 if (start >= sb->depth) > 243 start = 0; > 244 index = SB_NR_TO_INDEX(sb, start); > 245 nr = SB_NR_TO_BIT(sb, start); > 246 > 247 while (scanned < sb->depth) { > 248 unsigned long word; > 249 unsigned int depth = min_t(unsigned int, > 250 sb->map[index].depth - nr, > 251 sb->depth - scanned); forgot to add this crash> bt PID: 17640 TASK: ffff956f4a468000 CPU: 13 COMMAND: "iostat" #0 [ffffb701aefb7980] machine_kexec at ffffffffba66fb91 #1 [ffffb701aefb79d8] __crash_kexec at ffffffffba75927d #2 [ffffb701aefb7aa0] crash_kexec at ffffffffba75a13d #3 [ffffb701aefb7ab8] oops_end at ffffffffba636cdf #4 [ffffb701aefb7ad8] no_context at ffffffffba682baf #5 [ffffb701aefb7b40] do_page_fault at ffffffffba683e30 #6 [ffffb701aefb7b70] page_fault at ffffffffbb0012fe [exception RIP: blk_mq_queue_tag_busy_iter+484] RIP: ffffffffbaa4a254 RSP: ffffb701aefb7c20 RFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 RDX: ffff954f43c14c00 RSI: ffffffffbaa442c0 RDI: ffff954fb3749010 RBP: 0000000000000000 R8: 0000000800000000 R9: 00000008ffffffff R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff9567b2200000 R14: ffff954798c09bc0 R15: ffff954798c09c20 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffb701aefb7cb0] blk_mq_in_flight at ffffffffbaa461e5 #8 [ffffb701aefb7cd0] diskstats_show at ffffffffbaa4f00d #9 [ffffb701aefb7e50] seq_read at ffffffffba90df16 #10 [ffffb701aefb7eb0] proc_reg_read at ffffffffba96d789 #11 [ffffb701aefb7ec8] vfs_read at ffffffffba8e4c39 #12 [ffffb701aefb7ef8] ksys_read at ffffffffba8e4fc1 #13 [ffffb701aefb7f38] do_syscall_64 at ffffffffba60538b #14 [ffffb701aefb7f50] entry_SYSCALL_64_after_hwframe at ffffffffbb00008c RIP: 00007f6031f4eb5e RSP: 00007ffd187a7e88 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00005577a698c2c0 RCX: 00007f6031f4eb5e RDX: 0000000000000400 RSI: 00005577a698f480 RDI: 0000000000000003 RBP: 00007f603222e300 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000a R13: 0000000000000d68 R14: 00007f603222d700 R15: 0000000000000d68 ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b I've tried to figure out the request_pointer from the registers and I think the pointer is still in %rdi struct request_queue { last_merge = 0x0, elevator = 0x0, stats = 0xffff956f45a9bec0, rq_qos = 0xffff954f54c57558, mq_ops = 0xffffffffc0c27140 <nvme_tcp_mq_ops>, queue_ctx = 0x4151cf2266c0, queue_depth = 0x0, queue_hw_ctx = 0xffff954f43c14c00, nr_hw_queues = 0x50, backing_dev_info = 0xffff953fae3ae800, queuedata = 0xffff953622282800, queue_flags = 0x5041d0, struct request_queue { last_merge = 0x0, elevator = 0x0, stats = 0xffff956f45a9bec0, rq_qos = 0xffff954f54c57558, mq_ops = 0xffffffffc0c27140 <nvme_tcp_mq_ops>, queue_ctx = 0x4151cf2266c0, queue_depth = 0x0, queue_hw_ctx = 0xffff954f43c14c00, nr_hw_queues = 0x50, backing_dev_info = 0xffff953fae3ae800, queuedata = 0xffff953622282800, queue_flags = 0x5041d0, pm_only = { counter = 0x0 }, id = 0x61, bounce_gfp = 0x0, queue_lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, kobj = { name = 0xffffffffbb7846be "queue", entry = { next = 0xffff954fb3749088, prev = 0xffff954fb3749088 }, parent = 0xffff95af629a58a0, kset = 0x0, ktype = 0xffffffffbbb56540, sd = 0xffff956563ecd580, kref = { refcount = { refs = { counter = 0x2 } } }, state_initialized = 0x1, state_in_sysfs = 0x1, state_add_uevent_sent = 0x0, state_remove_uevent_sent = 0x0, uevent_suppress = 0x0 }, mq_kobj = 0xffff95720ca05660, integrity = { profile = 0x0, flags = 0x0, tuple_size = 0x0, interval_exp = 0x0, tag_size = 0x0 }, dev = 0x0, rpm_status = RPM_ACTIVE, nr_pending = 0x0, nr_requests = 0x80, dma_pad_mask = 0x0, dma_alignment = 0x7, ksm = 0x0, rq_timeout = 0x1d4c, poll_nsec = 0xffffffff, poll_cb = 0xffff953443cafa00, poll_stat = {{ mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }, { mean = 0x0, min = 0x0, max = 0x0, nr_samples = 0x0, batch = 0x0 }}, timeout = { entry = { next = 0x0, pprev = 0x0 }, expires = 0x0, function = 0xffffffffbaa39440, flags = 0x2c }, timeout_work = { data = { counter = 0xfffffffe0 }, entry = { next = 0xffff954fb37493c0, prev = 0xffff954fb37493c0 }, func = 0xffffffffbaa454f0 }, icq_list = { next = 0xffff954fb37493d8, prev = 0xffff954fb37493d8 }, blkcg_pols = {0x3}, root_blkg = 0xffff953622281c00, blkg_list = { next = 0xffff953622281c08, prev = 0xffff953622281c08 }, limits = { bounce_pfn = 0xfffffffffffff, seg_boundary_mask = 0xffffffff, virt_boundary_mask = 0xfff, max_hw_sectors = 0x800, max_dev_sectors = 0x0, chunk_sectors = 0x0, max_sectors = 0x800, max_segment_size = 0xffffffff, physical_block_size = 0x200, logical_block_size = 0x200, alignment_offset = 0x0, io_min = 0x200, io_opt = 0x0, max_discard_sectors = 0xffffffff, max_hw_discard_sectors = 0xffffffff, max_write_same_sectors = 0x0, max_write_zeroes_sectors = 0x801, max_zone_append_sectors = 0x0, discard_granularity = 0x200, discard_alignment = 0x0, max_segments = 0x101, max_integrity_segments = 0x0, max_discard_segments = 0x100, misaligned = 0x0, discard_misaligned = 0x0, raid_partial_stripes_expensive = 0x0, zoned = BLK_ZONED_NONE }, required_elevator_features = 0x0, nr_zones = 0x0, conv_zones_bitmap = 0x0, seq_zones_wlock = 0x0, sg_timeout = 0x0, sg_reserved_size = 0x7fffffff, node = 0xffffffff, blk_trace = 0x0, blk_trace_mutex = { owner = { counter = 0x0 }, wait_lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, osq = { tail = { counter = 0x0 } }, wait_list = { next = 0xffff954fb37494b8, prev = 0xffff954fb37494b8 } }, fq = 0x0, requeue_list = { next = 0xffff954fb37494d0, prev = 0xffff954fb37494d0 }, requeue_lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, requeue_work = { work = { data = { counter = 0xfffffffe0 }, entry = { next = 0xffff954fb37494f0, prev = 0xffff954fb37494f0 }, func = 0xffffffffbaa474d0 }, timer = { entry = { next = 0x0, pprev = 0x0 }, expires = 0x0, function = 0xffffffffba6bb050, flags = 0x200040 }, wq = 0x0, cpu = 0x0 }, sysfs_lock = { owner = { counter = 0x0 }, wait_lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, osq = { tail = { counter = 0x0 } }, wait_list = { next = 0xffff954fb3749550, prev = 0xffff954fb3749550 } }, sysfs_dir_lock = { owner = { counter = 0x0 }, wait_lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, osq = { tail = { counter = 0x0 } }, wait_list = { next = 0xffff954fb3749570, prev = 0xffff954fb3749570 } }, unused_hctx_list = { next = 0xffff954fb3749580, prev = 0xffff954fb3749580 }, unused_hctx_lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, mq_freeze_depth = 0x0, bsg_dev = { class_dev = 0x0, minor = 0x0, queue = 0x0, ops = 0x0 }, td = 0xffff954f43c12800, callback_head = { next = 0x0, func = 0x0 }, mq_freeze_wq = { lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, head = { next = 0xffff954fb37495d8, prev = 0xffff954fb37495d8 } }, mq_freeze_lock = { owner = { counter = 0x0 }, wait_lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, osq = { tail = { counter = 0x0 } }, wait_list = { next = 0xffff954fb37495f8, prev = 0xffff954fb37495f8 } }, q_usage_counter = { count = { counter = 0x8000000000000001 }, percpu_count_ptr = 0x4151cda18fe8, release = 0xffffffffbaa38da0, confirm_switch = 0x0, force_atomic = 0x0, allow_reinit = 0x1, rcu = { next = 0x0, func = 0x0 } }, tag_set = 0xffff958f7fe56008, tag_set_list = { next = 0xffff958f7fe56118, prev = 0xffff956f479e6ea0 }, bio_split = { bio_slab = 0xffff9530c7c20e00, front_pad = 0x0, bio_pool = { lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, min_nr = 0x2, curr_nr = 0x2, elements = 0xffff954fbdb45180, pool_data = 0xffff9530c7c20e00, alloc = 0xffffffffba832460, free = 0xffffffffba832480, wait = { lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, head = { next = 0xffff954fb37496a0, prev = 0xffff954fb37496a0 } } }, bvec_pool = { lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, min_nr = 0x2, curr_nr = 0x2, elements = 0xffff954fbdb453f0, pool_data = 0xffff956fbe9a9180, alloc = 0xffffffffba832460, free = 0xffffffffba832480, wait = { lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, head = { next = 0xffff954fb37496e8, prev = 0xffff954fb37496e8 } } }, bio_integrity_pool = { lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, min_nr = 0x0, curr_nr = 0x0, elements = 0x0, pool_data = 0x0, alloc = 0x0, free = 0x0, wait = { lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, head = { next = 0x0, prev = 0x0 } } }, bvec_integrity_pool = { lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, min_nr = 0x0, curr_nr = 0x0, elements = 0x0, pool_data = 0x0, alloc = 0x0, free = 0x0, wait = { lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, head = { next = 0x0, prev = 0x0 } } }, rescue_lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, rescue_list = { head = 0x0, tail = 0x0 }, rescue_work = { data = { counter = 0xfffffffe0 }, entry = { next = 0xffff954fb37497a8, prev = 0xffff954fb37497a8 }, func = 0xffffffffbaa33ec0 }, rescue_workqueue = 0x0 }, debugfs_dir = 0xffff956f64d2efc0, sched_debugfs_dir = 0x0, rqos_debugfs_dir = 0xffff953fae3a6d80, mq_sysfs_init_done = 0x1, cmd_size = 0x0, write_hints = {0x0, 0x0, 0x0, 0x0, 0x0} }
On Thu, Jan 20, 2022 at 03:01:26PM +0100, Daniel Wagner wrote: > On Thu, Jan 20, 2022 at 09:51:18PM +0800, Ming Lei wrote: > > Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to > > in source code? And what is NULL pointer? > > Here we go: > > /usr/src/debug/kernel-default-5.3.18-59.27.1.x86_64/linux-5.3/linux-obj/../include/linux/sbitmap.h: 249 That looks quite old. How does this look on mainline?
On Thu, Jan 20, 2022 at 06:21:26AM -0800, Christoph Hellwig wrote:
> That looks quite old. How does this look on mainline?
I am still try to reproduce it locally (on our SUSE kernels and
mainline). So far 'no luck' using nvme-fc. I'll try now with nvme-tcp
maybe it's transport thing.
On Thu, Jan 20, 2022 at 03:01:26PM +0100, Daniel Wagner wrote: > On Thu, Jan 20, 2022 at 09:51:18PM +0800, Ming Lei wrote: > > Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to > > in source code? And what is NULL pointer? > > Here we go: > > /usr/src/debug/kernel-default-5.3.18-59.27.1.x86_64/linux-5.3/linux-obj/../include/linux/sbitmap.h: 249 > 0xffffffffb244a254 <blk_mq_queue_tag_busy_iter+484>: mov (%rax),%rdx > > 234static inline void __sbitmap_for_each_set(struct sbitmap *sb, > 235 unsigned int start, > 236 sb_for_each_fn fn, void *data) > 237{ > 238 unsigned int index; > 239 unsigned int nr; > 240 unsigned int scanned = 0; > 241 > 242 if (start >= sb->depth) > 243 start = 0; > 244 index = SB_NR_TO_INDEX(sb, start); > 245 nr = SB_NR_TO_BIT(sb, start); > 246 > 247 while (scanned < sb->depth) { > 248 unsigned long word; > 249 unsigned int depth = min_t(unsigned int, > 250 sb->map[index].depth - nr, > 251 sb->depth - scanned); > Seems more likely nvme recovery issue, tags is changed even though ->q_usage_counter is grabbed in blk_mq_queue_tag_busy_iter(), which is added in v4.19. You may check dmesg log and see if there is any nvme error recovery info. Thanks, Ming
On Thu, Jan 20, 2022 at 11:37:00PM +0800, Ming Lei wrote: > Seems more likely nvme recovery issue, tags is changed even though > ->q_usage_counter is grabbed in blk_mq_queue_tag_busy_iter(), which is > added in v4.19. > > You may check dmesg log and see if there is any nvme error recovery info. Thanks for the tip. It's likely that a recovery is running although there is no entry in the logs. At least I have an idea what could be the problem here.
On Thu, Jan 20, 2022 at 05:29:50PM +0100, Daniel Wagner wrote: > On Thu, Jan 20, 2022 at 11:37:00PM +0800, Ming Lei wrote: > > Seems more likely nvme recovery issue, tags is changed even though > > ->q_usage_counter is grabbed in blk_mq_queue_tag_busy_iter(), which is > > added in v4.19. > > > > You may check dmesg log and see if there is any nvme error recovery info. > > Thanks for the tip. It's likely that a recovery is running although > there is no entry in the logs. At least I have an idea what could be the > problem here. FWIW, a846a8e6c9a5 ("blk-mq: don't free tags if the tag_set is used by other device in queue initialztion") was missing our tree.
diff --git a/block/genhd.c b/block/genhd.c index c9d4386dd177..0e163055a4e6 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1167,10 +1167,13 @@ static int diskstats_show(struct seq_file *seqf, void *v) if (bdev_is_partition(hd) && !bdev_nr_sectors(hd)) continue; part_stat_read_all(hd, &stat); + if (blk_queue_enter(gp->queue, BLK_MQ_REQ_NOWAIT)) + continue; if (queue_is_mq(gp->queue)) inflight = blk_mq_in_flight(gp->queue, hd); else inflight = part_in_flight(hd); + blk_queue_exit(gp->queue); seq_printf(seqf, "%4d %7d %pg " "%lu %lu %lu %u "
The request queues can be freed while we operate on them. Make sure we hold a reference when using blk_mq_queue_tag_busy_iter. RIP: blk_mq_queue_tag_busy_iter Call Trace: ? blk_mq_hctx_mark_pending ? diskstats_show ? blk_mq_hctx_mark_pending blk_mq_in_flight diskstats_show ? klist_next seq_read proc_reg_read vfs_read ksys_read do_syscall_64 entry_SYSCALL_64_after_hwframe Signed-off-by: Daniel Wagner <dwagner@suse.de> --- block/genhd.c | 3 +++ 1 file changed, 3 insertions(+)