diff mbox series

[2/2] block: hold queue lock while iterating in diskstats_show

Message ID 20220120105248.117025-3-dwagner@suse.de (mailing list archive)
State New, archived
Headers show
Series fix crash by reading /proc/diskstats | expand

Commit Message

Daniel Wagner Jan. 20, 2022, 10:52 a.m. UTC
The request queues can be freed while we operate on them. Make sure we
hold a reference when using blk_mq_queue_tag_busy_iter.

RIP: blk_mq_queue_tag_busy_iter
Call Trace:
 ? blk_mq_hctx_mark_pending
 ? diskstats_show
 ? blk_mq_hctx_mark_pending
 blk_mq_in_flight
 diskstats_show
 ? klist_next
 seq_read
 proc_reg_read
 vfs_read
 ksys_read
 do_syscall_64
 entry_SYSCALL_64_after_hwframe

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 block/genhd.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Hannes Reinecke Jan. 20, 2022, 11:18 a.m. UTC | #1
On 1/20/22 11:52 AM, Daniel Wagner wrote:
> The request queues can be freed while we operate on them. Make sure we
> hold a reference when using blk_mq_queue_tag_busy_iter.
> 
> RIP: blk_mq_queue_tag_busy_iter
> Call Trace:
>  ? blk_mq_hctx_mark_pending
>  ? diskstats_show
>  ? blk_mq_hctx_mark_pending
>  blk_mq_in_flight
>  diskstats_show
>  ? klist_next
>  seq_read
>  proc_reg_read
>  vfs_read
>  ksys_read
>  do_syscall_64
>  entry_SYSCALL_64_after_hwframe
> 
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>  block/genhd.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index c9d4386dd177..0e163055a4e6 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -1167,10 +1167,13 @@ static int diskstats_show(struct seq_file *seqf, void *v)
>  		if (bdev_is_partition(hd) && !bdev_nr_sectors(hd))
>  			continue;
>  		part_stat_read_all(hd, &stat);
> +		if (blk_queue_enter(gp->queue, BLK_MQ_REQ_NOWAIT))
> +			continue;
>  		if (queue_is_mq(gp->queue))
>  			inflight = blk_mq_in_flight(gp->queue, hd);
>  		else
>  			inflight = part_in_flight(hd);
> +		blk_queue_exit(gp->queue);
>  
>  		seq_printf(seqf, "%4d %7d %pg "
>  			   "%lu %lu %lu %u "
> 
While this looks good, it seems to have been cut against an earlier
branch; the code in question looks different in my copy.

Cheers,

Hannes
Daniel Wagner Jan. 20, 2022, 12:41 p.m. UTC | #2
On Thu, Jan 20, 2022 at 12:18:24PM +0100, Hannes Reinecke wrote:
> On 1/20/22 11:52 AM, Daniel Wagner wrote:
> While this looks good, it seems to have been cut against an earlier
> branch; the code in question looks different in my copy.

This was cut against linux-block/for-next. Let me know which tree/branch
I should use.
Ming Lei Jan. 20, 2022, 12:56 p.m. UTC | #3
On Thu, Jan 20, 2022 at 11:52:48AM +0100, Daniel Wagner wrote:
> The request queues can be freed while we operate on them. Make sure we

It shouldn't be freed here.

> hold a reference when using blk_mq_queue_tag_busy_iter.
> 
> RIP: blk_mq_queue_tag_busy_iter

Any chance to pose the whole stack trace?

> Call Trace:
>  ? blk_mq_hctx_mark_pending
>  ? diskstats_show
>  ? blk_mq_hctx_mark_pending
>  blk_mq_in_flight
>  diskstats_show
>  ? klist_next
>  seq_read
>  proc_reg_read
>  vfs_read
>  ksys_read
>  do_syscall_64
>  entry_SYSCALL_64_after_hwframe
> 
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>  block/genhd.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index c9d4386dd177..0e163055a4e6 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -1167,10 +1167,13 @@ static int diskstats_show(struct seq_file *seqf, void *v)
>  		if (bdev_is_partition(hd) && !bdev_nr_sectors(hd))
>  			continue;
>  		part_stat_read_all(hd, &stat);
> +		if (blk_queue_enter(gp->queue, BLK_MQ_REQ_NOWAIT))
> +			continue;

The disk device's refcnt is supposed to be grabbed by class device
iterator code, so queue shouldn't be released in diskstats_show().

Not mention blk_queue_enter() won't prevent disk from being released...

thanks,
Ming
Daniel Wagner Jan. 20, 2022, 1:19 p.m. UTC | #4
Hi Ming,

On Thu, Jan 20, 2022 at 08:56:03PM +0800, Ming Lei wrote:
> On Thu, Jan 20, 2022 at 11:52:48AM +0100, Daniel Wagner wrote:
> > The request queues can be freed while we operate on them. Make sure we
> 
> It shouldn't be freed here.

No, that is not what I tried to say. We somehow end here operating on
pointers which are invalid.

> > hold a reference when using blk_mq_queue_tag_busy_iter.
> > 
> > RIP: blk_mq_queue_tag_busy_iter
> 
> Any chance to pose the whole stack trace?

[323465.820735] nvme nvme3: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 100.102.253.10:8009
[323465.821654] nvme nvme3: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[323465.850291] nvme nvme3: creating 80 I/O queues.
[323465.974196] nvme nvme3: mapped 80/0/0 default/read/poll queues.
[323466.031107] nvme nvme3: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.251.11:4420
[323466.093520] nvme nvme4: creating 80 I/O queues.
[323466.131740] nvme nvme4: mapped 80/0/0 default/read/poll queues.
[323466.194742] nvme nvme4: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.252.12:4420
[323466.207338] nvme nvme5: creating 80 I/O queues.
[323466.219473] nvme nvme5: mapped 80/0/0 default/read/poll queues.
[323466.256877] nvme nvme5: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.253.12:4420
[323466.295990] nvme nvme6: creating 80 I/O queues.
[323466.308171] nvme nvme6: mapped 80/0/0 default/read/poll queues.
[323466.359055] nvme nvme6: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.251.12:4420
[323466.366912] nvme nvme7: creating 80 I/O queues.
[323466.379132] nvme nvme7: mapped 80/0/0 default/read/poll queues.
[323466.432957] nvme nvme7: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.252.11:4420
[323466.458548] nvme nvme8: creating 80 I/O queues.
[323466.474414] nvme nvme8: mapped 80/0/0 default/read/poll queues.
[323466.548254] nvme nvme8: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:a2ee06317d1845EE5227", addr 100.102.253.11:4420
[323467.255527] BUG: kernel NULL pointer dereference, address: 0000000000000000
[323467.255538] #PF: supervisor read access in kernel mode
[323467.255541] #PF: error_code(0x0000) - not-present page
[323467.255544] PGD 0 P4D 0 
[323467.255550] Oops: 0000 [#1] SMP PTI
[323467.255555] CPU: 13 PID: 17640 Comm: iostat Kdump: loaded Tainted: G          IOE  X    5.3.18-59.27-default #1 SLE15-SP3
[323467.255559] Hardware name: Dell Inc. PowerEdge R940xa/08XR9M, BIOS 2.12.2 07/12/2021
[323467.255569] RIP: 0010:blk_mq_queue_tag_busy_iter+0x1e4/0x2e0
[323467.255573] Code: 89 44 24 48 41 8b 1f 85 db 0f 84 d1 fe ff ff 4c 89 6c 24 20 4c 89 74 24 28 89 e8 44 29 e3 48 8d 04 40 48 c1 e0 06 49 03 47 10 <48> 8b 10 39 d3 0f 43 da 48 8b 90 80 00 00 00 41 01 dc 48 f7 d2 48
[323467.255583] RSP: 0018:ffffb701aefb7c20 EFLAGS: 00010246
[323467.255587] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[323467.255592] RDX: ffff954f43c14c00 RSI: ffffffffbaa442c0 RDI: ffff954fb3749010
[323467.255597] RBP: 0000000000000000 R08: 0000000800000000 R09: 00000008ffffffff
[323467.255602] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[323467.255607] R13: ffff9567b2200000 R14: ffff954798c09bc0 R15: ffff954798c09c20
[323467.255612] FS:  00007f6032444580(0000) GS:ffff956fbf6c0000(0000) knlGS:0000000000000000
[323467.255616] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[323467.255618] CR2: 0000000000000000 CR3: 00000037bc7c0001 CR4: 00000000007726e0
[323467.255621] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[323467.255624] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[323467.255626] PKRU: 55555554
[323467.255628] Call Trace:
[323467.255641]  ? blk_mq_hctx_mark_pending+0x70/0x70
[323467.255647]  ? diskstats_show+0x207/0x2a0
[323467.255651]  ? blk_mq_hctx_mark_pending+0x70/0x70
[323467.255654]  blk_mq_in_flight+0x35/0x60
[323467.255659]  diskstats_show+0x6d/0x2a0
[323467.255669]  ? klist_next+0x88/0x110
[323467.255676]  seq_read+0x2a6/0x3f0
[323467.255682]  proc_reg_read+0x39/0x60
[323467.255690]  vfs_read+0x89/0x140
[323467.255693]  ksys_read+0xa1/0xe0
[323467.255702]  do_syscall_64+0x5b/0x1e0
[323467.255709]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

> The disk device's refcnt is supposed to be grabbed by class device
> iterator code, so queue shouldn't be released in diskstats_show().

Okay, but it's not the disk device which is invalid as far I
understand. It's the request_queue which is bogus.

> Not mention blk_queue_enter() won't prevent disk from being released...

Understood.

Daniel
Ming Lei Jan. 20, 2022, 1:51 p.m. UTC | #5
On Thu, Jan 20, 2022 at 02:19:36PM +0100, Daniel Wagner wrote:
> Hi Ming,
> 
> On Thu, Jan 20, 2022 at 08:56:03PM +0800, Ming Lei wrote:
> [323467.255527] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [323467.255538] #PF: supervisor read access in kernel mode
> [323467.255541] #PF: error_code(0x0000) - not-present page
> [323467.255544] PGD 0 P4D 0 
> [323467.255550] Oops: 0000 [#1] SMP PTI
> [323467.255555] CPU: 13 PID: 17640 Comm: iostat Kdump: loaded Tainted: G          IOE  X    5.3.18-59.27-default #1 SLE15-SP3
> [323467.255559] Hardware name: Dell Inc. PowerEdge R940xa/08XR9M, BIOS 2.12.2 07/12/2021
> [323467.255569] RIP: 0010:blk_mq_queue_tag_busy_iter+0x1e4/0x2e0

Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to
in source code? And what is NULL pointer?

With this kind of info, we may know the root cause.

Thanks,
Ming
Daniel Wagner Jan. 20, 2022, 2:01 p.m. UTC | #6
On Thu, Jan 20, 2022 at 09:51:18PM +0800, Ming Lei wrote:
> Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to
> in source code? And what is NULL pointer?

Here we go:

/usr/src/debug/kernel-default-5.3.18-59.27.1.x86_64/linux-5.3/linux-obj/../include/linux/sbitmap.h: 249
0xffffffffb244a254 <blk_mq_queue_tag_busy_iter+484>:    mov    (%rax),%rdx

234static inline void __sbitmap_for_each_set(struct sbitmap *sb,
235                                          unsigned int start,
236                                          sb_for_each_fn fn, void *data)
237{
238        unsigned int index;
239        unsigned int nr;
240        unsigned int scanned = 0;
241
242        if (start >= sb->depth)
243                start = 0;
244        index = SB_NR_TO_INDEX(sb, start);
245        nr = SB_NR_TO_BIT(sb, start);
246
247        while (scanned < sb->depth) {
248                unsigned long word;
249                unsigned int depth = min_t(unsigned int,
250                                           sb->map[index].depth - nr,
251                                           sb->depth - scanned);
Daniel Wagner Jan. 20, 2022, 2:06 p.m. UTC | #7
On Thu, Jan 20, 2022 at 03:01:27PM +0100, Daniel Wagner wrote:
> On Thu, Jan 20, 2022 at 09:51:18PM +0800, Ming Lei wrote:
> > Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to
> > in source code? And what is NULL pointer?

Yes %rax is NULL

> Here we go:
> 
> /usr/src/debug/kernel-default-5.3.18-59.27.1.x86_64/linux-5.3/linux-obj/../include/linux/sbitmap.h: 249
> 0xffffffffb244a254 <blk_mq_queue_tag_busy_iter+484>:    mov    (%rax),%rdx
> 
> 234static inline void __sbitmap_for_each_set(struct sbitmap *sb,
> 235                                          unsigned int start,
> 236                                          sb_for_each_fn fn, void *data)
> 237{
> 238        unsigned int index;
> 239        unsigned int nr;
> 240        unsigned int scanned = 0;
> 241
> 242        if (start >= sb->depth)
> 243                start = 0;
> 244        index = SB_NR_TO_INDEX(sb, start);
> 245        nr = SB_NR_TO_BIT(sb, start);
> 246
> 247        while (scanned < sb->depth) {
> 248                unsigned long word;
> 249                unsigned int depth = min_t(unsigned int,
> 250                                           sb->map[index].depth - nr,
> 251                                           sb->depth - scanned);


forgot to add this

crash> bt
PID: 17640  TASK: ffff956f4a468000  CPU: 13  COMMAND: "iostat"
 #0 [ffffb701aefb7980] machine_kexec at ffffffffba66fb91
 #1 [ffffb701aefb79d8] __crash_kexec at ffffffffba75927d
 #2 [ffffb701aefb7aa0] crash_kexec at ffffffffba75a13d
 #3 [ffffb701aefb7ab8] oops_end at ffffffffba636cdf
 #4 [ffffb701aefb7ad8] no_context at ffffffffba682baf
 #5 [ffffb701aefb7b40] do_page_fault at ffffffffba683e30
 #6 [ffffb701aefb7b70] page_fault at ffffffffbb0012fe
    [exception RIP: blk_mq_queue_tag_busy_iter+484]
    RIP: ffffffffbaa4a254  RSP: ffffb701aefb7c20  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: 0000000000000001  RCX: 0000000000000000
    RDX: ffff954f43c14c00  RSI: ffffffffbaa442c0  RDI: ffff954fb3749010
    RBP: 0000000000000000   R8: 0000000800000000   R9: 00000008ffffffff
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: ffff9567b2200000  R14: ffff954798c09bc0  R15: ffff954798c09c20
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffffb701aefb7cb0] blk_mq_in_flight at ffffffffbaa461e5
 #8 [ffffb701aefb7cd0] diskstats_show at ffffffffbaa4f00d
 #9 [ffffb701aefb7e50] seq_read at ffffffffba90df16
#10 [ffffb701aefb7eb0] proc_reg_read at ffffffffba96d789
#11 [ffffb701aefb7ec8] vfs_read at ffffffffba8e4c39
#12 [ffffb701aefb7ef8] ksys_read at ffffffffba8e4fc1
#13 [ffffb701aefb7f38] do_syscall_64 at ffffffffba60538b
#14 [ffffb701aefb7f50] entry_SYSCALL_64_after_hwframe at ffffffffbb00008c
    RIP: 00007f6031f4eb5e  RSP: 00007ffd187a7e88  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 00005577a698c2c0  RCX: 00007f6031f4eb5e
    RDX: 0000000000000400  RSI: 00005577a698f480  RDI: 0000000000000003
    RBP: 00007f603222e300   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000246  R12: 000000000000000a
    R13: 0000000000000d68  R14: 00007f603222d700  R15: 0000000000000d68
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b


I've tried to figure out the request_pointer from the registers and I
think the pointer is still in %rdi

struct request_queue {
  last_merge = 0x0, 
  elevator = 0x0, 
  stats = 0xffff956f45a9bec0, 
  rq_qos = 0xffff954f54c57558, 
  mq_ops = 0xffffffffc0c27140 <nvme_tcp_mq_ops>, 
  queue_ctx = 0x4151cf2266c0, 
  queue_depth = 0x0, 
  queue_hw_ctx = 0xffff954f43c14c00, 
  nr_hw_queues = 0x50, 
  backing_dev_info = 0xffff953fae3ae800, 
  queuedata = 0xffff953622282800, 
  queue_flags = 0x5041d0,
struct request_queue {
  last_merge = 0x0, 
  elevator = 0x0, 
  stats = 0xffff956f45a9bec0, 
  rq_qos = 0xffff954f54c57558, 
  mq_ops = 0xffffffffc0c27140 <nvme_tcp_mq_ops>, 
  queue_ctx = 0x4151cf2266c0, 
  queue_depth = 0x0, 
  queue_hw_ctx = 0xffff954f43c14c00, 
  nr_hw_queues = 0x50, 
  backing_dev_info = 0xffff953fae3ae800, 
  queuedata = 0xffff953622282800, 
  queue_flags = 0x5041d0, 
  pm_only = {
    counter = 0x0
  }, 
  id = 0x61, 
  bounce_gfp = 0x0, 
  queue_lock = {
    {
      rlock = {
        raw_lock = {
          {
            val = {
              counter = 0x0
            }, 
            {
              locked = 0x0, 
              pending = 0x0
            }, 
            {
              locked_pending = 0x0, 
              tail = 0x0
            }
          }
        }
      }
    }
  }, 
  kobj = {
    name = 0xffffffffbb7846be "queue", 
    entry = {
      next = 0xffff954fb3749088, 
      prev = 0xffff954fb3749088
    }, 
    parent = 0xffff95af629a58a0, 
    kset = 0x0, 
    ktype = 0xffffffffbbb56540, 
    sd = 0xffff956563ecd580, 
    kref = {
      refcount = {
        refs = {
          counter = 0x2
        }
      }
    }, 
    state_initialized = 0x1, 
    state_in_sysfs = 0x1, 
    state_add_uevent_sent = 0x0, 
    state_remove_uevent_sent = 0x0, 
    uevent_suppress = 0x0
  }, 
  mq_kobj = 0xffff95720ca05660, 
  integrity = {
    profile = 0x0, 
    flags = 0x0, 
    tuple_size = 0x0, 
    interval_exp = 0x0, 
    tag_size = 0x0
  }, 
  dev = 0x0, 
  rpm_status = RPM_ACTIVE, 
  nr_pending = 0x0, 
  nr_requests = 0x80, 
  dma_pad_mask = 0x0, 
  dma_alignment = 0x7, 
  ksm = 0x0, 
  rq_timeout = 0x1d4c, 
  poll_nsec = 0xffffffff, 
  poll_cb = 0xffff953443cafa00, 
  poll_stat = {{
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }, {
      mean = 0x0, 
      min = 0x0, 
      max = 0x0, 
      nr_samples = 0x0, 
      batch = 0x0
    }}, 
  timeout = {
    entry = {
      next = 0x0, 
      pprev = 0x0
    }, 
    expires = 0x0, 
    function = 0xffffffffbaa39440, 
    flags = 0x2c
  }, 
  timeout_work = {
    data = {
      counter = 0xfffffffe0
    }, 
    entry = {
      next = 0xffff954fb37493c0, 
      prev = 0xffff954fb37493c0
    }, 
    func = 0xffffffffbaa454f0
  }, 
  icq_list = {
    next = 0xffff954fb37493d8, 
    prev = 0xffff954fb37493d8
  }, 
  blkcg_pols = {0x3}, 
  root_blkg = 0xffff953622281c00, 
  blkg_list = {
    next = 0xffff953622281c08, 
    prev = 0xffff953622281c08
  }, 
  limits = {
    bounce_pfn = 0xfffffffffffff, 
    seg_boundary_mask = 0xffffffff, 
    virt_boundary_mask = 0xfff, 
    max_hw_sectors = 0x800, 
    max_dev_sectors = 0x0, 
    chunk_sectors = 0x0, 
    max_sectors = 0x800, 
    max_segment_size = 0xffffffff, 
    physical_block_size = 0x200, 
    logical_block_size = 0x200, 
    alignment_offset = 0x0, 
    io_min = 0x200, 
    io_opt = 0x0, 
    max_discard_sectors = 0xffffffff, 
    max_hw_discard_sectors = 0xffffffff, 
    max_write_same_sectors = 0x0, 
    max_write_zeroes_sectors = 0x801, 
    max_zone_append_sectors = 0x0, 
    discard_granularity = 0x200, 
    discard_alignment = 0x0, 
    max_segments = 0x101, 
    max_integrity_segments = 0x0, 
    max_discard_segments = 0x100, 
    misaligned = 0x0, 
    discard_misaligned = 0x0, 
    raid_partial_stripes_expensive = 0x0, 
    zoned = BLK_ZONED_NONE
  }, 
  required_elevator_features = 0x0, 
  nr_zones = 0x0, 
  conv_zones_bitmap = 0x0, 
  seq_zones_wlock = 0x0, 
  sg_timeout = 0x0, 
  sg_reserved_size = 0x7fffffff, 
  node = 0xffffffff, 
  blk_trace = 0x0, 
  blk_trace_mutex = {
    owner = {
      counter = 0x0
    }, 
    wait_lock = {
      {
        rlock = {
          raw_lock = {
            {
              val = {
                counter = 0x0
              }, 
              {
                locked = 0x0, 
                pending = 0x0
              }, 
              {
                locked_pending = 0x0, 
                tail = 0x0
              }
            }
          }
        }
      }
    }, 
    osq = {
      tail = {
        counter = 0x0
      }
    }, 
    wait_list = {
      next = 0xffff954fb37494b8, 
      prev = 0xffff954fb37494b8
    }
  }, 
  fq = 0x0, 
  requeue_list = {
    next = 0xffff954fb37494d0, 
    prev = 0xffff954fb37494d0
  }, 
  requeue_lock = {
    {
      rlock = {
        raw_lock = {
          {
            val = {
              counter = 0x0
            }, 
            {
              locked = 0x0, 
              pending = 0x0
            }, 
            {
              locked_pending = 0x0, 
              tail = 0x0
            }
          }
        }
      }
    }
  }, 
  requeue_work = {
    work = {
      data = {
        counter = 0xfffffffe0
      }, 
      entry = {
        next = 0xffff954fb37494f0, 
        prev = 0xffff954fb37494f0
      }, 
      func = 0xffffffffbaa474d0
    }, 
    timer = {
      entry = {
        next = 0x0, 
        pprev = 0x0
      }, 
      expires = 0x0, 
      function = 0xffffffffba6bb050, 
      flags = 0x200040
    }, 
    wq = 0x0, 
    cpu = 0x0
  }, 
  sysfs_lock = {
    owner = {
      counter = 0x0
    }, 
    wait_lock = {
      {
        rlock = {
          raw_lock = {
            {
              val = {
                counter = 0x0
              }, 
              {
                locked = 0x0, 
                pending = 0x0
              }, 
              {
                locked_pending = 0x0, 
                tail = 0x0
              }
            }
          }
        }
      }
    }, 
    osq = {
      tail = {
        counter = 0x0
      }
    }, 
    wait_list = {
      next = 0xffff954fb3749550, 
      prev = 0xffff954fb3749550
    }
  }, 
  sysfs_dir_lock = {
    owner = {
      counter = 0x0
    }, 
    wait_lock = {
      {
        rlock = {
          raw_lock = {
            {
              val = {
                counter = 0x0
              }, 
              {
                locked = 0x0, 
                pending = 0x0
              }, 
              {
                locked_pending = 0x0, 
                tail = 0x0
              }
            }
          }
        }
      }
    }, 
    osq = {
      tail = {
        counter = 0x0
      }
    }, 
    wait_list = {
      next = 0xffff954fb3749570, 
      prev = 0xffff954fb3749570
    }
  }, 
  unused_hctx_list = {
    next = 0xffff954fb3749580, 
    prev = 0xffff954fb3749580
  }, 
  unused_hctx_lock = {
    {
      rlock = {
        raw_lock = {
          {
            val = {
              counter = 0x0
            }, 
            {
              locked = 0x0, 
              pending = 0x0
            }, 
            {
              locked_pending = 0x0, 
              tail = 0x0
            }
          }
        }
      }
    }
  }, 
  mq_freeze_depth = 0x0, 
  bsg_dev = {
    class_dev = 0x0, 
    minor = 0x0, 
    queue = 0x0, 
    ops = 0x0
  }, 
  td = 0xffff954f43c12800, 
  callback_head = {
    next = 0x0, 
    func = 0x0
  }, 
  mq_freeze_wq = {
    lock = {
      {
        rlock = {
          raw_lock = {
            {
              val = {
                counter = 0x0
              }, 
              {
                locked = 0x0, 
                pending = 0x0
              }, 
              {
                locked_pending = 0x0, 
                tail = 0x0
              }
            }
          }
        }
      }
    }, 
    head = {
      next = 0xffff954fb37495d8, 
      prev = 0xffff954fb37495d8
    }
  }, 
  mq_freeze_lock = {
    owner = {
      counter = 0x0
    }, 
    wait_lock = {
      {
        rlock = {
          raw_lock = {
            {
              val = {
                counter = 0x0
              }, 
              {
                locked = 0x0, 
                pending = 0x0
              }, 
              {
                locked_pending = 0x0, 
                tail = 0x0
              }
            }
          }
        }
      }
    }, 
    osq = {
      tail = {
        counter = 0x0
      }
    }, 
    wait_list = {
      next = 0xffff954fb37495f8, 
      prev = 0xffff954fb37495f8
    }
  }, 
  q_usage_counter = {
    count = {
      counter = 0x8000000000000001
    }, 
    percpu_count_ptr = 0x4151cda18fe8, 
    release = 0xffffffffbaa38da0, 
    confirm_switch = 0x0, 
    force_atomic = 0x0, 
    allow_reinit = 0x1, 
    rcu = {
      next = 0x0, 
      func = 0x0
    }
  }, 
  tag_set = 0xffff958f7fe56008, 
  tag_set_list = {
    next = 0xffff958f7fe56118, 
    prev = 0xffff956f479e6ea0
  }, 
  bio_split = {
    bio_slab = 0xffff9530c7c20e00, 
    front_pad = 0x0, 
    bio_pool = {
      lock = {
        {
          rlock = {
            raw_lock = {
              {
                val = {
                  counter = 0x0
                }, 
                {
                  locked = 0x0, 
                  pending = 0x0
                }, 
                {
                  locked_pending = 0x0, 
                  tail = 0x0
                }
              }
            }
          }
        }
      }, 
      min_nr = 0x2, 
      curr_nr = 0x2, 
      elements = 0xffff954fbdb45180, 
      pool_data = 0xffff9530c7c20e00, 
      alloc = 0xffffffffba832460, 
      free = 0xffffffffba832480, 
      wait = {
        lock = {
          {
            rlock = {
              raw_lock = {
                {
                  val = {
                    counter = 0x0
                  }, 
                  {
                    locked = 0x0, 
                    pending = 0x0
                  }, 
                  {
                    locked_pending = 0x0, 
                    tail = 0x0
                  }
                }
              }
            }
          }
        }, 
        head = {
          next = 0xffff954fb37496a0, 
          prev = 0xffff954fb37496a0
        }
      }
    }, 
    bvec_pool = {
      lock = {
        {
          rlock = {
            raw_lock = {
              {
                val = {
                  counter = 0x0
                }, 
                {
                  locked = 0x0, 
                  pending = 0x0
                }, 
                {
                  locked_pending = 0x0, 
                  tail = 0x0
                }
              }
            }
          }
        }
      }, 
      min_nr = 0x2, 
      curr_nr = 0x2, 
      elements = 0xffff954fbdb453f0, 
      pool_data = 0xffff956fbe9a9180, 
      alloc = 0xffffffffba832460, 
      free = 0xffffffffba832480, 
      wait = {
        lock = {
          {
            rlock = {
              raw_lock = {
                {
                  val = {
                    counter = 0x0
                  }, 
                  {
                    locked = 0x0, 
                    pending = 0x0
                  }, 
                  {
                    locked_pending = 0x0, 
                    tail = 0x0
                  }
                }
              }
            }
          }
        }, 
        head = {
          next = 0xffff954fb37496e8, 
          prev = 0xffff954fb37496e8
        }
      }
    }, 
    bio_integrity_pool = {
      lock = {
        {
          rlock = {
            raw_lock = {
              {
                val = {
                  counter = 0x0
                }, 
                {
                  locked = 0x0, 
                  pending = 0x0
                }, 
                {
                  locked_pending = 0x0, 
                  tail = 0x0
                }
              }
            }
          }
        }
      }, 
      min_nr = 0x0, 
      curr_nr = 0x0, 
      elements = 0x0, 
      pool_data = 0x0, 
      alloc = 0x0, 
      free = 0x0, 
      wait = {
        lock = {
          {
            rlock = {
              raw_lock = {
                {
                  val = {
                    counter = 0x0
                  }, 
                  {
                    locked = 0x0, 
                    pending = 0x0
                  }, 
                  {
                    locked_pending = 0x0, 
                    tail = 0x0
                  }
                }
              }
            }
          }
        }, 
        head = {
          next = 0x0, 
          prev = 0x0
        }
      }
    }, 
    bvec_integrity_pool = {
      lock = {
        {
          rlock = {
            raw_lock = {
              {
                val = {
                  counter = 0x0
                }, 
                {
                  locked = 0x0, 
                  pending = 0x0
                }, 
                {
                  locked_pending = 0x0, 
                  tail = 0x0
                }
              }
            }
          }
        }
      }, 
      min_nr = 0x0, 
      curr_nr = 0x0, 
      elements = 0x0, 
      pool_data = 0x0, 
      alloc = 0x0, 
      free = 0x0, 
      wait = {
        lock = {
          {
            rlock = {
              raw_lock = {
                {
                  val = {
                    counter = 0x0
                  }, 
                  {
                    locked = 0x0, 
                    pending = 0x0
                  }, 
                  {
                    locked_pending = 0x0, 
                    tail = 0x0
                  }
                }
              }
            }
          }
        }, 
        head = {
          next = 0x0, 
          prev = 0x0
        }
      }
    }, 
    rescue_lock = {
      {
        rlock = {
          raw_lock = {
            {
              val = {
                counter = 0x0
              }, 
              {
                locked = 0x0, 
                pending = 0x0
              }, 
              {
                locked_pending = 0x0, 
                tail = 0x0
              }
            }
          }
        }
      }
    }, 
    rescue_list = {
      head = 0x0, 
      tail = 0x0
    }, 
    rescue_work = {
      data = {
        counter = 0xfffffffe0
      }, 
      entry = {
        next = 0xffff954fb37497a8, 
        prev = 0xffff954fb37497a8
      }, 
      func = 0xffffffffbaa33ec0
    }, 
    rescue_workqueue = 0x0
  }, 
  debugfs_dir = 0xffff956f64d2efc0, 
  sched_debugfs_dir = 0x0, 
  rqos_debugfs_dir = 0xffff953fae3a6d80, 
  mq_sysfs_init_done = 0x1, 
  cmd_size = 0x0, 
  write_hints = {0x0, 0x0, 0x0, 0x0, 0x0}
}
Christoph Hellwig Jan. 20, 2022, 2:21 p.m. UTC | #8
On Thu, Jan 20, 2022 at 03:01:26PM +0100, Daniel Wagner wrote:
> On Thu, Jan 20, 2022 at 09:51:18PM +0800, Ming Lei wrote:
> > Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to
> > in source code? And what is NULL pointer?
> 
> Here we go:
> 
> /usr/src/debug/kernel-default-5.3.18-59.27.1.x86_64/linux-5.3/linux-obj/../include/linux/sbitmap.h: 249

That looks quite old.  How does this look on mainline?
Daniel Wagner Jan. 20, 2022, 2:34 p.m. UTC | #9
On Thu, Jan 20, 2022 at 06:21:26AM -0800, Christoph Hellwig wrote:
> That looks quite old.  How does this look on mainline?

I am still try to reproduce it locally (on our SUSE kernels and
mainline). So far 'no luck' using nvme-fc. I'll try now with nvme-tcp
maybe it's transport thing.
Ming Lei Jan. 20, 2022, 3:37 p.m. UTC | #10
On Thu, Jan 20, 2022 at 03:01:26PM +0100, Daniel Wagner wrote:
> On Thu, Jan 20, 2022 at 09:51:18PM +0800, Ming Lei wrote:
> > Then Can you figure out where blk_mq_queue_tag_busy_iter+0x1e4 points to
> > in source code? And what is NULL pointer?
> 
> Here we go:
> 
> /usr/src/debug/kernel-default-5.3.18-59.27.1.x86_64/linux-5.3/linux-obj/../include/linux/sbitmap.h: 249
> 0xffffffffb244a254 <blk_mq_queue_tag_busy_iter+484>:    mov    (%rax),%rdx
> 
> 234static inline void __sbitmap_for_each_set(struct sbitmap *sb,
> 235                                          unsigned int start,
> 236                                          sb_for_each_fn fn, void *data)
> 237{
> 238        unsigned int index;
> 239        unsigned int nr;
> 240        unsigned int scanned = 0;
> 241
> 242        if (start >= sb->depth)
> 243                start = 0;
> 244        index = SB_NR_TO_INDEX(sb, start);
> 245        nr = SB_NR_TO_BIT(sb, start);
> 246
> 247        while (scanned < sb->depth) {
> 248                unsigned long word;
> 249                unsigned int depth = min_t(unsigned int,
> 250                                           sb->map[index].depth - nr,
> 251                                           sb->depth - scanned);
> 

Seems more likely nvme recovery issue, tags is changed even though
->q_usage_counter is grabbed in blk_mq_queue_tag_busy_iter(), which is
added in v4.19.

You may check dmesg log and see if there is any nvme error recovery info.

Thanks,
Ming
Daniel Wagner Jan. 20, 2022, 4:29 p.m. UTC | #11
On Thu, Jan 20, 2022 at 11:37:00PM +0800, Ming Lei wrote:
> Seems more likely nvme recovery issue, tags is changed even though
> ->q_usage_counter is grabbed in blk_mq_queue_tag_busy_iter(), which is
> added in v4.19.
> 
> You may check dmesg log and see if there is any nvme error recovery info.

Thanks for the tip. It's likely that a recovery is running although
there is no entry in the logs. At least I have an idea what could be the
problem here.
Daniel Wagner March 4, 2022, 2:53 p.m. UTC | #12
On Thu, Jan 20, 2022 at 05:29:50PM +0100, Daniel Wagner wrote:
> On Thu, Jan 20, 2022 at 11:37:00PM +0800, Ming Lei wrote:
> > Seems more likely nvme recovery issue, tags is changed even though
> > ->q_usage_counter is grabbed in blk_mq_queue_tag_busy_iter(), which is
> > added in v4.19.
> > 
> > You may check dmesg log and see if there is any nvme error recovery info.
> 
> Thanks for the tip. It's likely that a recovery is running although
> there is no entry in the logs. At least I have an idea what could be the
> problem here.

FWIW, a846a8e6c9a5 ("blk-mq: don't free tags if the tag_set is used by
other device in queue initialztion") was missing our tree.
diff mbox series

Patch

diff --git a/block/genhd.c b/block/genhd.c
index c9d4386dd177..0e163055a4e6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1167,10 +1167,13 @@  static int diskstats_show(struct seq_file *seqf, void *v)
 		if (bdev_is_partition(hd) && !bdev_nr_sectors(hd))
 			continue;
 		part_stat_read_all(hd, &stat);
+		if (blk_queue_enter(gp->queue, BLK_MQ_REQ_NOWAIT))
+			continue;
 		if (queue_is_mq(gp->queue))
 			inflight = blk_mq_in_flight(gp->queue, hd);
 		else
 			inflight = part_in_flight(hd);
+		blk_queue_exit(gp->queue);
 
 		seq_printf(seqf, "%4d %7d %pg "
 			   "%lu %lu %lu %u "