diff mbox

[0/2] sd: Fix a deadlock between event checking and device removal

Message ID 1510938082.2846.23.camel@wdc.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Bart Van Assche Nov. 17, 2017, 5:01 p.m. UTC
On Fri, 2017-11-17 at 16:14 +0100, Jack Wang wrote:
> I suspect could be missing run queue or lost IO, IMHO it's unlikely

> below disk probing fix the bug.


If the system is still in this state or if you can reproduce this issue,
please collect and analyze the information under /sys/kernel/debug/block.
That's the only way I know of to verify whether or not a lockup has been
caused by a missing queue run. If the following command resolves the lockup
then the root cause is definitely a missing queue run:

for f in /sys/kernel/debug/block/*; do echo kick >$f/state; done

When analyzing queue lockups it's important to also have information about
requests that have been queued but that have not yet been started. I'm using
the following patch locally (will split this patch and submit it properly when
I have the time):

Comments

Jack Wang Nov. 17, 2017, 5:10 p.m. UTC | #1
2017-11-17 18:01 GMT+01:00 Bart Van Assche <Bart.VanAssche@wdc.com>:
> On Fri, 2017-11-17 at 16:14 +0100, Jack Wang wrote:
>> I suspect could be missing run queue or lost IO, IMHO it's unlikely
>> below disk probing fix the bug.
>
> If the system is still in this state or if you can reproduce this issue,
> please collect and analyze the information under /sys/kernel/debug/block.
> That's the only way I know of to verify whether or not a lockup has been
> caused by a missing queue run. If the following command resolves the lockup
> then the root cause is definitely a missing queue run:

It's production server, I was too late to gather more information.
kernel is 4.4.36/4.4.50
Request mode for both multipath and scsi, no multiqueue involvement.

I found thread back to 2012, you also report this problem in 3.2..
https://lkml.org/lkml/2012/1/3/163

I might be a very old bug.

I will try harder to reproduce.

Thanks,
Jack


>
> for f in /sys/kernel/debug/block/*; do echo kick >$f/state; done
>
> When analyzing queue lockups it's important to also have information about
> requests that have been queued but that have not yet been started. I'm using
> the following patch locally (will split this patch and submit it properly when
> I have the time):
>
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 29e8451931ff..3c9d64793865 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -408,8 +408,7 @@ static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
>  {
>         const struct show_busy_params *params = data;
>
> -       if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
> -           test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> +       if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx)
>                 __blk_mq_debugfs_rq_show(params->m,
>                                          list_entry_rq(&rq->queuelist));
>  }
> diff --git a/drivers/scsi/scsi_debugfs.c b/drivers/scsi/scsi_debugfs.c
> index 01f08c03f2c1..41d1e3a01786 100644
> --- a/drivers/scsi/scsi_debugfs.c
> +++ b/drivers/scsi/scsi_debugfs.c
> @@ -7,10 +7,14 @@
>  void scsi_show_rq(struct seq_file *m, struct request *rq)
>  {
>         struct scsi_cmnd *cmd = container_of(scsi_req(rq), typeof(*cmd), req);
> -       int msecs = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc);
> -       char buf[80];
> +       int alloc_ms = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc);
> +       int timeout_ms = jiffies_to_msecs(rq->timeout);
> +       const u8 *const cdb = READ_ONCE(cmd->cmnd);
> +       char buf[80] = "(?)";
>
> -       __scsi_format_command(buf, sizeof(buf), cmd->cmnd, cmd->cmd_len);
> -       seq_printf(m, ", .cmd=%s, .retries=%d, allocated %d.%03d s ago", buf,
> -                  cmd->retries, msecs / 1000, msecs % 1000);
> +       if ((cmd->flags & SCMD_INITIALIZED) && cdb)
> +               __scsi_format_command(buf, sizeof(buf), cdb, cmd->cmd_len);
> +       seq_printf(m, ", .cmd=%s, .retries=%d, .timeout=%d.%03d, allocated %d.%03d s ago",
> +                  buf, cmd->retries, timeout_ms / 1000, timeout_ms % 1000,
> +                  alloc_ms / 1000, alloc_ms % 1000);
>  }
Bart Van Assche Nov. 17, 2017, 5:28 p.m. UTC | #2
On Fri, 2017-11-17 at 18:10 +0100, Jack Wang wrote:
> It's production server, I was too late to gather more information.

> kernel is 4.4.36/4.4.50

> Request mode for both multipath and scsi, no multiqueue involvement.


Hello Jack,

I haven't seen any lockups with the multipath + SRP and the legacy block
and SCSI layers in a long time. So either something went wrong with
backporting upstream fixes, an upstream fix has not been backported to kernel
4.4.36 or the lockup is caused by the SCSI LLD in your setup, e.g. a missing
->scsi_done() call. Since the srp-test software passes on your setup the
latter may be the most likely.

Bart.
diff mbox

Patch

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 29e8451931ff..3c9d64793865 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -408,8 +408,7 @@  static void hctx_show_busy_rq(struct request *rq, void *data, bool reserved)
 {
 	const struct show_busy_params *params = data;
 
-	if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
-	    test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
+	if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx)
 		__blk_mq_debugfs_rq_show(params->m,
 					 list_entry_rq(&rq->queuelist));
 }
diff --git a/drivers/scsi/scsi_debugfs.c b/drivers/scsi/scsi_debugfs.c
index 01f08c03f2c1..41d1e3a01786 100644
--- a/drivers/scsi/scsi_debugfs.c
+++ b/drivers/scsi/scsi_debugfs.c
@@ -7,10 +7,14 @@ 
 void scsi_show_rq(struct seq_file *m, struct request *rq)
 {
 	struct scsi_cmnd *cmd = container_of(scsi_req(rq), typeof(*cmd), req);
-	int msecs = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc);
-	char buf[80];
+	int alloc_ms = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc);
+	int timeout_ms = jiffies_to_msecs(rq->timeout);
+	const u8 *const cdb = READ_ONCE(cmd->cmnd);
+	char buf[80] = "(?)";
 
-	__scsi_format_command(buf, sizeof(buf), cmd->cmnd, cmd->cmd_len);
-	seq_printf(m, ", .cmd=%s, .retries=%d, allocated %d.%03d s ago", buf,
-		   cmd->retries, msecs / 1000, msecs % 1000);
+	if ((cmd->flags & SCMD_INITIALIZED) && cdb)
+		__scsi_format_command(buf, sizeof(buf), cdb, cmd->cmd_len);
+	seq_printf(m, ", .cmd=%s, .retries=%d, .timeout=%d.%03d, allocated %d.%03d s ago",
+		   buf, cmd->retries, timeout_ms / 1000, timeout_ms % 1000,
+		   alloc_ms / 1000, alloc_ms % 1000);
 }