mbox series

[PATCHv4,0/3] scsi timeout handling updates

Message ID 20181126165430.4519-1-keith.busch@intel.com (mailing list archive)
Headers show
Series scsi timeout handling updates | expand

Message

Keith Busch Nov. 26, 2018, 4:54 p.m. UTC
The iterative update to the previous version taking into account review
comments.

Background:

The main objective is to remove the generic block layer's lock prefix
currently required to transition a request to its completed state by
shifting that expense to lower level drivers that actually need it,
and removing the software layering violation that was required to use
that mechnaism.

Changes since v3:

  The complete state is moved to its own field separate from the
  non-atomic scsi_cmnd "flags" field.

  Code comments added to describe the more obscure handling for fake
  timeout injection.

Keith Busch (3):
  blk-mq: Return true if request was completed
  scsi: Do not rely on blk-mq for double completions
  blk-mq: Simplify request completion state

 block/blk-mq.c            |  9 ++++-----
 drivers/scsi/scsi_error.c | 22 +++++++++++-----------
 drivers/scsi/scsi_lib.c   | 13 ++++++++++++-
 include/linux/blk-mq.h    | 16 +---------------
 include/scsi/scsi_cmnd.h  |  4 ++++
 5 files changed, 32 insertions(+), 32 deletions(-)

Comments

Jens Axboe Nov. 26, 2018, 5:33 p.m. UTC | #1
On 11/26/18 9:54 AM, Keith Busch wrote:
> The iterative update to the previous version taking into account review
> comments.
> 
> Background:
> 
> The main objective is to remove the generic block layer's lock prefix
> currently required to transition a request to its completed state by
> shifting that expense to lower level drivers that actually need it,
> and removing the software layering violation that was required to use
> that mechnaism.

Thanks Keith, added for 4.21.
Ming Lei Nov. 28, 2018, 2:20 a.m. UTC | #2
On Mon, Nov 26, 2018 at 10:33:32AM -0700, Jens Axboe wrote:
> On 11/26/18 9:54 AM, Keith Busch wrote:
> > The iterative update to the previous version taking into account review
> > comments.
> > 
> > Background:
> > 
> > The main objective is to remove the generic block layer's lock prefix
> > currently required to transition a request to its completed state by
> > shifting that expense to lower level drivers that actually need it,
> > and removing the software layering violation that was required to use
> > that mechnaism.
> 
> Thanks Keith, added for 4.21.

Hi Jens & Keith,

Seems this patchset causes the following kernel panic, since not see
this issue with commit 4ab32bf3305e

[ 1478.366776] nvmet: adding nsid 1 to subsystem testnqn
[ 1478.385104] nvmet: creating controller 1 for subsystem testnqn for NQN nqn.2014-08.org.nvmexpress:uuid:8786b40e-ed54-4f9e-b1b0-6507377ffa97.
[ 1478.386975] nvme nvme1: creating 4 I/O queues.
[ 1478.388784] nvme nvme1: new ctrl: "testnqn"
[ 1479.396523] nvme io: runtime 40
[ 1575.259358] nvme reset: runtime 40
[ 1575.263920] nvme nvme0: resetting controller
[ 1575.288473] nvme nvme0: 4/0/0 read/write/poll queues
[ 1575.291531] nvme nvme1: resetting controller
[ 1575.298338] print_req_error: I/O error, dev nvme1n1, sector 524287272
[ 1575.299169] print_req_error: I/O error, dev nvme1n1, sector 524287328
[ 1575.299950] BUG: unable to handle kernel NULL pointer dereference at 0000000000000198
[ 1575.299951] PGD 0 P4D 0 
[ 1575.299954] Oops: 0002 [#1] PREEMPT SMP PTI
[ 1575.299972] CPU: 0 PID: 12014 Comm: kworker/u8:0 Not tainted 4.20.0-rc3_5f0ed774ed29_for-next+ #1
[ 1575.299973] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
[ 1575.299978] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop]
[ 1575.299985] RIP: 0010:blk_mq_free_request+0x7e/0xe4
[ 1575.299987] Code: 00 00 00 00 00 8b 53 18 b8 01 00 00 00 84 d2 74 0b 31 c0 81 e2 00 08 06 00 0f 95 c0 49 ff 84 c5 80 00 00 00 f6 43 1c 40 74 09 <f0> 41 ff 8c 24 98 01 00 00 83 3d 0b c6 52 01 00 74 15 0f b6 43 18
[ 1575.299988] RSP: 0018:ffff888277a03e48 EFLAGS: 00010002
[ 1575.299989] RAX: 0000000000000001 RBX: ffff88823aa04e00 RCX: ffff888265e71dc0
[ 1575.299990] RDX: 0000000000080700 RSI: ffff888265e71c00 RDI: ffff88821d822808
[ 1575.299991] RBP: ffff888267a3d750 R08: 00000000ffffffff R09: ffff888265e71dd0
[ 1575.299992] R10: ffff888265e71dd0 R11: 0000000000000020 R12: 0000000000000000
[ 1575.299993] R13: ffffe8ffffc0be40 R14: 0000000000000000 R15: 0000000000006000
[ 1575.299994] FS:  0000000000000000(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000
[ 1575.299995] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1575.299996] CR2: 0000000000000198 CR3: 0000000236de8005 CR4: 0000000000760ef0
[ 1575.300000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1575.300001] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1575.300001] PKRU: 55555554
[ 1575.300002] Call Trace:
[ 1575.300012]  <IRQ>
[ 1575.300016]  blk_mq_complete_request+0xda/0xfd
[ 1575.300021]  nvmet_req_complete+0x11/0x5e [nvmet]
[ 1575.300025]  nvmet_bio_done+0x2b/0x3d [nvmet]
[ 1575.300027]  blk_update_request+0x177/0x27b
[ 1575.300032]  ? null_complete_rq+0x11/0x11 [null_blk]
[ 1575.300034]  blk_mq_end_request+0x1a/0xc8
[ 1575.300036]  null_cmd_timer_expired+0xe/0x11 [null_blk]
[ 1575.300040]  __hrtimer_run_queues+0x176/0x238
[ 1575.300043]  ? kvm_clock_read+0x14/0x23
[ 1575.300045]  hrtimer_interrupt+0x103/0x21c
[ 1575.300049]  smp_apic_timer_interrupt+0xd3/0x141
[ 1575.300051]  apic_timer_interrupt+0xf/0x20
[ 1575.300053]  </IRQ>
[ 1575.300055] RIP: 0010:console_unlock+0x3a8/0x45f
[ 1575.300057] Code: 59 00 8a 1d bc 0e 5d 01 48 c7 c7 f0 0c 6a 82 4c 89 2d b6 0e 5d 01 e8 fe c0 59 00 84 db 75 15 e8 9f 1c 00 00 48 8b 3c 24 57 9d <0f> 1f 44 00 00 e9 aa fc ff ff c6 05 89 0e 5d 01 00 e8 83 1c 00 00
[ 1575.300057] RSP: 0018:ffffc90003a3fc20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[ 1575.300059] RAX: 0000000080000002 RBX: 0000000000000000 RCX: 0000000000000000
[ 1575.300060] RDX: 0000000000000001 RSI: 0000000000000046 RDI: 0000000000000246
[ 1575.300060] RBP: 0000000000000000 R08: 0000000000000002 R09: 0000000000000000
[ 1575.300061] R10: 0000000005f5e100 R11: ffffffff826a00e7 R12: ffffffff821e7460
[ 1575.300062] R13: 0000000000000000 R14: ffffffff826a00e0 R15: 0000000000000000
[ 1575.300067]  vprintk_emit+0x1ff/0x236
[ 1575.300069]  printk+0x52/0x6e
[ 1575.300071]  blk_update_request+0x10a/0x27b
[ 1575.300074]  blk_mq_end_request+0x1a/0xc8
[ 1575.300076]  blk_mq_complete_request+0xda/0xfd
[ 1575.300080]  nvme_cancel_request+0x5b/0x60
[ 1575.300082]  bt_tags_for_each+0xe2/0x10c
[ 1575.300085]  ? nvme_complete_rq+0x162/0x162
[ 1575.300087]  ? nvme_complete_rq+0x162/0x162
[ 1575.300089]  blk_mq_tagset_busy_iter+0x68/0x75
[ 1575.300091]  nvme_loop_shutdown_ctrl+0x38/0x86 [nvme_loop]
[ 1575.300094]  nvme_loop_reset_ctrl_work+0x2a/0xcf [nvme_loop]
[ 1575.300098]  process_one_work+0x1da/0x313
[ 1575.300101]  ? rescuer_thread+0x282/0x282
[ 1575.300103]  process_scheduled_works+0x27/0x2c
[ 1575.300105]  worker_thread+0x1e7/0x295
[ 1575.300107]  kthread+0x115/0x11d
[ 1575.300109]  ? kthread_park+0x76/0x76
[ 1575.300111]  ret_from_fork+0x35/0x40
[ 1575.300113] Modules linked in: nvme_loop nvmet nvme_fabrics null_blk scsi_debug xfs libcrc32c isofs iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core lpc_ich mfd_core ip_tables sr_mod cdrom usb_storage sd_mod ahci libahci libata crc32c_intel virtio_scsi qemu_fw_cfg dm_mirror dm_region_hash dm_log dm_mod [last unloaded: null_blk]
[ 1575.300127] Dumping ftrace buffer:
[ 1575.300129]    (ftrace buffer empty)
[ 1575.300130] CR2: 0000000000000198
[ 1575.300133] ---[ end trace f941476562612d99 ]---

thanks,
Ming
Christoph Hellwig Nov. 28, 2018, 7 a.m. UTC | #3
On Wed, Nov 28, 2018 at 10:20:01AM +0800, Ming Lei wrote:
> On Mon, Nov 26, 2018 at 10:33:32AM -0700, Jens Axboe wrote:
> > On 11/26/18 9:54 AM, Keith Busch wrote:
> > > The iterative update to the previous version taking into account review
> > > comments.
> > > 
> > > Background:
> > > 
> > > The main objective is to remove the generic block layer's lock prefix
> > > currently required to transition a request to its completed state by
> > > shifting that expense to lower level drivers that actually need it,
> > > and removing the software layering violation that was required to use
> > > that mechnaism.
> > 
> > Thanks Keith, added for 4.21.
> 
> Hi Jens & Keith,
> 
> Seems this patchset causes the following kernel panic, since not see
> this issue with commit 4ab32bf3305e

Is this the nvme target on top of null_blk?
Ming Lei Nov. 28, 2018, 10:07 a.m. UTC | #4
On Wed, Nov 28, 2018 at 08:00:10AM +0100, Christoph Hellwig wrote:
> On Wed, Nov 28, 2018 at 10:20:01AM +0800, Ming Lei wrote:
> > On Mon, Nov 26, 2018 at 10:33:32AM -0700, Jens Axboe wrote:
> > > On 11/26/18 9:54 AM, Keith Busch wrote:
> > > > The iterative update to the previous version taking into account review
> > > > comments.
> > > > 
> > > > Background:
> > > > 
> > > > The main objective is to remove the generic block layer's lock prefix
> > > > currently required to transition a request to its completed state by
> > > > shifting that expense to lower level drivers that actually need it,
> > > > and removing the software layering violation that was required to use
> > > > that mechnaism.
> > > 
> > > Thanks Keith, added for 4.21.
> > 
> > Hi Jens & Keith,
> > 
> > Seems this patchset causes the following kernel panic, since not see
> > this issue with commit 4ab32bf3305e
> 
> Is this the nvme target on top of null_blk?

Yes.

Thanks,
Ming
Christoph Hellwig Nov. 28, 2018, 10:08 a.m. UTC | #5
On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote:
> > Is this the nvme target on top of null_blk?
> 
> Yes.

And it goes away if you revert just the last patch?
Keith Busch Nov. 28, 2018, 3:49 p.m. UTC | #6
On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote:
> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote:
> > > Is this the nvme target on top of null_blk?
> > 
> > Yes.
> 
> And it goes away if you revert just the last patch?

It looks like a problem existed before that last patch. Reverting it
helps only if the request happened to have not been reallocated. If it
had been reallocated, the NULL_IRQ_TIMER would have completed the wrong
request out-of-order. If this were a real device, that'd probably result
in data corruption.
Jens Axboe Nov. 28, 2018, 3:58 p.m. UTC | #7
On 11/28/18 8:49 AM, Keith Busch wrote:
> On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote:
>> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote:
>>>> Is this the nvme target on top of null_blk?
>>>
>>> Yes.
>>
>> And it goes away if you revert just the last patch?
> 
> It looks like a problem existed before that last patch. Reverting it
> helps only if the request happened to have not been reallocated. If it
> had been reallocated, the NULL_IRQ_TIMER would have completed the wrong
> request out-of-order. If this were a real device, that'd probably result
> in data corruption.

null_blk just needs updating for this.
Keith Busch Nov. 28, 2018, 4:26 p.m. UTC | #8
On Wed, Nov 28, 2018 at 08:58:00AM -0700, Jens Axboe wrote:
> On 11/28/18 8:49 AM, Keith Busch wrote:
> > On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote:
> >> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote:
> >>>> Is this the nvme target on top of null_blk?
> >>>
> >>> Yes.
> >>
> >> And it goes away if you revert just the last patch?
> > 
> > It looks like a problem existed before that last patch. Reverting it
> > helps only if the request happened to have not been reallocated. If it
> > had been reallocated, the NULL_IRQ_TIMER would have completed the wrong
> > request out-of-order. If this were a real device, that'd probably result
> > in data corruption.
> 
> null_blk just needs updating for this.

Isn't this the nvme target's problem? It shouldn't complete requests
dispatched to its backing device, so I'm thinking something like the
following is what should happen. Untested at the moment, but will try
it out shortly.

---
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 9908082b32c4..116398b240e5 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -428,10 +428,14 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl)
 static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
 {
 	if (ctrl->ctrl.queue_count > 1) {
-		nvme_stop_queues(&ctrl->ctrl);
-		blk_mq_tagset_busy_iter(&ctrl->tag_set,
-					nvme_cancel_request, &ctrl->ctrl);
+		/*
+		 * The back end device driver is responsible for completing all
+		 * entered requests
+		 */
+		nvme_start_freeze(&ctrl->ctrl);
+		nvme_wait_freeze(&ctrl->ctrl);
 		nvme_loop_destroy_io_queues(ctrl);
+		nvme_unfreeze(&ctrl->ctrl);
 	}
 
 	if (ctrl->ctrl.state == NVME_CTRL_LIVE)
---
Jens Axboe Nov. 28, 2018, 4:31 p.m. UTC | #9
On 11/28/18 9:26 AM, Keith Busch wrote:
> On Wed, Nov 28, 2018 at 08:58:00AM -0700, Jens Axboe wrote:
>> On 11/28/18 8:49 AM, Keith Busch wrote:
>>> On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote:
>>>> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote:
>>>>>> Is this the nvme target on top of null_blk?
>>>>>
>>>>> Yes.
>>>>
>>>> And it goes away if you revert just the last patch?
>>>
>>> It looks like a problem existed before that last patch. Reverting it
>>> helps only if the request happened to have not been reallocated. If it
>>> had been reallocated, the NULL_IRQ_TIMER would have completed the wrong
>>> request out-of-order. If this were a real device, that'd probably result
>>> in data corruption.
>>
>> null_blk just needs updating for this.
> 
> Isn't this the nvme target's problem? It shouldn't complete requests
> dispatched to its backing device, so I'm thinking something like the
> following is what should happen. Untested at the moment, but will try
> it out shortly.

I looked at null_blk after the fact, and my recollection of how
that timeout stuff worked was wrong. I think you are right.
Keith Busch Nov. 28, 2018, 5:56 p.m. UTC | #10
On Wed, Nov 28, 2018 at 09:26:55AM -0700, Keith Busch wrote:
> ---
> diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
> index 9908082b32c4..116398b240e5 100644
> --- a/drivers/nvme/target/loop.c
> +++ b/drivers/nvme/target/loop.c
> @@ -428,10 +428,14 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl)
>  static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
>  {
>  	if (ctrl->ctrl.queue_count > 1) {
> -		nvme_stop_queues(&ctrl->ctrl);
> -		blk_mq_tagset_busy_iter(&ctrl->tag_set,
> -					nvme_cancel_request, &ctrl->ctrl);
> +		/*
> +		 * The back end device driver is responsible for completing all
> +		 * entered requests
> +		 */
> +		nvme_start_freeze(&ctrl->ctrl);
> +		nvme_wait_freeze(&ctrl->ctrl);
>  		nvme_loop_destroy_io_queues(ctrl);
> +		nvme_unfreeze(&ctrl->ctrl);
>  	}
>  
>  	if (ctrl->ctrl.state == NVME_CTRL_LIVE)
> ---

The above tests fine with io and nvme resets on a target nvme loop
backed by null_blk, but I also couldn't recreate the original report
without the patch.

Ming,

Could you tell me a little more how you set it up? I'm just configuring
null_blk with queue_mode=2 irqmode=2. Anything else to recreate?

Thanks,
Keith
Keith Busch Nov. 28, 2018, 10:31 p.m. UTC | #11
On Wed, Nov 28, 2018 at 09:26:55AM -0700, Keith Busch wrote:
> ---
> diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
> index 9908082b32c4..116398b240e5 100644
> --- a/drivers/nvme/target/loop.c
> +++ b/drivers/nvme/target/loop.c
> @@ -428,10 +428,14 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl)
>  static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
>  {
>  	if (ctrl->ctrl.queue_count > 1) {
> -		nvme_stop_queues(&ctrl->ctrl);
> -		blk_mq_tagset_busy_iter(&ctrl->tag_set,
> -					nvme_cancel_request, &ctrl->ctrl);
> +		/*
> +		 * The back end device driver is responsible for completing all
> +		 * entered requests
> +		 */
> +		nvme_start_freeze(&ctrl->ctrl);
> +		nvme_wait_freeze(&ctrl->ctrl);
>  		nvme_loop_destroy_io_queues(ctrl);
> +		nvme_unfreeze(&ctrl->ctrl);
>  	}
>  
>  	if (ctrl->ctrl.state == NVME_CTRL_LIVE)
> ---

Grr, the above doesn't work so well when not using NVMe mpath because
nvmf_check_ready() will requeue it, leaving entered requests that won't
be started.

Waiting for a freeze isn't really the criteria we need anyway: we don't
care if there are entered requests in REQ_MQ_IDLE. We just want to wait
for dispatched ones to return, and we currently don't have a good way
to sync with that condition.
Keith Busch Nov. 28, 2018, 11:36 p.m. UTC | #12
On Wed, Nov 28, 2018 at 03:31:46PM -0700, Keith Busch wrote:
> Waiting for a freeze isn't really the criteria we need anyway: we don't
> care if there are entered requests in REQ_MQ_IDLE. We just want to wait
> for dispatched ones to return, and we currently don't have a good way
> to sync with that condition.

One thing making this weird is that blk_mq_request_started() returns true
for COMPLETED requests, and that makes no sense to me. Completed is
the opposite of started, so I'm not sure why we would return true for
such states. Is anyone actually depending on that?

If we can return true only for started commands, the following
implements the desired wait criteria:

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a82830f39933..d0ef540711c7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -647,7 +647,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
 
 int blk_mq_request_started(struct request *rq)
 {
-	return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
+	return blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT;
 }
 EXPORT_SYMBOL_GPL(blk_mq_request_started);
 
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 9908082b32c4..ae50b6ed95fb 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -14,6 +14,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/scatterlist.h>
 #include <linux/blk-mq.h>
+#include <linux/delay.h>
 #include <linux/nvme.h>
 #include <linux/module.h>
 #include <linux/parser.h>
@@ -425,12 +426,37 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl)
 	return error;
 }
 
+bool nvme_count_active(struct request *req, void *data, bool reserved)
+{
+	unsigned int *active = data;
+
+	(*active)++;
+	return true;
+}
+
+/*
+ * It is the backing device driver's responsibility to ensure all dispatched
+ * requests are eventually completed.
+ */
+static void nvme_wait_for_stopped(struct nvme_loop_ctrl *ctrl)
+{
+	unsigned int active;
+
+	do {
+		active = 0;
+		blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_count_active,
+					&active);
+		if (!active)
+			return;
+		msleep(100);
+	} while (true);
+}
+
 static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
 {
 	if (ctrl->ctrl.queue_count > 1) {
 		nvme_stop_queues(&ctrl->ctrl);
-		blk_mq_tagset_busy_iter(&ctrl->tag_set,
-					nvme_cancel_request, &ctrl->ctrl);
+		nvme_wait_for_stopped(ctrl);
 		nvme_loop_destroy_io_queues(ctrl);
 	}
 
--
Ming Lei Nov. 29, 2018, 1:18 a.m. UTC | #13
On Wed, Nov 28, 2018 at 10:56:25AM -0700, Keith Busch wrote:
> On Wed, Nov 28, 2018 at 09:26:55AM -0700, Keith Busch wrote:
> > ---
> > diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
> > index 9908082b32c4..116398b240e5 100644
> > --- a/drivers/nvme/target/loop.c
> > +++ b/drivers/nvme/target/loop.c
> > @@ -428,10 +428,14 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl)
> >  static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
> >  {
> >  	if (ctrl->ctrl.queue_count > 1) {
> > -		nvme_stop_queues(&ctrl->ctrl);
> > -		blk_mq_tagset_busy_iter(&ctrl->tag_set,
> > -					nvme_cancel_request, &ctrl->ctrl);
> > +		/*
> > +		 * The back end device driver is responsible for completing all
> > +		 * entered requests
> > +		 */
> > +		nvme_start_freeze(&ctrl->ctrl);
> > +		nvme_wait_freeze(&ctrl->ctrl);
> >  		nvme_loop_destroy_io_queues(ctrl);
> > +		nvme_unfreeze(&ctrl->ctrl);
> >  	}
> >  
> >  	if (ctrl->ctrl.state == NVME_CTRL_LIVE)
> > ---
> 
> The above tests fine with io and nvme resets on a target nvme loop
> backed by null_blk, but I also couldn't recreate the original report
> without the patch.
> 
> Ming,
> 
> Could you tell me a little more how you set it up? I'm just configuring
> null_blk with queue_mode=2 irqmode=2. Anything else to recreate?

No any parameters for null_blk.

You may find all the scripts and .json here:

http://people.redhat.com/minlei/tests/tools/nvme/

And the test entry is nvme_sanity.

thanks,
Ming
Ming Lei Nov. 29, 2018, 2:15 a.m. UTC | #14
On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote:
> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote:
> > > Is this the nvme target on top of null_blk?
> > 
> > Yes.
> 
> And it goes away if you revert just the last patch?

Today I run this test again and can't reproduce it any more.

So maybe not a new issue, and it is just triggered in yesterday's
for-4.21/block, :-)

Thanks,
Ming
Martin K. Petersen Nov. 29, 2018, 2:39 a.m. UTC | #15
Ming,

> On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote:
>> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote:
>> > > Is this the nvme target on top of null_blk?
>> > 
>> > Yes.
>> 
>> And it goes away if you revert just the last patch?
>
> Today I run this test again and can't reproduce it any more.
>
> So maybe not a new issue, and it is just triggered in yesterday's
> for-4.21/block, :-)

Can you reproduce with yesterday's tree then?

I'm obviously a bit concerned about merging something without knowing
the cause of the problems you were seeing.
Ming Lei Nov. 29, 2018, 8:12 a.m. UTC | #16
On Wed, Nov 28, 2018 at 09:39:44PM -0500, Martin K. Petersen wrote:
> 
> Ming,
> 
> > On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote:
> >> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote:
> >> > > Is this the nvme target on top of null_blk?
> >> > 
> >> > Yes.
> >> 
> >> And it goes away if you revert just the last patch?
> >
> > Today I run this test again and can't reproduce it any more.
> >
> > So maybe not a new issue, and it is just triggered in yesterday's
> > for-4.21/block, :-)
> 
> Can you reproduce with yesterday's tree then?
> 
> I'm obviously a bit concerned about merging something without knowing
> the cause of the problems you were seeing.

Hi Martin,

The panic is triggered on my auto daily test on yesterday's
for-4.21/block.

It isn't triggered in today's for-4.21/block. Also not triggered any more
when I run same test again on yesterday's for-4.21/block(5f0ed774ed29).


Thanks,
Ming
Christoph Hellwig Nov. 29, 2018, 5:11 p.m. UTC | #17
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a82830f39933..d0ef540711c7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -647,7 +647,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
>  
>  int blk_mq_request_started(struct request *rq)
>  {
> -	return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
> +	return blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT;
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_request_started);

Independ of this series this change looks like the right thing to do.
But this whole area is a mine field, so separate testing would be
very helpful.

I also wonder why we even bother with the above helper, a direct
state comparism seems a lot more obvious to the reader.

Last but not least the blk_mq_request_started check in nbd
should probably be lifted into blk_mq_tag_to_rq while we're at it..

As for the nvme issue - it seems to me like we need to decouple the
nvme loop frontend request from the target backend request.  In case
of a timeout/reset we'll complete struct request like all other nvme
transport drivers, but we leave the backend target state around, which
will be freed when it completes (or leaks when the completion is lost).
Keith Busch Nov. 29, 2018, 5:20 p.m. UTC | #18
On Thu, Nov 29, 2018 at 06:11:59PM +0100, Christoph Hellwig wrote:
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index a82830f39933..d0ef540711c7 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -647,7 +647,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
> >  
> >  int blk_mq_request_started(struct request *rq)
> >  {
> > -	return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
> > +	return blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT;
> >  }
> >  EXPORT_SYMBOL_GPL(blk_mq_request_started);
> 
> Independ of this series this change looks like the right thing to do.
> But this whole area is a mine field, so separate testing would be
> very helpful.
>
> I also wonder why we even bother with the above helper, a direct
> state comparism seems a lot more obvious to the reader.

I think it's just because blk_mq_rq_state() is a private interface. The
enum mq_rq_state is defined under include/linux/, so it looks okay to
make getting the state public too.
 
> Last but not least the blk_mq_request_started check in nbd
> should probably be lifted into blk_mq_tag_to_rq while we're at it..
> 
> As for the nvme issue - it seems to me like we need to decouple the
> nvme loop frontend request from the target backend request.  In case
> of a timeout/reset we'll complete struct request like all other nvme
> transport drivers, but we leave the backend target state around, which
> will be freed when it completes (or leaks when the completion is lost).

I don't think nvme's loop target should do anything to help a command
complete. It shouldn't even implement a timeout for the same reason
no stacking block driver implements these. If a request is stuck, the
lowest level is the only driver that should have the responsibility to
make it unstuck.
Hannes Reinecke Dec. 28, 2018, 12:47 p.m. UTC | #19
On 11/29/18 6:20 PM, Keith Busch wrote:
> On Thu, Nov 29, 2018 at 06:11:59PM +0100, Christoph Hellwig wrote:
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index a82830f39933..d0ef540711c7 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -647,7 +647,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
>>>   
>>>   int blk_mq_request_started(struct request *rq)
>>>   {
>>> -	return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
>>> +	return blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT;
>>>   }
>>>   EXPORT_SYMBOL_GPL(blk_mq_request_started);
>>
>> Independ of this series this change looks like the right thing to do.
>> But this whole area is a mine field, so separate testing would be
>> very helpful.
>>
>> I also wonder why we even bother with the above helper, a direct
>> state comparism seems a lot more obvious to the reader.
> 
> I think it's just because blk_mq_rq_state() is a private interface. The
> enum mq_rq_state is defined under include/linux/, so it looks okay to
> make getting the state public too.
>   
>> Last but not least the blk_mq_request_started check in nbd
>> should probably be lifted into blk_mq_tag_to_rq while we're at it..
>>
>> As for the nvme issue - it seems to me like we need to decouple the
>> nvme loop frontend request from the target backend request.  In case
>> of a timeout/reset we'll complete struct request like all other nvme
>> transport drivers, but we leave the backend target state around, which
>> will be freed when it completes (or leaks when the completion is lost).
> 
> I don't think nvme's loop target should do anything to help a command
> complete. It shouldn't even implement a timeout for the same reason
> no stacking block driver implements these. If a request is stuck, the
> lowest level is the only driver that should have the responsibility to
> make it unstuck.
> 
Not quite.
You still need to be able to reset the controller, which you can't if 
you have to wait for the lowest level.

Cheers,

Hannes