Message ID | 20181126165430.4519-1-keith.busch@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | scsi timeout handling updates | expand |
On 11/26/18 9:54 AM, Keith Busch wrote: > The iterative update to the previous version taking into account review > comments. > > Background: > > The main objective is to remove the generic block layer's lock prefix > currently required to transition a request to its completed state by > shifting that expense to lower level drivers that actually need it, > and removing the software layering violation that was required to use > that mechnaism. Thanks Keith, added for 4.21.
On Mon, Nov 26, 2018 at 10:33:32AM -0700, Jens Axboe wrote: > On 11/26/18 9:54 AM, Keith Busch wrote: > > The iterative update to the previous version taking into account review > > comments. > > > > Background: > > > > The main objective is to remove the generic block layer's lock prefix > > currently required to transition a request to its completed state by > > shifting that expense to lower level drivers that actually need it, > > and removing the software layering violation that was required to use > > that mechnaism. > > Thanks Keith, added for 4.21. Hi Jens & Keith, Seems this patchset causes the following kernel panic, since not see this issue with commit 4ab32bf3305e [ 1478.366776] nvmet: adding nsid 1 to subsystem testnqn [ 1478.385104] nvmet: creating controller 1 for subsystem testnqn for NQN nqn.2014-08.org.nvmexpress:uuid:8786b40e-ed54-4f9e-b1b0-6507377ffa97. [ 1478.386975] nvme nvme1: creating 4 I/O queues. [ 1478.388784] nvme nvme1: new ctrl: "testnqn" [ 1479.396523] nvme io: runtime 40 [ 1575.259358] nvme reset: runtime 40 [ 1575.263920] nvme nvme0: resetting controller [ 1575.288473] nvme nvme0: 4/0/0 read/write/poll queues [ 1575.291531] nvme nvme1: resetting controller [ 1575.298338] print_req_error: I/O error, dev nvme1n1, sector 524287272 [ 1575.299169] print_req_error: I/O error, dev nvme1n1, sector 524287328 [ 1575.299950] BUG: unable to handle kernel NULL pointer dereference at 0000000000000198 [ 1575.299951] PGD 0 P4D 0 [ 1575.299954] Oops: 0002 [#1] PREEMPT SMP PTI [ 1575.299972] CPU: 0 PID: 12014 Comm: kworker/u8:0 Not tainted 4.20.0-rc3_5f0ed774ed29_for-next+ #1 [ 1575.299973] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [ 1575.299978] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop] [ 1575.299985] RIP: 0010:blk_mq_free_request+0x7e/0xe4 [ 1575.299987] Code: 00 00 00 00 00 8b 53 18 b8 01 00 00 00 84 d2 74 0b 31 c0 81 e2 00 08 06 00 0f 95 c0 49 ff 84 c5 80 00 00 00 f6 43 1c 40 74 09 <f0> 41 ff 8c 24 98 01 00 00 83 3d 0b c6 52 01 00 74 15 0f b6 43 18 [ 1575.299988] RSP: 0018:ffff888277a03e48 EFLAGS: 00010002 [ 1575.299989] RAX: 0000000000000001 RBX: ffff88823aa04e00 RCX: ffff888265e71dc0 [ 1575.299990] RDX: 0000000000080700 RSI: ffff888265e71c00 RDI: ffff88821d822808 [ 1575.299991] RBP: ffff888267a3d750 R08: 00000000ffffffff R09: ffff888265e71dd0 [ 1575.299992] R10: ffff888265e71dd0 R11: 0000000000000020 R12: 0000000000000000 [ 1575.299993] R13: ffffe8ffffc0be40 R14: 0000000000000000 R15: 0000000000006000 [ 1575.299994] FS: 0000000000000000(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000 [ 1575.299995] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1575.299996] CR2: 0000000000000198 CR3: 0000000236de8005 CR4: 0000000000760ef0 [ 1575.300000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1575.300001] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1575.300001] PKRU: 55555554 [ 1575.300002] Call Trace: [ 1575.300012] <IRQ> [ 1575.300016] blk_mq_complete_request+0xda/0xfd [ 1575.300021] nvmet_req_complete+0x11/0x5e [nvmet] [ 1575.300025] nvmet_bio_done+0x2b/0x3d [nvmet] [ 1575.300027] blk_update_request+0x177/0x27b [ 1575.300032] ? null_complete_rq+0x11/0x11 [null_blk] [ 1575.300034] blk_mq_end_request+0x1a/0xc8 [ 1575.300036] null_cmd_timer_expired+0xe/0x11 [null_blk] [ 1575.300040] __hrtimer_run_queues+0x176/0x238 [ 1575.300043] ? kvm_clock_read+0x14/0x23 [ 1575.300045] hrtimer_interrupt+0x103/0x21c [ 1575.300049] smp_apic_timer_interrupt+0xd3/0x141 [ 1575.300051] apic_timer_interrupt+0xf/0x20 [ 1575.300053] </IRQ> [ 1575.300055] RIP: 0010:console_unlock+0x3a8/0x45f [ 1575.300057] Code: 59 00 8a 1d bc 0e 5d 01 48 c7 c7 f0 0c 6a 82 4c 89 2d b6 0e 5d 01 e8 fe c0 59 00 84 db 75 15 e8 9f 1c 00 00 48 8b 3c 24 57 9d <0f> 1f 44 00 00 e9 aa fc ff ff c6 05 89 0e 5d 01 00 e8 83 1c 00 00 [ 1575.300057] RSP: 0018:ffffc90003a3fc20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 [ 1575.300059] RAX: 0000000080000002 RBX: 0000000000000000 RCX: 0000000000000000 [ 1575.300060] RDX: 0000000000000001 RSI: 0000000000000046 RDI: 0000000000000246 [ 1575.300060] RBP: 0000000000000000 R08: 0000000000000002 R09: 0000000000000000 [ 1575.300061] R10: 0000000005f5e100 R11: ffffffff826a00e7 R12: ffffffff821e7460 [ 1575.300062] R13: 0000000000000000 R14: ffffffff826a00e0 R15: 0000000000000000 [ 1575.300067] vprintk_emit+0x1ff/0x236 [ 1575.300069] printk+0x52/0x6e [ 1575.300071] blk_update_request+0x10a/0x27b [ 1575.300074] blk_mq_end_request+0x1a/0xc8 [ 1575.300076] blk_mq_complete_request+0xda/0xfd [ 1575.300080] nvme_cancel_request+0x5b/0x60 [ 1575.300082] bt_tags_for_each+0xe2/0x10c [ 1575.300085] ? nvme_complete_rq+0x162/0x162 [ 1575.300087] ? nvme_complete_rq+0x162/0x162 [ 1575.300089] blk_mq_tagset_busy_iter+0x68/0x75 [ 1575.300091] nvme_loop_shutdown_ctrl+0x38/0x86 [nvme_loop] [ 1575.300094] nvme_loop_reset_ctrl_work+0x2a/0xcf [nvme_loop] [ 1575.300098] process_one_work+0x1da/0x313 [ 1575.300101] ? rescuer_thread+0x282/0x282 [ 1575.300103] process_scheduled_works+0x27/0x2c [ 1575.300105] worker_thread+0x1e7/0x295 [ 1575.300107] kthread+0x115/0x11d [ 1575.300109] ? kthread_park+0x76/0x76 [ 1575.300111] ret_from_fork+0x35/0x40 [ 1575.300113] Modules linked in: nvme_loop nvmet nvme_fabrics null_blk scsi_debug xfs libcrc32c isofs iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core lpc_ich mfd_core ip_tables sr_mod cdrom usb_storage sd_mod ahci libahci libata crc32c_intel virtio_scsi qemu_fw_cfg dm_mirror dm_region_hash dm_log dm_mod [last unloaded: null_blk] [ 1575.300127] Dumping ftrace buffer: [ 1575.300129] (ftrace buffer empty) [ 1575.300130] CR2: 0000000000000198 [ 1575.300133] ---[ end trace f941476562612d99 ]--- thanks, Ming
On Wed, Nov 28, 2018 at 10:20:01AM +0800, Ming Lei wrote: > On Mon, Nov 26, 2018 at 10:33:32AM -0700, Jens Axboe wrote: > > On 11/26/18 9:54 AM, Keith Busch wrote: > > > The iterative update to the previous version taking into account review > > > comments. > > > > > > Background: > > > > > > The main objective is to remove the generic block layer's lock prefix > > > currently required to transition a request to its completed state by > > > shifting that expense to lower level drivers that actually need it, > > > and removing the software layering violation that was required to use > > > that mechnaism. > > > > Thanks Keith, added for 4.21. > > Hi Jens & Keith, > > Seems this patchset causes the following kernel panic, since not see > this issue with commit 4ab32bf3305e Is this the nvme target on top of null_blk?
On Wed, Nov 28, 2018 at 08:00:10AM +0100, Christoph Hellwig wrote: > On Wed, Nov 28, 2018 at 10:20:01AM +0800, Ming Lei wrote: > > On Mon, Nov 26, 2018 at 10:33:32AM -0700, Jens Axboe wrote: > > > On 11/26/18 9:54 AM, Keith Busch wrote: > > > > The iterative update to the previous version taking into account review > > > > comments. > > > > > > > > Background: > > > > > > > > The main objective is to remove the generic block layer's lock prefix > > > > currently required to transition a request to its completed state by > > > > shifting that expense to lower level drivers that actually need it, > > > > and removing the software layering violation that was required to use > > > > that mechnaism. > > > > > > Thanks Keith, added for 4.21. > > > > Hi Jens & Keith, > > > > Seems this patchset causes the following kernel panic, since not see > > this issue with commit 4ab32bf3305e > > Is this the nvme target on top of null_blk? Yes. Thanks, Ming
On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote: > > Is this the nvme target on top of null_blk? > > Yes. And it goes away if you revert just the last patch?
On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote: > On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote: > > > Is this the nvme target on top of null_blk? > > > > Yes. > > And it goes away if you revert just the last patch? It looks like a problem existed before that last patch. Reverting it helps only if the request happened to have not been reallocated. If it had been reallocated, the NULL_IRQ_TIMER would have completed the wrong request out-of-order. If this were a real device, that'd probably result in data corruption.
On 11/28/18 8:49 AM, Keith Busch wrote: > On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote: >> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote: >>>> Is this the nvme target on top of null_blk? >>> >>> Yes. >> >> And it goes away if you revert just the last patch? > > It looks like a problem existed before that last patch. Reverting it > helps only if the request happened to have not been reallocated. If it > had been reallocated, the NULL_IRQ_TIMER would have completed the wrong > request out-of-order. If this were a real device, that'd probably result > in data corruption. null_blk just needs updating for this.
On Wed, Nov 28, 2018 at 08:58:00AM -0700, Jens Axboe wrote: > On 11/28/18 8:49 AM, Keith Busch wrote: > > On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote: > >> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote: > >>>> Is this the nvme target on top of null_blk? > >>> > >>> Yes. > >> > >> And it goes away if you revert just the last patch? > > > > It looks like a problem existed before that last patch. Reverting it > > helps only if the request happened to have not been reallocated. If it > > had been reallocated, the NULL_IRQ_TIMER would have completed the wrong > > request out-of-order. If this were a real device, that'd probably result > > in data corruption. > > null_blk just needs updating for this. Isn't this the nvme target's problem? It shouldn't complete requests dispatched to its backing device, so I'm thinking something like the following is what should happen. Untested at the moment, but will try it out shortly. --- diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index 9908082b32c4..116398b240e5 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -428,10 +428,14 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl) static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl) { if (ctrl->ctrl.queue_count > 1) { - nvme_stop_queues(&ctrl->ctrl); - blk_mq_tagset_busy_iter(&ctrl->tag_set, - nvme_cancel_request, &ctrl->ctrl); + /* + * The back end device driver is responsible for completing all + * entered requests + */ + nvme_start_freeze(&ctrl->ctrl); + nvme_wait_freeze(&ctrl->ctrl); nvme_loop_destroy_io_queues(ctrl); + nvme_unfreeze(&ctrl->ctrl); } if (ctrl->ctrl.state == NVME_CTRL_LIVE) ---
On 11/28/18 9:26 AM, Keith Busch wrote: > On Wed, Nov 28, 2018 at 08:58:00AM -0700, Jens Axboe wrote: >> On 11/28/18 8:49 AM, Keith Busch wrote: >>> On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote: >>>> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote: >>>>>> Is this the nvme target on top of null_blk? >>>>> >>>>> Yes. >>>> >>>> And it goes away if you revert just the last patch? >>> >>> It looks like a problem existed before that last patch. Reverting it >>> helps only if the request happened to have not been reallocated. If it >>> had been reallocated, the NULL_IRQ_TIMER would have completed the wrong >>> request out-of-order. If this were a real device, that'd probably result >>> in data corruption. >> >> null_blk just needs updating for this. > > Isn't this the nvme target's problem? It shouldn't complete requests > dispatched to its backing device, so I'm thinking something like the > following is what should happen. Untested at the moment, but will try > it out shortly. I looked at null_blk after the fact, and my recollection of how that timeout stuff worked was wrong. I think you are right.
On Wed, Nov 28, 2018 at 09:26:55AM -0700, Keith Busch wrote: > --- > diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c > index 9908082b32c4..116398b240e5 100644 > --- a/drivers/nvme/target/loop.c > +++ b/drivers/nvme/target/loop.c > @@ -428,10 +428,14 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl) > static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl) > { > if (ctrl->ctrl.queue_count > 1) { > - nvme_stop_queues(&ctrl->ctrl); > - blk_mq_tagset_busy_iter(&ctrl->tag_set, > - nvme_cancel_request, &ctrl->ctrl); > + /* > + * The back end device driver is responsible for completing all > + * entered requests > + */ > + nvme_start_freeze(&ctrl->ctrl); > + nvme_wait_freeze(&ctrl->ctrl); > nvme_loop_destroy_io_queues(ctrl); > + nvme_unfreeze(&ctrl->ctrl); > } > > if (ctrl->ctrl.state == NVME_CTRL_LIVE) > --- The above tests fine with io and nvme resets on a target nvme loop backed by null_blk, but I also couldn't recreate the original report without the patch. Ming, Could you tell me a little more how you set it up? I'm just configuring null_blk with queue_mode=2 irqmode=2. Anything else to recreate? Thanks, Keith
On Wed, Nov 28, 2018 at 09:26:55AM -0700, Keith Busch wrote: > --- > diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c > index 9908082b32c4..116398b240e5 100644 > --- a/drivers/nvme/target/loop.c > +++ b/drivers/nvme/target/loop.c > @@ -428,10 +428,14 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl) > static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl) > { > if (ctrl->ctrl.queue_count > 1) { > - nvme_stop_queues(&ctrl->ctrl); > - blk_mq_tagset_busy_iter(&ctrl->tag_set, > - nvme_cancel_request, &ctrl->ctrl); > + /* > + * The back end device driver is responsible for completing all > + * entered requests > + */ > + nvme_start_freeze(&ctrl->ctrl); > + nvme_wait_freeze(&ctrl->ctrl); > nvme_loop_destroy_io_queues(ctrl); > + nvme_unfreeze(&ctrl->ctrl); > } > > if (ctrl->ctrl.state == NVME_CTRL_LIVE) > --- Grr, the above doesn't work so well when not using NVMe mpath because nvmf_check_ready() will requeue it, leaving entered requests that won't be started. Waiting for a freeze isn't really the criteria we need anyway: we don't care if there are entered requests in REQ_MQ_IDLE. We just want to wait for dispatched ones to return, and we currently don't have a good way to sync with that condition.
On Wed, Nov 28, 2018 at 03:31:46PM -0700, Keith Busch wrote: > Waiting for a freeze isn't really the criteria we need anyway: we don't > care if there are entered requests in REQ_MQ_IDLE. We just want to wait > for dispatched ones to return, and we currently don't have a good way > to sync with that condition. One thing making this weird is that blk_mq_request_started() returns true for COMPLETED requests, and that makes no sense to me. Completed is the opposite of started, so I'm not sure why we would return true for such states. Is anyone actually depending on that? If we can return true only for started commands, the following implements the desired wait criteria: --- diff --git a/block/blk-mq.c b/block/blk-mq.c index a82830f39933..d0ef540711c7 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -647,7 +647,7 @@ EXPORT_SYMBOL(blk_mq_complete_request); int blk_mq_request_started(struct request *rq) { - return blk_mq_rq_state(rq) != MQ_RQ_IDLE; + return blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT; } EXPORT_SYMBOL_GPL(blk_mq_request_started); diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index 9908082b32c4..ae50b6ed95fb 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -14,6 +14,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/scatterlist.h> #include <linux/blk-mq.h> +#include <linux/delay.h> #include <linux/nvme.h> #include <linux/module.h> #include <linux/parser.h> @@ -425,12 +426,37 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl) return error; } +bool nvme_count_active(struct request *req, void *data, bool reserved) +{ + unsigned int *active = data; + + (*active)++; + return true; +} + +/* + * It is the backing device driver's responsibility to ensure all dispatched + * requests are eventually completed. + */ +static void nvme_wait_for_stopped(struct nvme_loop_ctrl *ctrl) +{ + unsigned int active; + + do { + active = 0; + blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_count_active, + &active); + if (!active) + return; + msleep(100); + } while (true); +} + static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl) { if (ctrl->ctrl.queue_count > 1) { nvme_stop_queues(&ctrl->ctrl); - blk_mq_tagset_busy_iter(&ctrl->tag_set, - nvme_cancel_request, &ctrl->ctrl); + nvme_wait_for_stopped(ctrl); nvme_loop_destroy_io_queues(ctrl); } --
On Wed, Nov 28, 2018 at 10:56:25AM -0700, Keith Busch wrote: > On Wed, Nov 28, 2018 at 09:26:55AM -0700, Keith Busch wrote: > > --- > > diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c > > index 9908082b32c4..116398b240e5 100644 > > --- a/drivers/nvme/target/loop.c > > +++ b/drivers/nvme/target/loop.c > > @@ -428,10 +428,14 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl) > > static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl) > > { > > if (ctrl->ctrl.queue_count > 1) { > > - nvme_stop_queues(&ctrl->ctrl); > > - blk_mq_tagset_busy_iter(&ctrl->tag_set, > > - nvme_cancel_request, &ctrl->ctrl); > > + /* > > + * The back end device driver is responsible for completing all > > + * entered requests > > + */ > > + nvme_start_freeze(&ctrl->ctrl); > > + nvme_wait_freeze(&ctrl->ctrl); > > nvme_loop_destroy_io_queues(ctrl); > > + nvme_unfreeze(&ctrl->ctrl); > > } > > > > if (ctrl->ctrl.state == NVME_CTRL_LIVE) > > --- > > The above tests fine with io and nvme resets on a target nvme loop > backed by null_blk, but I also couldn't recreate the original report > without the patch. > > Ming, > > Could you tell me a little more how you set it up? I'm just configuring > null_blk with queue_mode=2 irqmode=2. Anything else to recreate? No any parameters for null_blk. You may find all the scripts and .json here: http://people.redhat.com/minlei/tests/tools/nvme/ And the test entry is nvme_sanity. thanks, Ming
On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote: > On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote: > > > Is this the nvme target on top of null_blk? > > > > Yes. > > And it goes away if you revert just the last patch? Today I run this test again and can't reproduce it any more. So maybe not a new issue, and it is just triggered in yesterday's for-4.21/block, :-) Thanks, Ming
Ming, > On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote: >> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote: >> > > Is this the nvme target on top of null_blk? >> > >> > Yes. >> >> And it goes away if you revert just the last patch? > > Today I run this test again and can't reproduce it any more. > > So maybe not a new issue, and it is just triggered in yesterday's > for-4.21/block, :-) Can you reproduce with yesterday's tree then? I'm obviously a bit concerned about merging something without knowing the cause of the problems you were seeing.
On Wed, Nov 28, 2018 at 09:39:44PM -0500, Martin K. Petersen wrote: > > Ming, > > > On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote: > >> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote: > >> > > Is this the nvme target on top of null_blk? > >> > > >> > Yes. > >> > >> And it goes away if you revert just the last patch? > > > > Today I run this test again and can't reproduce it any more. > > > > So maybe not a new issue, and it is just triggered in yesterday's > > for-4.21/block, :-) > > Can you reproduce with yesterday's tree then? > > I'm obviously a bit concerned about merging something without knowing > the cause of the problems you were seeing. Hi Martin, The panic is triggered on my auto daily test on yesterday's for-4.21/block. It isn't triggered in today's for-4.21/block. Also not triggered any more when I run same test again on yesterday's for-4.21/block(5f0ed774ed29). Thanks, Ming
> diff --git a/block/blk-mq.c b/block/blk-mq.c > index a82830f39933..d0ef540711c7 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -647,7 +647,7 @@ EXPORT_SYMBOL(blk_mq_complete_request); > > int blk_mq_request_started(struct request *rq) > { > - return blk_mq_rq_state(rq) != MQ_RQ_IDLE; > + return blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT; > } > EXPORT_SYMBOL_GPL(blk_mq_request_started); Independ of this series this change looks like the right thing to do. But this whole area is a mine field, so separate testing would be very helpful. I also wonder why we even bother with the above helper, a direct state comparism seems a lot more obvious to the reader. Last but not least the blk_mq_request_started check in nbd should probably be lifted into blk_mq_tag_to_rq while we're at it.. As for the nvme issue - it seems to me like we need to decouple the nvme loop frontend request from the target backend request. In case of a timeout/reset we'll complete struct request like all other nvme transport drivers, but we leave the backend target state around, which will be freed when it completes (or leaks when the completion is lost).
On Thu, Nov 29, 2018 at 06:11:59PM +0100, Christoph Hellwig wrote: > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index a82830f39933..d0ef540711c7 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -647,7 +647,7 @@ EXPORT_SYMBOL(blk_mq_complete_request); > > > > int blk_mq_request_started(struct request *rq) > > { > > - return blk_mq_rq_state(rq) != MQ_RQ_IDLE; > > + return blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT; > > } > > EXPORT_SYMBOL_GPL(blk_mq_request_started); > > Independ of this series this change looks like the right thing to do. > But this whole area is a mine field, so separate testing would be > very helpful. > > I also wonder why we even bother with the above helper, a direct > state comparism seems a lot more obvious to the reader. I think it's just because blk_mq_rq_state() is a private interface. The enum mq_rq_state is defined under include/linux/, so it looks okay to make getting the state public too. > Last but not least the blk_mq_request_started check in nbd > should probably be lifted into blk_mq_tag_to_rq while we're at it.. > > As for the nvme issue - it seems to me like we need to decouple the > nvme loop frontend request from the target backend request. In case > of a timeout/reset we'll complete struct request like all other nvme > transport drivers, but we leave the backend target state around, which > will be freed when it completes (or leaks when the completion is lost). I don't think nvme's loop target should do anything to help a command complete. It shouldn't even implement a timeout for the same reason no stacking block driver implements these. If a request is stuck, the lowest level is the only driver that should have the responsibility to make it unstuck.
On 11/29/18 6:20 PM, Keith Busch wrote: > On Thu, Nov 29, 2018 at 06:11:59PM +0100, Christoph Hellwig wrote: >>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>> index a82830f39933..d0ef540711c7 100644 >>> --- a/block/blk-mq.c >>> +++ b/block/blk-mq.c >>> @@ -647,7 +647,7 @@ EXPORT_SYMBOL(blk_mq_complete_request); >>> >>> int blk_mq_request_started(struct request *rq) >>> { >>> - return blk_mq_rq_state(rq) != MQ_RQ_IDLE; >>> + return blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT; >>> } >>> EXPORT_SYMBOL_GPL(blk_mq_request_started); >> >> Independ of this series this change looks like the right thing to do. >> But this whole area is a mine field, so separate testing would be >> very helpful. >> >> I also wonder why we even bother with the above helper, a direct >> state comparism seems a lot more obvious to the reader. > > I think it's just because blk_mq_rq_state() is a private interface. The > enum mq_rq_state is defined under include/linux/, so it looks okay to > make getting the state public too. > >> Last but not least the blk_mq_request_started check in nbd >> should probably be lifted into blk_mq_tag_to_rq while we're at it.. >> >> As for the nvme issue - it seems to me like we need to decouple the >> nvme loop frontend request from the target backend request. In case >> of a timeout/reset we'll complete struct request like all other nvme >> transport drivers, but we leave the backend target state around, which >> will be freed when it completes (or leaks when the completion is lost). > > I don't think nvme's loop target should do anything to help a command > complete. It shouldn't even implement a timeout for the same reason > no stacking block driver implements these. If a request is stuck, the > lowest level is the only driver that should have the responsibility to > make it unstuck. > Not quite. You still need to be able to reset the controller, which you can't if you have to wait for the lowest level. Cheers, Hannes