Message ID | 20160720142727.GA57399@redhat.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On 07/20/2016 07:27 AM, Mike Snitzer wrote: > On Wed, Jul 20 2016 at 10:08am -0400, > Mike Snitzer <snitzer@redhat.com> wrote: > [ ... ] > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > index 7a96618..347ff25 100644 > --- a/drivers/md/dm-rq.c > +++ b/drivers/md/dm-rq.c > @@ -414,7 +414,7 @@ static void dm_complete_request(struct request *rq, int error) > if (!rq->q->mq_ops) > blk_complete_request(rq); > else > - blk_mq_complete_request(rq, error); > + blk_mq_complete_request(rq, 0); > } > > /* Hello Mike, Thanks for the patch. But unfortunately even with this patch applied I see fio reporting I/O errors when run on top of dm-mq after a (simulated) cable pull. I think these errors occur because dm_softirq_done() fails requests if clone == NULL and tio->error != NULL. Adding WARN_ON_ONCE(error && !tio->clone) in dm_complete_request() produced the following call stack: Workqueue: kblockd blk_mq_run_work_fn Call Trace: [<ffffffff8132aa27>] dump_stack+0x68/0xa1 [<ffffffff81061ed6>] __warn+0xc6/0xe0 [<ffffffff81061fa8>] warn_slowpath_null+0x18/0x20 [<ffffffffa0253fda>] dm_complete_request+0x8a/0xb0 [dm_mod] [<ffffffffa0255200>] map_request+0x70/0x2e0 [dm_mod] [<ffffffffa025771d>] dm_mq_queue_rq+0x7d/0x120 [dm_mod] [<ffffffff8131195a>] __blk_mq_run_hw_queue+0x1da/0x350 [<ffffffff813120a0>] blk_mq_run_work_fn+0x10/0x20 [<ffffffff8107f279>] process_one_work+0x1f9/0x6a0 [<ffffffff8107f769>] worker_thread+0x49/0x490 [<ffffffff81085f7a>] kthread+0xea/0x100 [<ffffffff8162faff>] ret_from_fork+0x1f/0x40 Please let me know if you need more information. Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 20 2016 at 1:37pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 07/20/2016 07:27 AM, Mike Snitzer wrote: > >On Wed, Jul 20 2016 at 10:08am -0400, > >Mike Snitzer <snitzer@redhat.com> wrote: > >[ ... ] > >diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > >index 7a96618..347ff25 100644 > >--- a/drivers/md/dm-rq.c > >+++ b/drivers/md/dm-rq.c > >@@ -414,7 +414,7 @@ static void dm_complete_request(struct request *rq, int error) > > if (!rq->q->mq_ops) > > blk_complete_request(rq); > > else > >- blk_mq_complete_request(rq, error); > >+ blk_mq_complete_request(rq, 0); > > } > > > > /* > > Hello Mike, > > Thanks for the patch. But unfortunately even with this patch applied > I see fio reporting I/O errors when run on top of dm-mq after a > (simulated) cable pull. I think these errors occur because > dm_softirq_done() fails requests if clone == NULL and tio->error != > NULL. Adding WARN_ON_ONCE(error && !tio->clone) in > dm_complete_request() produced the following call stack: > > Workqueue: kblockd blk_mq_run_work_fn > Call Trace: > [<ffffffff8132aa27>] dump_stack+0x68/0xa1 > [<ffffffff81061ed6>] __warn+0xc6/0xe0 > [<ffffffff81061fa8>] warn_slowpath_null+0x18/0x20 > [<ffffffffa0253fda>] dm_complete_request+0x8a/0xb0 [dm_mod] > [<ffffffffa0255200>] map_request+0x70/0x2e0 [dm_mod] > [<ffffffffa025771d>] dm_mq_queue_rq+0x7d/0x120 [dm_mod] > [<ffffffff8131195a>] __blk_mq_run_hw_queue+0x1da/0x350 > [<ffffffff813120a0>] blk_mq_run_work_fn+0x10/0x20 > [<ffffffff8107f279>] process_one_work+0x1f9/0x6a0 > [<ffffffff8107f769>] worker_thread+0x49/0x490 > [<ffffffff81085f7a>] kthread+0xea/0x100 > [<ffffffff8162faff>] ret_from_fork+0x1f/0x40 > > Please let me know if you need more information. Cable pull testing isn't new. I'd have thought it covered (albeit simulated) via mptest. Does mptest pass for you? https://github.com/snitm/mptest Last I knew your tests were all passing with dm-mq. Would love to understand how/when they have regressed. If clone is NULL then the request wasn't actually issued to (or even allocated from the tag space of) the underlying blk-mq request_queue (via clone_and_map_rq's call to __multipath_map). The original request could've been previously cloned, failed due to cable pull, it began retry via requeue, but on retry blk_mq_alloc_request() could've failed in __multipath_map() -- (maybe now the request_queue was "dying"?). Or __multipath_map() failed for some other reason. Would be interesting to know the error returned from map_request()'s ti->type->clone_and_map_rq(). Really should just be DM_MAPIO_REQUEUE. But the stack you've provided shows map_request calling dm_complete_request(), which implies dm_kill_unmapped_request() is being called due to ti->type->clone_and_map_rq() returning < 0. Could be that __multipath_map() is returning an error even before it would've otherwise called blk_mq_alloc_request(). You said you are using queue_if_no_path? Would be interesting to see where __multipth_map is returning with an error. Could be that __multipath_map() should respond differently if blk_mq_alloc_request() fails and the queue is dying -- at a minimum the path should be failed. But before we explore that we need to understand why the clone_and_map_rq() is returning < 0. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 20 2016 at 1:37pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 07/20/2016 07:27 AM, Mike Snitzer wrote: > >On Wed, Jul 20 2016 at 10:08am -0400, > >Mike Snitzer <snitzer@redhat.com> wrote: > >[ ... ] > >diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > >index 7a96618..347ff25 100644 > >--- a/drivers/md/dm-rq.c > >+++ b/drivers/md/dm-rq.c > >@@ -414,7 +414,7 @@ static void dm_complete_request(struct request *rq, int error) > > if (!rq->q->mq_ops) > > blk_complete_request(rq); > > else > >- blk_mq_complete_request(rq, error); > >+ blk_mq_complete_request(rq, 0); > > } > > > > /* > > Hello Mike, > > Thanks for the patch. But unfortunately even with this patch applied > I see fio reporting I/O errors when run on top of dm-mq after a > (simulated) cable pull. BTW, curious, how is it that you're simulating the cable pull? I'd like to see if doing the same, via a new test in mptest, exposes this issue you're hitting. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/21/2016 01:32 PM, Mike Snitzer wrote: > BTW, curious, how is it that you're simulating the cable pull? I'd like > to see if doing the same, via a new test in mptest, exposes this issue > you're hitting. Hello Mike, Writing into the /sys/class/srp_remote_ports/port-*/delete sysfs attribute causes the SRP initiator driver to call scsi_remove_host(), what also happens if the dev_loss_tmo timer expires. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/20/2016 11:33 AM, Mike Snitzer wrote: > Would be interesting to know the error returned from map_request()'s > ti->type->clone_and_map_rq(). Really should just be DM_MAPIO_REQUEUE. > But the stack you've provided shows map_request calling > dm_complete_request(), which implies dm_kill_unmapped_request() is being > called due to ti->type->clone_and_map_rq() returning < 0. Hello Mike, Apparently certain requests fail with -EIO because DM_DEV_SUSPEND ioctls are being submitted to the same multipath target. As you know DM_DEV_SUSPEND changes QUEUE_IF_NO_PATH from 1 into 0. A WARN_ON() statement that I added in driver dm-mpath statement learned me that multipathd is submitting these DM_DEV_SUSPEND ioctls. In the output of strace -fp$(pidof multipathd) I found the following: [pid 13927] ioctl(5, DM_TABLE_STATUS, 0x7fa1000483f0) = 0 [pid 13927] write(1, "mpathbe: failed to setup multipa"..., 35) = 35 [pid 13927] write(1, "dm-0: uev_add_map failed\n", 25) = 25 [pid 13927] write(1, "uevent trigger error\n", 21) = 21 [pid 13927] write(1, "sdh: remove path (uevent)\n", 26) = 26 [pid 13927] ioctl(5, DM_TABLE_LOAD, 0x7fa1000483f0) = 0 [pid 13927] ioctl(5, DM_DEV_SUSPEND, 0x7fa1000483f0) = 0 I'm still analyzing these and other messages. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 21 2016 at 4:58pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 07/20/2016 11:33 AM, Mike Snitzer wrote: > >Would be interesting to know the error returned from map_request()'s > >ti->type->clone_and_map_rq(). Really should just be DM_MAPIO_REQUEUE. > >But the stack you've provided shows map_request calling > >dm_complete_request(), which implies dm_kill_unmapped_request() is being > >called due to ti->type->clone_and_map_rq() returning < 0. > > Hello Mike, > > Apparently certain requests fail with -EIO because DM_DEV_SUSPEND > ioctls are being submitted to the same multipath target. As you know > DM_DEV_SUSPEND changes QUEUE_IF_NO_PATH from 1 into 0. A WARN_ON() > statement that I added in driver dm-mpath statement learned me that > multipathd is submitting these DM_DEV_SUSPEND ioctls. In the output > of strace -fp$(pidof multipathd) I found the following: > > [pid 13927] ioctl(5, DM_TABLE_STATUS, 0x7fa1000483f0) = 0 > [pid 13927] write(1, "mpathbe: failed to setup multipa"..., 35) = 35 > [pid 13927] write(1, "dm-0: uev_add_map failed\n", 25) = 25 > [pid 13927] write(1, "uevent trigger error\n", 21) = 21 > [pid 13927] write(1, "sdh: remove path (uevent)\n", 26) = 26 > [pid 13927] ioctl(5, DM_TABLE_LOAD, 0x7fa1000483f0) = 0 > [pid 13927] ioctl(5, DM_DEV_SUSPEND, 0x7fa1000483f0) = 0 > > I'm still analyzing these and other messages. The various ioctls you're seeing is just multipathd responding to the failures. Part of reloading a table (with revised path info, etc) is to suspend and then resume the device that is being updated. But I'm not actually sure on the historic reasoning of why queue_if_no_path is disabled (and active setting saved) on suspend. I'll think about this further but maybe others recall why? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 25 2016 at 1:53pm -0400, Mike Snitzer <snitzer@redhat.com> wrote: > On Thu, Jul 21 2016 at 4:58pm -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: > > > On 07/20/2016 11:33 AM, Mike Snitzer wrote: > > >Would be interesting to know the error returned from map_request()'s > > >ti->type->clone_and_map_rq(). Really should just be DM_MAPIO_REQUEUE. > > >But the stack you've provided shows map_request calling > > >dm_complete_request(), which implies dm_kill_unmapped_request() is being > > >called due to ti->type->clone_and_map_rq() returning < 0. > > > > Hello Mike, > > > > Apparently certain requests fail with -EIO because DM_DEV_SUSPEND > > ioctls are being submitted to the same multipath target. As you know > > DM_DEV_SUSPEND changes QUEUE_IF_NO_PATH from 1 into 0. A WARN_ON() > > statement that I added in driver dm-mpath statement learned me that > > multipathd is submitting these DM_DEV_SUSPEND ioctls. In the output > > of strace -fp$(pidof multipathd) I found the following: > > > > [pid 13927] ioctl(5, DM_TABLE_STATUS, 0x7fa1000483f0) = 0 > > [pid 13927] write(1, "mpathbe: failed to setup multipa"..., 35) = 35 > > [pid 13927] write(1, "dm-0: uev_add_map failed\n", 25) = 25 > > [pid 13927] write(1, "uevent trigger error\n", 21) = 21 > > [pid 13927] write(1, "sdh: remove path (uevent)\n", 26) = 26 > > [pid 13927] ioctl(5, DM_TABLE_LOAD, 0x7fa1000483f0) = 0 > > [pid 13927] ioctl(5, DM_DEV_SUSPEND, 0x7fa1000483f0) = 0 > > > > I'm still analyzing these and other messages. > > The various ioctls you're seeing is just multipathd responding to the > failures. Part of reloading a table (with revised path info, etc) is to > suspend and then resume the device that is being updated. > > But I'm not actually sure on the historic reasoning of why > queue_if_no_path is disabled (and active setting saved) on suspend. > > I'll think about this further but maybe others recall why? I think it dates back to when we queued IO within the multipath target. Commit e809917735ebf ("dm mpath: push back requests instead of queueing") obviously changed how we handle the retry. But regardless __must_push_back() should catch the case where queue_io_no_path is cleared during suspend (by checking if current != saved). SO I'd be curious to know if your debugging has enabled you to identify exactly where in the dm-mapth.c code the -EIO return is being established. do_end_io() is the likely candidate -- but again the __must_push_back() check should prevent it and DM_ENDIO_REQUEUE should be returned. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/25/2016 02:23 PM, Mike Snitzer wrote: > So I'd be curious to know if your debugging has enabled you to identify > exactly where in the dm-mapth.c code the -EIO return is being > established. do_end_io() is the likely candidate -- but again the > __must_push_back() check should prevent it and DM_ENDIO_REQUEUE should > be returned. Hello Mike, Thanks for looking further into this. The pr_info() statement that I had added in the following code block in __multipath_map() fired what told me that the following code block triggered the -EIO return: if (!pgpath) { if (!must_push_back(m)) r = -EIO; /* Failed */ pr_info("%s(): (a) returning %d\n", __func__, r); return r; } From the system log: kernel: mpath 254:0: queue_if_no_path 1 -> 0 kernel: __multipath_map(): (a) returning -5 The code that I had added in queue_if_no_path() is as follows: old = test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags); [ ... ] pr_info("mpath %s: queue_if_no_path %d -> %d\n", dm_device_name(dm_table_get_md(m->ti->table)), old, queue_if_no_path); Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index 7a96618..347ff25 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -414,7 +414,7 @@ static void dm_complete_request(struct request *rq, int error) if (!rq->q->mq_ops) blk_complete_request(rq); else - blk_mq_complete_request(rq, error); + blk_mq_complete_request(rq, 0); } /*