Message ID | 20200215032140.4093-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-mq: insert passthrough request into hctx->dispatch directly | expand |
On Sat, Feb 15, 2020 at 11:21:40AM +0800, Ming Lei wrote: > For some reason, device may be in one situation which can't handle > FS request, so STS_RESOURCE is always returned and the FS request > will be added to hctx->dispatch. However passthrough request may > be required at that time for fixing the problem. If passthrough > request is added to scheduler queue, there isn't any chance for > blk-mq to dispatch it given we prioritize requests in hctx->dispatch. > Then the FS IO request may never be completed, and IO hang is caused. > > So passthrough request has to be added to hctx->dispatch directly. > > Fix this issue by inserting passthrough request into hctx->dispatch > directly. Then it becomes consistent with original legacy IO request > path, in which passthrough request is always added to q->queue_head. Do you have a description of an actual problem this fixes? Maybe even a reproducer for blktests?
On Wed, Feb 19, 2020 at 08:36:15AM -0800, Christoph Hellwig wrote: > On Sat, Feb 15, 2020 at 11:21:40AM +0800, Ming Lei wrote: > > For some reason, device may be in one situation which can't handle > > FS request, so STS_RESOURCE is always returned and the FS request > > will be added to hctx->dispatch. However passthrough request may > > be required at that time for fixing the problem. If passthrough > > request is added to scheduler queue, there isn't any chance for > > blk-mq to dispatch it given we prioritize requests in hctx->dispatch. > > Then the FS IO request may never be completed, and IO hang is caused. > > > > So passthrough request has to be added to hctx->dispatch directly. > > > > Fix this issue by inserting passthrough request into hctx->dispatch > > directly. Then it becomes consistent with original legacy IO request > > path, in which passthrough request is always added to q->queue_head. > > Do you have a description of an actual problem this fixes? Maybe even > a reproducer for blktests? > It is reported by one RH customer in the following test case: 1) Start IO on Emulex FC host 2) Fail one controller, wait 5 minutes 3) Bring controller back online When we trace the problem, it is found that FS request started in device_add_disk() from scsi disk probe context stuck because scsi_queue_rq() always return STS_BUSY via scsi_setup_fs_cmnd() -> alua_prep_fn(). The kernel ALUA state is TRANSITIONING at that time, so it is reasonable to see BLK_TYPE_FS requests won't go anywhere because of the check in alua_prep_fn(). However, the passthrough request(TEST UNIT READY) is submitted from alua_rtpg_work when the FS request can't be dispatched to LLD. And SCSI stack should have been allowed to handle this passthrough rquest. But it can't reach SCSI stack via .queue_rq() because blk-mq won't dispatch it until hctx->dispatch is empty. The legacy IO request code always added passthrough request into head of q->queue_head directly instead of scheduler queue or sw queue, so no such issue. So far not figured out one blktests test case, but the problem is real. BTW, I just found we need the extra following change: @@ -1301,7 +1301,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, q->mq_ops->commit_rqs(hctx); spin_lock(&hctx->lock); - list_splice_init(list, &hctx->dispatch); + list_splice_tail_init(list, &hctx->dispatch); spin_unlock(&hctx->lock); Will post V2. Thanks, Ming
On 2/19/20 2:10 PM, Ming Lei wrote: > On Wed, Feb 19, 2020 at 08:36:15AM -0800, Christoph Hellwig wrote: >> On Sat, Feb 15, 2020 at 11:21:40AM +0800, Ming Lei wrote: >>> For some reason, device may be in one situation which can't handle >>> FS request, so STS_RESOURCE is always returned and the FS request >>> will be added to hctx->dispatch. However passthrough request may >>> be required at that time for fixing the problem. If passthrough >>> request is added to scheduler queue, there isn't any chance for >>> blk-mq to dispatch it given we prioritize requests in hctx->dispatch. >>> Then the FS IO request may never be completed, and IO hang is caused. >>> >>> So passthrough request has to be added to hctx->dispatch directly. >>> >>> Fix this issue by inserting passthrough request into hctx->dispatch >>> directly. Then it becomes consistent with original legacy IO request >>> path, in which passthrough request is always added to q->queue_head. >> >> Do you have a description of an actual problem this fixes? Maybe even >> a reproducer for blktests? >> > > It is reported by one RH customer in the following test case: > > 1) Start IO on Emulex FC host > 2) Fail one controller, wait 5 minutes > 3) Bring controller back online > > When we trace the problem, it is found that FS request started in device_add_disk() > from scsi disk probe context stuck because scsi_queue_rq() always return > STS_BUSY via scsi_setup_fs_cmnd() -> alua_prep_fn(). > > The kernel ALUA state is TRANSITIONING at that time, so it is reasonable to see > BLK_TYPE_FS requests won't go anywhere because of the check in alua_prep_fn(). > > However, the passthrough request(TEST UNIT READY) is submitted from alua_rtpg_work > when the FS request can't be dispatched to LLD. And SCSI stack should > have been allowed to handle this passthrough rquest. But it can't reach SCSI stack > via .queue_rq() because blk-mq won't dispatch it until hctx->dispatch is > empty. > > The legacy IO request code always added passthrough request into head of q->queue_head > directly instead of scheduler queue or sw queue, so no such issue. > > So far not figured out one blktests test case, but the problem is real. > > BTW, I just found we need the extra following change: > > @@ -1301,7 +1301,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, > q->mq_ops->commit_rqs(hctx); > > spin_lock(&hctx->lock); > - list_splice_init(list, &hctx->dispatch); > + list_splice_tail_init(list, &hctx->dispatch); > spin_unlock(&hctx->lock); > Is it fine to add to tail as the requests on dispatch would be reordered? A, B, C and D are on the list. Suppose A is failed and the new list would become B, C D, A? Dongli Zhang
On Wed, Feb 19, 2020 at 03:47:50PM -0800, dongli.zhang@oracle.com wrote: > > > On 2/19/20 2:10 PM, Ming Lei wrote: > > On Wed, Feb 19, 2020 at 08:36:15AM -0800, Christoph Hellwig wrote: > >> On Sat, Feb 15, 2020 at 11:21:40AM +0800, Ming Lei wrote: > >>> For some reason, device may be in one situation which can't handle > >>> FS request, so STS_RESOURCE is always returned and the FS request > >>> will be added to hctx->dispatch. However passthrough request may > >>> be required at that time for fixing the problem. If passthrough > >>> request is added to scheduler queue, there isn't any chance for > >>> blk-mq to dispatch it given we prioritize requests in hctx->dispatch. > >>> Then the FS IO request may never be completed, and IO hang is caused. > >>> > >>> So passthrough request has to be added to hctx->dispatch directly. > >>> > >>> Fix this issue by inserting passthrough request into hctx->dispatch > >>> directly. Then it becomes consistent with original legacy IO request > >>> path, in which passthrough request is always added to q->queue_head. > >> > >> Do you have a description of an actual problem this fixes? Maybe even > >> a reproducer for blktests? > >> > > > > It is reported by one RH customer in the following test case: > > > > 1) Start IO on Emulex FC host > > 2) Fail one controller, wait 5 minutes > > 3) Bring controller back online > > > > When we trace the problem, it is found that FS request started in device_add_disk() > > from scsi disk probe context stuck because scsi_queue_rq() always return > > STS_BUSY via scsi_setup_fs_cmnd() -> alua_prep_fn(). > > > > The kernel ALUA state is TRANSITIONING at that time, so it is reasonable to see > > BLK_TYPE_FS requests won't go anywhere because of the check in alua_prep_fn(). > > > > However, the passthrough request(TEST UNIT READY) is submitted from alua_rtpg_work > > when the FS request can't be dispatched to LLD. And SCSI stack should > > have been allowed to handle this passthrough rquest. But it can't reach SCSI stack > > via .queue_rq() because blk-mq won't dispatch it until hctx->dispatch is > > empty. > > > > The legacy IO request code always added passthrough request into head of q->queue_head > > directly instead of scheduler queue or sw queue, so no such issue. > > > > So far not figured out one blktests test case, but the problem is real. > > > > BTW, I just found we need the extra following change: > > > > @@ -1301,7 +1301,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, > > q->mq_ops->commit_rqs(hctx); > > > > spin_lock(&hctx->lock); > > - list_splice_init(list, &hctx->dispatch); > > + list_splice_tail_init(list, &hctx->dispatch); > > spin_unlock(&hctx->lock); > > > > Is it fine to add to tail as the requests on dispatch would be reordered? Wrt. FS request: Firstly we never guarantee that the request is dispatched in order. Secondly and more importantly, request can be added into hctx->dispatch in any order. One usual case is that request is added to hctx->dispatch concurrently when .queue_rq() fails. On the other side, in case of not concurrent adding to hctx->dispatch, after one request is added to hctx->dispatch, we always dispatch request from hctx->dispatch first, instead of dequeuing request from scheduler queue and adding them to hctx->dispatch again after .queue_rq() fails. > > A, B, C and D are on the list. Suppose A is failed and the new list would become > B, C D, A? Right, I don't see there is any issue in this way, do you see issues? Thanks, Ming
On 2/19/20 5:45 PM, Ming Lei wrote: > On Wed, Feb 19, 2020 at 03:47:50PM -0800, dongli.zhang@oracle.com wrote: >> >> >> On 2/19/20 2:10 PM, Ming Lei wrote: >>> On Wed, Feb 19, 2020 at 08:36:15AM -0800, Christoph Hellwig wrote: >>>> On Sat, Feb 15, 2020 at 11:21:40AM +0800, Ming Lei wrote: >>>>> For some reason, device may be in one situation which can't handle >>>>> FS request, so STS_RESOURCE is always returned and the FS request >>>>> will be added to hctx->dispatch. However passthrough request may >>>>> be required at that time for fixing the problem. If passthrough >>>>> request is added to scheduler queue, there isn't any chance for >>>>> blk-mq to dispatch it given we prioritize requests in hctx->dispatch. >>>>> Then the FS IO request may never be completed, and IO hang is caused. >>>>> >>>>> So passthrough request has to be added to hctx->dispatch directly. >>>>> >>>>> Fix this issue by inserting passthrough request into hctx->dispatch >>>>> directly. Then it becomes consistent with original legacy IO request >>>>> path, in which passthrough request is always added to q->queue_head. >>>> >>>> Do you have a description of an actual problem this fixes? Maybe even >>>> a reproducer for blktests? >>>> >>> >>> It is reported by one RH customer in the following test case: >>> >>> 1) Start IO on Emulex FC host >>> 2) Fail one controller, wait 5 minutes >>> 3) Bring controller back online >>> >>> When we trace the problem, it is found that FS request started in device_add_disk() >>> from scsi disk probe context stuck because scsi_queue_rq() always return >>> STS_BUSY via scsi_setup_fs_cmnd() -> alua_prep_fn(). >>> >>> The kernel ALUA state is TRANSITIONING at that time, so it is reasonable to see >>> BLK_TYPE_FS requests won't go anywhere because of the check in alua_prep_fn(). >>> >>> However, the passthrough request(TEST UNIT READY) is submitted from alua_rtpg_work >>> when the FS request can't be dispatched to LLD. And SCSI stack should >>> have been allowed to handle this passthrough rquest. But it can't reach SCSI stack >>> via .queue_rq() because blk-mq won't dispatch it until hctx->dispatch is >>> empty. >>> >>> The legacy IO request code always added passthrough request into head of q->queue_head >>> directly instead of scheduler queue or sw queue, so no such issue. >>> >>> So far not figured out one blktests test case, but the problem is real. >>> >>> BTW, I just found we need the extra following change: >>> >>> @@ -1301,7 +1301,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, >>> q->mq_ops->commit_rqs(hctx); >>> >>> spin_lock(&hctx->lock); >>> - list_splice_init(list, &hctx->dispatch); >>> + list_splice_tail_init(list, &hctx->dispatch); >>> spin_unlock(&hctx->lock); >>> >> >> Is it fine to add to tail as the requests on dispatch would be reordered? > > Wrt. FS request: > > Firstly we never guarantee that the request is dispatched in order. > > Secondly and more importantly, request can be added into hctx->dispatch > in any order. One usual case is that request is added to hctx->dispatch > concurrently when .queue_rq() fails. On the other side, in case of not > concurrent adding to hctx->dispatch, after one request is added to > hctx->dispatch, we always dispatch request from hctx->dispatch first, > instead of dequeuing request from scheduler queue and adding them to > hctx->dispatch again after .queue_rq() fails. > >> >> A, B, C and D are on the list. Suppose A is failed and the new list would become >> B, C D, A? > > Right, I don't see there is any issue in this way, do you see issues? Thank you very much for the explanation. I do not see issue if order guarantee in hctx->dispatch is not required. Dongli Zhang
diff --git a/block/blk-flush.c b/block/blk-flush.c index 3f977c517960..5cc775bdb06a 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -412,7 +412,7 @@ void blk_insert_flush(struct request *rq) */ if ((policy & REQ_FSEQ_DATA) && !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { - blk_mq_request_bypass_insert(rq, false); + blk_mq_request_bypass_insert(rq, false, false); return; } diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index ca22afd47b3d..856356b1619e 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -361,13 +361,19 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, bool has_sched, struct request *rq) { - /* dispatch flush rq directly */ - if (rq->rq_flags & RQF_FLUSH_SEQ) { - spin_lock(&hctx->lock); - list_add(&rq->queuelist, &hctx->dispatch); - spin_unlock(&hctx->lock); + /* + * dispatch flush and passthrough rq directly + * + * passthrough request has to be added to hctx->dispatch directly. + * For some reason, device may be in one situation which can't + * handle FS request, so STS_RESOURCE is always returned and the + * FS request will be added to hctx->dispatch. However passthrough + * request may be required at that time for fixing the problem. If + * passthrough request is added to scheduler queue, there isn't any + * chance to dispatch it given we prioritize requests in hctx->dispatch. + */ + if ((rq->rq_flags & RQF_FLUSH_SEQ) || blk_rq_is_passthrough(rq)) return true; - } if (has_sched) rq->rq_flags |= RQF_SORTED; @@ -391,8 +397,10 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head, WARN_ON(e && (rq->tag != -1)); - if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) + if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) { + blk_mq_request_bypass_insert(rq, at_head, false); goto run; + } if (e && e->type->ops.insert_requests) { LIST_HEAD(list); diff --git a/block/blk-mq.c b/block/blk-mq.c index a12b1763508d..5f5c43ae3792 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -735,7 +735,7 @@ static void blk_mq_requeue_work(struct work_struct *work) * merge. */ if (rq->rq_flags & RQF_DONTPREP) - blk_mq_request_bypass_insert(rq, false); + blk_mq_request_bypass_insert(rq, false, false); else blk_mq_sched_insert_request(rq, true, false, false); } @@ -1677,12 +1677,16 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, * Should only be used carefully, when the caller knows we want to * bypass a potential IO scheduler on the target device. */ -void blk_mq_request_bypass_insert(struct request *rq, bool run_queue) +void blk_mq_request_bypass_insert(struct request *rq, bool at_head, + bool run_queue) { struct blk_mq_hw_ctx *hctx = rq->mq_hctx; spin_lock(&hctx->lock); - list_add_tail(&rq->queuelist, &hctx->dispatch); + if (at_head) + list_add(&rq->queuelist, &hctx->dispatch); + else + list_add_tail(&rq->queuelist, &hctx->dispatch); spin_unlock(&hctx->lock); if (run_queue) @@ -1849,7 +1853,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, if (bypass_insert) return BLK_STS_RESOURCE; - blk_mq_request_bypass_insert(rq, run_queue); + blk_mq_request_bypass_insert(rq, false, run_queue); return BLK_STS_OK; } @@ -1876,7 +1880,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false, true); if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) - blk_mq_request_bypass_insert(rq, true); + blk_mq_request_bypass_insert(rq, false, true); else if (ret != BLK_STS_OK) blk_mq_end_request(rq, ret); @@ -1910,7 +1914,7 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx, if (ret != BLK_STS_OK) { if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { - blk_mq_request_bypass_insert(rq, + blk_mq_request_bypass_insert(rq, false, list_empty(list)); break; } diff --git a/block/blk-mq.h b/block/blk-mq.h index eaaca8fc1c28..c0fa34378eb2 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -66,7 +66,8 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, */ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, bool at_head); -void blk_mq_request_bypass_insert(struct request *rq, bool run_queue); +void blk_mq_request_bypass_insert(struct request *rq, bool at_head, + bool run_queue); void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx, struct list_head *list);
For some reason, device may be in one situation which can't handle FS request, so STS_RESOURCE is always returned and the FS request will be added to hctx->dispatch. However passthrough request may be required at that time for fixing the problem. If passthrough request is added to scheduler queue, there isn't any chance for blk-mq to dispatch it given we prioritize requests in hctx->dispatch. Then the FS IO request may never be completed, and IO hang is caused. So passthrough request has to be added to hctx->dispatch directly. Fix this issue by inserting passthrough request into hctx->dispatch directly. Then it becomes consistent with original legacy IO request path, in which passthrough request is always added to q->queue_head. Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-flush.c | 2 +- block/blk-mq-sched.c | 22 +++++++++++++++------- block/blk-mq.c | 16 ++++++++++------ block/blk-mq.h | 3 ++- 4 files changed, 28 insertions(+), 15 deletions(-)