[v2,1/8] blk-mq: use the right hctx when getting a driver tag fails
diff mbox

Message ID 1491417192.2787.11.camel@sandisk.com
State New
Headers show

Commit Message

Bart Van Assche April 5, 2017, 6:33 p.m. UTC
On Wed, 2017-04-05 at 11:28 -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> While dispatching requests, if we fail to get a driver tag, we mark the
> hardware queue as waiting for a tag and put the requests on a
> hctx->dispatch list to be run later when a driver tag is freed. However,
> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> queues if using a single-queue scheduler with a multiqueue device. If
> blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
> are processing. This means we end up using the hardware queue of the
> previous request, which may or may not be the same as that of the
> current request. If it isn't, the wrong hardware queue will end up
> waiting for a tag, and the requests will be on the wrong dispatch list,
> leading to a hang.
> 
> The fix is twofold:
> 
> 1. Make sure we save which hardware queue we were trying to get a
>    request for in blk_mq_get_driver_tag() regardless of whether it
>    succeeds or not.
> 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
>    blk_mq_hw_queue to make it clear that it must handle multiple
>    hardware queues, since I've already messed this up on a couple of
>    occasions.
> 
> This didn't appear in testing with nvme and mq-deadline because nvme has
> more driver tags than the default number of scheduler tags. However,
> with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

Would the patch below be a valid alternative?

Thanks,

Bart.

[PATCH] blk-mq: Simplify blk_mq_get_driver_tag()

The blk_mq_get_driver_tag() callers either assume that *hctx is not
modified or that it points to a valid hctx pointer upon return if
tag allocation succeeded. Avoid this confusion by returning the hctx
pointer if and only if tag allocation succeeded and by only storing
the return value into hctx in those blk_mq_get_driver_tag() callers
for which the hctx pointer had not yet been computed before the
blk_mq_get_driver_tag() call.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
---
 block/blk-mq-sched.c |  4 +++-
 block/blk-mq.c       | 24 ++++++++++--------------
 block/blk-mq.h       |  3 +--
 3 files changed, 14 insertions(+), 17 deletions(-)

-- 
2.12.0

Comments

Omar Sandoval April 5, 2017, 6:42 p.m. UTC | #1
On Wed, Apr 05, 2017 at 06:33:14PM +0000, Bart Van Assche wrote:
> On Wed, 2017-04-05 at 11:28 -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > While dispatching requests, if we fail to get a driver tag, we mark the
> > hardware queue as waiting for a tag and put the requests on a
> > hctx->dispatch list to be run later when a driver tag is freed. However,
> > blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> > queues if using a single-queue scheduler with a multiqueue device. If
> > blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
> > are processing. This means we end up using the hardware queue of the
> > previous request, which may or may not be the same as that of the
> > current request. If it isn't, the wrong hardware queue will end up
> > waiting for a tag, and the requests will be on the wrong dispatch list,
> > leading to a hang.
> > 
> > The fix is twofold:
> > 
> > 1. Make sure we save which hardware queue we were trying to get a
> >    request for in blk_mq_get_driver_tag() regardless of whether it
> >    succeeds or not.
> > 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
> >    blk_mq_hw_queue to make it clear that it must handle multiple
> >    hardware queues, since I've already messed this up on a couple of
> >    occasions.
> > 
> > This didn't appear in testing with nvme and mq-deadline because nvme has
> > more driver tags than the default number of scheduler tags. However,
> > with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
> 
> Would the patch below be a valid alternative?
> 
> Thanks,
> 
> Bart.

Hi, Bart,

This actually has the same bug as the original code, see below.

> [PATCH] blk-mq: Simplify blk_mq_get_driver_tag()
> 
> The blk_mq_get_driver_tag() callers either assume that *hctx is not
> modified or that it points to a valid hctx pointer upon return if
> tag allocation succeeded. Avoid this confusion by returning the hctx
> pointer if and only if tag allocation succeeded and by only storing
> the return value into hctx in those blk_mq_get_driver_tag() callers
> for which the hctx pointer had not yet been computed before the
> blk_mq_get_driver_tag() call.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> ---
>  block/blk-mq-sched.c |  4 +++-
>  block/blk-mq.c       | 24 ++++++++++--------------
>  block/blk-mq.h       |  3 +--
>  3 files changed, 14 insertions(+), 17 deletions(-)
> 

[snip]

>  static void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
> @@ -985,7 +980,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
>  		struct blk_mq_queue_data bd;
>  
>  		rq = list_first_entry(list, struct request, queuelist);
> -		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
> +		if (!blk_mq_get_driver_tag(rq, false)) {

Here, we want to know what hardware queue we attempted the tag
allocation on, so this won't work.

Thanks for taking a look!

Patch
diff mbox

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index a85d939ef450..bfb8bdb95a87 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -386,7 +386,9 @@  void blk_mq_sched_restart_a_queue(struct blk_mq_hw_ctx *hctx)
 static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx *hctx,
 				      struct request *rq, bool can_block)
 {
-	if (blk_mq_get_driver_tag(rq, &hctx, can_block)) {
+	WARN_ON_ONCE(!hctx);
+
+	if (blk_mq_get_driver_tag(rq, can_block)) {
 		blk_insert_flush(rq);
 		blk_mq_run_hw_queue(hctx, true);
 	} else
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 45e7f597cea3..c8e0c02dc8ca 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -857,8 +857,7 @@  static inline unsigned int queued_to_index(unsigned int queued)
 	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
 }
 
-bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
-			   bool wait)
+struct blk_mq_hw_ctx *blk_mq_get_driver_tag(struct request *rq, bool wait)
 {
 	struct blk_mq_alloc_data data = {
 		.q = rq->q,
@@ -866,12 +865,8 @@  bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 		.flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
 	};
 
-	if (rq->tag != -1) {
-done:
-		if (hctx)
-			*hctx = data.hctx;
-		return true;
-	}
+	if (rq->tag != -1)
+		return data.hctx;
 
 	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
 		data.flags |= BLK_MQ_REQ_RESERVED;
@@ -883,10 +878,10 @@  bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 			atomic_inc(&data.hctx->nr_active);
 		}
 		data.hctx->tags->rqs[rq->tag] = rq;
-		goto done;
+		return data.hctx;
 	}
 
-	return false;
+	return NULL;
 }
 
 static void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
@@ -985,7 +980,7 @@  bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 		struct blk_mq_queue_data bd;
 
 		rq = list_first_entry(list, struct request, queuelist);
-		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
+		if (!blk_mq_get_driver_tag(rq, false)) {
 			if (!queued && reorder_tags_to_front(list))
 				continue;
 
@@ -999,7 +994,7 @@  bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 				 * window between the allocation failure and
 				 * adding the hardware queue to the wait queue.
 				 */
-				if (!blk_mq_get_driver_tag(rq, &hctx, false))
+				if (!blk_mq_get_driver_tag(rq, false))
 					break;
 			} else {
 				break;
@@ -1020,7 +1015,7 @@  bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 			struct request *nxt;
 
 			nxt = list_first_entry(list, struct request, queuelist);
-			bd.last = !blk_mq_get_driver_tag(nxt, NULL, false);
+			bd.last = !blk_mq_get_driver_tag(nxt, false);
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
@@ -1435,7 +1430,8 @@  static void __blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie,
 	if (q->elevator)
 		goto insert;
 
-	if (!blk_mq_get_driver_tag(rq, &hctx, false))
+	hctx = blk_mq_get_driver_tag(rq, false);
+	if (!hctx)
 		goto insert;
 
 	new_cookie = request_to_qc_t(hctx, rq);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8d49c06fc520..b1917fbe955c 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -33,8 +33,7 @@  void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
-bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
-				bool wait);
+struct blk_mq_hw_ctx *blk_mq_get_driver_tag(struct request *rq, bool wait);
 
 /*
  * Internal helpers for allocating/freeing the request map