diff mbox series

blk-mq: release scheduler resource when request complete

Message ID 20230813133643.3006943-1-chengming.zhou@linux.dev (mailing list archive)
State New, archived
Headers show
Series blk-mq: release scheduler resource when request complete | expand

Commit Message

Chengming Zhou Aug. 13, 2023, 1:36 p.m. UTC
From: Chengming Zhou <zhouchengming@bytedance.com>

Chuck reported [1] a IO hang problem on NFS exports that reside on SATA
devices and bisected to commit 615939a2ae73 ("blk-mq: defer to the normal
submission path for post-flush requests").

We analysed the IO hang problem, found there are two postflush requests
are waiting for each other.

The first postflush request completed the REQ_FSEQ_DATA sequence, so go to
the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but
failed to blk_kick_flush() because of the second postflush request which
is inflight waiting in scheduler queue.

The second postflush waiting in scheduler queue can't be dispatched because
the first postflush hasn't released scheduler resource even though it has
completed by itself.

Fix it by releasing scheduler resource when the first postflush request
completed, so the second postflush can be dispatched and completed, then
make blk_kick_flush() succeed.

[1] https://lore.kernel.org/all/7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com/

Fixes: 615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
---
 block/blk-mq.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

Comments

Jens Axboe Aug. 13, 2023, 1:54 p.m. UTC | #1
On 8/13/23 7:36 AM, chengming.zhou@linux.dev wrote:
> From: Chengming Zhou <zhouchengming@bytedance.com>
> 
> Chuck reported [1] a IO hang problem on NFS exports that reside on SATA
> devices and bisected to commit 615939a2ae73 ("blk-mq: defer to the normal
> submission path for post-flush requests").
> 
> We analysed the IO hang problem, found there are two postflush requests
> are waiting for each other.
> 
> The first postflush request completed the REQ_FSEQ_DATA sequence, so go to
> the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but
> failed to blk_kick_flush() because of the second postflush request which
> is inflight waiting in scheduler queue.
> 
> The second postflush waiting in scheduler queue can't be dispatched because
> the first postflush hasn't released scheduler resource even though it has
> completed by itself.
> 
> Fix it by releasing scheduler resource when the first postflush request
> completed, so the second postflush can be dispatched and completed, then
> make blk_kick_flush() succeed.

Change looks good to me. But since we're in here:

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f14b8669ac69..5b14f18f9670 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -682,6 +682,15 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
>  
> +static void blk_mq_finish_request(struct request *rq)
> +{
> +	struct request_queue *q = rq->q;
> +
> +	if ((rq->rq_flags & RQF_USE_SCHED) &&
> +	    q->elevator->type->ops.finish_request)
> +		q->elevator->type->ops.finish_request(rq);
> +}

Any IO scheduler should set ->finish_request(), so this should just be:

static void blk_mq_finish_request(struct request *rq)
{
	struct request_queue *q = rq->q;

	if (rq->rq_flags & RQF_USE_SCHED)
		q->elevator->type->ops.finish_request(rq);
}

and would probably be a good idea to solidify that with a:

	if (WARN_ON_ONCE(!e->ops.finish_request))
		return -EINVAL;

at the top of elv_register() like we have for insert/dispatch as well.
All 3 IO schedulers do set ->finish_request().
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f14b8669ac69..5b14f18f9670 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -682,6 +682,15 @@  struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
+static void blk_mq_finish_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+
+	if ((rq->rq_flags & RQF_USE_SCHED) &&
+	    q->elevator->type->ops.finish_request)
+		q->elevator->type->ops.finish_request(rq);
+}
+
 static void __blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
@@ -708,10 +717,6 @@  void blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
-	if ((rq->rq_flags & RQF_USE_SCHED) &&
-	    q->elevator->type->ops.finish_request)
-		q->elevator->type->ops.finish_request(rq);
-
 	if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
 		laptop_io_completion(q->disk->bdi);
 
@@ -1021,6 +1026,8 @@  inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
 	if (blk_mq_need_time_stamp(rq))
 		__blk_mq_end_request_acct(rq, ktime_get_ns());
 
+	blk_mq_finish_request(rq);
+
 	if (rq->end_io) {
 		rq_qos_done(rq->q, rq);
 		if (rq->end_io(rq, error) == RQ_END_IO_FREE)
@@ -1075,6 +1082,8 @@  void blk_mq_end_request_batch(struct io_comp_batch *iob)
 		if (iob->need_ts)
 			__blk_mq_end_request_acct(rq, now);
 
+		blk_mq_finish_request(rq);
+
 		rq_qos_done(rq->q, rq);
 
 		/*