diff mbox

[BUGFIX,V3] block, bfq: add requeue-request hook

Message ID 20180207211920.6343-1-paolo.valente@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

Paolo Valente Feb. 7, 2018, 9:19 p.m. UTC
Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
be re-inserted into the active I/O scheduler for that device. As a
consequence, I/O schedulers may get the same request inserted again,
even several times, without a finish_request invoked on that request
before each re-insertion.

This fact is the cause of the failure reported in [1]. For an I/O
scheduler, every re-insertion of the same re-prepared request is
equivalent to the insertion of a new request. For schedulers like
mq-deadline or kyber, this fact causes no harm. In contrast, it
confuses a stateful scheduler like BFQ, which keeps state for an I/O
request, until the finish_request hook is invoked on the request. In
particular, BFQ may get stuck, waiting forever for the number of
request dispatches, of the same request, to be balanced by an equal
number of request completions (while there will be one completion for
that request). In this state, BFQ may refuse to serve I/O requests
from other bfq_queues. The hang reported in [1] then follows.

However, the above re-prepared requests undergo a requeue, thus the
requeue_request hook of the active elevator is invoked for these
requests, if set. This commit then addresses the above issue by
properly implementing the hook requeue_request in BFQ.

[1] https://marc.info/?l=linux-block&m=151211117608676

Reported-by: Ivan Kozik <ivan@ludios.org>
Reported-by: Alban Browaeys <alban.browaeys@gmail.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Serena Ziviani <ziviani.serena@gmail.com>
---
V2: contains fix to bug reported in [2]
V3: implements the improvement suggested in [3]

[2] https://lkml.org/lkml/2018/2/5/599
[3] https://lkml.org/lkml/2018/2/7/532

 block/bfq-iosched.c | 107 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 82 insertions(+), 25 deletions(-)

--
2.15.1

Comments

Jens Axboe Feb. 7, 2018, 10:18 p.m. UTC | #1
On 2/7/18 2:19 PM, Paolo Valente wrote:
> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
> be re-inserted into the active I/O scheduler for that device. As a
> consequence, I/O schedulers may get the same request inserted again,
> even several times, without a finish_request invoked on that request
> before each re-insertion.
> 
> This fact is the cause of the failure reported in [1]. For an I/O
> scheduler, every re-insertion of the same re-prepared request is
> equivalent to the insertion of a new request. For schedulers like
> mq-deadline or kyber, this fact causes no harm. In contrast, it
> confuses a stateful scheduler like BFQ, which keeps state for an I/O
> request, until the finish_request hook is invoked on the request. In
> particular, BFQ may get stuck, waiting forever for the number of
> request dispatches, of the same request, to be balanced by an equal
> number of request completions (while there will be one completion for
> that request). In this state, BFQ may refuse to serve I/O requests
> from other bfq_queues. The hang reported in [1] then follows.
> 
> However, the above re-prepared requests undergo a requeue, thus the
> requeue_request hook of the active elevator is invoked for these
> requests, if set. This commit then addresses the above issue by
> properly implementing the hook requeue_request in BFQ.

Thanks, applied.
Paolo Valente Feb. 8, 2018, 7:16 a.m. UTC | #2
> Il giorno 07 feb 2018, alle ore 23:18, Jens Axboe <axboe@kernel.dk> ha scritto:
> 
> On 2/7/18 2:19 PM, Paolo Valente wrote:
>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
>> be re-inserted into the active I/O scheduler for that device. As a
>> consequence, I/O schedulers may get the same request inserted again,
>> even several times, without a finish_request invoked on that request
>> before each re-insertion.
>> 
>> This fact is the cause of the failure reported in [1]. For an I/O
>> scheduler, every re-insertion of the same re-prepared request is
>> equivalent to the insertion of a new request. For schedulers like
>> mq-deadline or kyber, this fact causes no harm. In contrast, it
>> confuses a stateful scheduler like BFQ, which keeps state for an I/O
>> request, until the finish_request hook is invoked on the request. In
>> particular, BFQ may get stuck, waiting forever for the number of
>> request dispatches, of the same request, to be balanced by an equal
>> number of request completions (while there will be one completion for
>> that request). In this state, BFQ may refuse to serve I/O requests
>> from other bfq_queues. The hang reported in [1] then follows.
>> 
>> However, the above re-prepared requests undergo a requeue, thus the
>> requeue_request hook of the active elevator is invoked for these
>> requests, if set. This commit then addresses the above issue by
>> properly implementing the hook requeue_request in BFQ.
> 
> Thanks, applied.
> 

I Jens,
I forgot to add
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
in the patch.

Is it still possible to add it?

Thanks,
Paolo

> -- 
> Jens Axboe
Oleksandr Natalenko Feb. 9, 2018, 1:21 p.m. UTC | #3
Hi.

08.02.2018 08:16, Paolo Valente wrote:
>> Il giorno 07 feb 2018, alle ore 23:18, Jens Axboe <axboe@kernel.dk> ha 
>> scritto:
>> 
>> On 2/7/18 2:19 PM, Paolo Valente wrote:
>>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq 
>>> via
>>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a 
>>> device
>>> be re-inserted into the active I/O scheduler for that device. As a
>>> consequence, I/O schedulers may get the same request inserted again,
>>> even several times, without a finish_request invoked on that request
>>> before each re-insertion.
>>> 
>>> This fact is the cause of the failure reported in [1]. For an I/O
>>> scheduler, every re-insertion of the same re-prepared request is
>>> equivalent to the insertion of a new request. For schedulers like
>>> mq-deadline or kyber, this fact causes no harm. In contrast, it
>>> confuses a stateful scheduler like BFQ, which keeps state for an I/O
>>> request, until the finish_request hook is invoked on the request. In
>>> particular, BFQ may get stuck, waiting forever for the number of
>>> request dispatches, of the same request, to be balanced by an equal
>>> number of request completions (while there will be one completion for
>>> that request). In this state, BFQ may refuse to serve I/O requests
>>> from other bfq_queues. The hang reported in [1] then follows.
>>> 
>>> However, the above re-prepared requests undergo a requeue, thus the
>>> requeue_request hook of the active elevator is invoked for these
>>> requests, if set. This commit then addresses the above issue by
>>> properly implementing the hook requeue_request in BFQ.
>> 
>> Thanks, applied.
>> 
> 
> I Jens,
> I forgot to add
> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> in the patch.
> 
> Is it still possible to add it?
> 

In addition to this I think it should be worth considering CC'ing Greg 
to pull this fix into 4.15 stable tree.

Oleksandr
Jens Axboe Feb. 9, 2018, 5:17 p.m. UTC | #4
On 2/9/18 6:21 AM, Oleksandr Natalenko wrote:
> Hi.
> 
> 08.02.2018 08:16, Paolo Valente wrote:
>>> Il giorno 07 feb 2018, alle ore 23:18, Jens Axboe <axboe@kernel.dk> ha 
>>> scritto:
>>>
>>> On 2/7/18 2:19 PM, Paolo Valente wrote:
>>>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq 
>>>> via
>>>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a 
>>>> device
>>>> be re-inserted into the active I/O scheduler for that device. As a
>>>> consequence, I/O schedulers may get the same request inserted again,
>>>> even several times, without a finish_request invoked on that request
>>>> before each re-insertion.
>>>>
>>>> This fact is the cause of the failure reported in [1]. For an I/O
>>>> scheduler, every re-insertion of the same re-prepared request is
>>>> equivalent to the insertion of a new request. For schedulers like
>>>> mq-deadline or kyber, this fact causes no harm. In contrast, it
>>>> confuses a stateful scheduler like BFQ, which keeps state for an I/O
>>>> request, until the finish_request hook is invoked on the request. In
>>>> particular, BFQ may get stuck, waiting forever for the number of
>>>> request dispatches, of the same request, to be balanced by an equal
>>>> number of request completions (while there will be one completion for
>>>> that request). In this state, BFQ may refuse to serve I/O requests
>>>> from other bfq_queues. The hang reported in [1] then follows.
>>>>
>>>> However, the above re-prepared requests undergo a requeue, thus the
>>>> requeue_request hook of the active elevator is invoked for these
>>>> requests, if set. This commit then addresses the above issue by
>>>> properly implementing the hook requeue_request in BFQ.
>>>
>>> Thanks, applied.
>>>
>>
>> I Jens,
>> I forgot to add
>> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
>> in the patch.
>>
>> Is it still possible to add it?
>>
> 
> In addition to this I think it should be worth considering CC'ing Greg 
> to pull this fix into 4.15 stable tree.

I can't add the tested-by anymore, but it's easy enough to target for
stable after-the-fact.
Mike Galbraith Feb. 9, 2018, 5:29 p.m. UTC | #5
On Fri, 2018-02-09 at 14:21 +0100, Oleksandr Natalenko wrote:
> 
> In addition to this I think it should be worth considering CC'ing Greg 
> to pull this fix into 4.15 stable tree.

This isn't one he can cherry-pick, some munging required, in which case
he usually wants a properly tested backport.

	-Mike
Oleksandr Natalenko Feb. 10, 2018, 8:29 a.m. UTC | #6
Hi.

On pátek 9. února 2018 18:29:39 CET Mike Galbraith wrote:
> On Fri, 2018-02-09 at 14:21 +0100, Oleksandr Natalenko wrote:
> > In addition to this I think it should be worth considering CC'ing Greg
> > to pull this fix into 4.15 stable tree.
> 
> This isn't one he can cherry-pick, some munging required, in which case
> he usually wants a properly tested backport.
> 
> 	-Mike

Maybe, this could be a good opportunity to push all the pending BFQ patches 
into the stable 4.15 branch? Because IIUC currently BFQ in 4.15 is just 
unusable.

Paolo?

---

block, bfq: add requeue-request hook
bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHED
block, bfq: release oom-queue ref to root group on exit
block, bfq: put async queues for root bfq groups too
block, bfq: limit sectors served with interactive weight raising
block, bfq: limit tags for writes and async I/O
block, bfq: increase threshold to deem I/O as random
block, bfq: remove superfluous check in queue-merging setup
block, bfq: let a queue be merged only shortly after starting I/O
block, bfq: check low_latency flag in bfq_bfqq_save_state()
block, bfq: add missing rq_pos_tree update on rq removal
block, bfq: fix occurrences of request finish method's old name
block, bfq: consider also past I/O in soft real-time detection
block, bfq: remove batches of confusing ifdefs
Paolo Valente Feb. 12, 2018, 7:24 a.m. UTC | #7
> Il giorno 10 feb 2018, alle ore 09:29, Oleksandr Natalenko <oleksandr@natalenko.name> ha scritto:
> 
> Hi.
> 
> On pátek 9. února 2018 18:29:39 CET Mike Galbraith wrote:
>> On Fri, 2018-02-09 at 14:21 +0100, Oleksandr Natalenko wrote:
>>> In addition to this I think it should be worth considering CC'ing Greg
>>> to pull this fix into 4.15 stable tree.
>> 
>> This isn't one he can cherry-pick, some munging required, in which case
>> he usually wants a properly tested backport.
>> 
>> 	-Mike
> 
> Maybe, this could be a good opportunity to push all the pending BFQ patches 
> into the stable 4.15 branch? Because IIUC currently BFQ in 4.15 is just 
> unusable.
> 
> Paolo?
> 

Of course ok for me, and thanks Oleksandr for proposing this.  These
commits should apply cleanly on 4.15, and FWIW have been tested, by me
and BFQ users, on 4.15 too in these months.

Thanks,
Paolo

> ---
> 
> block, bfq: add requeue-request hook
> bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHED
> block, bfq: release oom-queue ref to root group on exit
> block, bfq: put async queues for root bfq groups too
> block, bfq: limit sectors served with interactive weight raising
> block, bfq: limit tags for writes and async I/O
> block, bfq: increase threshold to deem I/O as random
> block, bfq: remove superfluous check in queue-merging setup
> block, bfq: let a queue be merged only shortly after starting I/O
> block, bfq: check low_latency flag in bfq_bfqq_save_state()
> block, bfq: add missing rq_pos_tree update on rq removal
> block, bfq: fix occurrences of request finish method's old name
> block, bfq: consider also past I/O in soft real-time detection
> block, bfq: remove batches of confusing ifdefs
> 
>
diff mbox

Patch

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..aeca22d91101 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3823,24 +3823,26 @@  static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
 		}

 		/*
-		 * We exploit the bfq_finish_request hook to decrement
-		 * rq_in_driver, but bfq_finish_request will not be
-		 * invoked on this request. So, to avoid unbalance,
-		 * just start this request, without incrementing
-		 * rq_in_driver. As a negative consequence,
-		 * rq_in_driver is deceptively lower than it should be
-		 * while this request is in service. This may cause
-		 * bfq_schedule_dispatch to be invoked uselessly.
+		 * We exploit the bfq_finish_requeue_request hook to
+		 * decrement rq_in_driver, but
+		 * bfq_finish_requeue_request will not be invoked on
+		 * this request. So, to avoid unbalance, just start
+		 * this request, without incrementing rq_in_driver. As
+		 * a negative consequence, rq_in_driver is deceptively
+		 * lower than it should be while this request is in
+		 * service. This may cause bfq_schedule_dispatch to be
+		 * invoked uselessly.
 		 *
 		 * As for implementing an exact solution, the
-		 * bfq_finish_request hook, if defined, is probably
-		 * invoked also on this request. So, by exploiting
-		 * this hook, we could 1) increment rq_in_driver here,
-		 * and 2) decrement it in bfq_finish_request. Such a
-		 * solution would let the value of the counter be
-		 * always accurate, but it would entail using an extra
-		 * interface function. This cost seems higher than the
-		 * benefit, being the frequency of non-elevator-private
+		 * bfq_finish_requeue_request hook, if defined, is
+		 * probably invoked also on this request. So, by
+		 * exploiting this hook, we could 1) increment
+		 * rq_in_driver here, and 2) decrement it in
+		 * bfq_finish_requeue_request. Such a solution would
+		 * let the value of the counter be always accurate,
+		 * but it would entail using an extra interface
+		 * function. This cost seems higher than the benefit,
+		 * being the frequency of non-elevator-private
 		 * requests very low.
 		 */
 		goto start_rq;
@@ -4515,6 +4517,8 @@  static inline void bfq_update_insert_stats(struct request_queue *q,
 					   unsigned int cmd_flags) {}
 #endif

+static void bfq_prepare_request(struct request *rq, struct bio *bio);
+
 static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 			       bool at_head)
 {
@@ -4541,6 +4545,18 @@  static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 		else
 			list_add_tail(&rq->queuelist, &bfqd->dispatch);
 	} else {
+		if (WARN_ON_ONCE(!bfqq)) {
+			/*
+			 * This should never happen. Most likely rq is
+			 * a requeued regular request, being
+			 * re-inserted without being first
+			 * re-prepared. Do a prepare, to avoid
+			 * failure.
+			 */
+			bfq_prepare_request(rq, rq->bio);
+			bfqq = RQ_BFQQ(rq);
+		}
+
 		idle_timer_disabled = __bfq_insert_request(bfqd, rq);
 		/*
 		 * Update bfqq, because, if a queue merge has occurred
@@ -4697,22 +4713,44 @@  static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
 		bfq_schedule_dispatch(bfqd);
 }

-static void bfq_finish_request_body(struct bfq_queue *bfqq)
+static void bfq_finish_requeue_request_body(struct bfq_queue *bfqq)
 {
 	bfqq->allocated--;

 	bfq_put_queue(bfqq);
 }

-static void bfq_finish_request(struct request *rq)
+/*
+ * Handle either a requeue or a finish for rq. The things to do are
+ * the same in both cases: all references to rq are to be dropped. In
+ * particular, rq is considered completed from the point of view of
+ * the scheduler.
+ */
+static void bfq_finish_requeue_request(struct request *rq)
 {
-	struct bfq_queue *bfqq;
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
 	struct bfq_data *bfqd;

-	if (!rq->elv.icq)
+	/*
+	 * Requeue and finish hooks are invoked in blk-mq without
+	 * checking whether the involved request is actually still
+	 * referenced in the scheduler. To handle this fact, the
+	 * following two checks make this function exit in case of
+	 * spurious invocations, for which there is nothing to do.
+	 *
+	 * First, check whether rq has nothing to do with an elevator.
+	 */
+	if (unlikely(!(rq->rq_flags & RQF_ELVPRIV)))
+		return;
+
+	/*
+	 * rq either is not associated with any icq, or is an already
+	 * requeued request that has not (yet) been re-inserted into
+	 * a bfq_queue.
+	 */
+	if (!rq->elv.icq || !bfqq)
 		return;

-	bfqq = RQ_BFQQ(rq);
 	bfqd = bfqq->bfqd;

 	if (rq->rq_flags & RQF_STARTED)
@@ -4727,13 +4765,14 @@  static void bfq_finish_request(struct request *rq)
 		spin_lock_irqsave(&bfqd->lock, flags);

 		bfq_completed_request(bfqq, bfqd);
-		bfq_finish_request_body(bfqq);
+		bfq_finish_requeue_request_body(bfqq);

 		spin_unlock_irqrestore(&bfqd->lock, flags);
 	} else {
 		/*
 		 * Request rq may be still/already in the scheduler,
-		 * in which case we need to remove it. And we cannot
+		 * in which case we need to remove it (this should
+		 * never happen in case of requeue). And we cannot
 		 * defer such a check and removal, to avoid
 		 * inconsistencies in the time interval from the end
 		 * of this function to the start of the deferred work.
@@ -4748,9 +4787,26 @@  static void bfq_finish_request(struct request *rq)
 			bfqg_stats_update_io_remove(bfqq_group(bfqq),
 						    rq->cmd_flags);
 		}
-		bfq_finish_request_body(bfqq);
+		bfq_finish_requeue_request_body(bfqq);
 	}

+	/*
+	 * Reset private fields. In case of a requeue, this allows
+	 * this function to correctly do nothing if it is spuriously
+	 * invoked again on this same request (see the check at the
+	 * beginning of the function). Probably, a better general
+	 * design would be to prevent blk-mq from invoking the requeue
+	 * or finish hooks of an elevator, for a request that is not
+	 * referred by that elevator.
+	 *
+	 * Resetting the following fields would break the
+	 * request-insertion logic if rq is re-inserted into a bfq
+	 * internal queue, without a re-preparation. Here we assume
+	 * that re-insertions of requeued requests, without
+	 * re-preparation, can happen only for pass_through or at_head
+	 * requests (which are not re-inserted into bfq internal
+	 * queues).
+	 */
 	rq->elv.priv[0] = NULL;
 	rq->elv.priv[1] = NULL;
 }
@@ -5426,7 +5482,8 @@  static struct elevator_type iosched_bfq_mq = {
 	.ops.mq = {
 		.limit_depth		= bfq_limit_depth,
 		.prepare_request	= bfq_prepare_request,
-		.finish_request		= bfq_finish_request,
+		.requeue_request        = bfq_finish_requeue_request,
+		.finish_request		= bfq_finish_requeue_request,
 		.exit_icq		= bfq_exit_icq,
 		.insert_requests	= bfq_insert_requests,
 		.dispatch_request	= bfq_dispatch_request,