[1/2] blk-mq: don't complete un-started request in timeout handler
diff mbox

Message ID 1489613678.2660.9.camel@sandisk.com
State New
Headers show

Commit Message

Bart Van Assche March 15, 2017, 9:35 p.m. UTC
On Thu, 2017-03-16 at 00:22 +0800, Ming Lei wrote:
> On Wed, Mar 15, 2017 at 03:36:31PM +0000, Bart Van Assche wrote:
> > Please have another look at __blk_mq_requeue_request(). In that function
> > the following code occurs: if (test_and_clear_bit(REQ_ATOM_STARTED,
> > &rq->atomic_flags)) { ... }
> > 
> > I think the REQ_ATOM_STARTED check in blk_mq_check_expired() races with the
> > test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags) call in
> > __blk_mq_requeue_request().
> 
> OK, this race should only exist in case that the requeue happens after dispatch
> busy, because COMPLETE flag isn't set. And if the requeue is from io completion,
> no such race because COMPLETE flag is set.
> 
> One solution I thought of is to call blk_mark_rq_complete() before requeuing
> when dispatch busy happened, but that looks a bit silly. Another way is to
> set STARTED flag just after .queue_rq returns BLK_MQ_RQ_QUEUE_OK, which looks
> reasonable too. Any comments on the 2nd solution?

Sorry but I don't think that would be sufficient. There are several other
scenarios that have not been mentioned above, e.g. that a tag gets freed and
reused after the blk_mq_check_expired() call started and before that function
has had the chance to examine any members of struct request. What is needed in
blk_mq_check_expired() is the following as a single atomic operation:
* Check whether REQ_ATOM_STARTED has been set.
* Check whether REQ_ATOM_COMPLETE has not yet been set.
* If both conditions have been met, set REQ_ATOM_COMPLETE.

I don't think there is another solution than using a single state variable to
represent the REQ_ATOM_STARTED and REQ_ATOM_COMPLETE states instead of two
independent bits. How about the patch below?

Thanks,

Bart.


[PATCH] blk-mq: Fix request state manipulation races

Use a single state variable instead of two separate bits
REQ_ATOM_STARTED and REQ_ATOM_COMPLETE. For blk-mq, make
__blk_mq_finish_request() perform the transition from "complete" to
"not complete" instead of blk_mq_start_request(). Make sure that
blk_mq_check_expired() uses a single atomic operation to test the
"started" and "complete" states and to set the "complete" state.
---
 block/blk-core.c           |  6 +++--
 block/blk-mq.c             | 67 +++++++++++++++++-----------------------------
 block/blk-timeout.c        |  2 +-
 block/blk.h                | 21 ++++++++++-----
 drivers/scsi/virtio_scsi.c |  2 +-
 include/linux/blkdev.h     |  1 +
 6 files changed, 47 insertions(+), 52 deletions(-)

-- 
2.12.0

Comments

Ming Lei March 16, 2017, 12:07 a.m. UTC | #1
On Wed, Mar 15, 2017 at 09:35:03PM +0000, Bart Van Assche wrote:
> On Thu, 2017-03-16 at 00:22 +0800, Ming Lei wrote:
> > On Wed, Mar 15, 2017 at 03:36:31PM +0000, Bart Van Assche wrote:
> > > Please have another look at __blk_mq_requeue_request(). In that function
> > > the following code occurs: if (test_and_clear_bit(REQ_ATOM_STARTED,
> > > &rq->atomic_flags)) { ... }
> > > 
> > > I think the REQ_ATOM_STARTED check in blk_mq_check_expired() races with the
> > > test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags) call in
> > > __blk_mq_requeue_request().
> > 
> > OK, this race should only exist in case that the requeue happens after dispatch
> > busy, because COMPLETE flag isn't set. And if the requeue is from io completion,
> > no such race because COMPLETE flag is set.
> > 
> > One solution I thought of is to call blk_mark_rq_complete() before requeuing
> > when dispatch busy happened, but that looks a bit silly. Another way is to
> > set STARTED flag just after .queue_rq returns BLK_MQ_RQ_QUEUE_OK, which looks
> > reasonable too. Any comments on the 2nd solution?
> 
> Sorry but I don't think that would be sufficient. There are several other
> scenarios that have not been mentioned above, e.g. that a tag gets freed and
> reused after the blk_mq_check_expired() call started and before that function
> has had the chance to examine any members of struct request. What is needed in
> blk_mq_check_expired() is the following as a single atomic operation:

We have dealt with this by checking COMPLETE & rq->deadline together in
blk_mq_check_expired() already:

- if new rq->deadline(set in reuse path) has been observed in the later
checking rq of blk_mq_check_expired(), it won't be timeouted because of the timing.

- if new rq->deadline(set in reuse path) hasn't been observed in the
later checking rq of blk_mq_check_expired(), that means COMPLETE flag isn't set
yet in reuse path because we have a barrier to enhance the order in
blk_mq_start_request(), so it won't be timeouted too.

So let me know what is the real race between free/reusing vs. timeout.

> * Check whether REQ_ATOM_STARTED has been set.
> * Check whether REQ_ATOM_COMPLETE has not yet been set.
> * If both conditions have been met, set REQ_ATOM_COMPLETE.
> 
> I don't think there is another solution than using a single state variable to
> represent the REQ_ATOM_STARTED and REQ_ATOM_COMPLETE states instead of two
> independent bits. How about the patch below?

I would review it if you can confirm me that it is a real issue, :-)

Thanks,
Ming
Bart Van Assche March 16, 2017, 9:35 p.m. UTC | #2
On Thu, 2017-03-16 at 08:07 +0800, Ming Lei wrote:
> > * Check whether REQ_ATOM_STARTED has been set.
> > * Check whether REQ_ATOM_COMPLETE has not yet been set.
> > * If both conditions have been met, set REQ_ATOM_COMPLETE.
> > 
> > I don't think there is another solution than using a single state variable to
> > represent the REQ_ATOM_STARTED and REQ_ATOM_COMPLETE states instead of two
> > independent bits. How about the patch below?
> 
> I would review it if you can confirm me that it is a real issue, :-)

Hello Ming,

I was chasing a v4.11 regression in the SCSI stack. Since my tests of today
revealed that it's probably not a block layer issue, let's proceed with your
patch.

Bart.
Ming Lei March 17, 2017, 12:07 a.m. UTC | #3
On Fri, Mar 17, 2017 at 5:35 AM, Bart Van Assche
<Bart.VanAssche@sandisk.com> wrote:
> On Thu, 2017-03-16 at 08:07 +0800, Ming Lei wrote:
>> > * Check whether REQ_ATOM_STARTED has been set.
>> > * Check whether REQ_ATOM_COMPLETE has not yet been set.
>> > * If both conditions have been met, set REQ_ATOM_COMPLETE.
>> >
>> > I don't think there is another solution than using a single state variable to
>> > represent the REQ_ATOM_STARTED and REQ_ATOM_COMPLETE states instead of two
>> > independent bits. How about the patch below?
>>
>> I would review it if you can confirm me that it is a real issue, :-)
>
> Hello Ming,
>
> I was chasing a v4.11 regression in the SCSI stack. Since my tests of today
> revealed that it's probably not a block layer issue, let's proceed with your
> patch.

Yeah, it shouldn't have been related with blk-mq timeout handler which
isn't changed much
recently, and thanks for your review!



Thanks,
Ming Lei

Patch
diff mbox

diff --git a/block/blk-core.c b/block/blk-core.c
index 0eeb99ef654f..dff857d2b86b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1305,7 +1305,7 @@  EXPORT_SYMBOL(blk_get_request);
 void blk_requeue_request(struct request_queue *q, struct request *rq)
 {
 	blk_delete_timer(rq);
-	blk_clear_rq_complete(rq);
+	atomic_set(&rq->state, REQ_NOT_STARTED);
 	trace_block_rq_requeue(q, rq);
 	wbt_requeue(q->rq_wb, &rq->issue_stat);
 
@@ -2477,7 +2477,9 @@  void blk_start_request(struct request *req)
 		wbt_issue(req->q->rq_wb, &req->issue_stat);
 	}
 
-	BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags));
+	WARN_ONCE(atomic_read(&req->state) != REQ_NOT_STARTED,
+		  "unexpected request state %d != %d\n",
+		  atomic_read(&req->state), REQ_NOT_STARTED);
 	blk_add_timer(req);
 }
 EXPORT_SYMBOL(blk_start_request);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..fe73d5a1ffc3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -343,7 +343,7 @@  void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	wbt_done(q->rq_wb, &rq->issue_stat);
 	rq->rq_flags = 0;
 
-	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+	atomic_set(&rq->state, REQ_NOT_STARTED);
 	clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
@@ -479,7 +479,7 @@  EXPORT_SYMBOL(blk_mq_complete_request);
 
 int blk_mq_request_started(struct request *rq)
 {
-	return test_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+	return atomic_read(&rq->state) == REQ_STARTED;
 }
 EXPORT_SYMBOL_GPL(blk_mq_request_started);
 
@@ -505,16 +505,10 @@  void blk_mq_start_request(struct request *rq)
 	 */
 	smp_mb__before_atomic();
 
-	/*
-	 * Mark us as started and clear complete. Complete might have been
-	 * set if requeue raced with timeout, which then marked it as
-	 * complete. So be sure to clear complete again when we start
-	 * the request, otherwise we'll ignore the completion event.
-	 */
-	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
-		set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
-		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
+	WARN_ONCE(atomic_read(&rq->state) != REQ_NOT_STARTED,
+		  "unexpected request state %d != %d\n",
+		  atomic_read(&rq->state), REQ_NOT_STARTED);
+	atomic_set(&rq->state, REQ_STARTED);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -530,12 +524,14 @@  EXPORT_SYMBOL(blk_mq_start_request);
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	enum rq_state prev;
 
 	trace_block_rq_requeue(q, rq);
 	wbt_requeue(q->rq_wb, &rq->issue_stat);
 	blk_mq_sched_requeue_request(rq);
 
-	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
+	prev = atomic_xchg(&rq->state, REQ_NOT_STARTED);
+	if (prev != REQ_NOT_STARTED) {
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
@@ -661,17 +657,7 @@  void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	/*
-	 * We know that complete is set at this point. If STARTED isn't set
-	 * anymore, then the request isn't active and the "timeout" should
-	 * just be ignored. This can happen due to the bitflag ordering.
-	 * Timeout first checks if STARTED is set, and if it is, assumes
-	 * the request is active. But if we race with completion, then
-	 * we both flags will get cleared. So check here again, and ignore
-	 * a timeout event with a request that isn't active.
-	 */
-	if (!test_bit(REQ_ATOM_STARTED, &req->atomic_flags))
-		return;
+	WARN_ON_ONCE(atomic_read(&req->state) != REQ_COMPLETE);
 
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
@@ -682,7 +668,7 @@  void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		break;
 	case BLK_EH_RESET_TIMER:
 		blk_add_timer(req);
-		blk_clear_rq_complete(req);
+		atomic_set(&req->state, REQ_STARTED);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -692,27 +678,24 @@  void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	}
 }
 
+/*
+ * Check whether or not a request has expired. This function can execute
+ * concurrently with other functions that change the request state. This can
+ * result in returning a deadline (blk_mq_timeout_data.next) that occurs
+ * before a request times out. However, this is harmless because the next
+ * call of blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data) will
+ * yield the correct timeout, unless the same race occurs again.
+ */
 static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
 
-	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
-		/*
-		 * If a request wasn't started before the queue was
-		 * marked dying, kill it here or it'll go unnoticed.
-		 */
-		if (unlikely(blk_queue_dying(rq->q))) {
-			rq->errors = -EIO;
-			blk_mq_end_request(rq, rq->errors);
-		}
-		return;
-	}
-
-	if (time_after_eq(jiffies, rq->deadline)) {
-		if (!blk_mark_rq_complete(rq))
-			blk_mq_rq_timed_out(rq, reserved);
-	} else if (!data->next_set || time_after(data->next, rq->deadline)) {
+	if (time_after_eq(jiffies, rq->deadline) &&
+	    !blk_mark_rq_complete_if_started(rq)) {
+		blk_mq_rq_timed_out(rq, reserved);
+	} else if ((!data->next_set || time_after(data->next, rq->deadline)) &&
+		   blk_mq_request_started(rq)) {
 		data->next = rq->deadline;
 		data->next_set = 1;
 	}
@@ -2821,7 +2804,7 @@  static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
 
 	hrtimer_init_sleeper(&hs, current);
 	do {
-		if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
+		if (atomic_read(&rq->state) == REQ_COMPLETE)
 			break;
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		hrtimer_start_expires(&hs.timer, mode);
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index a30441a200c0..9a8b44ebfb99 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -94,7 +94,7 @@  static void blk_rq_timed_out(struct request *req)
 		break;
 	case BLK_EH_RESET_TIMER:
 		blk_add_timer(req);
-		blk_clear_rq_complete(req);
+		atomic_set(&req->state, REQ_NOT_STARTED);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		/*
diff --git a/block/blk.h b/block/blk.h
index d1ea4bd9b9a3..8af5fe21e85f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -115,23 +115,32 @@  void blk_account_io_done(struct request *req);
  * Internal atomic flags for request handling
  */
 enum rq_atomic_flags {
-	REQ_ATOM_COMPLETE = 0,
-	REQ_ATOM_STARTED,
 	REQ_ATOM_POLL_SLEPT,
 };
 
 /*
+ * Request states. Note: REQ_STARTED is only used by the blk-mq code.
+ */
+enum rq_state {
+	REQ_NOT_STARTED,
+	REQ_STARTED,
+	REQ_COMPLETE,
+};
+
+/*
  * EH timer and IO completion will both attempt to 'grab' the request, make
- * sure that only one of them succeeds
+ * sure that only one of them succeeds. The return value 0 means that this
+ * function changed the request state from "not complete" into "complete".
  */
 static inline int blk_mark_rq_complete(struct request *rq)
 {
-	return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
+	return atomic_xchg(&rq->state, REQ_COMPLETE) == REQ_COMPLETE;
 }
 
-static inline void blk_clear_rq_complete(struct request *rq)
+static inline int blk_mark_rq_complete_if_started(struct request *rq)
 {
-	clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
+	return atomic_cmpxchg(&rq->state, REQ_STARTED, REQ_COMPLETE) !=
+		REQ_STARTED;
 }
 
 /*
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 939c47df73fa..136379097131 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -672,7 +672,7 @@  static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
 	 *
 	 * In the abort case, sc->scsi_done will do nothing, because
 	 * the block layer must have detected a timeout and as a result
-	 * REQ_ATOM_COMPLETE has been set.
+	 * rq->state == REQ_COMPLETED.
 	 */
 	virtscsi_poll_requests(vscsi);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5a7da607ca04..b286887b095e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -142,6 +142,7 @@  struct request {
 
 	int internal_tag;
 
+	atomic_t state;
 	unsigned long atomic_flags;
 
 	/* the following two fields are internal, NEVER access directly */