[5/7] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq
diff mbox

Message ID 20171216120726.517153-6-tj@kernel.org
State New
Headers show

Commit Message

Tejun Heo Dec. 16, 2017, 12:07 p.m. UTC
After the recent updates to use generation number and state based
synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
to avoid firing the same timeout multiple times.

Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
times.  This removes atomic bitops from hot paths too.

v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().

v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
---
 block/blk-mq.c         | 18 ++++++++----------
 block/blk-timeout.c    |  1 +
 include/linux/blkdev.h |  2 ++
 3 files changed, 11 insertions(+), 10 deletions(-)

Comments

jianchao.wang Dec. 21, 2017, 3:56 a.m. UTC | #1
Hi tejun

On 12/16/2017 08:07 PM, Tejun Heo wrote:
> After the recent updates to use generation number and state based
> synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
> to avoid firing the same timeout multiple times.
> 
> Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
> RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
> times.  This removes atomic bitops from hot paths too.
> 
> v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().
> 
> v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
> ---
>  block/blk-mq.c         | 18 ++++++++----------
>  block/blk-timeout.c    |  1 +
>  include/linux/blkdev.h |  2 ++
>  3 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 88baa82..47e722b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -607,14 +607,12 @@ void blk_mq_complete_request(struct request *rq)
>  	 */
>  	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
>  		rcu_read_lock();
> -		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
> -		    !blk_mark_rq_complete(rq))
> +		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>  			__blk_mq_complete_request(rq);
>  		rcu_read_unlock();
>  	} else {
>  		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> -		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
> -		    !blk_mark_rq_complete(rq))
> +		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>  			__blk_mq_complete_request(rq);
>  		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
>  	}
It's worrying that even though the blk_mark_rq_complete() here is intended to synchronize with
timeout path, but it indeed give the blk_mq_complete_request() the capability to exclude with 
itself. Maybe this capability should be reserved.

Thanks
Jianchao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Dec. 21, 2017, 1:50 p.m. UTC | #2
Hello,

On Thu, Dec 21, 2017 at 11:56:49AM +0800, jianchao.wang wrote:
> It's worrying that even though the blk_mark_rq_complete() here is intended to synchronize with
> timeout path, but it indeed give the blk_mq_complete_request() the capability to exclude with 
> itself. Maybe this capability should be reserved.

Can you explain further how that'd help?  The problem there is that if
you have two competing completions, where one isn't a timeout, there's
nothing synchronizing the reuse of the request.  IOW, the losing one
can easily complete the next recycle instance.  The atomic bitops
might feel like more protection but it's just feels.

Thanks.
jianchao.wang Dec. 22, 2017, 4:02 a.m. UTC | #3
Sorry for my non-detailed description. 

On 12/21/2017 09:50 PM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Dec 21, 2017 at 11:56:49AM +0800, jianchao.wang wrote:
>> It's worrying that even though the blk_mark_rq_complete() here is intended to synchronize with
>> timeout path, but it indeed give the blk_mq_complete_request() the capability to exclude with 
There could be scenario where the driver itself stop a request itself with blk_mq_complete_request() or
some other interface that will invoke it, races with the normal completion path where a same request comes.
For example:
a reset could be triggered through sysfs on nvme-rdma
Then the driver will cancel all the reqs, including in-flight ones.
nvme_rdma_reset_ctrl_work()
    nvme_rdma_shutdown_ctrl()
    >>>>
        if (ctrl->ctrl.queue_count > 1) {
            nvme_stop_queues(&ctrl->ctrl); //quiesce the queue
            blk_mq_tagset_busy_iter(&ctrl->tag_set,
                        nvme_cancel_request, &ctrl->ctrl); //invoke blk_mq_complete_request()
            nvme_rdma_destroy_io_queues(ctrl, shutdown);
        }
    >>>>

These operations could race with the normal completion path of in-flight ones.
It should drain all the in-flight ones first here. But there maybe some other
places similar with this.
 
>> itself. Maybe this capability should be reserved.
> 
> Can you explain further how that'd help?  The problem there is that if
> you have two competing completions, where one isn't a timeout, there's
> nothing synchronizing the reuse of the request.  IOW, the losing on
> can easily complete the next recycle instance.  The atomic bitops
> might feel like more protection but it's just feels.

In above case, the request may simultaneously enter requeue and end path.

Thanks
Jianchao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Jan. 8, 2018, 5:27 p.m. UTC | #4
Hello, Jianchao.

On Fri, Dec 22, 2017 at 12:02:20PM +0800, jianchao.wang wrote:
> > On Thu, Dec 21, 2017 at 11:56:49AM +0800, jianchao.wang wrote:
> >> It's worrying that even though the blk_mark_rq_complete() here is
> >> intended to synchronize with timeout path, but it indeed give the
> >> blk_mq_complete_request() the capability to exclude with
>
> There could be scenario where the driver itself stop a request
> itself with blk_mq_complete_request() or some other interface that
> will invoke it, races with the normal completion path where a same
> request comes.

But what'd prevent the completion reinitializing the request and then
the actual completion path coming in and completing the request again?

> For example:
> a reset could be triggered through sysfs on nvme-rdma
> Then the driver will cancel all the reqs, including in-flight ones.
> nvme_rdma_reset_ctrl_work()
>     nvme_rdma_shutdown_ctrl()
>     >>>>
>         if (ctrl->ctrl.queue_count > 1) {
>             nvme_stop_queues(&ctrl->ctrl); //quiesce the queue
>             blk_mq_tagset_busy_iter(&ctrl->tag_set,
>                         nvme_cancel_request, &ctrl->ctrl); //invoke blk_mq_complete_request()
>             nvme_rdma_destroy_io_queues(ctrl, shutdown);
>         }
>     >>>>
> 
> These operations could race with the normal completion path of in-flight ones.
> It should drain all the in-flight ones first here. But there maybe some other
> places similar with this.

If there are any such places, they should be using an interface which
is propelry synchronized like blk_abort_request(), which btw is what
libata already does.  Otherwise, it's racy with or without these
patches.

Thanks.
jianchao.wang Jan. 9, 2018, 3:08 a.m. UTC | #5
Hi tejun

Many thanks for you kindly response.

On 01/09/2018 01:27 AM, Tejun Heo wrote:
> Hello, Jianchao.
> 
> On Fri, Dec 22, 2017 at 12:02:20PM +0800, jianchao.wang wrote:
>>> On Thu, Dec 21, 2017 at 11:56:49AM +0800, jianchao.wang wrote:
>>>> It's worrying that even though the blk_mark_rq_complete() here is
>>>> intended to synchronize with timeout path, but it indeed give the
>>>> blk_mq_complete_request() the capability to exclude with
>>
>> There could be scenario where the driver itself stop a request
>> itself with blk_mq_complete_request() or some other interface that
>> will invoke it, races with the normal completion path where a same
>> request comes.
> 
> But what'd prevent the completion reinitializing the request and then
> the actual completion path coming in and completing the request again?
> 
blk_mark_rq_complete() will gate and ensure there will be only one 
__blk_mq_complete_request() to be invoked.

>> For example:
>> a reset could be triggered through sysfs on nvme-rdma
>> Then the driver will cancel all the reqs, including in-flight ones.
>> nvme_rdma_reset_ctrl_work()
>>     nvme_rdma_shutdown_ctrl()
>>     >>>>
>>         if (ctrl->ctrl.queue_count > 1) {
>>             nvme_stop_queues(&ctrl->ctrl); //quiesce the queue
>>             blk_mq_tagset_busy_iter(&ctrl->tag_set,
>>                         nvme_cancel_request, &ctrl->ctrl); //invoke blk_mq_complete_request()
>>             nvme_rdma_destroy_io_queues(ctrl, shutdown);
>>         }
>>     >>>>
>>
>> These operations could race with the normal completion path of in-flight ones.
>> It should drain all the in-flight ones first here. But there maybe some other
>> places similar with this.
> 
> If there are any such places, they should be using an interface which
> is propelry synchronized like blk_abort_request(), which btw is what
> libata already does.  Otherwise, it's racy with or without these
> patches.
Yes, it is that.

Thanks for you kindly response again.
Jianchao

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Jan. 9, 2018, 3:37 a.m. UTC | #6
Hello,

On Tue, Jan 09, 2018 at 11:08:04AM +0800, jianchao.wang wrote:
> > But what'd prevent the completion reinitializing the request and then
> > the actual completion path coming in and completing the request again?
>
> blk_mark_rq_complete() will gate and ensure there will be only one 
> __blk_mq_complete_request() to be invoked.

Yeah, but then the complete flag will be cleared once completion is
done and the request is reinitialized.

Thanks.
jianchao.wang Jan. 9, 2018, 5:59 a.m. UTC | #7
Hi tejun

Many thanks for your kindly response.

On 01/09/2018 11:37 AM, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jan 09, 2018 at 11:08:04AM +0800, jianchao.wang wrote:
>>> But what'd prevent the completion reinitializing the request and then
>>> the actual completion path coming in and completing the request again?
>>
>> blk_mark_rq_complete() will gate and ensure there will be only one 
>> __blk_mq_complete_request() to be invoked.
> 
> Yeah, but then the complete flag will be cleared once completion is
> done and the request is reinitialized.

Yes, it is. What I mean is not to reserve blk_mark_rq_complete and  REQ_ATOM_COMPLETE
usages in blk-mq but the side-effect after blk_mq_complete_request cannot exclude with itself.
As you said, the scene is racy and should be modified. :)

Thanks
Jianchao 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 88baa82..47e722b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -607,14 +607,12 @@  void blk_mq_complete_request(struct request *rq)
 	 */
 	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
 		rcu_read_lock();
-		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
-		    !blk_mark_rq_complete(rq))
+		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 			__blk_mq_complete_request(rq);
 		rcu_read_unlock();
 	} else {
 		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
-		    !blk_mark_rq_complete(rq))
+		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 			__blk_mq_complete_request(rq);
 		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
 	}
@@ -665,8 +663,6 @@  void blk_mq_start_request(struct request *rq)
 	preempt_enable();
 
 	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
-		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -817,6 +813,8 @@  static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	if (!test_bit(REQ_ATOM_STARTED, &req->atomic_flags))
 		return;
 
+	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
+
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
@@ -832,7 +830,6 @@  static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		 */
 		blk_mq_rq_update_aborted_gstate(req, 0);
 		blk_add_timer(req);
-		blk_clear_rq_complete(req);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -851,7 +848,8 @@  static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 
 	might_sleep();
 
-	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
+	if ((rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) ||
+	    !test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
 		return;
 
 	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
@@ -886,8 +884,8 @@  static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 	 * now guaranteed to see @rq->aborted_gstate and yield.  If
 	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
 	 */
-	if (READ_ONCE(rq->gstate) == rq->aborted_gstate &&
-	    !blk_mark_rq_complete(rq))
+	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
+	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
 		blk_mq_rq_timed_out(rq, reserved);
 }
 
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index d580af3..25c4ffa 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -209,6 +209,7 @@  void blk_add_timer(struct request *req)
 		req->timeout = q->rq_timeout;
 
 	req->deadline = jiffies + req->timeout;
+	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
 
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2d6fd11..13186a7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -123,6 +123,8 @@  typedef __u32 __bitwise req_flags_t;
 /* Look at ->special_vec for the actual data payload instead of the
    bio chain. */
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
+/* timeout is expired */
+#define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 19))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \