diff mbox

[v2,4/7] blk-mq: Introduce blk_quiesce_queue() and blk_resume_queue()

Message ID a11bb09a-6207-f4a8-cfd4-4fe5627c1700@sandisk.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Bart Van Assche Oct. 5, 2016, 4:16 a.m. UTC
On 10/01/16 15:56, Ming Lei wrote:
> If we just call the rcu/srcu read lock(or the mutex) around .queue_rq(), the
> above code needn't to be duplicated any more.

Hello Ming,

Can you have a look at the attached patch? That patch uses an srcu read 
lock for all queue types, whether or not the BLK_MQ_F_BLOCKING flag has 
been set. Additionally, I have dropped the QUEUE_FLAG_QUIESCING flag. 
Just like previous versions, this patch has been tested.

Thanks,

Bart.

Comments

Ming Lei Oct. 5, 2016, 4:32 a.m. UTC | #1
On Wed, Oct 5, 2016 at 12:16 PM, Bart Van Assche
<bart.vanassche@sandisk.com> wrote:
> On 10/01/16 15:56, Ming Lei wrote:
>>
>> If we just call the rcu/srcu read lock(or the mutex) around .queue_rq(),
>> the
>> above code needn't to be duplicated any more.
>
>
> Hello Ming,
>
> Can you have a look at the attached patch? That patch uses an srcu read lock
> for all queue types, whether or not the BLK_MQ_F_BLOCKING flag has been set.

That is much cleaner now.

> Additionally, I have dropped the QUEUE_FLAG_QUIESCING flag. Just like
> previous versions, this patch has been tested.

I think the flag of QUEUE_FLAG_QUIESCING is still needed because we
have to set this flag to prevent new coming .queue_rq() from being run,
and synchronize_srcu() won't wait for completion of that at all (see
section of 'Update-Side Primitives' in [1]).


[1] https://lwn.net/Articles/202847/
Bart Van Assche Oct. 5, 2016, 2:46 p.m. UTC | #2
On 10/04/16 21:32, Ming Lei wrote:
> On Wed, Oct 5, 2016 at 12:16 PM, Bart Van Assche
> <bart.vanassche@sandisk.com> wrote:
>> On 10/01/16 15:56, Ming Lei wrote:
>>>
>>> If we just call the rcu/srcu read lock(or the mutex) around .queue_rq(),
>>> the above code needn't to be duplicated any more.
>>
>> Can you have a look at the attached patch? That patch uses an srcu read lock
>> for all queue types, whether or not the BLK_MQ_F_BLOCKING flag has been set.
>
> That is much cleaner now.
>
>> Additionally, I have dropped the QUEUE_FLAG_QUIESCING flag. Just like
>> previous versions, this patch has been tested.
>
> I think the flag of QUEUE_FLAG_QUIESCING is still needed because we
> have to set this flag to prevent new coming .queue_rq() from being run,
> and synchronize_srcu() won't wait for completion of that at all (see
> section of 'Update-Side Primitives' in [1]).
>
> [1] https://lwn.net/Articles/202847/

Hello Ming,

How about using the existing flag BLK_MQ_S_STOPPED instead of 
introducing a new QUEUE_FLAG_QUIESCING flag? From the comment above 
blk_mq_quiesce_queue() in the patch that was attached to my previous 
e-mail: "Additionally, it is not prevented that new queue_rq() calls 
occur unless the queue has been stopped first."

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Oct. 5, 2016, 4:11 p.m. UTC | #3
On Wed, Oct 5, 2016 at 10:46 PM, Bart Van Assche
<bart.vanassche@sandisk.com> wrote:
> On 10/04/16 21:32, Ming Lei wrote:
>>
>> On Wed, Oct 5, 2016 at 12:16 PM, Bart Van Assche
>> <bart.vanassche@sandisk.com> wrote:
>>>
>>> On 10/01/16 15:56, Ming Lei wrote:
>>>>
>>>>
>>>> If we just call the rcu/srcu read lock(or the mutex) around .queue_rq(),
>>>> the above code needn't to be duplicated any more.
>>>
>>>
>>> Can you have a look at the attached patch? That patch uses an srcu read
>>> lock
>>> for all queue types, whether or not the BLK_MQ_F_BLOCKING flag has been
>>> set.
>>
>>
>> That is much cleaner now.
>>
>>> Additionally, I have dropped the QUEUE_FLAG_QUIESCING flag. Just like
>>> previous versions, this patch has been tested.
>>
>>
>> I think the flag of QUEUE_FLAG_QUIESCING is still needed because we
>> have to set this flag to prevent new coming .queue_rq() from being run,
>> and synchronize_srcu() won't wait for completion of that at all (see
>> section of 'Update-Side Primitives' in [1]).
>>
>> [1] https://lwn.net/Articles/202847/
>
>
> Hello Ming,
>
> How about using the existing flag BLK_MQ_S_STOPPED instead of introducing a
> new QUEUE_FLAG_QUIESCING flag? From the comment above blk_mq_quiesce_queue()

That looks fine, and we need to stop direct issue first after hw queue
becomes BLK_MQ_S_STOPPED.

> in the patch that was attached to my previous e-mail: "Additionally, it is
> not prevented that new queue_rq() calls occur unless the queue has been
> stopped first."
>
> Thanks,
>
> Bart.
Sagi Grimberg Oct. 5, 2016, 6:14 p.m. UTC | #4
> Hello Ming,
>
> Can you have a look at the attached patch? That patch uses an srcu read
> lock for all queue types, whether or not the BLK_MQ_F_BLOCKING flag has
> been set. Additionally, I have dropped the QUEUE_FLAG_QUIESCING flag.
> Just like previous versions, this patch has been tested.

Hey Bart,

Do we care about the synchronization of queue_rq and/or
blk_mq_run_hw_queue of the hctx is not stopped?

I'm wandering if we can avoid introducing new barriers in the
submission path of its not absolutely needed.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 5, 2016, 7:05 p.m. UTC | #5
On 10/05/2016 11:14 AM, Sagi Grimberg wrote:
>> Hello Ming,
>>
>> Can you have a look at the attached patch? That patch uses an srcu read
>> lock for all queue types, whether or not the BLK_MQ_F_BLOCKING flag has
>> been set. Additionally, I have dropped the QUEUE_FLAG_QUIESCING flag.
>> Just like previous versions, this patch has been tested.
>
> Hey Bart,
>
> Do we care about the synchronization of queue_rq and/or
> blk_mq_run_hw_queue of the hctx is not stopped?
>
> I'm wandering if we can avoid introducing new barriers in the
> submission path of its not absolutely needed.

Hello Sagi,

I'm not sure whether the new blk_quiesce_queue() function is useful 
without stopping all hardware contexts first. In other words, in my view 
setting BLK_MQ_F_BLOCKING flag before calling blk_quiesce_queue() is 
sufficient and I don't think that a new QUEUE_FLAG_QUIESCING flag is 
necessary.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Oct. 5, 2016, 7:10 p.m. UTC | #6
>>> Hello Ming,
>>>
>>> Can you have a look at the attached patch? That patch uses an srcu read
>>> lock for all queue types, whether or not the BLK_MQ_F_BLOCKING flag has
>>> been set. Additionally, I have dropped the QUEUE_FLAG_QUIESCING flag.
>>> Just like previous versions, this patch has been tested.
>>
>> Hey Bart,
>>
>> Do we care about the synchronization of queue_rq and/or
>> blk_mq_run_hw_queue of the hctx is not stopped?
>>
>> I'm wandering if we can avoid introducing new barriers in the
>> submission path of its not absolutely needed.
>
> Hello Sagi,

Hey Bart,

>
> I'm not sure whether the new blk_quiesce_queue() function is useful
> without stopping all hardware contexts first. In other words, in my view
> setting BLK_MQ_F_BLOCKING flag before calling blk_quiesce_queue() is
> sufficient and I don't think that a new QUEUE_FLAG_QUIESCING flag is
> necessary.

I was referring to weather we can take srcu in the submission path
conditional of the hctx being STOPPED?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 5, 2016, 9:08 p.m. UTC | #7
On 10/05/2016 12:11 PM, Sagi Grimberg wrote:
> I was referring to weather we can take srcu in the submission path
> conditional of the hctx being STOPPED?

Hello Sagi,

Regarding run-time overhead:
* rcu_read_lock() is a no-op on CONFIG_PREEMPT_NONE kernels and is
   translated into preempt_disable() with preemption enabled. The latter
   function modifies a per-cpu variable.
* Checking BLK_MQ_S_STOPPED before taking an rcu or srcu lock is only
   safe if the BLK_MQ_S_STOPPED flag is tested in such a way that the
   compiler is told to reread the hctx flags (READ_ONCE()) and if the
   compiler and CPU are told not to reorder test_bit() with the
   memory accesses in (s)rcu_read_lock(). To avoid races
   BLK_MQ_S_STOPPED will have to be tested a second time after the lock
   has been obtained, similar to the double-checked-locking pattern.
* srcu_read_lock() reads a word from the srcu structure, disables
   preemption, calls __srcu_read_lock() and re-enables preemption. The
   latter function increments two CPU-local variables and triggers a
   memory barrier (smp_mp()).

Swapping srcu_read_lock() and the BLK_MQ_S_STOPPED flag test will make 
the code more complicated. Going back to the implementation that calls 
rcu_read_lock() if .queue_rq() won't sleep will result in an 
implementation that is easier to read and to verify. If I overlooked 
something, please let me know.

Bart.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Oct. 5, 2016, 10:49 p.m. UTC | #8
On Thu, Oct 6, 2016 at 5:08 AM, Bart Van Assche
<bart.vanassche@sandisk.com> wrote:
> On 10/05/2016 12:11 PM, Sagi Grimberg wrote:
>>
>> I was referring to weather we can take srcu in the submission path
>> conditional of the hctx being STOPPED?
>
>
> Hello Sagi,
>
> Regarding run-time overhead:
> * rcu_read_lock() is a no-op on CONFIG_PREEMPT_NONE kernels and is
>   translated into preempt_disable() with preemption enabled. The latter
>   function modifies a per-cpu variable.
> * Checking BLK_MQ_S_STOPPED before taking an rcu or srcu lock is only
>   safe if the BLK_MQ_S_STOPPED flag is tested in such a way that the
>   compiler is told to reread the hctx flags (READ_ONCE()) and if the
>   compiler and CPU are told not to reorder test_bit() with the
>   memory accesses in (s)rcu_read_lock(). To avoid races
>   BLK_MQ_S_STOPPED will have to be tested a second time after the lock
>   has been obtained, similar to the double-checked-locking pattern.
> * srcu_read_lock() reads a word from the srcu structure, disables
>   preemption, calls __srcu_read_lock() and re-enables preemption. The
>   latter function increments two CPU-local variables and triggers a
>   memory barrier (smp_mp()).

We can use srcu read lock for BLOCKING and rcu read lock for non-BLOCKING,
by putting *_read_lock() and *_read_unlock() into two wrappers, which
should minimize the cost of srcu read lock & unlock and the code is still easy
to read & verify.

>
> Swapping srcu_read_lock() and the BLK_MQ_S_STOPPED flag test will make the
> code more complicated. Going back to the implementation that calls
> rcu_read_lock() if .queue_rq() won't sleep will result in an implementation
> that is easier to read and to verify.

Yeah, I agree.

Thanks,
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 5, 2016, 11 p.m. UTC | #9
On 10/05/2016 03:49 PM, Ming Lei wrote:
> We can use srcu read lock for BLOCKING and rcu read lock for non-BLOCKING,
> by putting *_read_lock() and *_read_unlock() into two wrappers, which
> should minimize the cost of srcu read lock & unlock and the code is still easy
> to read & verify.

Hello Ming,

The lock checking algorithms in the sparse and smatch static checkers 
are unable to deal with code of the type "if (condition) (un)lock()". So 
unless someone has a better proposal my preference is to use the 
approach from the patch at the start of this e-mail thread.

Thanks,

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From 25f02ed7ab7b2308fd18b89d180c0c613e55d416 Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bart.vanassche@sandisk.com>
Date: Tue, 27 Sep 2016 10:52:36 -0700
Subject: [PATCH] blk-mq: Introduce blk_mq_quiesce_queue()

blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations
have finished. This function does *not* wait until all outstanding
requests have finished (this means invocation of request.end_io()).

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-mq.c         | 40 ++++++++++++++++++++++++++++++++++------
 include/linux/blk-mq.h |  3 +++
 include/linux/blkdev.h |  1 +
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d8c45de..38ae685 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -115,6 +115,23 @@  void blk_mq_unfreeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
 
+/**
+ * blk_mq_quiesce_queue() - wait until all ongoing queue_rq calls have finished
+ *
+ * Note: this function does not prevent that the struct request end_io()
+ * callback function is invoked. Additionally, it is not prevented that
+ * new queue_rq() calls occur unless the queue has been stopped first.
+ */
+void blk_mq_quiesce_queue(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+
+	queue_for_each_hw_ctx(q, hctx, i)
+		synchronize_srcu(&hctx->queue_rq_srcu);
+}
+EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
+
 void blk_mq_wake_waiters(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
@@ -789,11 +806,13 @@  static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	LIST_HEAD(rq_list);
 	LIST_HEAD(driver_list);
 	struct list_head *dptr;
-	int queued;
+	int queued, srcu_idx;
 
 	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
 		return;
 
+	srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
+
 	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
 		cpu_online(hctx->next_cpu));
 
@@ -885,6 +904,8 @@  static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		 **/
 		blk_mq_run_hw_queue(hctx, true);
 	}
+
+	srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
 }
 
 /*
@@ -1298,7 +1319,7 @@  static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	const int is_flush_fua = bio->bi_opf & (REQ_PREFLUSH | REQ_FUA);
 	struct blk_map_ctx data;
 	struct request *rq;
-	unsigned int request_count = 0;
+	unsigned int request_count = 0, srcu_idx;
 	struct blk_plug *plug;
 	struct request *same_queue_rq = NULL;
 	blk_qc_t cookie;
@@ -1341,7 +1362,7 @@  static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		blk_mq_bio_to_request(rq, bio);
 
 		/*
-		 * We do limited pluging. If the bio can be merged, do that.
+		 * We do limited plugging. If the bio can be merged, do that.
 		 * Otherwise the existing request in the plug list will be
 		 * issued. So the plug list will have one request at most
 		 */
@@ -1361,9 +1382,12 @@  static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		blk_mq_put_ctx(data.ctx);
 		if (!old_rq)
 			goto done;
-		if (!blk_mq_direct_issue_request(old_rq, &cookie))
-			goto done;
-		blk_mq_insert_request(old_rq, false, true, true);
+
+		srcu_idx = srcu_read_lock(&data.hctx->queue_rq_srcu);
+		if (blk_mq_direct_issue_request(old_rq, &cookie) != 0)
+			blk_mq_insert_request(old_rq, false, true, true);
+		srcu_read_unlock(&data.hctx->queue_rq_srcu, srcu_idx);
+
 		goto done;
 	}
 
@@ -1659,6 +1683,8 @@  static void blk_mq_exit_hctx(struct request_queue *q,
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
+	cleanup_srcu_struct(&hctx->queue_rq_srcu);
+
 	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
 	blk_free_flush_queue(hctx->fq);
 	sbitmap_free(&hctx->ctx_map);
@@ -1741,6 +1767,8 @@  static int blk_mq_init_hctx(struct request_queue *q,
 				   flush_start_tag + hctx_idx, node))
 		goto free_fq;
 
+	init_srcu_struct(&hctx->queue_rq_srcu);
+
 	return 0;
 
  free_fq:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 368c460d..b2ccd3c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -3,6 +3,7 @@ 
 
 #include <linux/blkdev.h>
 #include <linux/sbitmap.h>
+#include <linux/srcu.h>
 
 struct blk_mq_tags;
 struct blk_flush_queue;
@@ -41,6 +42,8 @@  struct blk_mq_hw_ctx {
 
 	struct blk_mq_tags	*tags;
 
+	struct srcu_struct	queue_rq_srcu;
+
 	unsigned long		queued;
 	unsigned long		run;
 #define BLK_MQ_MAX_DISPATCH_ORDER	7
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c47c358..8259d87 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -824,6 +824,7 @@  extern void __blk_run_queue(struct request_queue *q);
 extern void __blk_run_queue_uncond(struct request_queue *q);
 extern void blk_run_queue(struct request_queue *);
 extern void blk_run_queue_async(struct request_queue *q);
+extern void blk_mq_quiesce_queue(struct request_queue *q);
 extern int blk_rq_map_user(struct request_queue *, struct request *,
 			   struct rq_map_data *, void __user *, unsigned long,
 			   gfp_t);
-- 
2.9.3