diff mbox series

[2/5] blk-mq: introduce blk_mq_tagset_wait_completed_request()

Message ID 20190722053954.25423-3-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series blk-mq: wait until completed req's complete fn is run | expand

Commit Message

Ming Lei July 22, 2019, 5:39 a.m. UTC
blk-mq may schedule to call queue's complete function on remote CPU via
IPI, but doesn't provide any way to synchronize the request's complete
fn.

In some driver's EH(such as NVMe), hardware queue's resource may be freed &
re-allocated. If the completed request's complete fn is run finally after the
hardware queue's resource is released, kernel crash will be triggered.

Prepare for fixing this kind of issue by introducing
blk_mq_tagset_wait_completed_request().

Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c     | 32 ++++++++++++++++++++++++++++++++
 include/linux/blk-mq.h |  1 +
 2 files changed, 33 insertions(+)

Comments

Bart Van Assche July 22, 2019, 3:25 p.m. UTC | #1
On 7/21/19 10:39 PM, Ming Lei wrote:
> blk-mq may schedule to call queue's complete function on remote CPU via
> IPI, but doesn't provide any way to synchronize the request's complete
> fn.
> 
> In some driver's EH(such as NVMe), hardware queue's resource may be freed &
> re-allocated. If the completed request's complete fn is run finally after the
> hardware queue's resource is released, kernel crash will be triggered.
> 
> Prepare for fixing this kind of issue by introducing
> blk_mq_tagset_wait_completed_request().

An explanation is missing of why the block layer is modified to fix this 
instead of the NVMe driver.

Thanks,

Bart.
Ming Lei July 23, 2019, 1:06 a.m. UTC | #2
On Mon, Jul 22, 2019 at 08:25:07AM -0700, Bart Van Assche wrote:
> On 7/21/19 10:39 PM, Ming Lei wrote:
> > blk-mq may schedule to call queue's complete function on remote CPU via
> > IPI, but doesn't provide any way to synchronize the request's complete
> > fn.
> > 
> > In some driver's EH(such as NVMe), hardware queue's resource may be freed &
> > re-allocated. If the completed request's complete fn is run finally after the
> > hardware queue's resource is released, kernel crash will be triggered.
> > 
> > Prepare for fixing this kind of issue by introducing
> > blk_mq_tagset_wait_completed_request().
> 
> An explanation is missing of why the block layer is modified to fix this
> instead of the NVMe driver.

The above commit log has explained that there isn't sync mechanism in
blk-mq wrt. request completion, and there might be similar issue in other
future drivers.


Thanks,
Ming
Sagi Grimberg July 23, 2019, 8:27 p.m. UTC | #3
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Bart Van Assche July 23, 2019, 8:54 p.m. UTC | #4
On 7/22/19 6:06 PM, Ming Lei wrote:
> On Mon, Jul 22, 2019 at 08:25:07AM -0700, Bart Van Assche wrote:
>> On 7/21/19 10:39 PM, Ming Lei wrote:
>>> blk-mq may schedule to call queue's complete function on remote CPU via
>>> IPI, but doesn't provide any way to synchronize the request's complete
>>> fn.
>>>
>>> In some driver's EH(such as NVMe), hardware queue's resource may be freed &
>>> re-allocated. If the completed request's complete fn is run finally after the
>>> hardware queue's resource is released, kernel crash will be triggered.
>>>
>>> Prepare for fixing this kind of issue by introducing
>>> blk_mq_tagset_wait_completed_request().
>>
>> An explanation is missing of why the block layer is modified to fix this
>> instead of the NVMe driver.
> 
> The above commit log has explained that there isn't sync mechanism in
> blk-mq wrt. request completion, and there might be similar issue in other
> future drivers.

That is not sufficient as a motivation to modify the block layer because 
there is already a way to wait until request completions have finished, 
namely the request queue freeze mechanism. Have you considered to use 
that mechanism instead of introducing 
blk_mq_tagset_wait_completed_request()?

Thanks,

Bart.
Ming Lei July 24, 2019, 1:34 a.m. UTC | #5
On Tue, Jul 23, 2019 at 01:54:52PM -0700, Bart Van Assche wrote:
> On 7/22/19 6:06 PM, Ming Lei wrote:
> > On Mon, Jul 22, 2019 at 08:25:07AM -0700, Bart Van Assche wrote:
> > > On 7/21/19 10:39 PM, Ming Lei wrote:
> > > > blk-mq may schedule to call queue's complete function on remote CPU via
> > > > IPI, but doesn't provide any way to synchronize the request's complete
> > > > fn.
> > > > 
> > > > In some driver's EH(such as NVMe), hardware queue's resource may be freed &
> > > > re-allocated. If the completed request's complete fn is run finally after the
> > > > hardware queue's resource is released, kernel crash will be triggered.
> > > > 
> > > > Prepare for fixing this kind of issue by introducing
> > > > blk_mq_tagset_wait_completed_request().
> > > 
> > > An explanation is missing of why the block layer is modified to fix this
> > > instead of the NVMe driver.
> > 
> > The above commit log has explained that there isn't sync mechanism in
> > blk-mq wrt. request completion, and there might be similar issue in other
> > future drivers.
> 
> That is not sufficient as a motivation to modify the block layer because
> there is already a way to wait until request completions have finished,
> namely the request queue freeze mechanism. Have you considered to use that
> mechanism instead of introducing blk_mq_tagset_wait_completed_request()?

The introduced interface is used in EH, during which the aborted
requests will stay at blk-mq sw/scheduler queue, so queue freeze will
cause deadlock. We simply can't use it.

Thanks,
Ming
diff mbox series

Patch

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index da19f0bc8876..008388e82b5c 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -10,6 +10,7 @@ 
 #include <linux/module.h>
 
 #include <linux/blk-mq.h>
+#include <linux/delay.h>
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
@@ -354,6 +355,37 @@  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
+static bool blk_mq_tagset_count_completed_rqs(struct request *rq,
+		void *data, bool reserved)
+{
+	unsigned *count = data;
+
+	if (blk_mq_request_completed(rq))
+		(*count)++;
+	return true;
+}
+
+/**
+ * blk_mq_tagset_wait_completed_request - wait until all completed req's
+ * complete funtion is run
+ * @tagset:	Tag set to drain completed request
+ *
+ * Note: This function has to be run after all IO queues are shutdown
+ */
+void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset)
+{
+	while (true) {
+		unsigned count = 0;
+
+		blk_mq_tagset_busy_iter(tagset,
+				blk_mq_tagset_count_completed_rqs, &count);
+		if (!count)
+			break;
+		msleep(5);
+	}
+}
+EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request);
+
 /**
  * blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag
  * @q:		Request queue to examine.
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index baac2926e54a..ee0719b649b6 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -322,6 +322,7 @@  bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
+void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset);
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_unfreeze_queue(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);