diff mbox series

[V10,07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive

Message ID 20200505020930.1146281-8-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series blk-mq: improvement CPU hotplug | expand

Commit Message

Ming Lei May 5, 2020, 2:09 a.m. UTC
Before one CPU becomes offline, check if it is the last online CPU of hctx.
If yes, mark this hctx as inactive, meantime wait for completion of all
in-flight IOs originated from this hctx. Meantime check if this hctx has
become inactive in blk_mq_get_driver_tag(), if yes, release the
allocated tag.

This way guarantees that there isn't any inflight IO before shutdowning
the managed IRQ line when all CPUs of this IRQ line is offline.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |   1 +
 block/blk-mq.c         | 124 +++++++++++++++++++++++++++++++++++++----
 include/linux/blk-mq.h |   3 +
 3 files changed, 117 insertions(+), 11 deletions(-)

Comments

Bart Van Assche May 8, 2020, 11:39 p.m. UTC | #1
On 2020-05-04 19:09, Ming Lei wrote:
> -static bool blk_mq_get_driver_tag(struct request *rq)
> +static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
>  {
>  	if (rq->tag != -1)
>  		return true;
> -	return __blk_mq_get_driver_tag(rq);
> +
> +	if (!__blk_mq_get_driver_tag(rq))
> +		return false;
> +	/*
> +	 * In case that direct issue IO process is migrated to other CPU
> +	 * which may not belong to this hctx, add one memory barrier so we
> +	 * can order driver tag assignment and checking BLK_MQ_S_INACTIVE.
> +	 * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
> +	 * and driver tag assignment are run on the same CPU because
> +	 * BLK_MQ_S_INACTIVE is only set after the last CPU of this hctx is
> +	 * becoming offline.
> +	 *
> +	 * Process migration might happen after the check on current processor
> +	 * id, smp_mb() is implied by processor migration, so no need to worry
> +	 * about it.
> +	 */
> +	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))
> +		smp_mb();
> +	else
> +		barrier();
> +
> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {
> +		blk_mq_put_driver_tag(rq);
> +		return false;
> +	}
> +	return true;
>  }

How much does this patch slow down the hot path?

Can CPU migration be fixed without affecting the hot path, e.g. by using
the request queue freezing mechanism?

Thanks,

Bart.
Ming Lei May 9, 2020, 2:20 a.m. UTC | #2
On Fri, May 08, 2020 at 04:39:46PM -0700, Bart Van Assche wrote:
> On 2020-05-04 19:09, Ming Lei wrote:
> > -static bool blk_mq_get_driver_tag(struct request *rq)
> > +static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
> >  {
> >  	if (rq->tag != -1)
> >  		return true;
> > -	return __blk_mq_get_driver_tag(rq);
> > +
> > +	if (!__blk_mq_get_driver_tag(rq))
> > +		return false;
> > +	/*
> > +	 * In case that direct issue IO process is migrated to other CPU
> > +	 * which may not belong to this hctx, add one memory barrier so we
> > +	 * can order driver tag assignment and checking BLK_MQ_S_INACTIVE.
> > +	 * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
> > +	 * and driver tag assignment are run on the same CPU because
> > +	 * BLK_MQ_S_INACTIVE is only set after the last CPU of this hctx is
> > +	 * becoming offline.
> > +	 *
> > +	 * Process migration might happen after the check on current processor
> > +	 * id, smp_mb() is implied by processor migration, so no need to worry
> > +	 * about it.
> > +	 */
> > +	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))
> > +		smp_mb();
> > +	else
> > +		barrier();
> > +
> > +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {
> > +		blk_mq_put_driver_tag(rq);
> > +		return false;
> > +	}
> > +	return true;
> >  }
> 
> How much does this patch slow down the hot path?

Basically zero cost is added to hot path, exactly:

> +	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))

In case of direct issue, chance of the io process migration is very
small, since basically direct issue follows request allocation and the
time is quite small, so smp_mb() won't be run most of times.

> +		smp_mb();
> +	else
> +		barrier();

So barrier() is added most of times, however the effect can be ignored
since it is just a compiler barrier.

> +
> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {

hctx->state is always checked in hot path, so basically zero cost.

> +		blk_mq_put_driver_tag(rq);
> +		return false;
> +	}

> 
> Can CPU migration be fixed without affecting the hot path, e.g. by using
> the request queue freezing mechanism?

Why do we want to fix CPU migration of direct issue IO process? It may not be
necessary or quite difficultly:

1) preempt disable is removed previously in cleanup patch since request
is allocated

2) we have drivers which may set BLOCKING, so .queue_rq() may sleep

Not sure why you mention queue freezing.


Thanks,
Ming
Bart Van Assche May 9, 2020, 3:24 a.m. UTC | #3
On 2020-05-08 19:20, Ming Lei wrote:
> Not sure why you mention queue freezing.

This patch series introduces a fundamental race between modifying the
hardware queue state (BLK_MQ_S_INACTIVE) and tag allocation. The only
mechanism I know of for enforcing the order in which another thread
observes writes to different memory locations without inserting a memory
barrier in the hot path is RCU (see also The RCU-barrier menagerie;
https://lwn.net/Articles/573497/). The only existing such mechanism in
the blk-mq core I know of is queue freezing. Hence my comment about
queue freezing.

Thanks,

Bart.
Ming Lei May 9, 2020, 4:10 a.m. UTC | #4
On Fri, May 08, 2020 at 08:24:44PM -0700, Bart Van Assche wrote:
> On 2020-05-08 19:20, Ming Lei wrote:
> > Not sure why you mention queue freezing.
> 
> This patch series introduces a fundamental race between modifying the
> hardware queue state (BLK_MQ_S_INACTIVE) and tag allocation. The only

Basically there are two cases:

1) setting BLK_MQ_S_INACTIVE and driver tag allocation are run on same
CPU, we just need a compiler barrier, that happens most of times

2) setting BLK_MQ_S_INACTIVE and driver tag allocation are run on
different CPUs, then one pair of smp_mb() is applied for avoiding
out of order, that only happens in case of direct issue process migration.

Please take a look at the comment in this patch:

+       /*
+        * In case that direct issue IO process is migrated to other CPU
+        * which may not belong to this hctx, add one memory barrier so we
+        * can order driver tag assignment and checking BLK_MQ_S_INACTIVE.
+        * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
+        * and driver tag assignment are run on the same CPU because
+        * BLK_MQ_S_INACTIVE is only set after the last CPU of this hctx is
+        * becoming offline.
+        *
+        * Process migration might happen after the check on current processor
+        * id, smp_mb() is implied by processor migration, so no need to worry
+        * about it.
+        */

And you may find more discussion about this topic in the following thread:

https://lore.kernel.org/linux-block/20200429134327.GC700644@T590/

> mechanism I know of for enforcing the order in which another thread
> observes writes to different memory locations without inserting a memory
> barrier in the hot path is RCU (see also The RCU-barrier menagerie;
> https://lwn.net/Articles/573497/). The only existing such mechanism in
> the blk-mq core I know of is queue freezing. Hence my comment about
> queue freezing.

You didn't explain how queue freezing is used for this issue.

We are talking about CPU hotplug vs. IO. In short, when one hctx becomes
inactive(all cpus in hctx->cpumask becomes offline), in-flight IO from
this hctx needs to be drained for avoiding io timeout. Also all requests
in scheduler/sw queue from this hctx needs to be handled correctly for
avoiding IO hang.

queue freezing can only be applied on the request queue level, and not
hctx level. When requests can't be completed, wait freezing just hangs
for-ever.



Thanks,
Ming
Bart Van Assche May 9, 2020, 2:18 p.m. UTC | #5
On 2020-05-08 21:10, Ming Lei wrote:
> queue freezing can only be applied on the request queue level, and not
> hctx level. When requests can't be completed, wait freezing just hangs
> for-ever.

That's indeed what I meant: freeze the entire queue instead of
introducing a new mechanism that freezes only one hardware queue at a time.

Please clarify what "when requests can't be completed" means. Are you
referring to requests that take longer than expected due to e.g. a
controller lockup or to requests that take a long time intentionally?
The former case is handled by the block layer timeout handler. I propose
to handle the latter case by introducing a new callback function pointer
in struct blk_mq_ops that aborts all outstanding requests. Request queue
freezing is such an important block layer mechanism that I think we
should require that all block drivers support freezing a request queue
in a short time.

Bart.
Ming Lei May 11, 2020, 1:45 a.m. UTC | #6
On Sat, May 09, 2020 at 07:18:46AM -0700, Bart Van Assche wrote:
> On 2020-05-08 21:10, Ming Lei wrote:
> > queue freezing can only be applied on the request queue level, and not
> > hctx level. When requests can't be completed, wait freezing just hangs
> > for-ever.
> 
> That's indeed what I meant: freeze the entire queue instead of
> introducing a new mechanism that freezes only one hardware queue at a time.

No, the issue is exactly that one single hctx becomes inactive, and
other hctx are still active and workable.

If one entire queue is frozen because of some of CPUs are offline, how
can userspace submit IO to this disk? You suggestion justs makes the
disk not usable, that won't be accepted.

> 
> Please clarify what "when requests can't be completed" means. Are you
> referring to requests that take longer than expected due to e.g. a
> controller lockup or to requests that take a long time intentionally?

If all CPUs in one hctx->cpumask are offline, the managed irq of this hw
queue will be shutdown by genirq code, so any in-flight IO won't be
completed or timedout after the managed irq is shutdown because of cpu
offline.

Some drivers may implement timeout handler, so these in-flight requests
will be timed out, but still not friendly behaviour given the default
timeout is too long.

Some drivers don't implement timeout handler at all, so these IO won't
be completed.

> The former case is handled by the block layer timeout handler. I propose
> to handle the latter case by introducing a new callback function pointer
> in struct blk_mq_ops that aborts all outstanding requests.

As I mentioned, timeout isn't a friendly behavior. Or not every driver
implements timeout handler or well enough.

> Request queue
> freezing is such an important block layer mechanism that I think we
> should require that all block drivers support freezing a request queue
> in a short time.

Firstly, we just need to drain in-flight requests and re-submit queued
requests from one single hctx, and queue wide freezing causes whole
userspace IOs blocked unnecessarily.

Secondly, some requests may not be completed at all, so freezing can't
work because freeze_wait may hang forever.


Thanks, 
Ming
Bart Van Assche May 11, 2020, 3:20 a.m. UTC | #7
On 2020-05-10 18:45, Ming Lei wrote:
> On Sat, May 09, 2020 at 07:18:46AM -0700, Bart Van Assche wrote:
>> On 2020-05-08 21:10, Ming Lei wrote:
>>> queue freezing can only be applied on the request queue level, and not
>>> hctx level. When requests can't be completed, wait freezing just hangs
>>> for-ever.
>>
>> That's indeed what I meant: freeze the entire queue instead of
>> introducing a new mechanism that freezes only one hardware queue at a time.
> 
> No, the issue is exactly that one single hctx becomes inactive, and
> other hctx are still active and workable.
> 
> If one entire queue is frozen because of some of CPUs are offline, how
> can userspace submit IO to this disk? You suggestion justs makes the
> disk not usable, that won't be accepted.

What I meant is to freeze a request queue temporarily (until hot
unplugging of a CPU has finished). I would never suggest to freeze a
request queue forever and I think that you already knew that.

>> Please clarify what "when requests can't be completed" means. Are you
>> referring to requests that take longer than expected due to e.g. a
>> controller lockup or to requests that take a long time intentionally?
> 
> If all CPUs in one hctx->cpumask are offline, the managed irq of this hw
> queue will be shutdown by genirq code, so any in-flight IO won't be
> completed or timedout after the managed irq is shutdown because of cpu
> offline.
> 
> Some drivers may implement timeout handler, so these in-flight requests
> will be timed out, but still not friendly behaviour given the default
> timeout is too long.
> 
> Some drivers don't implement timeout handler at all, so these IO won't
> be completed.

I think that the block layer needs to be notified after the decision has
been taken to offline a CPU and before the interrupts associated with
that CPU are disabled. That would allow the block layer to freeze a
request queue without triggering any timeouts (ignoring block driver and
hardware bugs). I'm not familiar with CPU hotplugging so I don't know
whether or not such a mechanism already exists.

>> The former case is handled by the block layer timeout handler. I propose
>> to handle the latter case by introducing a new callback function pointer
>> in struct blk_mq_ops that aborts all outstanding requests.
> 
> As I mentioned, timeout isn't a friendly behavior. Or not every driver
> implements timeout handler or well enough.

What I propose is to fix those block drivers instead of complicating the
block layer core further and instead of introducing potential deadlocks
in the block layer core.

>> Request queue
>> freezing is such an important block layer mechanism that I think we
>> should require that all block drivers support freezing a request queue
>> in a short time.
> 
> Firstly, we just need to drain in-flight requests and re-submit queued
> requests from one single hctx, and queue wide freezing causes whole
> userspace IOs blocked unnecessarily.

Freezing a request queue for a short time is acceptable. As you know we
already do that when the queue depth is modified, when the write-back
throttling latency is modified and also when the I/O scheduler is changed.

> Secondly, some requests may not be completed at all, so freezing can't
> work because freeze_wait may hang forever.

If a request neither can be aborted nor completes then that's a severe
bug in the block driver that submitted the request to the block device.

Bart.
Ming Lei May 11, 2020, 3:48 a.m. UTC | #8
On Sun, May 10, 2020 at 08:20:24PM -0700, Bart Van Assche wrote:
> On 2020-05-10 18:45, Ming Lei wrote:
> > On Sat, May 09, 2020 at 07:18:46AM -0700, Bart Van Assche wrote:
> >> On 2020-05-08 21:10, Ming Lei wrote:
> >>> queue freezing can only be applied on the request queue level, and not
> >>> hctx level. When requests can't be completed, wait freezing just hangs
> >>> for-ever.
> >>
> >> That's indeed what I meant: freeze the entire queue instead of
> >> introducing a new mechanism that freezes only one hardware queue at a time.
> > 
> > No, the issue is exactly that one single hctx becomes inactive, and
> > other hctx are still active and workable.
> > 
> > If one entire queue is frozen because of some of CPUs are offline, how
> > can userspace submit IO to this disk? You suggestion justs makes the
> > disk not usable, that won't be accepted.
> 
> What I meant is to freeze a request queue temporarily (until hot
> unplugging of a CPU has finished). I would never suggest to freeze a
> request queue forever and I think that you already knew that.

But what is your motivation to freeze queue temporarily?

I don's see any help of freezing queue for this issue. Also even though
it is temporary, IO effect still can be observed for other online CPUs.

If you want to block new allocation from the inactive hctx, that isn't
necessary cause no new allocation is basically possible because all
cpus of this hctx will be offline.

If you want to wait completion of in-flight requests, that isn't doable
because requests may not be completed at all when one hctx becomes
inactive and the managed interrupt is shutdown.

> 
> >> Please clarify what "when requests can't be completed" means. Are you
> >> referring to requests that take longer than expected due to e.g. a
> >> controller lockup or to requests that take a long time intentionally?
> > 
> > If all CPUs in one hctx->cpumask are offline, the managed irq of this hw
> > queue will be shutdown by genirq code, so any in-flight IO won't be
> > completed or timedout after the managed irq is shutdown because of cpu
> > offline.
> > 
> > Some drivers may implement timeout handler, so these in-flight requests
> > will be timed out, but still not friendly behaviour given the default
> > timeout is too long.
> > 
> > Some drivers don't implement timeout handler at all, so these IO won't
> > be completed.
> 
> I think that the block layer needs to be notified after the decision has

I have added new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for getting the
notification and blk_mq_hctx_notify_online() will be called before this
cpu is put offline.

> been taken to offline a CPU and before the interrupts associated with
> that CPU are disabled. That would allow the block layer to freeze a
> request queue without triggering any timeouts (ignoring block driver and
> hardware bugs). I'm not familiar with CPU hotplugging so I don't know
> whether or not such a mechanism already exists.

How can freezing queue avoid to triggering timeout?

Freezing queue basically blocks new request allocation, and follows wait
for completion of all in-flight request. As I explained, either no new
allocation on this inactive hctx, or in-flight request won't be completed
without this patch's solution.

> 
> >> The former case is handled by the block layer timeout handler. I propose
> >> to handle the latter case by introducing a new callback function pointer
> >> in struct blk_mq_ops that aborts all outstanding requests.
> > 
> > As I mentioned, timeout isn't a friendly behavior. Or not every driver
> > implements timeout handler or well enough.
> 
> What I propose is to fix those block drivers instead of complicating the
> block layer core further and instead of introducing potential deadlocks
> in the block layer core.

The deadlock you mentioned can be fixed with help of BLK_MQ_REQ_PREEMPT.

> 
> >> Request queue
> >> freezing is such an important block layer mechanism that I think we
> >> should require that all block drivers support freezing a request queue
> >> in a short time.
> > 
> > Firstly, we just need to drain in-flight requests and re-submit queued
> > requests from one single hctx, and queue wide freezing causes whole
> > userspace IOs blocked unnecessarily.
> 
> Freezing a request queue for a short time is acceptable. As you know we
> already do that when the queue depth is modified, when the write-back
> throttling latency is modified and also when the I/O scheduler is changed.

Again, how can freeze queue help the issue addressed by this patchset?

> 
> > Secondly, some requests may not be completed at all, so freezing can't
> > work because freeze_wait may hang forever.
> 
> If a request neither can be aborted nor completes then that's a severe
> bug in the block driver that submitted the request to the block device.

It is hard to implement timeout handler for every driver, or remove all
BLK_EH_RESET_TIMER returning from driver.

Even for drivers which implementing timeout handler elegantly, it isn't
friendly to wait several dozens of seconds or more than one hundred seconds
to wait IO completion during cpu hotplug. Who said that IO timeout has
to be triggered during cpu hotplug? At least there isn't such issue with
non-managed interrupt.



Thanks, 
Ming
Bart Van Assche May 11, 2020, 8:56 p.m. UTC | #9
On 2020-05-10 20:48, Ming Lei wrote:
> On Sun, May 10, 2020 at 08:20:24PM -0700, Bart Van Assche wrote:
>> What I meant is to freeze a request queue temporarily.
> 
> But what is your motivation to freeze queue temporarily?

To achieve a solution for CPU hotplugging that is much simpler than this
patch series, requires less code and hence is easier to test, debug and
maintain.

Thanks,

Bart.
Ming Lei May 12, 2020, 1:25 a.m. UTC | #10
On Mon, May 11, 2020 at 01:56:49PM -0700, Bart Van Assche wrote:
> On 2020-05-10 20:48, Ming Lei wrote:
> > On Sun, May 10, 2020 at 08:20:24PM -0700, Bart Van Assche wrote:
> >> What I meant is to freeze a request queue temporarily.
> > 
> > But what is your motivation to freeze queue temporarily?
> 
> To achieve a solution for CPU hotplugging that is much simpler than this
> patch series, requires less code and hence is easier to test, debug and
> maintain.

Yeah, it can be done by queue freezing in the following way:

1) before one cpu is offline

- if one hctx becomes inactive, then freeze the whole queue and
wait for freezing done

2) after one cpu is offline
- un-freeze the request queue if any hctx is inactive

The whole disk becomes unusable during the period, which can
be quite long, because freezing queue & wait for freezing done takes
at least one RCU grace period even though there isn't any in-flight
IO.

And the above steps need to be run for every request queue in serialized
way, so the whole time of suspending IO can be very long. That isn't
reasonable.

Thanks,
Ming
diff mbox series

Patch

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index ddec58743e88..dc66cb689d2f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@  static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INACTIVE),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 54c107be7a47..4a2250ac4fbb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1038,11 +1038,36 @@  static bool __blk_mq_get_driver_tag(struct request *rq)
 	return true;
 }
 
-static bool blk_mq_get_driver_tag(struct request *rq)
+static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
 {
 	if (rq->tag != -1)
 		return true;
-	return __blk_mq_get_driver_tag(rq);
+
+	if (!__blk_mq_get_driver_tag(rq))
+		return false;
+	/*
+	 * In case that direct issue IO process is migrated to other CPU
+	 * which may not belong to this hctx, add one memory barrier so we
+	 * can order driver tag assignment and checking BLK_MQ_S_INACTIVE.
+	 * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
+	 * and driver tag assignment are run on the same CPU because
+	 * BLK_MQ_S_INACTIVE is only set after the last CPU of this hctx is
+	 * becoming offline.
+	 *
+	 * Process migration might happen after the check on current processor
+	 * id, smp_mb() is implied by processor migration, so no need to worry
+	 * about it.
+	 */
+	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))
+		smp_mb();
+	else
+		barrier();
+
+	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {
+		blk_mq_put_driver_tag(rq);
+		return false;
+	}
+	return true;
 }
 
 static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
@@ -1091,7 +1116,7 @@  static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 		 * Don't clear RESTART here, someone else could have set it.
 		 * At most this will cost an extra queue run.
 		 */
-		return blk_mq_get_driver_tag(rq);
+		return blk_mq_get_driver_tag(rq, false);
 	}
 
 	wait = &hctx->dispatch_wait;
@@ -1117,7 +1142,7 @@  static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 	 * allocation failure and adding the hardware queue to the wait
 	 * queue.
 	 */
-	ret = blk_mq_get_driver_tag(rq);
+	ret = blk_mq_get_driver_tag(rq, false);
 	if (!ret) {
 		spin_unlock(&hctx->dispatch_wait_lock);
 		spin_unlock_irq(&wq->lock);
@@ -1218,7 +1243,7 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			break;
 		}
 
-		if (!blk_mq_get_driver_tag(rq)) {
+		if (!blk_mq_get_driver_tag(rq, false)) {
 			/*
 			 * The initial allocation attempt failed, so we need to
 			 * rerun the hardware queue when a tag is freed. The
@@ -1250,7 +1275,7 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			bd.last = true;
 		else {
 			nxt = list_first_entry(list, struct request, queuelist);
-			bd.last = !blk_mq_get_driver_tag(nxt);
+			bd.last = !blk_mq_get_driver_tag(nxt, false);
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
@@ -1864,7 +1889,7 @@  static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	if (!blk_mq_get_dispatch_budget(hctx))
 		goto insert;
 
-	if (!blk_mq_get_driver_tag(rq)) {
+	if (!blk_mq_get_driver_tag(rq, true)) {
 		blk_mq_put_dispatch_budget(hctx);
 		goto insert;
 	}
@@ -2273,13 +2298,87 @@  int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
-static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+struct count_inflight_data {
+	unsigned count;
+	struct blk_mq_hw_ctx *hctx;
+};
+
+static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
+				     bool reserved)
 {
-	return 0;
+	struct count_inflight_data *count_data = data;
+
+	/*
+	 * Can't check rq's state because it is updated to MQ_RQ_IN_FLIGHT
+	 * in blk_mq_start_request(), at that time we can't prevent this rq
+	 * from being issued.
+	 *
+	 * So check if driver tag is assigned, if yes, count this rq as
+	 * inflight.
+	 */
+	if (rq->tag >= 0 && rq->mq_hctx == count_data->hctx)
+		count_data->count++;
+
+	return true;
+}
+
+static bool blk_mq_inflight_rq(struct request *rq, void *data,
+			       bool reserved)
+{
+	return rq->tag >= 0;
+}
+
+static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	struct count_inflight_data count_data = {
+		.hctx	= hctx,
+	};
+
+	blk_mq_all_tag_busy_iter(hctx->tags, blk_mq_count_inflight_rq,
+			blk_mq_inflight_rq, &count_data);
+	return count_data.count;
+}
+
+static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
+		struct blk_mq_hw_ctx *hctx)
+{
+	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
+		return false;
+	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
+		return false;
+	return true;
 }
 
 static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 {
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
+	if (!blk_mq_last_cpu_in_hctx(cpu, hctx))
+		return 0;
+
+	/*
+	 * Order setting BLK_MQ_S_INACTIVE versus checking rq->tag and rqs[tag],
+	 * in blk_mq_tags_inflight_rqs.  It pairs with the smp_mb() in
+	 * blk_mq_get_driver_tag.
+	 */
+	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	smp_mb__after_atomic();
+	while (blk_mq_tags_inflight_rqs(hctx))
+		msleep(5);
+	return 0;
+}
+
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (cpumask_test_cpu(cpu, hctx->cpumask))
+		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
 	return 0;
 }
 
@@ -2290,12 +2389,15 @@  static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_dead);
 	struct blk_mq_ctx *ctx;
 	LIST_HEAD(tmp);
 	enum hctx_type type;
 
-	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 3763207d88eb..77bf861d72ec 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -403,6 +403,9 @@  enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is inactive after all its CPUs become offline */
+	BLK_MQ_S_INACTIVE	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,