diff mbox

[1/1,RFC] blk-mq: fix queue stalling on shared hctx restart

Message ID 20171018102206.26020-1-roman.penyaev@profitbricks.com (mailing list archive)
State New, archived
Headers show

Commit Message

Roman Pen Oct. 18, 2017, 10:22 a.m. UTC
Hi all,

the patch below fixes queue stalling when shared hctx marked for restart
(BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
belongs to the particular queue, which in fact may not need to be restarted,
thus we return from blk_mq_sched_restart() and leave shared hctx of another
queue never restarted.

The fix is to make shared_hctx_restart counter belong not to the queue, but
to tags, thereby counter will reflect real number of shared hctx needed to
be restarted.

During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests
were noticed in dd->fifo_list of mq-deadline scheduler.

Seeming possible sequence of events:

1. Request A of queue A is inserted into dd->fifo_list of the scheduler.

2. Request B of queue A bypasses scheduler and goes directly to
   hctx->dispatch.

3. Request C of queue B is inserted.

4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not
   empty (request B is in the list) hctx is only marked for for next restart
   and request A is left in a list (see comment "So it's best to leave them
   there for as long as we can. Mark the hw queue as needing a restart in
   that case." in blk-mq-sched.c)

5. Eventually request B is completed/freed and blk_mq_sched_restart() is
   called, but by chance hctx from queue B is chosen for restart and request C
   gets a chance to be dispatched.

6. Eventually request C is completed/freed and blk_mq_sched_restart() is
   called, but shared_hctx_restart for queue B is zero and we return without
   attempt to restart hctx from queue A, thus request A is stuck forever.

But stalling queue is not the only one problem with blk_mq_sched_restart().
My tests show that those loops thru all queues and hctxs can be very costly,
even with shared_hctx_restart counter, which aims to fix performance issue.
For my tests I create 128 devices with 64 hctx each, which share same tags
set.

The following is the fio and ftrace output for v4.14-rc4 kernel:

 READ: io=5630.3MB, aggrb=573208KB/s, minb=573208KB/s, maxb=573208KB/s, mint=10058msec, maxt=10058msec
WRITE: io=5650.9MB, aggrb=575312KB/s, minb=575312KB/s, maxb=575312KB/s, mint=10058msec, maxt=10058msec

root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq
  Function                  Hit     Time            Avg             s^2
  --------                  ---     ----            ---             ---
  blk_mq_sched_restart     16347    9540759 us      583.639 us      8804801 us
  blk_mq_sched_restart      7884    6073471 us      770.354 us      8780054 us
  blk_mq_sched_restart     14176    7586794 us      535.185 us      2822731 us
  blk_mq_sched_restart      7843    6205435 us      791.206 us      12424960 us
  blk_mq_sched_restart      1490    4786107 us      3212.153 us     1949753 us
  blk_mq_sched_restart      7892    6039311 us      765.244 us      2994627 us
  blk_mq_sched_restart     15382    7511126 us      488.306 us      3090912 us
  [cut]

And here are results with two patches reverted:
   8e8320c9315c ("blk-mq: fix performance regression with shared tags")
   6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")

 READ: io=12884MB, aggrb=1284.3MB/s, minb=1284.3MB/s, maxb=1284.3MB/s, mint=10032msec, maxt=10032msec
WRITE: io=12987MB, aggrb=1294.6MB/s, minb=1294.6MB/s, maxb=1294.6MB/s, mint=10032msec, maxt=10032msec

root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq
  Function                  Hit      Time            Avg             s^2
  --------                  ---      ----            ---             ---
  blk_mq_sched_restart      50699    8802.349 us     0.173 us        121.771 us
  blk_mq_sched_restart      50362    8740.470 us     0.173 us        161.494 us
  blk_mq_sched_restart      50402    9066.337 us     0.179 us        113.009 us
  blk_mq_sched_restart      50104    9366.197 us     0.186 us        188.645 us
  blk_mq_sched_restart      50375    9317.727 us     0.184 us        54.218 us
  blk_mq_sched_restart      50136    9311.657 us     0.185 us        446.790 us
  blk_mq_sched_restart      50103    9179.625 us     0.183 us        114.472 us
  [cut]

Timings and stdevs are terrible, which leads to significant difference:
570MB/s vs 1280MB/s.

This is RFC since current patch fixes queue stalling but performance issue
still remains and for me is not clear is it better to improve commit
6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
making percpu restart lists (to avoid looping and to dequeue hctx immediately)
or revert it (frankly I did not notice any difference on small number of
devices and hctxs, when looping issue does not impact much).

--
Roman

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-block@vger.kernel.org
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Jens Axboe <axboe@fb.com>
---
 block/blk-mq-sched.c   | 10 +++++-----
 block/blk-mq-tag.c     |  1 +
 block/blk-mq-tag.h     |  1 +
 block/blk-mq.c         |  4 ++--
 include/linux/blkdev.h |  2 --
 5 files changed, 9 insertions(+), 9 deletions(-)

Comments

Bart Van Assche Oct. 19, 2017, 5:47 p.m. UTC | #1
On Wed, 2017-10-18 at 12:22 +0200, Roman Pen wrote:
> the patch below fixes queue stalling when shared hctx marked for restart

> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The

> root cause is that hctxs are shared between queues, but 'shared_hctx_restart'

> belongs to the particular queue, which in fact may not need to be restarted,

> thus we return from blk_mq_sched_restart() and leave shared hctx of another

> queue never restarted.

> 

> The fix is to make shared_hctx_restart counter belong not to the queue, but

> to tags, thereby counter will reflect real number of shared hctx needed to

> be restarted.


Hello Roman,

The patch you posted looks fine to me but seeing this patch and the patch
description makes me wonder why this had not been noticed before. Are you
perhaps using a block driver that returns BLK_STS_RESOURCE more often than
other block drivers? Did you perhaps run into this with the Infiniband
network block device (IBNBD) driver? No matter what driver triggered this,
I think this bug should be fixed.

Bart.
Roman Pen Oct. 20, 2017, 9:39 a.m. UTC | #2
Hi Bart,

On Thu, Oct 19, 2017 at 7:47 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Wed, 2017-10-18 at 12:22 +0200, Roman Pen wrote:
>> the patch below fixes queue stalling when shared hctx marked for restart
>> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
>> root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
>> belongs to the particular queue, which in fact may not need to be restarted,
>> thus we return from blk_mq_sched_restart() and leave shared hctx of another
>> queue never restarted.
>>
>> The fix is to make shared_hctx_restart counter belong not to the queue, but
>> to tags, thereby counter will reflect real number of shared hctx needed to
>> be restarted.
>
> Hello Roman,
>
> The patch you posted looks fine to me but seeing this patch and the patch
> description makes me wonder why this had not been noticed before.

This is a good question, which I could not answer.  I tried to simulate the
same behaviour (completion timings, completion pinning, number of submission
queues, shared tags, etc) on null block.  but what I see is that
*_sched_restart()
never observes 'shared_hctx_restart',  literally never (I made a counter when
we take a path and start looking for a hctx to restart, and a counter stays 0).

That makes me nervous and then I gave up.  After some time I want return to
that and try to reproduce the problem on something else, say nvme.

> Are you perhaps using a block driver that returns BLK_STS_RESOURCE more
> often than other block drivers? Did you perhaps run into this with the
> Infiniband network block device (IBNBD) driver?

Yep, this is IBNBD, but in these tests I tested with mq scheduler, shared tags
and 1 hctx for each queue (blk device),  thus I never run out of internal tags
and never return BLK_STS_RESOURCE.

Indeed, not modified IBNBD does internal tags management.  This was needed
because each queue (block device) was created with hctx number (nr_hw_queues)
equal to number of cpus on the system, but blk-mq tags set is shared only
between hctx, not globally, which led to need to return BLK_STS_RESOURCE
and queues restarts.

But, with mq scheduler situation changed: 1 hctx with shared tags can be
specified for all hundreds of devices without any performance impact.

Testing this configuration (1 hctx, shared tags, mq-deadline) immediately
shows these two problems: request stalling and slow loops inside
blk_mq_sched_restart().

> No matter what driver triggered this, I think this bug should be fixed.

Yes, queue stalling can be easily fixed.  I can resend current patch with
shorter description which targets only this particular bug, if no one else
has objections/comments etc.

But what bothers me is these looong loops inside blk_mq_sched_restart(),
and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart
a single queue if tag sets are shared") I want to ask what was the original
problem which you attempted to fix?  Likely I am missing some test scenario
which would be great to know about.

--
Roman
Bart Van Assche Oct. 20, 2017, 8:05 p.m. UTC | #3
On Fri, 2017-10-20 at 11:39 +0200, Roman Penyaev wrote:
> But what bothers me is these looong loops inside blk_mq_sched_restart(),

> and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart

> a single queue if tag sets are shared") I want to ask what was the original

> problem which you attempted to fix?  Likely I am missing some test scenario

> which would be great to know about.


Long loops? How many queues share the same tag set on your setup? How many
hardware queues does your block driver create per request queue?

Commit 6d8c6c0f97ad is something I came up with to fix queue lockups in the
SCSI and dm-mq drivers.

Bart.
Roman Pen Oct. 23, 2017, 3:12 p.m. UTC | #4
On Fri, Oct 20, 2017 at 10:05 PM, Bart Van Assche
<Bart.VanAssche@wdc.com> wrote:
> On Fri, 2017-10-20 at 11:39 +0200, Roman Penyaev wrote:
>> But what bothers me is these looong loops inside blk_mq_sched_restart(),
>> and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart
>> a single queue if tag sets are shared") I want to ask what was the original
>> problem which you attempted to fix?  Likely I am missing some test scenario
>> which would be great to know about.
>
> Long loops? How many queues share the same tag set on your setup? How many
> hardware queues does your block driver create per request queue?

Yeah, ok, my mistake. I had to split both issues and should not have described
everything in one go in the first email.  So, take a look.

For my tests I create 128 queues (devices) with 64 hctx each, all queues share
same tags set, then I start 128 fio jobs (1 job per 1 queue).

The following is the fio and ftrace output for v4.14-rc4 kernel
(without any changes):

 READ: io=5630.3MB, aggrb=573208KB/s, minb=573208KB/s,
maxb=573208KB/s, mint=10058msec, maxt=10058msec
WRITE: io=5650.9MB, aggrb=575312KB/s, minb=575312KB/s,
maxb=575312KB/s, mint=10058msec, maxt=10058msec

root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq
  Function                  Hit     Time            Avg             s^2
  --------                  ---     ----            ---             ---
  blk_mq_sched_restart     16347    9540759 us      583.639 us      8804801 us
  blk_mq_sched_restart      7884    6073471 us      770.354 us      8780054 us
  blk_mq_sched_restart     14176    7586794 us      535.185 us      2822731 us
  blk_mq_sched_restart      7843    6205435 us      791.206 us      12424960 us
  blk_mq_sched_restart      1490    4786107 us      3212.153 us
1949753 us    <<< !!! 3 ms in average !!!
  blk_mq_sched_restart      7892    6039311 us      765.244 us      2994627 us
  blk_mq_sched_restart     15382    7511126 us      488.306 us      3090912 us
  [cut]


And here are results with two patches reverted:

   8e8320c9315c ("blk-mq: fix performance regression with shared tags")
   6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")

 READ: io=12884MB, aggrb=1284.3MB/s, minb=1284.3MB/s, maxb=1284.3MB/s,
mint=10032msec, maxt=10032msec
WRITE: io=12987MB, aggrb=1294.6MB/s, minb=1294.6MB/s, maxb=1294.6MB/s,
mint=10032msec, maxt=10032msec

root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq
  Function                  Hit      Time            Avg             s^2
  --------                  ---      ----            ---             ---
  blk_mq_sched_restart      50699    8802.349 us     0.173 us        121.771 us
  blk_mq_sched_restart      50362    8740.470 us     0.173 us        161.494 us
  blk_mq_sched_restart      50402    9066.337 us     0.179 us        113.009 us
  blk_mq_sched_restart      50104    9366.197 us     0.186 us        188.645 us
  blk_mq_sched_restart      50375    9317.727 us     0.184 us        54.218 us
  blk_mq_sched_restart      50136    9311.657 us     0.185 us        446.790 us
  blk_mq_sched_restart      50103    9179.625 us     0.183 us        114.472 us
  [cut]

The difference is significant: 570MB/s vs 1280MB/s.  E.g. one cpu spent 3 ms in
average iterating over all queues and hctxs in order to find out hctx
to restart.
In total CPUs spent *seconds* in loop.  That seems incredibly long.

> Commit 6d8c6c0f97ad is something I came up with to fix queue lockups in the
> SCSI and dm-mq drivers.

You mean fairness? (some hctx get less amount of chances to be restarted).
That's why you need to restart them in RR fashion, right?

In IBNBD I also do hctx restarts in RR fashion and for that I put each hctx
which is needed to be restarted in a separate percpu list.  Probably it makes
sense to do the same here?

--
Roman
Roman Pen Oct. 23, 2017, 4:12 p.m. UTC | #5
Hi Ming,

On Fri, Oct 20, 2017 at 3:39 PM, Ming Lei <ming.lei@redhat.com> wrote:
> On Wed, Oct 18, 2017 at 12:22:06PM +0200, Roman Pen wrote:
>> Hi all,
>>
>> the patch below fixes queue stalling when shared hctx marked for restart
>> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
>> root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
>> belongs to the particular queue, which in fact may not need to be restarted,
>> thus we return from blk_mq_sched_restart() and leave shared hctx of another
>> queue never restarted.
>>
>> The fix is to make shared_hctx_restart counter belong not to the queue, but
>> to tags, thereby counter will reflect real number of shared hctx needed to
>> be restarted.
>>
>> During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests
>> were noticed in dd->fifo_list of mq-deadline scheduler.
>>
>> Seeming possible sequence of events:
>>
>> 1. Request A of queue A is inserted into dd->fifo_list of the scheduler.
>>
>> 2. Request B of queue A bypasses scheduler and goes directly to
>>    hctx->dispatch.
>>
>> 3. Request C of queue B is inserted.
>>
>> 4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not
>>    empty (request B is in the list) hctx is only marked for for next restart
>>    and request A is left in a list (see comment "So it's best to leave them
>>    there for as long as we can. Mark the hw queue as needing a restart in
>>    that case." in blk-mq-sched.c)
>>
>> 5. Eventually request B is completed/freed and blk_mq_sched_restart() is
>>    called, but by chance hctx from queue B is chosen for restart and request C
>>    gets a chance to be dispatched.
>>
>> 6. Eventually request C is completed/freed and blk_mq_sched_restart() is
>>    called, but shared_hctx_restart for queue B is zero and we return without
>>    attempt to restart hctx from queue A, thus request A is stuck forever.
>>
>> But stalling queue is not the only one problem with blk_mq_sched_restart().
>> My tests show that those loops thru all queues and hctxs can be very costly,
>> even with shared_hctx_restart counter, which aims to fix performance issue.
>> For my tests I create 128 devices with 64 hctx each, which share same tags
>> set.
>
> Hi Roman,
>
> I also find the performance issue with RESTART for TAG_SHARED.
>
> But from my analysis, RESTART isn't needed for TAG_SHARED
> because SCSI-MQ has handled the RESTART by itself already, so
> could you test the patch in following link posted days ago to
> see if it fixes your issue?

I can say without any testing: it fixes all the issues :) You've
reverted

   8e8320c9315c ("blk-mq: fix performance regression with shared tags")
   6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")

with one major difference: you do not handle shared tags in a special
way and restart only requested hctx, instead of iterating over all hctxs
in a queue.

Firstly I have to say that queue stalling issue(#1) and performance
issue(#2) were observed on our in-house rdma driver IBNBD:
https://lwn.net/Articles/718181/ and I've never tried to reproduce
them on SCSI-MQ.

Then your patch brakes RR restarts, which were introduced by this commit
6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared").
Seems basic idea of that patch is nice, but because of possible big
amount of queues and hctxs patch requires reimplementation.  Eventually
you should get fast hctx restart but in RR fashion. According to my
understanding that does not contradict with your patch.

--
Roman
Ming Lei Oct. 24, 2017, 10:04 a.m. UTC | #6
On Mon, Oct 23, 2017 at 06:12:29PM +0200, Roman Penyaev wrote:
> Hi Ming,
> 
> On Fri, Oct 20, 2017 at 3:39 PM, Ming Lei <ming.lei@redhat.com> wrote:
> > On Wed, Oct 18, 2017 at 12:22:06PM +0200, Roman Pen wrote:
> >> Hi all,
> >>
> >> the patch below fixes queue stalling when shared hctx marked for restart
> >> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
> >> root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
> >> belongs to the particular queue, which in fact may not need to be restarted,
> >> thus we return from blk_mq_sched_restart() and leave shared hctx of another
> >> queue never restarted.
> >>
> >> The fix is to make shared_hctx_restart counter belong not to the queue, but
> >> to tags, thereby counter will reflect real number of shared hctx needed to
> >> be restarted.
> >>
> >> During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests
> >> were noticed in dd->fifo_list of mq-deadline scheduler.
> >>
> >> Seeming possible sequence of events:
> >>
> >> 1. Request A of queue A is inserted into dd->fifo_list of the scheduler.
> >>
> >> 2. Request B of queue A bypasses scheduler and goes directly to
> >>    hctx->dispatch.
> >>
> >> 3. Request C of queue B is inserted.
> >>
> >> 4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not
> >>    empty (request B is in the list) hctx is only marked for for next restart
> >>    and request A is left in a list (see comment "So it's best to leave them
> >>    there for as long as we can. Mark the hw queue as needing a restart in
> >>    that case." in blk-mq-sched.c)
> >>
> >> 5. Eventually request B is completed/freed and blk_mq_sched_restart() is
> >>    called, but by chance hctx from queue B is chosen for restart and request C
> >>    gets a chance to be dispatched.
> >>
> >> 6. Eventually request C is completed/freed and blk_mq_sched_restart() is
> >>    called, but shared_hctx_restart for queue B is zero and we return without
> >>    attempt to restart hctx from queue A, thus request A is stuck forever.
> >>
> >> But stalling queue is not the only one problem with blk_mq_sched_restart().
> >> My tests show that those loops thru all queues and hctxs can be very costly,
> >> even with shared_hctx_restart counter, which aims to fix performance issue.
> >> For my tests I create 128 devices with 64 hctx each, which share same tags
> >> set.
> >
> > Hi Roman,
> >
> > I also find the performance issue with RESTART for TAG_SHARED.
> >
> > But from my analysis, RESTART isn't needed for TAG_SHARED
> > because SCSI-MQ has handled the RESTART by itself already, so
> > could you test the patch in following link posted days ago to
> > see if it fixes your issue?
> 
> I can say without any testing: it fixes all the issues :) You've
> reverted
> 
>    8e8320c9315c ("blk-mq: fix performance regression with shared tags")
>    6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
> 
> with one major difference: you do not handle shared tags in a special
> way and restart only requested hctx, instead of iterating over all hctxs
> in a queue.

Hi Roman,

Yeah, that is unnecessary as I explained in detail in the commit log and
introduces lots of overhead, and I can see IOPS is improved by 20%~30% in
my SCSI_DEBUG test(only 8 luns) by removing the blk-mq restart for TAG_SHARED.

Also it isn't needed for BLK_STS_RESOURCE returned from .get_budget().

I will post out V3 soon. 

> 
> Firstly I have to say that queue stalling issue(#1) and performance
> issue(#2) were observed on our in-house rdma driver IBNBD:
> https://lwn.net/Articles/718181/ and I've never tried to reproduce
> them on SCSI-MQ.

OK, got it, then you can handle it in SCSI-MQ's way since this way
is used by non-MQ for long time. Or you need to do nothing if your
driver is SCSI based.

> 
> Then your patch brakes RR restarts, which were introduced by this commit
> 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared").
> Seems basic idea of that patch is nice, but because of possible big
> amount of queues and hctxs patch requires reimplementation.  Eventually
> you should get fast hctx restart but in RR fashion. According to my
> understanding that does not contradict with your patch.

Firstly this way has been run/verified for long time in non-mq, and no one
complained it before.

Secondly please see scsi_target_queue_ready() and scsi_host_queue_ready(),
the sdev is added into the 'starved_list' in FIFO style, which is still fair.

So I don't think it is an issue to remove the RR restarts.

Thanks,
Ming
Bart Van Assche Nov. 1, 2017, 4:50 p.m. UTC | #7
On Mon, 2017-10-23 at 17:12 +0200, Roman Penyaev wrote:
> On Fri, Oct 20, 2017 at 10:05 PM, Bart Van Assche wrote:

> > Commit 6d8c6c0f97ad is something I came up with to fix queue lockups in the

> > SCSI and dm-mq drivers.

> 

> You mean fairness? (some hctx get less amount of chances to be restarted).

> That's why you need to restart them in RR fashion, right?


Commit 6d8c6c0f97ad not only improved fairness but also fixed a lockup scenario.
After a driver tag has been freed not only the hw queues of the request queue for
which a request completed must be reexamined but also the hw queues associated
with the other request queues that share the same tag set.

Bart.
Bart Van Assche Nov. 7, 2017, 3:54 p.m. UTC | #8
On Wed, 2017-10-18 at 12:22 +0200, Roman Pen wrote:
> the patch below fixes queue stalling when shared hctx marked for restart

> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.


Since I have been able to reproduce this queue stall with an upstream driver
(ib_srp) and since I also have been able to verify that this patch fixes that
stall:

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>

Tested-by: Bart Van Assche <bart.vanassche@wdc.com>
diff mbox

Patch

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..a19a7f275173 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -60,10 +60,10 @@  static void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
 		return;
 
 	if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
-		struct request_queue *q = hctx->queue;
+		struct blk_mq_tags *tags = hctx->tags;
 
 		if (!test_and_set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
-			atomic_inc(&q->shared_hctx_restart);
+			atomic_inc(&tags->shared_hctx_restart);
 	} else
 		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 }
@@ -74,10 +74,10 @@  static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 		return false;
 
 	if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
-		struct request_queue *q = hctx->queue;
+		struct blk_mq_tags *tags = hctx->tags;
 
 		if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
-			atomic_dec(&q->shared_hctx_restart);
+			atomic_dec(&tags->shared_hctx_restart);
 	} else
 		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 
@@ -312,7 +312,7 @@  void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
 		 * If this is 0, then we know that no hardware queues
 		 * have RESTART marked. We're done.
 		 */
-		if (!atomic_read(&queue->shared_hctx_restart))
+		if (!atomic_read(&tags->shared_hctx_restart))
 			return;
 
 		rcu_read_lock();
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d0be72ccb091..598f8e0095ff 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -385,6 +385,7 @@  struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
+	atomic_set(&tags->shared_hctx_restart, 0);
 
 	return blk_mq_init_bitmap_tags(tags, node, alloc_policy);
 }
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 5cb51e53cc03..adf05c8811cd 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -11,6 +11,7 @@  struct blk_mq_tags {
 	unsigned int nr_reserved_tags;
 
 	atomic_t active_queues;
+	atomic_t shared_hctx_restart;
 
 	struct sbitmap_queue bitmap_tags;
 	struct sbitmap_queue breserved_tags;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4603b115e234..7639f978ea2c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2120,11 +2120,11 @@  static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (shared) {
 			if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
-				atomic_inc(&q->shared_hctx_restart);
+				atomic_inc(&hctx->tags->shared_hctx_restart);
 			hctx->flags |= BLK_MQ_F_TAG_SHARED;
 		} else {
 			if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
-				atomic_dec(&q->shared_hctx_restart);
+				atomic_dec(&hctx->tags->shared_hctx_restart);
 			hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
 		}
 	}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2a5d52fa90f5..3852a9ea87d0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -393,8 +393,6 @@  struct request_queue {
 	int			nr_rqs[2];	/* # allocated [a]sync rqs */
 	int			nr_rqs_elvpriv;	/* # allocated rqs w/ elvpriv */
 
-	atomic_t		shared_hctx_restart;
-
 	struct blk_queue_stats	*stats;
 	struct rq_wb		*rq_wb;