Message ID | 20171018102206.26020-1-roman.penyaev@profitbricks.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, 2017-10-18 at 12:22 +0200, Roman Pen wrote: > the patch below fixes queue stalling when shared hctx marked for restart > (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero. The > root cause is that hctxs are shared between queues, but 'shared_hctx_restart' > belongs to the particular queue, which in fact may not need to be restarted, > thus we return from blk_mq_sched_restart() and leave shared hctx of another > queue never restarted. > > The fix is to make shared_hctx_restart counter belong not to the queue, but > to tags, thereby counter will reflect real number of shared hctx needed to > be restarted. Hello Roman, The patch you posted looks fine to me but seeing this patch and the patch description makes me wonder why this had not been noticed before. Are you perhaps using a block driver that returns BLK_STS_RESOURCE more often than other block drivers? Did you perhaps run into this with the Infiniband network block device (IBNBD) driver? No matter what driver triggered this, I think this bug should be fixed. Bart.
Hi Bart, On Thu, Oct 19, 2017 at 7:47 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Wed, 2017-10-18 at 12:22 +0200, Roman Pen wrote: >> the patch below fixes queue stalling when shared hctx marked for restart >> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero. The >> root cause is that hctxs are shared between queues, but 'shared_hctx_restart' >> belongs to the particular queue, which in fact may not need to be restarted, >> thus we return from blk_mq_sched_restart() and leave shared hctx of another >> queue never restarted. >> >> The fix is to make shared_hctx_restart counter belong not to the queue, but >> to tags, thereby counter will reflect real number of shared hctx needed to >> be restarted. > > Hello Roman, > > The patch you posted looks fine to me but seeing this patch and the patch > description makes me wonder why this had not been noticed before. This is a good question, which I could not answer. I tried to simulate the same behaviour (completion timings, completion pinning, number of submission queues, shared tags, etc) on null block. but what I see is that *_sched_restart() never observes 'shared_hctx_restart', literally never (I made a counter when we take a path and start looking for a hctx to restart, and a counter stays 0). That makes me nervous and then I gave up. After some time I want return to that and try to reproduce the problem on something else, say nvme. > Are you perhaps using a block driver that returns BLK_STS_RESOURCE more > often than other block drivers? Did you perhaps run into this with the > Infiniband network block device (IBNBD) driver? Yep, this is IBNBD, but in these tests I tested with mq scheduler, shared tags and 1 hctx for each queue (blk device), thus I never run out of internal tags and never return BLK_STS_RESOURCE. Indeed, not modified IBNBD does internal tags management. This was needed because each queue (block device) was created with hctx number (nr_hw_queues) equal to number of cpus on the system, but blk-mq tags set is shared only between hctx, not globally, which led to need to return BLK_STS_RESOURCE and queues restarts. But, with mq scheduler situation changed: 1 hctx with shared tags can be specified for all hundreds of devices without any performance impact. Testing this configuration (1 hctx, shared tags, mq-deadline) immediately shows these two problems: request stalling and slow loops inside blk_mq_sched_restart(). > No matter what driver triggered this, I think this bug should be fixed. Yes, queue stalling can be easily fixed. I can resend current patch with shorter description which targets only this particular bug, if no one else has objections/comments etc. But what bothers me is these looong loops inside blk_mq_sched_restart(), and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") I want to ask what was the original problem which you attempted to fix? Likely I am missing some test scenario which would be great to know about. -- Roman
On Fri, 2017-10-20 at 11:39 +0200, Roman Penyaev wrote: > But what bothers me is these looong loops inside blk_mq_sched_restart(), > and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart > a single queue if tag sets are shared") I want to ask what was the original > problem which you attempted to fix? Likely I am missing some test scenario > which would be great to know about. Long loops? How many queues share the same tag set on your setup? How many hardware queues does your block driver create per request queue? Commit 6d8c6c0f97ad is something I came up with to fix queue lockups in the SCSI and dm-mq drivers. Bart.
On Fri, Oct 20, 2017 at 10:05 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Fri, 2017-10-20 at 11:39 +0200, Roman Penyaev wrote: >> But what bothers me is these looong loops inside blk_mq_sched_restart(), >> and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart >> a single queue if tag sets are shared") I want to ask what was the original >> problem which you attempted to fix? Likely I am missing some test scenario >> which would be great to know about. > > Long loops? How many queues share the same tag set on your setup? How many > hardware queues does your block driver create per request queue? Yeah, ok, my mistake. I had to split both issues and should not have described everything in one go in the first email. So, take a look. For my tests I create 128 queues (devices) with 64 hctx each, all queues share same tags set, then I start 128 fio jobs (1 job per 1 queue). The following is the fio and ftrace output for v4.14-rc4 kernel (without any changes): READ: io=5630.3MB, aggrb=573208KB/s, minb=573208KB/s, maxb=573208KB/s, mint=10058msec, maxt=10058msec WRITE: io=5650.9MB, aggrb=575312KB/s, minb=575312KB/s, maxb=575312KB/s, mint=10058msec, maxt=10058msec root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq Function Hit Time Avg s^2 -------- --- ---- --- --- blk_mq_sched_restart 16347 9540759 us 583.639 us 8804801 us blk_mq_sched_restart 7884 6073471 us 770.354 us 8780054 us blk_mq_sched_restart 14176 7586794 us 535.185 us 2822731 us blk_mq_sched_restart 7843 6205435 us 791.206 us 12424960 us blk_mq_sched_restart 1490 4786107 us 3212.153 us 1949753 us <<< !!! 3 ms in average !!! blk_mq_sched_restart 7892 6039311 us 765.244 us 2994627 us blk_mq_sched_restart 15382 7511126 us 488.306 us 3090912 us [cut] And here are results with two patches reverted: 8e8320c9315c ("blk-mq: fix performance regression with shared tags") 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") READ: io=12884MB, aggrb=1284.3MB/s, minb=1284.3MB/s, maxb=1284.3MB/s, mint=10032msec, maxt=10032msec WRITE: io=12987MB, aggrb=1294.6MB/s, minb=1294.6MB/s, maxb=1294.6MB/s, mint=10032msec, maxt=10032msec root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq Function Hit Time Avg s^2 -------- --- ---- --- --- blk_mq_sched_restart 50699 8802.349 us 0.173 us 121.771 us blk_mq_sched_restart 50362 8740.470 us 0.173 us 161.494 us blk_mq_sched_restart 50402 9066.337 us 0.179 us 113.009 us blk_mq_sched_restart 50104 9366.197 us 0.186 us 188.645 us blk_mq_sched_restart 50375 9317.727 us 0.184 us 54.218 us blk_mq_sched_restart 50136 9311.657 us 0.185 us 446.790 us blk_mq_sched_restart 50103 9179.625 us 0.183 us 114.472 us [cut] The difference is significant: 570MB/s vs 1280MB/s. E.g. one cpu spent 3 ms in average iterating over all queues and hctxs in order to find out hctx to restart. In total CPUs spent *seconds* in loop. That seems incredibly long. > Commit 6d8c6c0f97ad is something I came up with to fix queue lockups in the > SCSI and dm-mq drivers. You mean fairness? (some hctx get less amount of chances to be restarted). That's why you need to restart them in RR fashion, right? In IBNBD I also do hctx restarts in RR fashion and for that I put each hctx which is needed to be restarted in a separate percpu list. Probably it makes sense to do the same here? -- Roman
Hi Ming, On Fri, Oct 20, 2017 at 3:39 PM, Ming Lei <ming.lei@redhat.com> wrote: > On Wed, Oct 18, 2017 at 12:22:06PM +0200, Roman Pen wrote: >> Hi all, >> >> the patch below fixes queue stalling when shared hctx marked for restart >> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero. The >> root cause is that hctxs are shared between queues, but 'shared_hctx_restart' >> belongs to the particular queue, which in fact may not need to be restarted, >> thus we return from blk_mq_sched_restart() and leave shared hctx of another >> queue never restarted. >> >> The fix is to make shared_hctx_restart counter belong not to the queue, but >> to tags, thereby counter will reflect real number of shared hctx needed to >> be restarted. >> >> During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests >> were noticed in dd->fifo_list of mq-deadline scheduler. >> >> Seeming possible sequence of events: >> >> 1. Request A of queue A is inserted into dd->fifo_list of the scheduler. >> >> 2. Request B of queue A bypasses scheduler and goes directly to >> hctx->dispatch. >> >> 3. Request C of queue B is inserted. >> >> 4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not >> empty (request B is in the list) hctx is only marked for for next restart >> and request A is left in a list (see comment "So it's best to leave them >> there for as long as we can. Mark the hw queue as needing a restart in >> that case." in blk-mq-sched.c) >> >> 5. Eventually request B is completed/freed and blk_mq_sched_restart() is >> called, but by chance hctx from queue B is chosen for restart and request C >> gets a chance to be dispatched. >> >> 6. Eventually request C is completed/freed and blk_mq_sched_restart() is >> called, but shared_hctx_restart for queue B is zero and we return without >> attempt to restart hctx from queue A, thus request A is stuck forever. >> >> But stalling queue is not the only one problem with blk_mq_sched_restart(). >> My tests show that those loops thru all queues and hctxs can be very costly, >> even with shared_hctx_restart counter, which aims to fix performance issue. >> For my tests I create 128 devices with 64 hctx each, which share same tags >> set. > > Hi Roman, > > I also find the performance issue with RESTART for TAG_SHARED. > > But from my analysis, RESTART isn't needed for TAG_SHARED > because SCSI-MQ has handled the RESTART by itself already, so > could you test the patch in following link posted days ago to > see if it fixes your issue? I can say without any testing: it fixes all the issues :) You've reverted 8e8320c9315c ("blk-mq: fix performance regression with shared tags") 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") with one major difference: you do not handle shared tags in a special way and restart only requested hctx, instead of iterating over all hctxs in a queue. Firstly I have to say that queue stalling issue(#1) and performance issue(#2) were observed on our in-house rdma driver IBNBD: https://lwn.net/Articles/718181/ and I've never tried to reproduce them on SCSI-MQ. Then your patch brakes RR restarts, which were introduced by this commit 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared"). Seems basic idea of that patch is nice, but because of possible big amount of queues and hctxs patch requires reimplementation. Eventually you should get fast hctx restart but in RR fashion. According to my understanding that does not contradict with your patch. -- Roman
On Mon, Oct 23, 2017 at 06:12:29PM +0200, Roman Penyaev wrote: > Hi Ming, > > On Fri, Oct 20, 2017 at 3:39 PM, Ming Lei <ming.lei@redhat.com> wrote: > > On Wed, Oct 18, 2017 at 12:22:06PM +0200, Roman Pen wrote: > >> Hi all, > >> > >> the patch below fixes queue stalling when shared hctx marked for restart > >> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero. The > >> root cause is that hctxs are shared between queues, but 'shared_hctx_restart' > >> belongs to the particular queue, which in fact may not need to be restarted, > >> thus we return from blk_mq_sched_restart() and leave shared hctx of another > >> queue never restarted. > >> > >> The fix is to make shared_hctx_restart counter belong not to the queue, but > >> to tags, thereby counter will reflect real number of shared hctx needed to > >> be restarted. > >> > >> During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests > >> were noticed in dd->fifo_list of mq-deadline scheduler. > >> > >> Seeming possible sequence of events: > >> > >> 1. Request A of queue A is inserted into dd->fifo_list of the scheduler. > >> > >> 2. Request B of queue A bypasses scheduler and goes directly to > >> hctx->dispatch. > >> > >> 3. Request C of queue B is inserted. > >> > >> 4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not > >> empty (request B is in the list) hctx is only marked for for next restart > >> and request A is left in a list (see comment "So it's best to leave them > >> there for as long as we can. Mark the hw queue as needing a restart in > >> that case." in blk-mq-sched.c) > >> > >> 5. Eventually request B is completed/freed and blk_mq_sched_restart() is > >> called, but by chance hctx from queue B is chosen for restart and request C > >> gets a chance to be dispatched. > >> > >> 6. Eventually request C is completed/freed and blk_mq_sched_restart() is > >> called, but shared_hctx_restart for queue B is zero and we return without > >> attempt to restart hctx from queue A, thus request A is stuck forever. > >> > >> But stalling queue is not the only one problem with blk_mq_sched_restart(). > >> My tests show that those loops thru all queues and hctxs can be very costly, > >> even with shared_hctx_restart counter, which aims to fix performance issue. > >> For my tests I create 128 devices with 64 hctx each, which share same tags > >> set. > > > > Hi Roman, > > > > I also find the performance issue with RESTART for TAG_SHARED. > > > > But from my analysis, RESTART isn't needed for TAG_SHARED > > because SCSI-MQ has handled the RESTART by itself already, so > > could you test the patch in following link posted days ago to > > see if it fixes your issue? > > I can say without any testing: it fixes all the issues :) You've > reverted > > 8e8320c9315c ("blk-mq: fix performance regression with shared tags") > 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") > > with one major difference: you do not handle shared tags in a special > way and restart only requested hctx, instead of iterating over all hctxs > in a queue. Hi Roman, Yeah, that is unnecessary as I explained in detail in the commit log and introduces lots of overhead, and I can see IOPS is improved by 20%~30% in my SCSI_DEBUG test(only 8 luns) by removing the blk-mq restart for TAG_SHARED. Also it isn't needed for BLK_STS_RESOURCE returned from .get_budget(). I will post out V3 soon. > > Firstly I have to say that queue stalling issue(#1) and performance > issue(#2) were observed on our in-house rdma driver IBNBD: > https://lwn.net/Articles/718181/ and I've never tried to reproduce > them on SCSI-MQ. OK, got it, then you can handle it in SCSI-MQ's way since this way is used by non-MQ for long time. Or you need to do nothing if your driver is SCSI based. > > Then your patch brakes RR restarts, which were introduced by this commit > 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared"). > Seems basic idea of that patch is nice, but because of possible big > amount of queues and hctxs patch requires reimplementation. Eventually > you should get fast hctx restart but in RR fashion. According to my > understanding that does not contradict with your patch. Firstly this way has been run/verified for long time in non-mq, and no one complained it before. Secondly please see scsi_target_queue_ready() and scsi_host_queue_ready(), the sdev is added into the 'starved_list' in FIFO style, which is still fair. So I don't think it is an issue to remove the RR restarts. Thanks, Ming
On Mon, 2017-10-23 at 17:12 +0200, Roman Penyaev wrote: > On Fri, Oct 20, 2017 at 10:05 PM, Bart Van Assche wrote: > > Commit 6d8c6c0f97ad is something I came up with to fix queue lockups in the > > SCSI and dm-mq drivers. > > You mean fairness? (some hctx get less amount of chances to be restarted). > That's why you need to restart them in RR fashion, right? Commit 6d8c6c0f97ad not only improved fairness but also fixed a lockup scenario. After a driver tag has been freed not only the hw queues of the request queue for which a request completed must be reexamined but also the hw queues associated with the other request queues that share the same tag set. Bart.
On Wed, 2017-10-18 at 12:22 +0200, Roman Pen wrote: > the patch below fixes queue stalling when shared hctx marked for restart > (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero. Since I have been able to reproduce this queue stall with an upstream driver (ib_srp) and since I also have been able to verify that this patch fixes that stall: Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Bart Van Assche <bart.vanassche@wdc.com>
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 4ab69435708c..a19a7f275173 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -60,10 +60,10 @@ static void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx) return; if (hctx->flags & BLK_MQ_F_TAG_SHARED) { - struct request_queue *q = hctx->queue; + struct blk_mq_tags *tags = hctx->tags; if (!test_and_set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) - atomic_inc(&q->shared_hctx_restart); + atomic_inc(&tags->shared_hctx_restart); } else set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state); } @@ -74,10 +74,10 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx) return false; if (hctx->flags & BLK_MQ_F_TAG_SHARED) { - struct request_queue *q = hctx->queue; + struct blk_mq_tags *tags = hctx->tags; if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) - atomic_dec(&q->shared_hctx_restart); + atomic_dec(&tags->shared_hctx_restart); } else clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state); @@ -312,7 +312,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx) * If this is 0, then we know that no hardware queues * have RESTART marked. We're done. */ - if (!atomic_read(&queue->shared_hctx_restart)) + if (!atomic_read(&tags->shared_hctx_restart)) return; rcu_read_lock(); diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index d0be72ccb091..598f8e0095ff 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -385,6 +385,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, tags->nr_tags = total_tags; tags->nr_reserved_tags = reserved_tags; + atomic_set(&tags->shared_hctx_restart, 0); return blk_mq_init_bitmap_tags(tags, node, alloc_policy); } diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 5cb51e53cc03..adf05c8811cd 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -11,6 +11,7 @@ struct blk_mq_tags { unsigned int nr_reserved_tags; atomic_t active_queues; + atomic_t shared_hctx_restart; struct sbitmap_queue bitmap_tags; struct sbitmap_queue breserved_tags; diff --git a/block/blk-mq.c b/block/blk-mq.c index 4603b115e234..7639f978ea2c 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2120,11 +2120,11 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared) queue_for_each_hw_ctx(q, hctx, i) { if (shared) { if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) - atomic_inc(&q->shared_hctx_restart); + atomic_inc(&hctx->tags->shared_hctx_restart); hctx->flags |= BLK_MQ_F_TAG_SHARED; } else { if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) - atomic_dec(&q->shared_hctx_restart); + atomic_dec(&hctx->tags->shared_hctx_restart); hctx->flags &= ~BLK_MQ_F_TAG_SHARED; } } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 2a5d52fa90f5..3852a9ea87d0 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -393,8 +393,6 @@ struct request_queue { int nr_rqs[2]; /* # allocated [a]sync rqs */ int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */ - atomic_t shared_hctx_restart; - struct blk_queue_stats *stats; struct rq_wb *rq_wb;
Hi all, the patch below fixes queue stalling when shared hctx marked for restart (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero. The root cause is that hctxs are shared between queues, but 'shared_hctx_restart' belongs to the particular queue, which in fact may not need to be restarted, thus we return from blk_mq_sched_restart() and leave shared hctx of another queue never restarted. The fix is to make shared_hctx_restart counter belong not to the queue, but to tags, thereby counter will reflect real number of shared hctx needed to be restarted. During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests were noticed in dd->fifo_list of mq-deadline scheduler. Seeming possible sequence of events: 1. Request A of queue A is inserted into dd->fifo_list of the scheduler. 2. Request B of queue A bypasses scheduler and goes directly to hctx->dispatch. 3. Request C of queue B is inserted. 4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not empty (request B is in the list) hctx is only marked for for next restart and request A is left in a list (see comment "So it's best to leave them there for as long as we can. Mark the hw queue as needing a restart in that case." in blk-mq-sched.c) 5. Eventually request B is completed/freed and blk_mq_sched_restart() is called, but by chance hctx from queue B is chosen for restart and request C gets a chance to be dispatched. 6. Eventually request C is completed/freed and blk_mq_sched_restart() is called, but shared_hctx_restart for queue B is zero and we return without attempt to restart hctx from queue A, thus request A is stuck forever. But stalling queue is not the only one problem with blk_mq_sched_restart(). My tests show that those loops thru all queues and hctxs can be very costly, even with shared_hctx_restart counter, which aims to fix performance issue. For my tests I create 128 devices with 64 hctx each, which share same tags set. The following is the fio and ftrace output for v4.14-rc4 kernel: READ: io=5630.3MB, aggrb=573208KB/s, minb=573208KB/s, maxb=573208KB/s, mint=10058msec, maxt=10058msec WRITE: io=5650.9MB, aggrb=575312KB/s, minb=575312KB/s, maxb=575312KB/s, mint=10058msec, maxt=10058msec root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq Function Hit Time Avg s^2 -------- --- ---- --- --- blk_mq_sched_restart 16347 9540759 us 583.639 us 8804801 us blk_mq_sched_restart 7884 6073471 us 770.354 us 8780054 us blk_mq_sched_restart 14176 7586794 us 535.185 us 2822731 us blk_mq_sched_restart 7843 6205435 us 791.206 us 12424960 us blk_mq_sched_restart 1490 4786107 us 3212.153 us 1949753 us blk_mq_sched_restart 7892 6039311 us 765.244 us 2994627 us blk_mq_sched_restart 15382 7511126 us 488.306 us 3090912 us [cut] And here are results with two patches reverted: 8e8320c9315c ("blk-mq: fix performance regression with shared tags") 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") READ: io=12884MB, aggrb=1284.3MB/s, minb=1284.3MB/s, maxb=1284.3MB/s, mint=10032msec, maxt=10032msec WRITE: io=12987MB, aggrb=1294.6MB/s, minb=1294.6MB/s, maxb=1294.6MB/s, mint=10032msec, maxt=10032msec root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq Function Hit Time Avg s^2 -------- --- ---- --- --- blk_mq_sched_restart 50699 8802.349 us 0.173 us 121.771 us blk_mq_sched_restart 50362 8740.470 us 0.173 us 161.494 us blk_mq_sched_restart 50402 9066.337 us 0.179 us 113.009 us blk_mq_sched_restart 50104 9366.197 us 0.186 us 188.645 us blk_mq_sched_restart 50375 9317.727 us 0.184 us 54.218 us blk_mq_sched_restart 50136 9311.657 us 0.185 us 446.790 us blk_mq_sched_restart 50103 9179.625 us 0.183 us 114.472 us [cut] Timings and stdevs are terrible, which leads to significant difference: 570MB/s vs 1280MB/s. This is RFC since current patch fixes queue stalling but performance issue still remains and for me is not clear is it better to improve commit 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") making percpu restart lists (to avoid looping and to dequeue hctx immediately) or revert it (frankly I did not notice any difference on small number of devices and hctxs, when looping issue does not impact much). -- Roman Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Cc: linux-kernel@vger.kernel.org Cc: linux-block@vger.kernel.org Cc: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Jens Axboe <axboe@fb.com> --- block/blk-mq-sched.c | 10 +++++----- block/blk-mq-tag.c | 1 + block/blk-mq-tag.h | 1 + block/blk-mq.c | 4 ++-- include/linux/blkdev.h | 2 -- 5 files changed, 9 insertions(+), 9 deletions(-)