diff mbox

[RFC] blk-mq: fixup RESTART when queue becomes idle

Message ID 20180118204856.GA31679@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mike Snitzer Jan. 18, 2018, 8:48 p.m. UTC
On Thu, Jan 18 2018 at  3:11pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 1/18/18 11:47 AM, Bart Van Assche wrote:
> >> This is all very tiresome.
> > 
> > Yes, this is tiresome. It is very annoying to me that others keep
> > introducing so many regressions in such important parts of the kernel.
> > It is also annoying to me that I get blamed if I report a regression
> > instead of seeing that the regression gets fixed.
> 
> I agree, it sucks that any change there introduces the regression. I'm
> fine with doing the delay insert again until a new patch is proven to be
> better.
> 
> From the original topic of this email, we have conditions that can cause
> the driver to not be able to submit an IO. A set of those conditions can
> only happen if IO is in flight, and those cases we have covered just
> fine. Another set can potentially trigger without IO being in flight.
> These are cases where a non-device resource is unavailable at the time
> of submission. This might be iommu running out of space, for instance,
> or it might be a memory allocation of some sort. For these cases, we
> don't get any notification when the shortage clears. All we can do is
> ensure that we restart operations at some point in the future. We're SOL
> at that point, but we have to ensure that we make forward progress.
> 
> That last set of conditions better not be a a common occurence, since
> performance is down the toilet at that point. I don't want to introduce
> hot path code to rectify it. Have the driver return if that happens in a
> way that is DIFFERENT from needing a normal restart. The driver knows if
> this is a resource that will become available when IO completes on this
> device or not. If we get that return, we have a generic run-again delay.
> 
> This basically becomes the same as doing the delay queue thing from DM,
> but just in a generic fashion.

This is a bit confusing for me (as I see it we have 2 blk-mq drivers
trying to collaborate, so your refering to "driver" lacks precision; but
I could just be missing something)...

For Bart's test the underlying scsi-mq driver is what is regularly
hitting this case in __blk_mq_try_issue_directly():

        if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))

It certainly better not be the norm (Bart's test hammering on this aside).

For starters, it'd be very useful to know if Bart is hitting the
blk_mq_hctx_stopped() or blk_queue_quiesced() for this case that is
triggering the use of blk_mq_sched_insert_request() -- I'd wager it is
due to blk_queue_quiesced() but Bart _please_ try to figure it out.

Anyway, in response to this case Bart would like the upper layer dm-mq
driver to blk_mq_delay_run_hw_queue().  Certainly is quite the hammer.

But that hammer aside, in general for this case, I'm concerned about: is
it really correct to allow an already stopped/quiesced underlying queue
to retain responsibility for processing the request?  Or would the
upper-layer dm-mq benefit from being able to retry the request on its
terms (via a "DIFFERENT" return from blk-mq core)?

Like this?  The (ab)use of BLK_STS_DM_REQUEUE certainly seems fitting in
this case but...

(Bart please note that this patch applies on linux-dm.git's 'for-next';
which is just a merge of Jens' 4.16 tree and dm-4.16)

Comments

Bart Van Assche Jan. 18, 2018, 8:58 p.m. UTC | #1
On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> For Bart's test the underlying scsi-mq driver is what is regularly

> hitting this case in __blk_mq_try_issue_directly():

> 

>         if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))


Hello Mike,

That code path is not the code path that triggered the lockups that I reported
during the past days. These lockups were all triggered by incorrect handling of
.queue_rq() returning BLK_STS_RESOURCE.

Bart.
Mike Snitzer Jan. 18, 2018, 9:23 p.m. UTC | #2
On Thu, Jan 18 2018 at  3:58P -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > For Bart's test the underlying scsi-mq driver is what is regularly
> > hitting this case in __blk_mq_try_issue_directly():
> > 
> >         if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> 
> Hello Mike,
> 
> That code path is not the code path that triggered the lockups that I reported
> during the past days.

If you're hitting blk_mq_sched_insert_request() then you most certainly
are hitting that code path.

If you aren't then what was your earlier email going on about?
https://www.redhat.com/archives/dm-devel/2018-January/msg00372.html

If you were just focusing on that as one possible reason, that isn't
very helpful.  By this point you really should _know_ what is triggering
the stall based on the code paths taken.  Please use ftrace's
function_graph tracer if need be.

> These lockups were all triggered by incorrect handling of
> .queue_rq() returning BLK_STS_RESOURCE.

Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
"Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?

Please try to do more work analyzing the test case that only you can
easily run (due to srp_test being a PITA).  And less time lobbying for
a change that you don't understand to _really_ be correct.

We have time to get this right, please stop hyperventilating about
"regressions".

Thanks,
Mike
Laurence Oberman Jan. 18, 2018, 9:37 p.m. UTC | #3
On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at  3:58P -0500,
> Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> 
> > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > For Bart's test the underlying scsi-mq driver is what is
> > > regularly
> > > hitting this case in __blk_mq_try_issue_directly():
> > > 
> > >         if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> > 
> > Hello Mike,
> > 
> > That code path is not the code path that triggered the lockups that
> > I reported
> > during the past days.
> 
> If you're hitting blk_mq_sched_insert_request() then you most
> certainly
> are hitting that code path.
> 
> If you aren't then what was your earlier email going on about?
> https://www.redhat.com/archives/dm-devel/2018-January/msg00372.html
> 
> If you were just focusing on that as one possible reason, that isn't
> very helpful.  By this point you really should _know_ what is
> triggering
> the stall based on the code paths taken.  Please use ftrace's
> function_graph tracer if need be.
> 
> > These lockups were all triggered by incorrect handling of
> > .queue_rq() returning BLK_STS_RESOURCE.
> 
> Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?
> 
> Please try to do more work analyzing the test case that only you can
> easily run (due to srp_test being a PITA).  And less time lobbying
> for
> a change that you don't understand to _really_ be correct.
> 
> We have time to get this right, please stop hyperventilating about
> "regressions".
> 
> Thanks,
> Mike

Hello Bart
I have run a good few loops of 02-mq and its stable for me on your
tree.
I am not running the entire disconnect re-connect loops and un-mounts
etc. for good reason.
I have 35 LUNS so its very impact-full to lose them and have them come
back all the time.

Anyway
I am very happy to try reproduce this in-house so Mike and Ming can
focus on it but I need to know if all I need to do is loop over 02-mq
over and over.

Also please let me know whats debugfs and sysfs to capture and I am
happy to try help move this along.

Regards
Laurence
Bart Van Assche Jan. 18, 2018, 9:39 p.m. UTC | #4
On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at  3:58P -0500,

> Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> 

> > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:

> > > For Bart's test the underlying scsi-mq driver is what is regularly

> > > hitting this case in __blk_mq_try_issue_directly():

> > > 

> > >         if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))

> >

> > These lockups were all triggered by incorrect handling of

> > .queue_rq() returning BLK_STS_RESOURCE.

> 

> Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?

> "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?


In what I wrote I was referring to both dm_mq_queue_rq() and scsi_queue_rq().
With "incorrect" I meant that queue lockups are introduced that make user
space processes unkillable. That's a severe bug.

> Please try to do more work analyzing the test case that only you can

> easily run (due to srp_test being a PITA).


It is not correct that I'm the only one who is able to run that software.
Anyone who is willing to merge the latest SRP initiator and target driver
patches in his or her tree can run that software in
any VM. I'm working hard
on getting the patches upstream that make it possible to run the srp-test
software on a setup that is not equipped with InfiniBand hardware.

> We have time to get this right, please stop hyperventilating about

> "regressions".


Sorry Mike but that's something I consider as an unfair comment. If Ming and
you work on patches together, it's your job to make sure that no regressions
are introduced. Instead of blaming me because I report these regressions you
should be grateful that I take the time and effort to report these regressions
early. And since you are employed by a large organization that sells Linux
support services, your employer should invest in developing test cases that
reach a higher coverage of the dm, SCSI and block layer code. I don't think
that it's normal that my tests discovered several issues that were not
discovered by Red Hat's internal test suite. That's something Red Hat has to
address.

Bart.
Mike Snitzer Jan. 18, 2018, 10:01 p.m. UTC | #5
On Thu, Jan 18 2018 at  4:39pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> > On Thu, Jan 18 2018 at  3:58P -0500,
> > Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> > 
> > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > > For Bart's test the underlying scsi-mq driver is what is regularly
> > > > hitting this case in __blk_mq_try_issue_directly():
> > > > 
> > > >         if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> > >
> > > These lockups were all triggered by incorrect handling of
> > > .queue_rq() returning BLK_STS_RESOURCE.
> > 
> > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?
> 
> In what I wrote I was referring to both dm_mq_queue_rq() and scsi_queue_rq().
> With "incorrect" I meant that queue lockups are introduced that make user
> space processes unkillable. That's a severe bug.

And yet Laurence cannot reproduce any such lockups with your test...

Are you absolutely certain this patch doesn't help you?
https://patchwork.kernel.org/patch/10174037/

If it doesn't then that is actually very useful to know.

> > We have time to get this right, please stop hyperventilating about
> > "regressions".
> 
> Sorry Mike but that's something I consider as an unfair comment. If Ming and
> you work on patches together, it's your job to make sure that no regressions
> are introduced. Instead of blaming me because I report these regressions you
> should be grateful that I take the time and effort to report these regressions
> early. And since you are employed by a large organization that sells Linux
> support services, your employer should invest in developing test cases that
> reach a higher coverage of the dm, SCSI and block layer code. I don't think
> that it's normal that my tests discovered several issues that were not
> discovered by Red Hat's internal test suite. That's something Red Hat has to
> address.

You have no self-awareness of just how mypoic you're being about this.

I'm not ignoring or blaming you for your test no longer passing.  Far
from it.  I very much want to fix this.  But I want it fixed in a way
that doesn't paper over the real bug(s) while also introducing blind
queue runs that compromise the blk-mq RESTART code's ability to work as
intended.

I'd have thought you could appreciate this.  We need a root cause on
this, not hand-waving justifications on why problematic delayed queue
runs are correct.

Please just focus on helping Laurence get his very capable testbed to
reproduce this issue.  Once we can reproduce these "unkillable" "stalls"
in-house it'll be _much_ easier to analyze and fix.

Thanks,
Mike
Laurence Oberman Jan. 18, 2018, 10:18 p.m. UTC | #6
On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at  4:39pm -0500,
> Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> 
> > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> > > On Thu, Jan 18 2018 at  3:58P -0500,
> > > Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> > > 
> > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > > > For Bart's test the underlying scsi-mq driver is what is
> > > > > regularly
> > > > > hitting this case in __blk_mq_try_issue_directly():
> > > > > 
> > > > >         if (blk_mq_hctx_stopped(hctx) ||
> > > > > blk_queue_quiesced(q))
> > > > 
> > > > These lockups were all triggered by incorrect handling of
> > > > .queue_rq() returning BLK_STS_RESOURCE.
> > > 
> > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> > > "Incorrect" because it no longer runs
> > > blk_mq_delay_run_hw_queue()?
> > 
> > In what I wrote I was referring to both dm_mq_queue_rq() and
> > scsi_queue_rq().
> > With "incorrect" I meant that queue lockups are introduced that
> > make user
> > space processes unkillable. That's a severe bug.
> 
> And yet Laurence cannot reproduce any such lockups with your test...
> 
> Are you absolutely certain this patch doesn't help you?
> https://patchwork.kernel.org/patch/10174037/
> 
> If it doesn't then that is actually very useful to know.
> 
> > > We have time to get this right, please stop hyperventilating
> > > about
> > > "regressions".
> > 
> > Sorry Mike but that's something I consider as an unfair comment. If
> > Ming and
> > you work on patches together, it's your job to make sure that no
> > regressions
> > are introduced. Instead of blaming me because I report these
> > regressions you
> > should be grateful that I take the time and effort to report these
> > regressions
> > early. And since you are employed by a large organization that
> > sells Linux
> > support services, your employer should invest in developing test
> > cases that
> > reach a higher coverage of the dm, SCSI and block layer code. I
> > don't think
> > that it's normal that my tests discovered several issues that were
> > not
> > discovered by Red Hat's internal test suite. That's something Red
> > Hat has to
> > address.
> 
> You have no self-awareness of just how mypoic you're being about
> this.
> 
> I'm not ignoring or blaming you for your test no longer passing.  Far
> from it.  I very much want to fix this.  But I want it fixed in a way
> that doesn't paper over the real bug(s) while also introducing blind
> queue runs that compromise the blk-mq RESTART code's ability to work
> as
> intended.
> 
> I'd have thought you could appreciate this.  We need a root cause on
> this, not hand-waving justifications on why problematic delayed queue
> runs are correct.
> 
> Please just focus on helping Laurence get his very capable testbed to
> reproduce this issue.  Once we can reproduce these "unkillable"
> "stalls"
> in-house it'll be _much_ easier to analyze and fix.
> 
> Thanks,
> Mike

Hello Bart

OK, I ran 5 at once of 5 separate mount points.
I am using 4k block sizes
Its solid consistent for me. No stalls no gaps.

[global]
name=02-mq
filename=fio-output-02-mq.txt
rw=randwrite
verify=md5
;rwmixread=60
;rwmixwrite=40
bs=4K
;direct=1
;numjobs=4
;time_based=1
runtime=120

[file1]
size=3G
ioengine=libaio
iodepth=16

I watch I/O and I see it ramp up but fio still runs and it kind of
shuts down.

#!/bin/bash
for i in 1 2 3 4 5
do
	cd /data-$i 
	fio /root/srp-test.lobe/fio_tests/02-mq 1>/data-$i.out 2>&1 &
done


#Time     cpu sys inter  ctxsw KBRead  Reads KBWrit Writes 

17:16:09   34  12 34431  10393      0      0 249872  36094 
17:16:10   31  10 13001   2679      0      0  57652   7980 
17:16:11   32  11 19473   4416      0      0 143248  17362 
17:16:12   31   9  8068   1606      0      0  20088   2026 
17:16:13   31   9  7718   1518      0      0  15908   1354 
17:16:14   36  14 41132   9710      0      0 686216  46661 
17:16:15   39  18 63622  21033      0      0  1246K  75108 
17:16:16   38  16 76114   9013      0      0  1146K  82338 
17:16:17   33  11 31941   2735      0      0 237340  25034 
17:16:18   36  14 23900   4982      0      1  1567K  43244 
17:16:19   37  15 55884   4974      0      4  1003K  67253 
17:16:20   28  12  7542   4975      0      0  1630K   3266 
17:16:21    8   6 14650  34721      0      0  2535K  21206 
17:16:22    2   2  6655  15839      0      0  2319K   9897 
17:16:23   11  11 37305 134030      0      0  1275K  87010 
17:16:24   23  22 59266 119350   6560   1640  1102K  78034 
17:16:25   21  17 65699 144462 148052  37013 159900  22120 
17:16:26   14   9 80700 164034 216588  54147      4      1 
#         <----CPU[HYPER]-----><----------Disks----------->
#Time     cpu sys inter  ctxsw KBRead  Reads KBWrit Writes 
17:16:27   14   9 78699 162095 214412  53603      0      0 
17:16:28   14   9 75895 155352 204932  51233      0      0 
17:16:29   14   9 75377 161871 214136  53534      0      0 
17:16:30   16  11 79298 155967 206164  51541      0      0 
17:16:31   14   8 82281 165266 218652  54662      0      0 
17:16:32   14   9 88635 163302 215876  53970      0      0 
17:16:33   14   8 88755 168958 223568  55892      0      0 
17:16:34   15   9 82073 154321 203936  50984      0      0 
17:16:35   17  10 87004 161467 213308  53327      0      0 
17:16:36   15   9 86198 161247 212940  53235      4      1 
17:16:37   15   9 83429 162890 215276  53819      0      0 
17:16:38   16  10 83272 160611 212264  53066     20      2 
17:16:39   16  10 93850 168179 222376  55594      0      0 
17:16:40   15   9 86801 167507 221372  55343      0      0 
17:16:41   14   8 82688 171249 226560  56640      0      0 
17:16:42   14   8 86202 159171 210344  52586      0      0 
17:16:43   16  10 86063 156009 206328  51582      0      0 
17:16:44   16  10 90927 153961 203584  50896      0      0 
17:16:45   15   9 95084 156370 206844  51710    132     13 
17:16:46   14   8 85572 158469 209300  52326      4      1 
17:16:47   14   8 92075 158469 209600  52400      0      0 
17:16:48   16  10 95469 154772 204176  51044      0      0 
#         <----CPU[HYPER]-----><----------Disks----------->
#Time     cpu sys inter  ctxsw KBRead  Reads KBWrit Writes 
17:16:49   15  10 93856 154182 203568  50897      5      5 
17:16:50   15  10 92308 151964 200680  50170      0      0 
17:16:51   16  10 94035 155076 204868  51217      0      0 
17:16:52   15   9 93949 164969 217980  54495      0      0 
17:16:53   14   9 85744 149553 197428  49357      0      0 
17:16:54   14   9 83064 155788 205684  51421      0      0 
17:16:55   16  10 83579 166084 218480  54620      0      0 
17:16:56   15   9 74766 158728 208556  52139      4      1 
17:16:57   14   9 82944 162720 215116  53779      0      0 
17:16:58   12   7 88529 163814 216440  54110      0      0 
17:16:59   14   7 85936 166723 220372  55093      4      1 
17:17:00   13   8 86701 153728 203052  50763      0      0 
17:17:01   14   9 92859 155963 205972  51493      0      0 
17:17:02   13   8 85624 143746 189756  47439      0      0 
17:17:03   13   8 85029 142540 188044  47010      0      0 
17:17:04   14   8 87291 144790 191148  47788      0      0 
17:17:05   14   8 81922 144342 190544  47636      0      0 
17:17:06   14   8 81587 161043 212576  53144      8      2 
17:17:07   15   8 85080 142285 187688  46922      0      0 
17:17:08   13   8 86980 150246 197852  49463      0      0 
17:17:09   12   8 81388 144335 190312  47578      4      1 
17:17:10   13   8 88436 146957 194040  48510      0      0 
#         <----CPU[HYPER]-----><----------Disks----------->
#Time     cpu sys inter  ctxsw KBRead  Reads KBWrit Writes 
17:17:11   11   7 78997 137847 181828  45457      0      0 
17:17:12   12   8 85709 148847 196708  49177      0      0 
17:17:13   17  10 97228 170646 225528  56382      0      0 
17:17:14   14   9 96032 173474 229400  57350      0      0 
17:17:15   15   9 95340 167951 221848  55462      0      0 
17:17:16   16  10 87392 157489 207748  51937     38      5 
17:17:17   16  10 84756 152753 201716  50429      0      0 
17:17:18   17  10 85109 153000 202080  50520      0      0 
17:17:19   16  10 92109 164514 217396  54349     80      5 
17:17:20   15   9 91342 155674 205804  51451      0      0 
17:17:21   12   7 73921 126577 167060  41765      0      0 
17:17:22   13   8 76916 126701 167176  41794      0      0 
17:17:23   13   8 85187 151074 199768  49942      0      0 
17:17:24   15   9 85737 153664 203168  50792      0      0 
17:17:25   15   9 87113 147636 195084  48771      0      0 
17:17:26   14   8 89440 157433 208116  52029      4      1 
17:17:27   14   9 95290 172224 227036  56759      0      0 
17:17:28   15  10 93918 167611 220164  55041      0      0 
17:17:29   13   8 82404 149274 197312  49328      0      0 
17:17:30   10   7 75961 138276 182668  45667      0      0 
17:17:31   12   8 84899 151829 200764  50191      0      0 
17:17:32   13   8 81386 141091 186444  46611      0      0 
17:17:33   15   9  100K 167489 221640  55410      0      0 
17:17:34   14   9 89601 158765 210032  52508      0      0 
17:17:35   14   9 92440 155287 205216  51304      0      0 
17:17:36   15  10 90042 155934 206168  51542      4      1 
17:17:37   14   9 91670 164456 217684  54421      0      0 
17:17:38   14   9 78612 167368 221348  55337      0      0 
17:17:39   15   9 91325 162599 214616  53654      0      0 
17:17:40   14   9 79112 151595 199984  50000     21      8 
17:17:41    9   3 32565  60936  79832  19964     32     12
Bart Van Assche Jan. 18, 2018, 10:20 p.m. UTC | #7
On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> And yet Laurence cannot reproduce any such lockups with your test...


Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has
already succeeded at running an unmodified version of my tests. In one of the
e-mails Laurence sent me this morning I read that he modified these scripts
to get past a kernel module unload failure that was reported while starting
these tests. So the next step is to check which changes were made to the test
scripts and also whether the test results are still valid.

> Are you absolutely certain this patch doesn't help you?

> https://patchwork.kernel.org/patch/10174037/

> 

> If it doesn't then that is actually very useful to know.


The first I tried this morning is to run the srp-test software against a merge
of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm
queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm code.
Since even that was not sufficient I tried to kick the queues via debugfs (for
s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was
not sufficient to resolve the queue stall I reverted the following tree patches
that are in Jens' tree:
* "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback"
* "blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request"
* "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy"

Only after I had done this the srp-test software ran again without triggering
dm queue lockups. Sorry but I have not yet had the time to test patch "[RFC]
blk-mq: fixup RESTART when queue becomes idle".

> Please just focus on helping Laurence get his very capable testbed to

> reproduce this issue.  Once we can reproduce these "unkillable" "stalls"

> in-house it'll be _much_ easier to analyze and fix.


OK, I will work with Laurence on this. Maybe Laurence and I should work on this
before analyzing the lockup that was mentioned above further?

Bart.
Bart Van Assche Jan. 18, 2018, 10:24 p.m. UTC | #8
On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote:
> OK, I ran 5 at once of 5 separate mount points.

> I am using 4k block sizes

> Its solid consistent for me. No stalls no gaps.


Hi Laurence,

That's great news and thank you for having shared this information but I think
it should be mentioned that you have been running my tree in which some recent
block and dm patches were reverted
(https://github.com/bvanassche/linux/tree/block-scsi-for-next)

> 

> [global]

> name=02-mq

> filename=fio-output-02-mq.txt

> rw=randwrite

> verify=md5

> ;rwmixread=60

> ;rwmixwrite=40

> bs=4K

> ;direct=1

> ;numjobs=4

> ;time_based=1

> runtime=120

> 

> [file1]

> size=3G

> ioengine=libaio

> iodepth=16

> 

> I watch I/O and I see it ramp up but fio still runs and it kind of

> shuts down.


In test "file1" I see an I/O size of 3G. Does that mean that the patch that
should fix the sgl_alloc() issue is working?

Thanks,

Bart.
Laurence Oberman Jan. 18, 2018, 10:35 p.m. UTC | #9
On Thu, 2018-01-18 at 22:24 +0000, Bart Van Assche wrote:
> On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote:
> > OK, I ran 5 at once of 5 separate mount points.
> > I am using 4k block sizes
> > Its solid consistent for me. No stalls no gaps.
> 
> Hi Laurence,
> 
> That's great news and thank you for having shared this information
> but I think
> it should be mentioned that you have been running my tree in which
> some recent
> block and dm patches were reverted
> (https://github.com/bvanassche/linux/tree/block-scsi-for-next)
> 
> > 
> > [global]
> > name=02-mq
> > filename=fio-output-02-mq.txt
> > rw=randwrite
> > verify=md5
> > ;rwmixread=60
> > ;rwmixwrite=40
> > bs=4K
> > ;direct=1
> > ;numjobs=4
> > ;time_based=1
> > runtime=120
> > 
> > [file1]
> > size=3G
> > ioengine=libaio
> > iodepth=16
> > 
> > I watch I/O and I see it ramp up but fio still runs and it kind of
> > shuts down.
> 
> In test "file1" I see an I/O size of 3G. Does that mean that the
> patch that
> should fix the sgl_alloc() issue is working?
> 
> Thanks,
> 
> Bart.

Hello Bart Thank you.

OK, so booting into Mike tree now and then hopefully I get the lockups.
Can you give me some idea of what to look for.
I assume I/O just stops.

I want to get this happening in-house so we have an avenue to fix this.
Following getting the stall I will attend to the slg patch test for the
SRPT side.

Please note I am not running latest SRPT right now as you know.

Regards
Back later with results
Jens Axboe Jan. 18, 2018, 10:39 p.m. UTC | #10
On 1/18/18 3:35 PM, Laurence Oberman wrote:
> On Thu, 2018-01-18 at 22:24 +0000, Bart Van Assche wrote:
>> On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote:
>>> OK, I ran 5 at once of 5 separate mount points.
>>> I am using 4k block sizes
>>> Its solid consistent for me. No stalls no gaps.
>>
>> Hi Laurence,
>>
>> That's great news and thank you for having shared this information
>> but I think
>> it should be mentioned that you have been running my tree in which
>> some recent
>> block and dm patches were reverted
>> (https://github.com/bvanassche/linux/tree/block-scsi-for-next)
>>
>>>
>>> [global]
>>> name=02-mq
>>> filename=fio-output-02-mq.txt
>>> rw=randwrite
>>> verify=md5
>>> ;rwmixread=60
>>> ;rwmixwrite=40
>>> bs=4K
>>> ;direct=1
>>> ;numjobs=4
>>> ;time_based=1
>>> runtime=120
>>>
>>> [file1]
>>> size=3G
>>> ioengine=libaio
>>> iodepth=16
>>>
>>> I watch I/O and I see it ramp up but fio still runs and it kind of
>>> shuts down.
>>
>> In test "file1" I see an I/O size of 3G. Does that mean that the
>> patch that
>> should fix the sgl_alloc() issue is working?
>>
>> Thanks,
>>
>> Bart.
> 
> Hello Bart Thank you.
> 
> OK, so booting into Mike tree now and then hopefully I get the lockups.
> Can you give me some idea of what to look for.
> I assume I/O just stops.
> 
> I want to get this happening in-house so we have an avenue to fix this.
> Following getting the stall I will attend to the slg patch test for the
> SRPT side.
> 
> Please note I am not running latest SRPT right now as you know.

When you do have a solid test case, please please submit a blktests
test case for it! This needs to be something we can regularly in
testing.
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 74a4f237ba91..371a1b97bf56 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1781,16 +1781,11 @@  static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	struct request_queue *q = rq->q;
 	bool run_queue = true;
 
-	/*
-	 * RCU or SRCU read lock is needed before checking quiesced flag.
-	 *
-	 * When queue is stopped or quiesced, ignore 'bypass_insert' from
-	 * blk_mq_request_direct_issue(), and return BLK_STS_OK to caller,
-	 * and avoid driver to try to dispatch again.
-	 */
+	/* RCU or SRCU read lock is needed before checking quiesced flag */
 	if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
 		run_queue = false;
-		bypass_insert = false;
+		if (bypass_insert)
+			return BLK_STS_DM_REQUEUE;
 		goto insert;
 	}
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index d8519ddd7e1a..2f554ea485c3 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -408,7 +408,7 @@  static blk_status_t dm_dispatch_clone_request(struct request *clone, struct requ
 
 	clone->start_time = jiffies;
 	r = blk_insert_cloned_request(clone->q, clone);
-	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
+	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DM_REQUEUE)
 		/* must complete clone in terms of original request */
 		dm_complete_request(rq, r);
 	return r;
@@ -472,6 +472,7 @@  static void init_tio(struct dm_rq_target_io *tio, struct request *rq,
  * Returns:
  * DM_MAPIO_*       : the request has been processed as indicated
  * DM_MAPIO_REQUEUE : the original request needs to be immediately requeued
+ * DM_MAPIO_DELAY_REQUEUE : the original request needs to be requeued after delay
  * < 0              : the request was completed due to failure
  */
 static int map_request(struct dm_rq_target_io *tio)
@@ -500,11 +501,11 @@  static int map_request(struct dm_rq_target_io *tio)
 		trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
 				     blk_rq_pos(rq));
 		ret = dm_dispatch_clone_request(clone, rq);
-		if (ret == BLK_STS_RESOURCE) {
+		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DM_REQUEUE) {
 			blk_rq_unprep_clone(clone);
 			tio->ti->type->release_clone_rq(clone);
 			tio->clone = NULL;
-			if (!rq->q->mq_ops)
+			if (ret == BLK_STS_DM_REQUEUE || !rq->q->mq_ops)
 				r = DM_MAPIO_DELAY_REQUEUE;
 			else
 				r = DM_MAPIO_REQUEUE;
@@ -741,6 +742,7 @@  static int dm_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 			  const struct blk_mq_queue_data *bd)
 {
+	int r;
 	struct request *rq = bd->rq;
 	struct dm_rq_target_io *tio = blk_mq_rq_to_pdu(rq);
 	struct mapped_device *md = tio->md;
@@ -768,10 +770,13 @@  static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 	tio->ti = ti;
 
 	/* Direct call is fine since .queue_rq allows allocations */
-	if (map_request(tio) == DM_MAPIO_REQUEUE) {
+	r = map_request(tio);
+	if (r == DM_MAPIO_REQUEUE || r == DM_MAPIO_DELAY_REQUEUE) {
 		/* Undo dm_start_request() before requeuing */
 		rq_end_stats(md, rq);
 		rq_completed(md, rq_data_dir(rq), false);
+		if (r == DM_MAPIO_DELAY_REQUEUE)
+			blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
 		return BLK_STS_RESOURCE;
 	}