diff mbox

[V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

Message ID 20180129203300.50595-1-snitzer@redhat.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Mike Snitzer Jan. 29, 2018, 8:33 p.m. UTC
From: Ming Lei <ming.lei@redhat.com>

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---

V4:
        - cleanup header and code comments
        - rerun queue after BLK_MQ_QUEUE_DELAY (3ms) instead of 10ms
        - eliminate nvmefc's queue rerun now that blk-mq does it
V3:
        - fix typo, and improvement document
        - add tested-by tag
V2:
        - add comments on the new introduced status
        - patch style fix
        - both are suggested by Christoph

 block/blk-core.c             |  1 +
 block/blk-mq.c               | 20 ++++++++++++++++----
 drivers/block/null_blk.c     |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c           |  5 ++---
 drivers/nvme/host/fc.c       | 12 ++----------
 drivers/scsi/scsi_lib.c      |  6 +++---
 include/linux/blk_types.h    | 14 ++++++++++++++
 9 files changed, 41 insertions(+), 23 deletions(-)

Comments

Bart Van Assche Jan. 29, 2018, 9:22 p.m. UTC | #1
On Mon, 2018-01-29 at 15:33 -0500, Mike Snitzer wrote:
> +		 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART

> +		 * bit is set, run queue after 10ms to avoid IO stalls

> +		 * that could otherwise occur if the queue is idle.

>  		 */

> -		if (!blk_mq_sched_needs_restart(hctx) ||

> +		needs_restart = blk_mq_sched_needs_restart(hctx);

> +		if (!needs_restart ||

>  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))

>  			blk_mq_run_hw_queue(hctx, true);

> +		else if (needs_restart && (ret == BLK_STS_RESOURCE))

> +			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);

>  	}


The above code only calls blk_mq_delay_run_hw_queue() if the following condition
is met: !(!needs_restart || (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
&& (needs_restart && (ret == BLK_STS_RESOURCE)). That boolean expression can be
simplified into the following: needs_restart && ret == BLK_STS_RESOURCE && 
!(no_tag && list_empty_careful(&hctx->dispatch_wait.entry)). From blk-mq-sched.h:

static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
{
	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
}

In other words, the above code will not call blk_mq_delay_run_hw_queue() if
BLK_MQ_S_SCHED_RESTART is cleared after it got set and before the above code is
reached. The BLK_MQ_S_SCHED_RESTART bit gets cleared if a request completes
concurrently with the above code. From blk-mq-sched.c:

static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
{
	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
		return false;

	if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
		struct request_queue *q = hctx->queue;

		if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
			atomic_dec(&q->shared_hctx_restart);
	} else
		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);

	return blk_mq_run_hw_queue(hctx, true);
}

The above blk_mq_run_hw_queue() call may finish either before or after
blk_mq_dispatch_rq_list() examines the BLK_MQ_S_SCHED_RESTART flag.

That's why I wrote in previous e-mails that this patch and also the previous
versions of this patch change a systematic call of blk_mq_delay_run_hw_queue()
into a call that may or may not happen, depending on whether or not a request
completes concurrently with request queueing. Sorry but I think that means
that the above change combined with changing BLK_STS_RESOURCE into
BLK_STS_DEV_RESOURCE is wrong and why I expect that this will result in
sporadic queue stalls. As you know sporadic queue stalls are annoying and hard
to debug.

Plenty of e-mails have already been exchanged about versions one to four of this
patch but so far nobody has even tried to contradict the above.

Bart.
Mike Snitzer Jan. 29, 2018, 9:44 p.m. UTC | #2
On Mon, Jan 29 2018 at  4:22pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Mon, 2018-01-29 at 15:33 -0500, Mike Snitzer wrote:
> > +		 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
> > +		 * bit is set, run queue after 10ms to avoid IO stalls
> > +		 * that could otherwise occur if the queue is idle.
> >  		 */
> > -		if (!blk_mq_sched_needs_restart(hctx) ||
> > +		needs_restart = blk_mq_sched_needs_restart(hctx);
> > +		if (!needs_restart ||
> >  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> >  			blk_mq_run_hw_queue(hctx, true);
> > +		else if (needs_restart && (ret == BLK_STS_RESOURCE))
> > +			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
> >  	}
> 
> The above code only calls blk_mq_delay_run_hw_queue() if the following condition
> is met: !(!needs_restart || (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> && (needs_restart && (ret == BLK_STS_RESOURCE)). That boolean expression can be
> simplified into the following: needs_restart && ret == BLK_STS_RESOURCE && 
> !(no_tag && list_empty_careful(&hctx->dispatch_wait.entry)). From blk-mq-sched.h:
> 
> static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
> {
> 	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> }
> 
> In other words, the above code will not call blk_mq_delay_run_hw_queue() if
> BLK_MQ_S_SCHED_RESTART is cleared after it got set and before the above code is
> reached. The BLK_MQ_S_SCHED_RESTART bit gets cleared if a request completes
> concurrently with the above code. From blk-mq-sched.c:
> 
> static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
> {
> 	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
> 		return false;
> 
> 	if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
> 		struct request_queue *q = hctx->queue;
> 
> 		if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
> 			atomic_dec(&q->shared_hctx_restart);
> 	} else
> 		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> 
> 	return blk_mq_run_hw_queue(hctx, true);
> }
> 
> The above blk_mq_run_hw_queue() call may finish either before or after
> blk_mq_dispatch_rq_list() examines the BLK_MQ_S_SCHED_RESTART flag.

But regardless of which wins the race, the queue will have been run.
Which is all we care about right?

> That's why I wrote in previous e-mails that this patch and also the previous
> versions of this patch change a systematic call of blk_mq_delay_run_hw_queue()
> into a call that may or may not happen, depending on whether or not a request
> completes concurrently with request queueing. Sorry but I think that means
> that the above change combined with changing BLK_STS_RESOURCE into
> BLK_STS_DEV_RESOURCE is wrong and why I expect that this will result in
> sporadic queue stalls. As you know sporadic queue stalls are annoying and hard
> to debug.
> 
> Plenty of e-mails have already been exchanged about versions one to four of this
> patch but so far nobody has even tried to contradict the above.

Sure, but we aren't mind-readers.  We're all very open to the idea that
you have noticed something we haven't.  But until your very helpful
mail I'm replying to, you hadn't been specific about the race that
concerned you.  Thanks for sharing.

I'll defer to Jens and/or Ming on whether the race you've raised is of
concern.  As I said above, I'm inclined to think it doesn't matter who
wins the race given the queue will be run again.  And when it does it
may get STS_RESOURCE again, and in that case it may then resort to
blk_mq_delay_run_hw_queue() in blk_mq_dispatch_rq_list().

Mike
Bart Van Assche Jan. 29, 2018, 9:51 p.m. UTC | #3
On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
> But regardless of which wins the race, the queue will have been run.

> Which is all we care about right?


Running the queue is not sufficient. With this patch applied it can happen
that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
concurrent queue runs finish before sufficient device resources are available
to execute the request and that blk_mq_delay_run_hw_queue() does not get
called at all. If no other activity triggers a queue run, e.g. request
completion, this will result in a queue stall.

Bart.
Mike Snitzer Jan. 29, 2018, 10:14 p.m. UTC | #4
On Mon, Jan 29 2018 at  4:51pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
> > But regardless of which wins the race, the queue will have been run.
> > Which is all we care about right?
> 
> Running the queue is not sufficient. With this patch applied it can happen
> that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
> concurrent queue runs finish before sufficient device resources are available
> to execute the request and that blk_mq_delay_run_hw_queue() does not get
> called at all. If no other activity triggers a queue run, e.g. request
> completion, this will result in a queue stall.

If BLK_STS_DEV_RESOURCE is returned then the driver doesn't need to rely
on a future queue run.  IIUC, that is the entire premise of
BLK_STS_DEV_RESOURCE.  If the driver had doubt about whether the
resource were going to be available in the future it should return
BLK_STS_RESOURCE.

That may seem like putting a lot on a driver developer (to decide
between the 2) but I'll again defer to Jens here.  This was the approach
he was advocating be pursued.

In practice it is working.  Would welcome you testing this V4.

Thanks,
Mike
Bart Van Assche Jan. 29, 2018, 10:22 p.m. UTC | #5
On Mon, 2018-01-29 at 17:14 -0500, Mike Snitzer wrote:
> On Mon, Jan 29 2018 at  4:51pm -0500,

> Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> 

> > On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:

> > > But regardless of which wins the race, the queue will have been run.

> > > Which is all we care about right?

> > 

> > Running the queue is not sufficient. With this patch applied it can happen

> > that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more

> > concurrent queue runs finish before sufficient device resources are available

> > to execute the request and that blk_mq_delay_run_hw_queue() does not get

> > called at all. If no other activity triggers a queue run, e.g. request

> > completion, this will result in a queue stall.

> 

> If BLK_STS_DEV_RESOURCE is returned then the driver doesn't need to rely

> on a future queue run.  IIUC, that is the entire premise of

> BLK_STS_DEV_RESOURCE.  If the driver had doubt about whether the

> resource were going to be available in the future it should return

> BLK_STS_RESOURCE.

> 

> That may seem like putting a lot on a driver developer (to decide

> between the 2) but I'll again defer to Jens here.  This was the approach

> he was advocating be pursued.


Hello Mike,

There was a typo in my reply: I should have referred to BLK_STS_RESOURCE
instead of BLK_STS_DEV_RESOURCE. The convention for BLK_STS_DEV_RESOURCE is
namely that if .queue_rq() returns that status code that the block driver
guarantees that the queue will be rerun.

Bart.
Ming Lei Jan. 30, 2018, 3:16 a.m. UTC | #6
On Tue, Jan 30, 2018 at 5:22 AM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Mon, 2018-01-29 at 15:33 -0500, Mike Snitzer wrote:
>> +              * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
>> +              * bit is set, run queue after 10ms to avoid IO stalls
>> +              * that could otherwise occur if the queue is idle.
>>                */
>> -             if (!blk_mq_sched_needs_restart(hctx) ||
>> +             needs_restart = blk_mq_sched_needs_restart(hctx);
>> +             if (!needs_restart ||
>>                   (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>>                       blk_mq_run_hw_queue(hctx, true);
>> +             else if (needs_restart && (ret == BLK_STS_RESOURCE))
>> +                     blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
>>       }
>
> The above code only calls blk_mq_delay_run_hw_queue() if the following condition
> is met: !(!needs_restart || (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> && (needs_restart && (ret == BLK_STS_RESOURCE)). That boolean expression can be
> simplified into the following: needs_restart && ret == BLK_STS_RESOURCE &&
> !(no_tag && list_empty_careful(&hctx->dispatch_wait.entry)). From blk-mq-sched.h:

No, the expression of  (needs_restart && (ret == BLK_STS_RESOURCE)) clearly
expresses what we want to do.

When RESTART is working, and meantime BLK_STS_RESOURCE is returned
from driver, we need to avoid IO hang by blk_mq_delay_run_hw_queue().

>
> static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
> {
>         return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> }
>
> In other words, the above code will not call blk_mq_delay_run_hw_queue() if
> BLK_MQ_S_SCHED_RESTART is cleared after it got set and before the above code is
> reached. The BLK_MQ_S_SCHED_RESTART bit gets cleared if a request completes
> concurrently with the above code. From blk-mq-sched.c:

That won't be an issue, given any time the SCHED_RESTART is cleared, the queue
will be run again, so we needn't to worry about that.

>
> static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
> {
>         if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
>                 return false;
>
>         if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
>                 struct request_queue *q = hctx->queue;
>
>                 if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
>                         atomic_dec(&q->shared_hctx_restart);
>         } else
>                 clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
>
>         return blk_mq_run_hw_queue(hctx, true);
> }
>
> The above blk_mq_run_hw_queue() call may finish either before or after

If the above blk_mq_run_hw_queue() finishes before blk_mq_dispatch_rq_list()
examines the flag, blk_mq_dispatch_rq_list() will see the flag cleared, and run
queue again by the following code branch:

                if (!blk_mq_sched_needs_restart(hctx) ||
                    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
                        blk_mq_run_hw_queue(hctx, true);

If blk_mq_run_hw_queue() finishes after blk_mq_dispatch_rq_list() examines
the flag, this blk_mq_run_hw_queue() will see the new added request, and
still everything is fine. Even blk_mq_delay_run_hw_queue() can be started
too.

> blk_mq_dispatch_rq_list() examines the BLK_MQ_S_SCHED_RESTART flag.
>
> That's why I wrote in previous e-mails that this patch and also the previous
> versions of this patch change a systematic call of blk_mq_delay_run_hw_queue()
> into a call that may or may not happen, depending on whether or not a request
> completes concurrently with request queueing. Sorry but I think that means
> that the above change combined with changing BLK_STS_RESOURCE into
> BLK_STS_DEV_RESOURCE is wrong and why I expect that this will result in
> sporadic queue stalls. As you know sporadic queue stalls are annoying and hard
> to debug.

Now there is debugfs, it isn't difficult to deal with that if
reporter'd like to cowork.

>
> Plenty of e-mails have already been exchanged about versions one to four of this
> patch but so far nobody has even tried to contradict the above.

No, I don't see the issue, let's revisit the main change again:

+       needs_restart = blk_mq_sched_needs_restart(hctx);
+       if (!needs_restart ||
            (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
            blk_mq_run_hw_queue(hctx, true);
+       else if (needs_restart && (ret == BLK_STS_RESOURCE))
+           blk_mq_delay_run_hw_queue(hctx, 10);

If SCHED_RESTART isn't set, queue is run immediately, otherwise when
BLK_STS_RESOURCE is returned, we avoid IO hang by
blk_mq_delay_run_hw_queue(hctx, XXX).

And we don't cover  BLK_STS_DEV_RESOURCE above because it is documented
clearly that  BLK_STS_DEV_RESOURCE is only returned by driver iff queue will be
rerun in future if resource is available.

Is there  any hole in above code?

Thanks,
Ming Lei
Ming Lei Jan. 30, 2018, 3:20 a.m. UTC | #7
On Tue, Jan 30, 2018 at 5:51 AM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
>> But regardless of which wins the race, the queue will have been run.
>> Which is all we care about right?
>
> Running the queue is not sufficient. With this patch applied it can happen
> that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
> concurrent queue runs finish before sufficient device resources are available
> to execute the request and that blk_mq_delay_run_hw_queue() does not get
> called at all. If no other activity triggers a queue run, e.g. request
> completion, this will result in a queue stall.

Please see document of BLK_STS_DEV_RESOURCE:

+ * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device
+ * related resource is unavailable, but driver can guarantee that queue
+ * will be rerun in future once the resource is available (whereby
+ * dispatching requests).

I have explained the SCSI's BLK_STS_DEV_RESOURCE conversion in
another thread, let's know if you have further concern:

https://marc.info/?l=linux-block&m=151727454815018&w=2
Ming Lei Jan. 30, 2018, 5:55 a.m. UTC | #8
On Tue, Jan 30, 2018 at 6:14 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Mon, Jan 29 2018 at  4:51pm -0500,
> Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
>
>> On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
>> > But regardless of which wins the race, the queue will have been run.
>> > Which is all we care about right?
>>
>> Running the queue is not sufficient. With this patch applied it can happen
>> that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
>> concurrent queue runs finish before sufficient device resources are available
>> to execute the request and that blk_mq_delay_run_hw_queue() does not get
>> called at all. If no other activity triggers a queue run, e.g. request
>> completion, this will result in a queue stall.
>
> If BLK_STS_DEV_RESOURCE is returned then the driver doesn't need to rely
> on a future queue run.  IIUC, that is the entire premise of
> BLK_STS_DEV_RESOURCE.  If the driver had doubt about whether the
> resource were going to be available in the future it should return
> BLK_STS_RESOURCE.
>
> That may seem like putting a lot on a driver developer (to decide
> between the 2) but I'll again defer to Jens here.  This was the approach
> he was advocating be pursued.

Thinking of further, maybe you can add the following description in V5,
and it should be much easier for driver developer to follow:

When any resource allocation fails, if driver can make sure that there is
any in-flight IO, it is safe to return BLK_STS_DEV_RESOURCE to blk-mq,
that is exactly what scsi_queue_rq() is doing.

Follows the theory:

1) driver returns BLK_STS_DEV_RESOURCE if driver figures out
there is any in-flight IO, in case of any resource allocation failure

2) If all these in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();

3) if there is any in-flight IO after/when examining SCHED_RESTART in
blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 2)
- otherwise, this request will be dispatched after any in-flight IO is
completed via
blk_mq_sched_restart() since this request is added to hctx->dispatch already

And there are two invariants when driver returns BLK_STS_DEV_RESOURCE
iff there is any in-flight IOs:

1) SCHED_RESTART must be zero if no in-flight IOs
2) there has to be any IO in-flight if SCHED_RESTART is read as 1


Thanks,
Ming
John Stoffel Feb. 2, 2018, 12:26 a.m. UTC | #9
>>>>> "Bart" == Bart Van Assche <Bart.VanAssche@wdc.com> writes:

Bart> On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
>> But regardless of which wins the race, the queue will have been run.
>> Which is all we care about right?

Bart> Running the queue is not sufficient. With this patch applied it can happen
Bart> that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
Bart> concurrent queue runs finish before sufficient device resources are available
Bart> to execute the request and that blk_mq_delay_run_hw_queue() does not get
Bart> called at all. If no other activity triggers a queue run, e.g. request
Bart> completion, this will result in a queue stall.

Doesn't this argue that you really want some sort of completions to be
run in this case instead?  Instead of busy looping or waiting for a
set amount of time, just fire off a callback to run once you have the
resources available, no?

I can't provide any code examples, I don't know the code base nearly
well enough.

John
Bart Van Assche Feb. 2, 2018, 12:53 a.m. UTC | #10
On Thu, 2018-02-01 at 19:26 -0500, John Stoffel wrote:
> Doesn't this argue that you really want some sort of completions to be

> run in this case instead?  Instead of busy looping or waiting for a

> set amount of time, just fire off a callback to run once you have the

> resources available, no?


Hello John,

Rerunning the queue when the resource we ran out of is available again would
be ideal. However, I'm not sure it is possible to implement this. If a driver
performs memory allocation while executing a request and kmalloc() returns
-ENOMEM then I don't know which mechanism should be used to trigger a queue
rerun when sufficient memory is available again.

Another example is the SCSI subsystem. If a the .queuecommand() implementation
of a SCSI LLD returns SCSI_MLQUEUE_*_BUSY because a certain HBA condition
occurred and the HBA does not trigger an interrupt when that condition is
cleared, how to figure out whether or not a request can be executed other than
by retrying, which means rerunning the queue?

Thanks,

Bart.
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index cdae69be68e9..38279d4ae08b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@  static const struct {
 	[BLK_STS_MEDIUM]	= { -ENODATA,	"critical medium" },
 	[BLK_STS_PROTECTION]	= { -EILSEQ,	"protection" },
 	[BLK_STS_RESOURCE]	= { -ENOMEM,	"kernel resource" },
+	[BLK_STS_DEV_RESOURCE]	= { -ENOMEM,	"device resource" },
 	[BLK_STS_AGAIN]		= { -EAGAIN,	"nonblocking retry" },
 
 	/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 43e7449723e0..dd097ca5f1e9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1160,6 +1160,8 @@  static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx,
 	return true;
 }
 
+#define BLK_MQ_QUEUE_DELAY	3		/* ms units */
+
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			     bool got_budget)
 {
@@ -1167,6 +1169,7 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	struct request *rq, *nxt;
 	bool no_tag = false;
 	int errors, queued;
+	blk_status_t ret = BLK_STS_OK;
 
 	if (list_empty(list))
 		return false;
@@ -1179,7 +1182,6 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	errors = queued = 0;
 	do {
 		struct blk_mq_queue_data bd;
-		blk_status_t ret;
 
 		rq = list_first_entry(list, struct request, queuelist);
 		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
@@ -1224,7 +1226,7 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
-		if (ret == BLK_STS_RESOURCE) {
+		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			/*
 			 * If an I/O scheduler has been configured and we got a
 			 * driver tag for the next request already, free it
@@ -1255,6 +1257,8 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	 * that is where we will continue on next queue run.
 	 */
 	if (!list_empty(list)) {
+		bool needs_restart;
+
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
@@ -1278,10 +1282,17 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		 * - Some but not all block drivers stop a queue before
 		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
 		 *   and dm-rq.
+		 *
+		 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
+		 * bit is set, run queue after 10ms to avoid IO stalls
+		 * that could otherwise occur if the queue is idle.
 		 */
-		if (!blk_mq_sched_needs_restart(hctx) ||
+		needs_restart = blk_mq_sched_needs_restart(hctx);
+		if (!needs_restart ||
 		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
 			blk_mq_run_hw_queue(hctx, true);
+		else if (needs_restart && (ret == BLK_STS_RESOURCE))
+			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
 	}
 
 	return (queued + errors) != 0;
@@ -1762,6 +1773,7 @@  static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 		*cookie = new_cookie;
 		break;
 	case BLK_STS_RESOURCE:
+	case BLK_STS_DEV_RESOURCE:
 		__blk_mq_requeue_request(rq);
 		break;
 	default:
@@ -1824,7 +1836,7 @@  static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	hctx_lock(hctx, &srcu_idx);
 
 	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
-	if (ret == BLK_STS_RESOURCE)
+	if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
 		blk_mq_sched_insert_request(rq, false, true, false);
 	else if (ret != BLK_STS_OK)
 		blk_mq_end_request(rq, ret);
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 5b94e530570c..4bc25fc4e73c 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -1230,7 +1230,7 @@  static blk_status_t null_handle_cmd(struct nullb_cmd *cmd)
 				return BLK_STS_OK;
 			} else
 				/* requeue request */
-				return BLK_STS_RESOURCE;
+				return BLK_STS_DEV_RESOURCE;
 		}
 	}
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 68846897d213..79908e6ddbf2 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -276,7 +276,7 @@  static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Out of mem doesn't actually happen, since we fall back
 		 * to direct descriptors */
 		if (err == -ENOMEM || err == -ENOSPC)
-			return BLK_STS_RESOURCE;
+			return BLK_STS_DEV_RESOURCE;
 		return BLK_STS_IOERR;
 	}
 
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 891265acb10e..e126e4cac2ca 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -911,7 +911,7 @@  static blk_status_t blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
 out_busy:
 	blk_mq_stop_hw_queue(hctx);
 	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
-	return BLK_STS_RESOURCE;
+	return BLK_STS_DEV_RESOURCE;
 }
 
 static void blkif_complete_rq(struct request *rq)
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index b7d175e94a02..348a0cb6963a 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -404,7 +404,7 @@  static blk_status_t dm_dispatch_clone_request(struct request *clone, struct requ
 
 	clone->start_time = jiffies;
 	r = blk_insert_cloned_request(clone->q, clone);
-	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
+	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DEV_RESOURCE)
 		/* must complete clone in terms of original request */
 		dm_complete_request(rq, r);
 	return r;
@@ -496,7 +496,7 @@  static int map_request(struct dm_rq_target_io *tio)
 		trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
 				     blk_rq_pos(rq));
 		ret = dm_dispatch_clone_request(clone, rq);
-		if (ret == BLK_STS_RESOURCE) {
+		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			blk_rq_unprep_clone(clone);
 			tio->ti->type->release_clone_rq(clone);
 			tio->clone = NULL;
@@ -769,7 +769,6 @@  static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Undo dm_start_request() before requeuing */
 		rq_end_stats(md, rq);
 		rq_completed(md, rq_data_dir(rq), false);
-		blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
 		return BLK_STS_RESOURCE;
 	}
 
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index b76ba4629e02..54e679541ad6 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -35,8 +35,6 @@  enum nvme_fc_queue_flags {
 	NVME_FC_Q_LIVE,
 };
 
-#define NVMEFC_QUEUE_DELAY	3		/* ms units */
-
 #define NVME_FC_DEFAULT_DEV_LOSS_TMO	60	/* seconds */
 
 struct nvme_fc_queue {
@@ -2231,7 +2229,7 @@  nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
 	 * the target device is present
 	 */
 	if (ctrl->rport->remoteport.port_state != FC_OBJSTATE_ONLINE)
-		goto busy;
+		return BLK_STS_RESOURCE;
 
 	if (!nvme_fc_ctrl_get(ctrl))
 		return BLK_STS_IOERR;
@@ -2311,16 +2309,10 @@  nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
 				ret != -EBUSY)
 			return BLK_STS_IOERR;
 
-		goto busy;
+		return BLK_STS_RESOURCE;
 	}
 
 	return BLK_STS_OK;
-
-busy:
-	if (!(op->flags & FCOP_FLAGS_AEN) && queue->hctx)
-		blk_mq_delay_run_hw_queue(queue->hctx, NVMEFC_QUEUE_DELAY);
-
-	return BLK_STS_RESOURCE;
 }
 
 static inline blk_status_t nvme_fc_is_ready(struct nvme_fc_queue *queue,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index d9ca1dfab154..55be2550c555 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2030,9 +2030,9 @@  static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
-		if (atomic_read(&sdev->device_busy) == 0 &&
-		    !scsi_device_blocked(sdev))
-			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
+		if (atomic_read(&sdev->device_busy) ||
+		    scsi_device_blocked(sdev))
+			ret = BLK_STS_DEV_RESOURCE;
 		break;
 	default:
 		/*
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 2d973ac54b09..6a8ad60e7f09 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -39,6 +39,20 @@  typedef u8 __bitwise blk_status_t;
 
 #define BLK_STS_AGAIN		((__force blk_status_t)12)
 
+/*
+ * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device
+ * related resource is unavailable, but driver can guarantee that queue
+ * will be rerun in future once the resource is available (whereby
+ * dispatching requests).
+ *
+ * Difference with BLK_STS_RESOURCE:
+ * If driver isn't sure if the queue will be rerun once device resource
+ * is made available, please return BLK_STS_RESOURCE.  For example: when
+ * memory allocation, DMA Mapping or other system resource allocation
+ * fails and IO can't be submitted to device.
+ */
+#define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with