diff mbox

[1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE

Message ID 20180122033550.27855-2-ming.lei@redhat.com (mailing list archive)
State Deferred, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Ming Lei Jan. 22, 2018, 3:35 a.m. UTC
This status is returned from driver to block layer if device related
resource is run out of, but driver can guarantee that IO dispatch is
triggered in future when the resource is available.

This patch converts some drivers to use this return value. Meantime
if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
queue after 10ms for avoiding IO hang.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c             |  1 +
 block/blk-mq.c               | 19 +++++++++++++++----
 drivers/block/null_blk.c     |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c           |  2 +-
 drivers/scsi/scsi_lib.c      |  6 +++---
 include/linux/blk_types.h    |  7 +++++++
 8 files changed, 30 insertions(+), 11 deletions(-)

Comments

Christoph Hellwig Jan. 22, 2018, 4:32 p.m. UTC | #1
> -		if (ret == BLK_STS_RESOURCE) {
> +		if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) {

No need for the inner braces here.

> +	if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))

Same here.

> +/*
> + * This status is returned from driver to block layer if device related
> + * resource is run out of, but driver can guarantee that IO dispatch is
> + * triggered in future when the resource is available.
> + */
> +#define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)

We'll need some good explanation on BLK_STS_RESOURCE vs
BLK_STS_DEV_RESOURCE I think.

Except for these nitpicks this looks fine to me.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Bart Van Assche Jan. 22, 2018, 4:49 p.m. UTC | #2
On Mon, 2018-01-22 at 11:35 +0800, Ming Lei wrote:
> @@ -1280,10 +1282,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>  		 * - Some but not all block drivers stop a queue before
>  		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
>  		 *   and dm-rq.
> +		 *
> +		 * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART
> +		 * bit is set, run queue after 10ms for avoiding IO hang
> +		 * because the queue may be idle and the RESTART mechanism
> +		 * can't work any more.
>  		 */
> -		if (!blk_mq_sched_needs_restart(hctx) ||
> +		needs_restart = blk_mq_sched_needs_restart(hctx);
> +		if (!needs_restart ||
>  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>  			blk_mq_run_hw_queue(hctx, true);
> +		else if (needs_restart && (ret == BLK_STS_RESOURCE))
> +			blk_mq_delay_run_hw_queue(hctx, 10);
>  	}

In my opinion there are two problems with the above changes:
* Only the block driver author can know what a good choice is for the time
  after which to rerun the queue. So I think moving the rerun delay (10 ms)
  constant from block drivers into the core is a step backwards instead of a
  step forwards.
* The purpose of the BLK_MQ_S_SCHED_RESTART flag is to detect whether or not
  any of the queue runs triggered by freeing a tag happened concurrently. I
  don't think that there is any relationship between queue runs happening all
  or not concurrently and the chance that driver resources become available.
  So deciding whether or not a queue should be rerun based on the value of
  the BLK_MQ_S_SCHED_RESTART flag seems wrong to me.

> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index d9ca1dfab154..55be2550c555 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	case BLK_STS_OK:
>  		break;
>  	case BLK_STS_RESOURCE:
> -		if (atomic_read(&sdev->device_busy) == 0 &&
> -		    !scsi_device_blocked(sdev))
> -			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> +		if (atomic_read(&sdev->device_busy) ||
> +		    scsi_device_blocked(sdev))
> +			ret = BLK_STS_DEV_RESOURCE;
>  		break;
>  	default:
>  		/*

The above introduces two changes that have not been mentioned in the
description of this patch:
- The queue rerunning delay is changed from 3 ms into 10 ms. Where is the
  explanation of this change? Does this change have a positive or negative
  performance impact?
- The above modifies a guaranteed queue rerun into a queue rerun that
  may or may not happen, depending on whether or not multiple tags get freed
  concurrently (return BLK_STS_DEV_RESOURCE). Sorry but I think that's wrong.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 23, 2018, 12:57 a.m. UTC | #3
On Mon, Jan 22, 2018 at 04:49:54PM +0000, Bart Van Assche wrote:
> On Mon, 2018-01-22 at 11:35 +0800, Ming Lei wrote:
> > @@ -1280,10 +1282,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> >  		 * - Some but not all block drivers stop a queue before
> >  		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
> >  		 *   and dm-rq.
> > +		 *
> > +		 * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART
> > +		 * bit is set, run queue after 10ms for avoiding IO hang
> > +		 * because the queue may be idle and the RESTART mechanism
> > +		 * can't work any more.
> >  		 */
> > -		if (!blk_mq_sched_needs_restart(hctx) ||
> > +		needs_restart = blk_mq_sched_needs_restart(hctx);
> > +		if (!needs_restart ||
> >  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> >  			blk_mq_run_hw_queue(hctx, true);
> > +		else if (needs_restart && (ret == BLK_STS_RESOURCE))
> > +			blk_mq_delay_run_hw_queue(hctx, 10);
> >  	}
> 
> In my opinion there are two problems with the above changes:
> * Only the block driver author can know what a good choice is for the time

The reality is that no one knew that before, if you look at the fact, most of
BLK_STS_RESOURCE need that, but no one dealt with it at all. Both Jens and I
concluded that it is a generic issue, which need generic solution.

On the contrary, it is much easier to evaluate if one STS_RESOURCE can be
converted to STS_DEV_RESOURCE, as you saw, there isn't many, because now we
call .queue_rq() after getting budget and driver tag.

>   after which to rerun the queue. So I think moving the rerun delay (10 ms)
>   constant from block drivers into the core is a step backwards instead of a
>   step forwards.

As I mentioned before, if running out of resource inside .queue_rq(),
this request is still out of control of blk-mq, and if run queue is
scheduled inside .queue_rq(), it isn't correct, because when __blk_mq_run_hw_queue()
is run, the request may not be visible for dispatch.

Even though RCU lock is held during dispatch, preemption or interrupt
can happen too, so it is simply wrong to depend on the timing to make
sure __blk_mq_run_hw_queue() can see the request in this situation.

Or driver reinserts the request into scheduler queue, but that way
involves much big change if we do this in driver, and introduce
unnecessary cost meantime.

> * The purpose of the BLK_MQ_S_SCHED_RESTART flag is to detect whether or not
>   any of the queue runs triggered by freeing a tag happened concurrently. I
>   don't think that there is any relationship between queue runs happening all
>   or not concurrently and the chance that driver resources become available.
>   So deciding whether or not a queue should be rerun based on the value of
>   the BLK_MQ_S_SCHED_RESTART flag seems wrong to me.

The fact is simple, that once BLK_MQ_S_SCHED_RESTART is set,
blk_mq_dispatch_rq_list() won't call blk_mq_run_hw_queue() any more, and depends
on request completion to rerun the queue. If driver can't make sure the queue will be
restarted in future by returning STS_RESOURCE, we have to call
blk_mq_delay_run_hw_queue(hctx, 10) for avoiding IO hang.

> 
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index d9ca1dfab154..55be2550c555 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
> >  	case BLK_STS_OK:
> >  		break;
> >  	case BLK_STS_RESOURCE:
> > -		if (atomic_read(&sdev->device_busy) == 0 &&
> > -		    !scsi_device_blocked(sdev))
> > -			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> > +		if (atomic_read(&sdev->device_busy) ||
> > +		    scsi_device_blocked(sdev))
> > +			ret = BLK_STS_DEV_RESOURCE;
> >  		break;
> >  	default:
> >  		/*
> 
> The above introduces two changes that have not been mentioned in the
> description of this patch:
> - The queue rerunning delay is changed from 3 ms into 10 ms. Where is the
>   explanation of this change? Does this change have a positive or negative
>   performance impact?

How can that be a issue for SCSI? The rerunning delay is only triggered
when there isn't any in-flight requests in SCSI queue.

> - The above modifies a guaranteed queue rerun into a queue rerun that
>   may or may not happen, depending on whether or not multiple tags get freed
>   concurrently (return BLK_STS_DEV_RESOURCE). Sorry but I think that's wrong.

This patch only moves the rerunning delay from SCSI .queue_rq into
blk-mq, I don't know what you mean above, please explained a bit.

I know there is a race here in SCSI, but nothing to do with this patch.
Bart Van Assche Jan. 23, 2018, 4:17 p.m. UTC | #4
On 01/22/18 16:57, Ming Lei wrote:
> Even though RCU lock is held during dispatch, preemption or interrupt
> can happen too, so it is simply wrong to depend on the timing to make
> sure __blk_mq_run_hw_queue() can see the request in this situation.

It is very unlikely that this race will ever be hit because that race 
exists for less than one microsecond and the smallest timeout that can 
be specified for delayed queue rerunning is one millisecond. Let's 
address this race if anyone ever finds a way to hit it.

>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>> index d9ca1dfab154..55be2550c555 100644
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
>>>   	case BLK_STS_OK:
>>>   		break;
>>>   	case BLK_STS_RESOURCE:
>>> -		if (atomic_read(&sdev->device_busy) == 0 &&
>>> -		    !scsi_device_blocked(sdev))
>>> -			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
>>> +		if (atomic_read(&sdev->device_busy) ||
>>> +		    scsi_device_blocked(sdev))
>>> +			ret = BLK_STS_DEV_RESOURCE;
>>>   		break;
>>>   	default:
>>>   		/*
>>
>> The above introduces two changes that have not been mentioned in the
>> description of this patch:
>> - The queue rerunning delay is changed from 3 ms into 10 ms. Where is the
>>    explanation of this change? Does this change have a positive or negative
>>    performance impact?
> 
> How can that be a issue for SCSI? The rerunning delay is only triggered
> when there isn't any in-flight requests in SCSI queue.

That change will result in more scsi_queue_rq() calls and hence in 
higher CPU usage. By how much the CPU usage is increased will depend on 
the CPU time required by the LLD .queuecommand() callback if that 
function gets invoked.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 23, 2018, 4:26 p.m. UTC | #5
On Tue, Jan 23, 2018 at 08:17:02AM -0800, Bart Van Assche wrote:
> 
> 
> On 01/22/18 16:57, Ming Lei wrote:
> > Even though RCU lock is held during dispatch, preemption or interrupt
> > can happen too, so it is simply wrong to depend on the timing to make
> > sure __blk_mq_run_hw_queue() can see the request in this situation.
> 
> It is very unlikely that this race will ever be hit because that race exists
> for less than one microsecond and the smallest timeout that can be specified
> for delayed queue rerunning is one millisecond. Let's address this race if
> anyone ever finds a way to hit it.

Please don't depend on the timing which is often fragile, as we can make it
correct in a generic approach. Also we should avoid to make every driver to
follow this way for dealing with most of STS_RESOURCE, right?

> 
> > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > > index d9ca1dfab154..55be2550c555 100644
> > > > --- a/drivers/scsi/scsi_lib.c
> > > > +++ b/drivers/scsi/scsi_lib.c
> > > > @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
> > > >   	case BLK_STS_OK:
> > > >   		break;
> > > >   	case BLK_STS_RESOURCE:
> > > > -		if (atomic_read(&sdev->device_busy) == 0 &&
> > > > -		    !scsi_device_blocked(sdev))
> > > > -			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> > > > +		if (atomic_read(&sdev->device_busy) ||
> > > > +		    scsi_device_blocked(sdev))
> > > > +			ret = BLK_STS_DEV_RESOURCE;
> > > >   		break;
> > > >   	default:
> > > >   		/*
> > > 
> > > The above introduces two changes that have not been mentioned in the
> > > description of this patch:
> > > - The queue rerunning delay is changed from 3 ms into 10 ms. Where is the
> > >    explanation of this change? Does this change have a positive or negative
> > >    performance impact?
> > 
> > How can that be a issue for SCSI? The rerunning delay is only triggered
> > when there isn't any in-flight requests in SCSI queue.
> 
> That change will result in more scsi_queue_rq() calls and hence in higher
> CPU usage. By how much the CPU usage is increased will depend on the CPU
> time required by the LLD .queuecommand() callback if that function gets
> invoked.

No, this patch won't increase CPU usage on SCSI, and the only change is to move
the blk_mq_delay_run_hw_queue() out of SCSI's .queue_rq(), and the delay
becomes 10.

Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Bart Van Assche Jan. 23, 2018, 4:37 p.m. UTC | #6
On 01/23/18 08:26, Ming Lei wrote:
> On Tue, Jan 23, 2018 at 08:17:02AM -0800, Bart Van Assche wrote:
>> On 01/22/18 16:57, Ming Lei wrote:
>>> Even though RCU lock is held during dispatch, preemption or interrupt
>>> can happen too, so it is simply wrong to depend on the timing to make
>>> sure __blk_mq_run_hw_queue() can see the request in this situation.
>>
>> It is very unlikely that this race will ever be hit because that race exists
>> for less than one microsecond and the smallest timeout that can be specified
>> for delayed queue rerunning is one millisecond. Let's address this race if
>> anyone ever finds a way to hit it.
> 
> Please don't depend on the timing which is often fragile, as we can make it
> correct in a generic approach. Also we should avoid to make every driver to
> follow this way for dealing with most of STS_RESOURCE, right?

Wouldn't it be better to fix that race without changing the block layer 
API, e.g. by using call_rcu() for delayed queue runs? As you know 
call_rcu() will only call the specified function after a grace period. 
Since pushing back requests onto the dispatch list happens with the RCU 
lock held using call_rcu() for delayed queue runs would be sufficient to 
address this race.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 23, 2018, 4:41 p.m. UTC | #7
On Tue, Jan 23, 2018 at 08:37:26AM -0800, Bart Van Assche wrote:
> On 01/23/18 08:26, Ming Lei wrote:
> > On Tue, Jan 23, 2018 at 08:17:02AM -0800, Bart Van Assche wrote:
> > > On 01/22/18 16:57, Ming Lei wrote:
> > > > Even though RCU lock is held during dispatch, preemption or interrupt
> > > > can happen too, so it is simply wrong to depend on the timing to make
> > > > sure __blk_mq_run_hw_queue() can see the request in this situation.
> > > 
> > > It is very unlikely that this race will ever be hit because that race exists
> > > for less than one microsecond and the smallest timeout that can be specified
> > > for delayed queue rerunning is one millisecond. Let's address this race if
> > > anyone ever finds a way to hit it.
> > 
> > Please don't depend on the timing which is often fragile, as we can make it
> > correct in a generic approach. Also we should avoid to make every driver to
> > follow this way for dealing with most of STS_RESOURCE, right?
> 
> Wouldn't it be better to fix that race without changing the block layer API,
> e.g. by using call_rcu() for delayed queue runs? As you know call_rcu() will

Could you explain where to call call_rcu()?  call_rcu() can't be used in
IO path at all.

> only call the specified function after a grace period. Since pushing back
> requests onto the dispatch list happens with the RCU lock held using
> call_rcu() for delayed queue runs would be sufficient to address this race.
Bart Van Assche Jan. 23, 2018, 4:47 p.m. UTC | #8
On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote:
> Could you explain where to call call_rcu()?  call_rcu() can't be used in
> IO path at all.

Can you explain what makes you think that call_rcu() can't be used in the
I/O path? As you know call_rcu() invokes a function asynchronously. From
Documentation/RCU/Design/Requirements/Requirements.html:

  The <tt>call_rcu()</tt> function may be used in a number of
  situations where neither <tt>synchronize_rcu()</tt> nor
  <tt>synchronize_rcu_expedited()</tt> would be legal,
  including within preempt-disable code, <tt>local_bh_disable()</tt> code,
  interrupt-disable code, and interrupt handlers.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 23, 2018, 4:49 p.m. UTC | #9
On Tue, Jan 23, 2018 at 04:47:11PM +0000, Bart Van Assche wrote:
> On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote:
> > Could you explain where to call call_rcu()?  call_rcu() can't be used in
> > IO path at all.
> 
> Can you explain what makes you think that call_rcu() can't be used in the
> I/O path? As you know call_rcu() invokes a function asynchronously. From
> Documentation/RCU/Design/Requirements/Requirements.html:
> 
>   The <tt>call_rcu()</tt> function may be used in a number of
>   situations where neither <tt>synchronize_rcu()</tt> nor
>   <tt>synchronize_rcu_expedited()</tt> would be legal,
>   including within preempt-disable code, <tt>local_bh_disable()</tt> code,
>   interrupt-disable code, and interrupt handlers.

OK, suppose it is true, do you want to change most of STS_RESOURCE in
all drivers to this way? Why can't we use the generic and simple approach
in this patch?
Bart Van Assche Jan. 23, 2018, 4:54 p.m. UTC | #10
On Wed, 2018-01-24 at 00:49 +0800, Ming Lei wrote:
> On Tue, Jan 23, 2018 at 04:47:11PM +0000, Bart Van Assche wrote:
> > On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote:
> > > Could you explain where to call call_rcu()?  call_rcu() can't be used in
> > > IO path at all.
> > 
> > Can you explain what makes you think that call_rcu() can't be used in the
> > I/O path? As you know call_rcu() invokes a function asynchronously. From
> > Documentation/RCU/Design/Requirements/Requirements.html:
> > 
> >   The <tt>call_rcu()</tt> function may be used in a number of
> >   situations where neither <tt>synchronize_rcu()</tt> nor
> >   <tt>synchronize_rcu_expedited()</tt> would be legal,
> >   including within preempt-disable code, <tt>local_bh_disable()</tt> code,
> >   interrupt-disable code, and interrupt handlers.
> 
> OK, suppose it is true, do you want to change most of STS_RESOURCE in
> all drivers to this way? Why can't we use the generic and simple approach
> in this patch?

Please reread my proposal. I did not propose to change any block drivers.
What I proposed is to change the blk_mq_delay_run_hw_queue() implementation
such that it uses call_rcu() instead of kblockd_mod_delayed_work_on().
That's a generic and simple approach.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 23, 2018, 4:59 p.m. UTC | #11
On Tue, Jan 23, 2018 at 04:54:02PM +0000, Bart Van Assche wrote:
> On Wed, 2018-01-24 at 00:49 +0800, Ming Lei wrote:
> > On Tue, Jan 23, 2018 at 04:47:11PM +0000, Bart Van Assche wrote:
> > > On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote:
> > > > Could you explain where to call call_rcu()?  call_rcu() can't be used in
> > > > IO path at all.
> > > 
> > > Can you explain what makes you think that call_rcu() can't be used in the
> > > I/O path? As you know call_rcu() invokes a function asynchronously. From
> > > Documentation/RCU/Design/Requirements/Requirements.html:
> > > 
> > >   The <tt>call_rcu()</tt> function may be used in a number of
> > >   situations where neither <tt>synchronize_rcu()</tt> nor
> > >   <tt>synchronize_rcu_expedited()</tt> would be legal,
> > >   including within preempt-disable code, <tt>local_bh_disable()</tt> code,
> > >   interrupt-disable code, and interrupt handlers.
> > 
> > OK, suppose it is true, do you want to change most of STS_RESOURCE in
> > all drivers to this way? Why can't we use the generic and simple approach
> > in this patch?
> 
> Please reread my proposal. I did not propose to change any block drivers.
> What I proposed is to change the blk_mq_delay_run_hw_queue() implementation
> such that it uses call_rcu() instead of kblockd_mod_delayed_work_on().
> That's a generic and simple approach.

How is that enough to fix the IO hang when driver returns STS_RESOURCE
and the queue is idle? If you want to follow previous dm-rq's way of
call blk_mq_delay_run_hw_queue() in .queue_rq(), the same trick need
to be applied to other drivers too, right?

Unfortunately most of STS_RESOURCE don't use this trick, but they need
to be handled.

The patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' can fix all these
cases.
Bart Van Assche Jan. 23, 2018, 10:01 p.m. UTC | #12
On Wed, 2018-01-24 at 00:59 +0800, Ming Lei wrote:
> How is that enough to fix the IO hang when driver returns STS_RESOURCE
> and the queue is idle? If you want to follow previous dm-rq's way of
> call blk_mq_delay_run_hw_queue() in .queue_rq(), the same trick need
> to be applied to other drivers too, right?
> 
> Unfortunately most of STS_RESOURCE don't use this trick, but they need
> to be handled.
> 
> The patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' can fix all these
> cases.

The goal of my proposal was to address the race between running the queue and
adding requests back to the dispatch queue only. Regarding the I/O hangs that
can occur if a block driver returns BLK_STS_RESOURCE: since all of these can
be addressed by inserting blk_mq_delay_run_hw_queue() calls in the affected
block drivers I prefer to modify the block drivers instead of making the
blk-mq core even more complicated.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Ming Lei Jan. 24, 2018, 2:31 a.m. UTC | #13
On Tue, Jan 23, 2018 at 10:01:37PM +0000, Bart Van Assche wrote:
> On Wed, 2018-01-24 at 00:59 +0800, Ming Lei wrote:
> > How is that enough to fix the IO hang when driver returns STS_RESOURCE
> > and the queue is idle? If you want to follow previous dm-rq's way of
> > call blk_mq_delay_run_hw_queue() in .queue_rq(), the same trick need
> > to be applied to other drivers too, right?
> > 
> > Unfortunately most of STS_RESOURCE don't use this trick, but they need
> > to be handled.
> > 
> > The patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' can fix all these
> > cases.
> 
> The goal of my proposal was to address the race between running the queue and
> adding requests back to the dispatch queue only. Regarding the I/O hangs that
> can occur if a block driver returns BLK_STS_RESOURCE: since all of these can
> be addressed by inserting blk_mq_delay_run_hw_queue() calls in the affected
> block drivers I prefer to modify the block drivers instead of making the
> blk-mq core even more complicated.

IMO, this change doesn't make blk-mq code more complicated, also it is
well documented, see the change in block layer:

 block/blk-core.c             |  1 +
 block/blk-mq.c               | 19 +++++++++++++++----
 include/linux/blk_types.h    | 18 ++++++++++++++++++

Also 21 lines of them are comment, so only 17 lines code change needed
in block layer.

If inserting blk_mq_delay_run_hw_queue() to driver, the change can be
a bit complicated, since call_rcu has to be used, you need to figure out
one way to provide callback and the parameter. Even you have to document
the change in each driver.

[ming@ming linux]$ git grep -n BLK_STS_RESOURCE drivers/ | wc -l
42

There are at least 42 uses of BLK_STS_RESOURCE in drivers, in theory
you should insert call_rcu(blk_mq_delay_run_hw_queue) in every BLK_STS_RESOURCE
of drivers.

I leave the decisions to Jens and drivers maintainers.

Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index a2005a485335..ac4789c5e329 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@  static const struct {
 	[BLK_STS_MEDIUM]	= { -ENODATA,	"critical medium" },
 	[BLK_STS_PROTECTION]	= { -EILSEQ,	"protection" },
 	[BLK_STS_RESOURCE]	= { -ENOMEM,	"kernel resource" },
+	[BLK_STS_DEV_RESOURCE]	= { -ENOMEM,	"device resource" },
 	[BLK_STS_AGAIN]		= { -EAGAIN,	"nonblocking retry" },
 
 	/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 01f271d40825..e91d688792a9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1169,6 +1169,7 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	struct request *rq, *nxt;
 	bool no_tag = false;
 	int errors, queued;
+	blk_status_t ret = BLK_STS_OK;
 
 	if (list_empty(list))
 		return false;
@@ -1181,7 +1182,6 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	errors = queued = 0;
 	do {
 		struct blk_mq_queue_data bd;
-		blk_status_t ret;
 
 		rq = list_first_entry(list, struct request, queuelist);
 		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
@@ -1226,7 +1226,7 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
-		if (ret == BLK_STS_RESOURCE) {
+		if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) {
 			/*
 			 * If an I/O scheduler has been configured and we got a
 			 * driver tag for the next request already, free it
@@ -1257,6 +1257,8 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	 * that is where we will continue on next queue run.
 	 */
 	if (!list_empty(list)) {
+		bool needs_restart;
+
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
@@ -1280,10 +1282,18 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		 * - Some but not all block drivers stop a queue before
 		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
 		 *   and dm-rq.
+		 *
+		 * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART
+		 * bit is set, run queue after 10ms for avoiding IO hang
+		 * because the queue may be idle and the RESTART mechanism
+		 * can't work any more.
 		 */
-		if (!blk_mq_sched_needs_restart(hctx) ||
+		needs_restart = blk_mq_sched_needs_restart(hctx);
+		if (!needs_restart ||
 		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
 			blk_mq_run_hw_queue(hctx, true);
+		else if (needs_restart && (ret == BLK_STS_RESOURCE))
+			blk_mq_delay_run_hw_queue(hctx, 10);
 	}
 
 	return (queued + errors) != 0;
@@ -1764,6 +1774,7 @@  static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 		*cookie = new_cookie;
 		break;
 	case BLK_STS_RESOURCE:
+	case BLK_STS_DEV_RESOURCE:
 		__blk_mq_requeue_request(rq);
 		break;
 	default:
@@ -1826,7 +1837,7 @@  static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	hctx_lock(hctx, &srcu_idx);
 
 	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
-	if (ret == BLK_STS_RESOURCE)
+	if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))
 		blk_mq_sched_insert_request(rq, false, true, false);
 	else if (ret != BLK_STS_OK)
 		blk_mq_end_request(rq, ret);
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 6655893a3a7a..287a09611c0f 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -1230,7 +1230,7 @@  static blk_status_t null_handle_cmd(struct nullb_cmd *cmd)
 				return BLK_STS_OK;
 			} else
 				/* requeue request */
-				return BLK_STS_RESOURCE;
+				return BLK_STS_DEV_RESOURCE;
 		}
 	}
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 68846897d213..79908e6ddbf2 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -276,7 +276,7 @@  static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Out of mem doesn't actually happen, since we fall back
 		 * to direct descriptors */
 		if (err == -ENOMEM || err == -ENOSPC)
-			return BLK_STS_RESOURCE;
+			return BLK_STS_DEV_RESOURCE;
 		return BLK_STS_IOERR;
 	}
 
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 891265acb10e..e126e4cac2ca 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -911,7 +911,7 @@  static blk_status_t blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
 out_busy:
 	blk_mq_stop_hw_queue(hctx);
 	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
-	return BLK_STS_RESOURCE;
+	return BLK_STS_DEV_RESOURCE;
 }
 
 static void blkif_complete_rq(struct request *rq)
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index d8519ddd7e1a..e0bb56723b9b 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -500,7 +500,7 @@  static int map_request(struct dm_rq_target_io *tio)
 		trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
 				     blk_rq_pos(rq));
 		ret = dm_dispatch_clone_request(clone, rq);
-		if (ret == BLK_STS_RESOURCE) {
+		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			blk_rq_unprep_clone(clone);
 			tio->ti->type->release_clone_rq(clone);
 			tio->clone = NULL;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index d9ca1dfab154..55be2550c555 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2030,9 +2030,9 @@  static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
-		if (atomic_read(&sdev->device_busy) == 0 &&
-		    !scsi_device_blocked(sdev))
-			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
+		if (atomic_read(&sdev->device_busy) ||
+		    scsi_device_blocked(sdev))
+			ret = BLK_STS_DEV_RESOURCE;
 		break;
 	default:
 		/*
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index c5d3db0d83f8..d76f1744a7ca 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -39,6 +39,13 @@  typedef u8 __bitwise blk_status_t;
 
 #define BLK_STS_AGAIN		((__force blk_status_t)12)
 
+/*
+ * This status is returned from driver to block layer if device related
+ * resource is run out of, but driver can guarantee that IO dispatch is
+ * triggered in future when the resource is available.
+ */
+#define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with