Message ID | 20180122033550.27855-2-ming.lei@redhat.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
> - if (ret == BLK_STS_RESOURCE) { > + if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) { No need for the inner braces here. > + if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) Same here. > +/* > + * This status is returned from driver to block layer if device related > + * resource is run out of, but driver can guarantee that IO dispatch is > + * triggered in future when the resource is available. > + */ > +#define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13) We'll need some good explanation on BLK_STS_RESOURCE vs BLK_STS_DEV_RESOURCE I think. Except for these nitpicks this looks fine to me.
On Mon, 2018-01-22 at 11:35 +0800, Ming Lei wrote: > @@ -1280,10 +1282,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, > * - Some but not all block drivers stop a queue before > * returning BLK_STS_RESOURCE. Two exceptions are scsi-mq > * and dm-rq. > + * > + * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART > + * bit is set, run queue after 10ms for avoiding IO hang > + * because the queue may be idle and the RESTART mechanism > + * can't work any more. > */ > - if (!blk_mq_sched_needs_restart(hctx) || > + needs_restart = blk_mq_sched_needs_restart(hctx); > + if (!needs_restart || > (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) > blk_mq_run_hw_queue(hctx, true); > + else if (needs_restart && (ret == BLK_STS_RESOURCE)) > + blk_mq_delay_run_hw_queue(hctx, 10); > } In my opinion there are two problems with the above changes: * Only the block driver author can know what a good choice is for the time after which to rerun the queue. So I think moving the rerun delay (10 ms) constant from block drivers into the core is a step backwards instead of a step forwards. * The purpose of the BLK_MQ_S_SCHED_RESTART flag is to detect whether or not any of the queue runs triggered by freeing a tag happened concurrently. I don't think that there is any relationship between queue runs happening all or not concurrently and the chance that driver resources become available. So deciding whether or not a queue should be rerun based on the value of the BLK_MQ_S_SCHED_RESTART flag seems wrong to me. > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > index d9ca1dfab154..55be2550c555 100644 > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, > case BLK_STS_OK: > break; > case BLK_STS_RESOURCE: > - if (atomic_read(&sdev->device_busy) == 0 && > - !scsi_device_blocked(sdev)) > - blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); > + if (atomic_read(&sdev->device_busy) || > + scsi_device_blocked(sdev)) > + ret = BLK_STS_DEV_RESOURCE; > break; > default: > /* The above introduces two changes that have not been mentioned in the description of this patch: - The queue rerunning delay is changed from 3 ms into 10 ms. Where is the explanation of this change? Does this change have a positive or negative performance impact? - The above modifies a guaranteed queue rerun into a queue rerun that may or may not happen, depending on whether or not multiple tags get freed concurrently (return BLK_STS_DEV_RESOURCE). Sorry but I think that's wrong. Bart.
On Mon, Jan 22, 2018 at 04:49:54PM +0000, Bart Van Assche wrote: > On Mon, 2018-01-22 at 11:35 +0800, Ming Lei wrote: > > @@ -1280,10 +1282,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, > > * - Some but not all block drivers stop a queue before > > * returning BLK_STS_RESOURCE. Two exceptions are scsi-mq > > * and dm-rq. > > + * > > + * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART > > + * bit is set, run queue after 10ms for avoiding IO hang > > + * because the queue may be idle and the RESTART mechanism > > + * can't work any more. > > */ > > - if (!blk_mq_sched_needs_restart(hctx) || > > + needs_restart = blk_mq_sched_needs_restart(hctx); > > + if (!needs_restart || > > (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) > > blk_mq_run_hw_queue(hctx, true); > > + else if (needs_restart && (ret == BLK_STS_RESOURCE)) > > + blk_mq_delay_run_hw_queue(hctx, 10); > > } > > In my opinion there are two problems with the above changes: > * Only the block driver author can know what a good choice is for the time The reality is that no one knew that before, if you look at the fact, most of BLK_STS_RESOURCE need that, but no one dealt with it at all. Both Jens and I concluded that it is a generic issue, which need generic solution. On the contrary, it is much easier to evaluate if one STS_RESOURCE can be converted to STS_DEV_RESOURCE, as you saw, there isn't many, because now we call .queue_rq() after getting budget and driver tag. > after which to rerun the queue. So I think moving the rerun delay (10 ms) > constant from block drivers into the core is a step backwards instead of a > step forwards. As I mentioned before, if running out of resource inside .queue_rq(), this request is still out of control of blk-mq, and if run queue is scheduled inside .queue_rq(), it isn't correct, because when __blk_mq_run_hw_queue() is run, the request may not be visible for dispatch. Even though RCU lock is held during dispatch, preemption or interrupt can happen too, so it is simply wrong to depend on the timing to make sure __blk_mq_run_hw_queue() can see the request in this situation. Or driver reinserts the request into scheduler queue, but that way involves much big change if we do this in driver, and introduce unnecessary cost meantime. > * The purpose of the BLK_MQ_S_SCHED_RESTART flag is to detect whether or not > any of the queue runs triggered by freeing a tag happened concurrently. I > don't think that there is any relationship between queue runs happening all > or not concurrently and the chance that driver resources become available. > So deciding whether or not a queue should be rerun based on the value of > the BLK_MQ_S_SCHED_RESTART flag seems wrong to me. The fact is simple, that once BLK_MQ_S_SCHED_RESTART is set, blk_mq_dispatch_rq_list() won't call blk_mq_run_hw_queue() any more, and depends on request completion to rerun the queue. If driver can't make sure the queue will be restarted in future by returning STS_RESOURCE, we have to call blk_mq_delay_run_hw_queue(hctx, 10) for avoiding IO hang. > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > > index d9ca1dfab154..55be2550c555 100644 > > --- a/drivers/scsi/scsi_lib.c > > +++ b/drivers/scsi/scsi_lib.c > > @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, > > case BLK_STS_OK: > > break; > > case BLK_STS_RESOURCE: > > - if (atomic_read(&sdev->device_busy) == 0 && > > - !scsi_device_blocked(sdev)) > > - blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); > > + if (atomic_read(&sdev->device_busy) || > > + scsi_device_blocked(sdev)) > > + ret = BLK_STS_DEV_RESOURCE; > > break; > > default: > > /* > > The above introduces two changes that have not been mentioned in the > description of this patch: > - The queue rerunning delay is changed from 3 ms into 10 ms. Where is the > explanation of this change? Does this change have a positive or negative > performance impact? How can that be a issue for SCSI? The rerunning delay is only triggered when there isn't any in-flight requests in SCSI queue. > - The above modifies a guaranteed queue rerun into a queue rerun that > may or may not happen, depending on whether or not multiple tags get freed > concurrently (return BLK_STS_DEV_RESOURCE). Sorry but I think that's wrong. This patch only moves the rerunning delay from SCSI .queue_rq into blk-mq, I don't know what you mean above, please explained a bit. I know there is a race here in SCSI, but nothing to do with this patch.
On 01/22/18 16:57, Ming Lei wrote: > Even though RCU lock is held during dispatch, preemption or interrupt > can happen too, so it is simply wrong to depend on the timing to make > sure __blk_mq_run_hw_queue() can see the request in this situation. It is very unlikely that this race will ever be hit because that race exists for less than one microsecond and the smallest timeout that can be specified for delayed queue rerunning is one millisecond. Let's address this race if anyone ever finds a way to hit it. >>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c >>> index d9ca1dfab154..55be2550c555 100644 >>> --- a/drivers/scsi/scsi_lib.c >>> +++ b/drivers/scsi/scsi_lib.c >>> @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, >>> case BLK_STS_OK: >>> break; >>> case BLK_STS_RESOURCE: >>> - if (atomic_read(&sdev->device_busy) == 0 && >>> - !scsi_device_blocked(sdev)) >>> - blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); >>> + if (atomic_read(&sdev->device_busy) || >>> + scsi_device_blocked(sdev)) >>> + ret = BLK_STS_DEV_RESOURCE; >>> break; >>> default: >>> /* >> >> The above introduces two changes that have not been mentioned in the >> description of this patch: >> - The queue rerunning delay is changed from 3 ms into 10 ms. Where is the >> explanation of this change? Does this change have a positive or negative >> performance impact? > > How can that be a issue for SCSI? The rerunning delay is only triggered > when there isn't any in-flight requests in SCSI queue. That change will result in more scsi_queue_rq() calls and hence in higher CPU usage. By how much the CPU usage is increased will depend on the CPU time required by the LLD .queuecommand() callback if that function gets invoked. Bart.
On Tue, Jan 23, 2018 at 08:17:02AM -0800, Bart Van Assche wrote: > > > On 01/22/18 16:57, Ming Lei wrote: > > Even though RCU lock is held during dispatch, preemption or interrupt > > can happen too, so it is simply wrong to depend on the timing to make > > sure __blk_mq_run_hw_queue() can see the request in this situation. > > It is very unlikely that this race will ever be hit because that race exists > for less than one microsecond and the smallest timeout that can be specified > for delayed queue rerunning is one millisecond. Let's address this race if > anyone ever finds a way to hit it. Please don't depend on the timing which is often fragile, as we can make it correct in a generic approach. Also we should avoid to make every driver to follow this way for dealing with most of STS_RESOURCE, right? > > > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > > > > index d9ca1dfab154..55be2550c555 100644 > > > > --- a/drivers/scsi/scsi_lib.c > > > > +++ b/drivers/scsi/scsi_lib.c > > > > @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, > > > > case BLK_STS_OK: > > > > break; > > > > case BLK_STS_RESOURCE: > > > > - if (atomic_read(&sdev->device_busy) == 0 && > > > > - !scsi_device_blocked(sdev)) > > > > - blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); > > > > + if (atomic_read(&sdev->device_busy) || > > > > + scsi_device_blocked(sdev)) > > > > + ret = BLK_STS_DEV_RESOURCE; > > > > break; > > > > default: > > > > /* > > > > > > The above introduces two changes that have not been mentioned in the > > > description of this patch: > > > - The queue rerunning delay is changed from 3 ms into 10 ms. Where is the > > > explanation of this change? Does this change have a positive or negative > > > performance impact? > > > > How can that be a issue for SCSI? The rerunning delay is only triggered > > when there isn't any in-flight requests in SCSI queue. > > That change will result in more scsi_queue_rq() calls and hence in higher > CPU usage. By how much the CPU usage is increased will depend on the CPU > time required by the LLD .queuecommand() callback if that function gets > invoked. No, this patch won't increase CPU usage on SCSI, and the only change is to move the blk_mq_delay_run_hw_queue() out of SCSI's .queue_rq(), and the delay becomes 10. Thanks, Ming
On 01/23/18 08:26, Ming Lei wrote: > On Tue, Jan 23, 2018 at 08:17:02AM -0800, Bart Van Assche wrote: >> On 01/22/18 16:57, Ming Lei wrote: >>> Even though RCU lock is held during dispatch, preemption or interrupt >>> can happen too, so it is simply wrong to depend on the timing to make >>> sure __blk_mq_run_hw_queue() can see the request in this situation. >> >> It is very unlikely that this race will ever be hit because that race exists >> for less than one microsecond and the smallest timeout that can be specified >> for delayed queue rerunning is one millisecond. Let's address this race if >> anyone ever finds a way to hit it. > > Please don't depend on the timing which is often fragile, as we can make it > correct in a generic approach. Also we should avoid to make every driver to > follow this way for dealing with most of STS_RESOURCE, right? Wouldn't it be better to fix that race without changing the block layer API, e.g. by using call_rcu() for delayed queue runs? As you know call_rcu() will only call the specified function after a grace period. Since pushing back requests onto the dispatch list happens with the RCU lock held using call_rcu() for delayed queue runs would be sufficient to address this race. Bart.
On Tue, Jan 23, 2018 at 08:37:26AM -0800, Bart Van Assche wrote: > On 01/23/18 08:26, Ming Lei wrote: > > On Tue, Jan 23, 2018 at 08:17:02AM -0800, Bart Van Assche wrote: > > > On 01/22/18 16:57, Ming Lei wrote: > > > > Even though RCU lock is held during dispatch, preemption or interrupt > > > > can happen too, so it is simply wrong to depend on the timing to make > > > > sure __blk_mq_run_hw_queue() can see the request in this situation. > > > > > > It is very unlikely that this race will ever be hit because that race exists > > > for less than one microsecond and the smallest timeout that can be specified > > > for delayed queue rerunning is one millisecond. Let's address this race if > > > anyone ever finds a way to hit it. > > > > Please don't depend on the timing which is often fragile, as we can make it > > correct in a generic approach. Also we should avoid to make every driver to > > follow this way for dealing with most of STS_RESOURCE, right? > > Wouldn't it be better to fix that race without changing the block layer API, > e.g. by using call_rcu() for delayed queue runs? As you know call_rcu() will Could you explain where to call call_rcu()? call_rcu() can't be used in IO path at all. > only call the specified function after a grace period. Since pushing back > requests onto the dispatch list happens with the RCU lock held using > call_rcu() for delayed queue runs would be sufficient to address this race.
On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote: > Could you explain where to call call_rcu()? call_rcu() can't be used in > IO path at all. Can you explain what makes you think that call_rcu() can't be used in the I/O path? As you know call_rcu() invokes a function asynchronously. From Documentation/RCU/Design/Requirements/Requirements.html: The <tt>call_rcu()</tt> function may be used in a number of situations where neither <tt>synchronize_rcu()</tt> nor <tt>synchronize_rcu_expedited()</tt> would be legal, including within preempt-disable code, <tt>local_bh_disable()</tt> code, interrupt-disable code, and interrupt handlers. Bart.
On Tue, Jan 23, 2018 at 04:47:11PM +0000, Bart Van Assche wrote: > On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote: > > Could you explain where to call call_rcu()? call_rcu() can't be used in > > IO path at all. > > Can you explain what makes you think that call_rcu() can't be used in the > I/O path? As you know call_rcu() invokes a function asynchronously. From > Documentation/RCU/Design/Requirements/Requirements.html: > > The <tt>call_rcu()</tt> function may be used in a number of > situations where neither <tt>synchronize_rcu()</tt> nor > <tt>synchronize_rcu_expedited()</tt> would be legal, > including within preempt-disable code, <tt>local_bh_disable()</tt> code, > interrupt-disable code, and interrupt handlers. OK, suppose it is true, do you want to change most of STS_RESOURCE in all drivers to this way? Why can't we use the generic and simple approach in this patch?
On Wed, 2018-01-24 at 00:49 +0800, Ming Lei wrote: > On Tue, Jan 23, 2018 at 04:47:11PM +0000, Bart Van Assche wrote: > > On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote: > > > Could you explain where to call call_rcu()? call_rcu() can't be used in > > > IO path at all. > > > > Can you explain what makes you think that call_rcu() can't be used in the > > I/O path? As you know call_rcu() invokes a function asynchronously. From > > Documentation/RCU/Design/Requirements/Requirements.html: > > > > The <tt>call_rcu()</tt> function may be used in a number of > > situations where neither <tt>synchronize_rcu()</tt> nor > > <tt>synchronize_rcu_expedited()</tt> would be legal, > > including within preempt-disable code, <tt>local_bh_disable()</tt> code, > > interrupt-disable code, and interrupt handlers. > > OK, suppose it is true, do you want to change most of STS_RESOURCE in > all drivers to this way? Why can't we use the generic and simple approach > in this patch? Please reread my proposal. I did not propose to change any block drivers. What I proposed is to change the blk_mq_delay_run_hw_queue() implementation such that it uses call_rcu() instead of kblockd_mod_delayed_work_on(). That's a generic and simple approach. Bart.
On Tue, Jan 23, 2018 at 04:54:02PM +0000, Bart Van Assche wrote: > On Wed, 2018-01-24 at 00:49 +0800, Ming Lei wrote: > > On Tue, Jan 23, 2018 at 04:47:11PM +0000, Bart Van Assche wrote: > > > On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote: > > > > Could you explain where to call call_rcu()? call_rcu() can't be used in > > > > IO path at all. > > > > > > Can you explain what makes you think that call_rcu() can't be used in the > > > I/O path? As you know call_rcu() invokes a function asynchronously. From > > > Documentation/RCU/Design/Requirements/Requirements.html: > > > > > > The <tt>call_rcu()</tt> function may be used in a number of > > > situations where neither <tt>synchronize_rcu()</tt> nor > > > <tt>synchronize_rcu_expedited()</tt> would be legal, > > > including within preempt-disable code, <tt>local_bh_disable()</tt> code, > > > interrupt-disable code, and interrupt handlers. > > > > OK, suppose it is true, do you want to change most of STS_RESOURCE in > > all drivers to this way? Why can't we use the generic and simple approach > > in this patch? > > Please reread my proposal. I did not propose to change any block drivers. > What I proposed is to change the blk_mq_delay_run_hw_queue() implementation > such that it uses call_rcu() instead of kblockd_mod_delayed_work_on(). > That's a generic and simple approach. How is that enough to fix the IO hang when driver returns STS_RESOURCE and the queue is idle? If you want to follow previous dm-rq's way of call blk_mq_delay_run_hw_queue() in .queue_rq(), the same trick need to be applied to other drivers too, right? Unfortunately most of STS_RESOURCE don't use this trick, but they need to be handled. The patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' can fix all these cases.
On Wed, 2018-01-24 at 00:59 +0800, Ming Lei wrote: > How is that enough to fix the IO hang when driver returns STS_RESOURCE > and the queue is idle? If you want to follow previous dm-rq's way of > call blk_mq_delay_run_hw_queue() in .queue_rq(), the same trick need > to be applied to other drivers too, right? > > Unfortunately most of STS_RESOURCE don't use this trick, but they need > to be handled. > > The patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' can fix all these > cases. The goal of my proposal was to address the race between running the queue and adding requests back to the dispatch queue only. Regarding the I/O hangs that can occur if a block driver returns BLK_STS_RESOURCE: since all of these can be addressed by inserting blk_mq_delay_run_hw_queue() calls in the affected block drivers I prefer to modify the block drivers instead of making the blk-mq core even more complicated. Thanks, Bart.
On Tue, Jan 23, 2018 at 10:01:37PM +0000, Bart Van Assche wrote: > On Wed, 2018-01-24 at 00:59 +0800, Ming Lei wrote: > > How is that enough to fix the IO hang when driver returns STS_RESOURCE > > and the queue is idle? If you want to follow previous dm-rq's way of > > call blk_mq_delay_run_hw_queue() in .queue_rq(), the same trick need > > to be applied to other drivers too, right? > > > > Unfortunately most of STS_RESOURCE don't use this trick, but they need > > to be handled. > > > > The patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' can fix all these > > cases. > > The goal of my proposal was to address the race between running the queue and > adding requests back to the dispatch queue only. Regarding the I/O hangs that > can occur if a block driver returns BLK_STS_RESOURCE: since all of these can > be addressed by inserting blk_mq_delay_run_hw_queue() calls in the affected > block drivers I prefer to modify the block drivers instead of making the > blk-mq core even more complicated. IMO, this change doesn't make blk-mq code more complicated, also it is well documented, see the change in block layer: block/blk-core.c | 1 + block/blk-mq.c | 19 +++++++++++++++---- include/linux/blk_types.h | 18 ++++++++++++++++++ Also 21 lines of them are comment, so only 17 lines code change needed in block layer. If inserting blk_mq_delay_run_hw_queue() to driver, the change can be a bit complicated, since call_rcu has to be used, you need to figure out one way to provide callback and the parameter. Even you have to document the change in each driver. [ming@ming linux]$ git grep -n BLK_STS_RESOURCE drivers/ | wc -l 42 There are at least 42 uses of BLK_STS_RESOURCE in drivers, in theory you should insert call_rcu(blk_mq_delay_run_hw_queue) in every BLK_STS_RESOURCE of drivers. I leave the decisions to Jens and drivers maintainers. Thanks, Ming
diff --git a/block/blk-core.c b/block/blk-core.c index a2005a485335..ac4789c5e329 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -145,6 +145,7 @@ static const struct { [BLK_STS_MEDIUM] = { -ENODATA, "critical medium" }, [BLK_STS_PROTECTION] = { -EILSEQ, "protection" }, [BLK_STS_RESOURCE] = { -ENOMEM, "kernel resource" }, + [BLK_STS_DEV_RESOURCE] = { -ENOMEM, "device resource" }, [BLK_STS_AGAIN] = { -EAGAIN, "nonblocking retry" }, /* device mapper special case, should not leak out: */ diff --git a/block/blk-mq.c b/block/blk-mq.c index 01f271d40825..e91d688792a9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1169,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, struct request *rq, *nxt; bool no_tag = false; int errors, queued; + blk_status_t ret = BLK_STS_OK; if (list_empty(list)) return false; @@ -1181,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, errors = queued = 0; do { struct blk_mq_queue_data bd; - blk_status_t ret; rq = list_first_entry(list, struct request, queuelist); if (!blk_mq_get_driver_tag(rq, &hctx, false)) { @@ -1226,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, } ret = q->mq_ops->queue_rq(hctx, &bd); - if (ret == BLK_STS_RESOURCE) { + if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) { /* * If an I/O scheduler has been configured and we got a * driver tag for the next request already, free it @@ -1257,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * that is where we will continue on next queue run. */ if (!list_empty(list)) { + bool needs_restart; + spin_lock(&hctx->lock); list_splice_init(list, &hctx->dispatch); spin_unlock(&hctx->lock); @@ -1280,10 +1282,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * - Some but not all block drivers stop a queue before * returning BLK_STS_RESOURCE. Two exceptions are scsi-mq * and dm-rq. + * + * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART + * bit is set, run queue after 10ms for avoiding IO hang + * because the queue may be idle and the RESTART mechanism + * can't work any more. */ - if (!blk_mq_sched_needs_restart(hctx) || + needs_restart = blk_mq_sched_needs_restart(hctx); + if (!needs_restart || (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) blk_mq_run_hw_queue(hctx, true); + else if (needs_restart && (ret == BLK_STS_RESOURCE)) + blk_mq_delay_run_hw_queue(hctx, 10); } return (queued + errors) != 0; @@ -1764,6 +1774,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, *cookie = new_cookie; break; case BLK_STS_RESOURCE: + case BLK_STS_DEV_RESOURCE: __blk_mq_requeue_request(rq); break; default: @@ -1826,7 +1837,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, hctx_lock(hctx, &srcu_idx); ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); - if (ret == BLK_STS_RESOURCE) + if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) blk_mq_sched_insert_request(rq, false, true, false); else if (ret != BLK_STS_OK) blk_mq_end_request(rq, ret); diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index 6655893a3a7a..287a09611c0f 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -1230,7 +1230,7 @@ static blk_status_t null_handle_cmd(struct nullb_cmd *cmd) return BLK_STS_OK; } else /* requeue request */ - return BLK_STS_RESOURCE; + return BLK_STS_DEV_RESOURCE; } } diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 68846897d213..79908e6ddbf2 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -276,7 +276,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx, /* Out of mem doesn't actually happen, since we fall back * to direct descriptors */ if (err == -ENOMEM || err == -ENOSPC) - return BLK_STS_RESOURCE; + return BLK_STS_DEV_RESOURCE; return BLK_STS_IOERR; } diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 891265acb10e..e126e4cac2ca 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -911,7 +911,7 @@ static blk_status_t blkif_queue_rq(struct blk_mq_hw_ctx *hctx, out_busy: blk_mq_stop_hw_queue(hctx); spin_unlock_irqrestore(&rinfo->ring_lock, flags); - return BLK_STS_RESOURCE; + return BLK_STS_DEV_RESOURCE; } static void blkif_complete_rq(struct request *rq) diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index d8519ddd7e1a..e0bb56723b9b 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -500,7 +500,7 @@ static int map_request(struct dm_rq_target_io *tio) trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)), blk_rq_pos(rq)); ret = dm_dispatch_clone_request(clone, rq); - if (ret == BLK_STS_RESOURCE) { + if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { blk_rq_unprep_clone(clone); tio->ti->type->release_clone_rq(clone); tio->clone = NULL; diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index d9ca1dfab154..55be2550c555 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, case BLK_STS_OK: break; case BLK_STS_RESOURCE: - if (atomic_read(&sdev->device_busy) == 0 && - !scsi_device_blocked(sdev)) - blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); + if (atomic_read(&sdev->device_busy) || + scsi_device_blocked(sdev)) + ret = BLK_STS_DEV_RESOURCE; break; default: /* diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index c5d3db0d83f8..d76f1744a7ca 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -39,6 +39,13 @@ typedef u8 __bitwise blk_status_t; #define BLK_STS_AGAIN ((__force blk_status_t)12) +/* + * This status is returned from driver to block layer if device related + * resource is run out of, but driver can guarantee that IO dispatch is + * triggered in future when the resource is available. + */ +#define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13) + /** * blk_path_error - returns true if error may be path related * @error: status the request was completed with
This status is returned from driver to block layer if device related resource is run out of, but driver can guarantee that IO dispatch is triggered in future when the resource is available. This patch converts some drivers to use this return value. Meantime if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run queue after 10ms for avoiding IO hang. Suggested-by: Jens Axboe <axboe@kernel.dk> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Bart Van Assche <bart.vanassche@sandisk.com> Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: linux-scsi@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-core.c | 1 + block/blk-mq.c | 19 +++++++++++++++---- drivers/block/null_blk.c | 2 +- drivers/block/virtio_blk.c | 2 +- drivers/block/xen-blkfront.c | 2 +- drivers/md/dm-rq.c | 2 +- drivers/scsi/scsi_lib.c | 6 +++--- include/linux/blk_types.h | 7 +++++++ 8 files changed, 30 insertions(+), 11 deletions(-)